Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a wider range of characters #27

Open
wojas opened this issue Feb 16, 2023 · 4 comments
Open

Allow a wider range of characters #27

wojas opened this issue Feb 16, 2023 · 4 comments

Comments

@wojas
Copy link
Member

wojas commented Feb 16, 2023

Currently simpleblob is quite restrictive about the characters that are allowed in names, constraining users to alphanumerical names with a few special chars (".", "-", "_"). The constraint primarily follows from using unescaped filenames in the fs backend.

We have a use case where we want to use a version specific prefix within a bucket (e.g. "v5/"), and also to be able to write a program that can discover all versions in use. This requires listing blobs with "/" is the name.

Amazon writes the following about S3 limitations: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
Basically they allow any UTF-8 character in the name, with a few caveats and exceptions.

Ideally we would like to allow an as wide as possible range of names, while also constraining to names that any current and future backend can safely support. These are conflicting goals. Perhaps we should use the same approach as the S3 documentation: define and recommend 'safe characters' that must work with any backend, while allowing almost all characters.

Potential solution

  • Allow almost all UTF-8 strings
  • But reject non-canonical and non-local paths like ../../foo///bar
  • Perform escaping in the fs backend to make the allowed names safe

Escaping and validation algorithm for fs:

  • Reject any backslash \ in the name
  • Call path.Clean(name) to check for non-canonical paths. Reject is the output differs from the input.
  • Reject if the name starts with .., because it could be something like ../../../etc/passwd (path.Clean will not touch this)
  • Call url.QueryEscape(name) to escape unsafe characters
  • If the name starts with a ., replace that character by %2E to avoid hidden files on UNIX

The validation and escaping functions can be exposed for other backends to reuse. The validation function should be called by every backend, the escaping function is optional.

There is another issue with special reserved device names on Windows. Go 1.20 introduces a new IsLocal function to check for these, but I don't think we want to depend on this, and it's only available in filepath. Perhaps always prepend _ to the filename to avoid this? This would also solve the UNIX hidden files issue, but be a breaking change, and it could be useful for the fs backend to produce unescaped files when restricting oneself to safe characters.

Cc @ahouene @nvaatstra

@ahouene
Copy link
Contributor

ahouene commented Feb 16, 2023

Good idea!

We could use fs.ValidPath to make sure we're dealing with a canonical path that does not contain ...
We should indeed have a way to keep files internal/private (e.g. S3 update marker), and making them .hidden is a familiar way to do so, thus always prepending _ sounds good! Or, we could keep private files in a separate, unreachable place, such as, for S3, another bucket? That would allow not forbidding any filename that would otherwise be valid.

Also, maybe we should limit key length?

@nvaatstra
Copy link
Contributor

PR #28 covers 2 specific parts of this issue:

  • "We have a use case where we want to use a version specific prefix within a bucket" > GlobalPrefix will allow you to work natively within a prefix in a bucket
  • "and also to be able to write a program that can discover all versions in use" > Recursive allows a user to List() objects which are further down in the prefix-tree inside a bucket

@udf2457
Copy link
Contributor

udf2457 commented Oct 3, 2023

Basically they allow any UTF-8 character in the name, with a few caveats and exceptions.

Just as an FYI if #59 is going to eventually merged, the safe characters for NATS is more limited: https://github.com/nats-io/nats.go/blob/9e5e70676ec02a86975d09193c6dd0a95450800c/kv.go#L331

@wojas
Copy link
Member Author

wojas commented Oct 5, 2023

Summary of a discussion on this topic with @ahouene:

An application SHOULD limit key names to a small set of of characters that are expected to be safe for any backend. This is probably the set [a-zA-Z0-9._-] with additional constraints on the dot (cannot be the first or last character and also not be used multiple times in a row). A backend MAY interpret names in a case-insensitive manner.

The safe set is not enforced on the simpleblob level. An application MAY use any character that a specific backend actually allows, but this use may not be portable to other backends. Particular care should be taken with / as used in folders, because backends may have special behavior for those.

A backend SHOULD support the basic safe set. It MAY support any characters. It MUST make sure that the use of a wider range of characters does not lead to security or operational issues.

Simpleblob does not enforce a universal limit on the length of a key. A backend MAY limit the allowed length of names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants