Aws-cli: Unable to set `charset` when mime-types are guessed (S3)

Created on 27 May 2015  路  24Comments  路  Source: aws/aws-cli

We are syncing a directory of various file types to an S3 bucket and aws-cli is correctly guessing the mime-types, however in our case it's important that it also append the charset. For example the guessed content-type for index.html might look like this:

Content-Type: text/html

But we'd like a way to be able to tell aws-cli that the charset for all the synced files is UTF8 for instance:

Content-Type: text/html; charset=utf-8

Version: aws-cli/1.7.26 Python/2.7.6 Darwin/14.4.0

feature-request s3 s3mimetype

Most helpful comment

Based on community feedback, we have decided to return feature requests to GitHub issues.

All 24 comments

+1

:+1:

:+1:

:panda_face:

:+1:

:+1:

Marking as feature request. Any suggestions on how you would like to see it exposed in the CLI would be appreciated.

@kyleknap perhaps via --charset option? which would be appended to the guessed mime-type.

You can explicitly set content-type for s3 cp/sync and s3api put-object APIs.

For s3 cp/sync, use --content-type option.

$ aws s3 cp --content-type 'text/plain; charset=utf-8' index.html s3://BUCKET/index.html
upload: ./index.html to s3://BUCKET/index.html
$ aws s3api head-object --bucket BUCKET --key index.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/plain; charset=utf-8",
    "LastModified": "Thu, 28 May 2015 14:18:42 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

$ aws s3 sync foo s3://BUCKET/foo  --content-type 'text/html; charset=utf-8'
upload: foo/index.html to s3://BUCKET/foo/index.html
$ aws s3api head-object --bucket BUCKET --key foo/index.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/html; charset=utf-8",
    "LastModified": "Thu, 28 May 2015 14:30:54 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

For s3api put-object, use --content-type option.

$ aws s3api put-object --content-type 'text/html; charset=latin-1' --bucket
BUCKET --key index2.html --body index.html
$ aws s3api head-object --bucket BUCKET --key index2.html
{
    "AcceptRanges": "bytes",
    "ContentType": "text/html; charset=latin-1",
    "LastModified": "Thu, 28 May 2015 14:26:03 GMT",
    "ContentLength": 12,
    "ETag": "\"6f5902ac237024bdd0c176cb93063dc4\"",
    "Metadata": {}
}

Is this different from want you want?

@quiver yes it's a bit different; we have client-side app that we're syncing with a multitude of different file types. we'd really like to continue to take advantage of the mime-type guessing feature which saves us from having to batch upload files based on type.

@gf3
Got it. thanks for your reply.

@kyleknap just checking in here, anything i can do to help this along?

+1

As a stop-gap, could the default "guessed" MIME type for HTML be changed to text/html; charset=utf-8 somehow?

Any updates on this perhaps?

When we use -content-type "text/html; charset=utf-8" the files actually default to text/plain which then in turn simply downloads the index.html file instead of serving it. How do I address this? I have the same scenario as @gf3 where I'm trying to sync up a client-side app..

Thanks!

Running into this problem myself now, trying to sync a bunch of static website files to a bucket and s3cmd is not setting the correct charset=utf8 content-type when uploading the files which contain utf8 characters.

I'd like to keep the deployment job simple by just syncing the directory up the pipe instead of having to define the content-type on a per file-type or file basis. Any way this is now possible?

As @dmahlow mentioned, you can define the content-type on a per-file-type basis. Just to illustrate what that might look like:

aws s3 sync --exclude "*" --include "*.html" --content-type "text/html; charset=utf-8" --delete ./public s3://www.example.com
aws s3 sync --include "*" --exclude "*.html" --delete ./public  s3://www.example.com

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We鈥檝e imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it鈥檚 a text-only import of the original post into UserVoice, we鈥檒l still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

Based on community feedback, we have decided to return feature requests to GitHub issues.

Thanks to @perennialmind's comment for a reasonable work around. Would be nice to be able to specify mappings of some kind, though, to avoid finicky configs like this.

@gf3 First of all, the NERVE of appearing in my ONLINE EXPERIENCE. How dare you.

Second: the mimetypes module happily returns the guessed encoding (same as using file --mime-encoding [filename] on the command line) as the second element of its return tuple, but it's currently getting thrown away here:

https://github.com/aws/aws-cli/blob/072688cc07578144060aead8b75556fd986e0f2f/awscli/customizations/s3/utils.py#L294

My Python environment is hosed, but I'll take a run at a patch unless somebody beats me to it.

There's an argument to be made that it isn't aws-cli's responsibility to include the charset= portion of Content-Type for text/html files, but it's _such_ a common use case (and the resulting mojibake so terrifying when it's omitted) that it seems worthwhile to me.

Alright, so guess_type doesn't actually use libmagic under the hood, and only understands/guesses compression encodings, not text encodings. The following commit "works" to set a charset automatically on uploaded files:

https://github.com/aws/aws-cli/compare/develop...pnc:libmagic?expand=1

However, it:

  1. Probably doesn't work on Windows (at least without Cygwin)
  2. Needs to be tweaked so it doesn't cause the S3 copy unit tests to fail (I think they rely on it guessing based on filename alone?)
  3. Adds another dependency

Leaving it for posterity in case someone wants to pick up the torch, but this doesn't seem super viable unless someone from the core team encourages it.

+1 for getting this solved correctly, please! My s3 copy commands are littered with include and exclude statements now 馃憥... looking very similar to justatheory's

Still an issue, I've updated my blog publish script from the broken link above to this script. Sure wish I could specify mappings explicitly and call it once!

Was this page helpful?
0 / 5 - 0 ratings