Aws-cli: Filename encoding errors

Created on 4 Jun 2015  Â·  29Comments  Â·  Source: aws/aws-cli

Hello all

I'm using aws s3 sync to upload to a bucket, but I'm getting errors on many files:

    Please check your locale settings.  The filename was decoded as: ANSI_X3.4-1968
    On posix platforms, check the LC_CTYPE environment variable.

My system is Ubuntu Server 14.04 running aws-cli 1.7.5 and Python 2.7.6. I've altered my installations locale to en_US.UTF-8 using export LC_ALL=en_US.UTF-8, but that didn't change anything. The filesystem is ext4, and the files in question have been created by OSX via Netatalk 3.1.7 (AFP).

Any suggestions very much welcome...

locale s3 unicode

Most helpful comment

Just type on your commandline
$ export LC_ALL=en_US.UTF-8

All 29 comments

How does your locale settings look when you run locale?. This is what mine looks like:

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

Maybe try setting your LC_CTYPE to en_US=UTF-8? That is what I usually suggest to other people who run into this issue.

@kyleknap I have the same locale and i'm receiving the following error https://github.com/aws/aws-cli/issues/1386#issuecomment-112649485

Seeing the same error with s3 sync. I have a file with "í" in the name, and the s3 error prints that out as "\xc3\xad" in the error (an "ls" in the terminal shows it correctly). Also have locale all set to "en_US.UTF-8".

With clients creating pdf's and document names with special characters this is becoming a real problem with nightly backups. Has there been any activity to rectify the issue?

I had a chance to look at this again. I can confirm with @ireuben that filenames with some special characters will fail. I found the specific error message when copying the file directly:

$ aws s3 cp /var/www/html/website/public/media/uploads/512a90f7b0069-John\ Smith\ R\xe9sum\xe9.pdf s3://mybucket/backup/

'utf8' codec can't decode byte 0xe9 in position 92: invalid continuation byte

If you rename the file and execute again, there is no issue:

$ mv /var/www/html/website/public/media/uploads/512a90f7b0069-John\ Smith\ R?sum?.pdf /var/www/html/website/public/media/uploads/512a90f7b0069-John_Smith_Resume.pdf
$ aws s3 cp /var/www/html/website/public/media/uploads/512a90f7b0069-John_Smith_Resume.pdf s3://mybucket/backup/

upload: /var/www/html/website/public/media/uploads/512a90f7b0069-John_Smith_Resume.pdf to s3://mybucket/backup/512a90f7b0069-John_Smith_Resume.pdf

So i'm guessing this is why $ aws s3 sync has errors, because the underlying cp command fails.

Again, here is the error message with sync:

$ aws s3 sync /var/www/html/website/public/media/uploads s3://mybucket/backup

warning: Skipping file '512a90f7b0069-John\ Smith\ R\xe9sum\xe9.pdf'. There was an error trying to decode the the file '512a90f7b0069-John\ Smith\ R\xe9sum\xe9.pdf' in directory "/var/www/html/website/public/media/uploads". 
Please check your locale settings.  The filename was decoded as: UTF-8
On posix platforms, check the LC_CTYPE environment variable.

You can see there is no indication of the actual failed message which the $ aws s3 cp verbosity tells us.

I'm pretty sure this a bug in the way special characters are handled in the filename. I've tried several combinations of using locale but many of my attempts do not circumvent the issue.

If anyone is interested I found a utility in the EPEL repo called detox. I wouldn't necessary suggest this a solution, but none the less a pretty helpful tool. The following cleans up the file names and allows transfers to succeed with $ aws s3 commands:

$ detox -rv -s iso8859_1-only /var/www/html/website/public/media/uploads/

AFAIK, POSIX doesn't mandate the use of any encoding at all for file system entries. It is not feasible to expect -- or, worst, force -- users to use any specific encoding. Users are free to choose what encoding they want their data to be displayed/stored as (that's why LC_ALL et al. variables were created), and this include how they name file system objects. It's not always feasible, nor desirable, to convert all your file names just to please your backup software. From a OS standpoint, path names are just a sequence of bytes, and that's how aws-cli should treat them.
This is a severe bug. It may be preventing people from backing up important data in their servers.

I'm running into the same problem, I think.

A sync operation just exits with no messages and code 1. If I add the --debug flag I get a lot of output including upload failed: uploads/2016/03/Coordonnateurtrice-des-relations-avec-la-client\udcc3\udca8leFR.pdf to s3://my-bucket/uploads/2016/03/Coordonnateurtrice-des-relations-avec-la-client\udcc3\udca8leFR.pdf 'utf-8' codec can't encode character '\\udcc3' in position 76: surrogates not allowed

I found that setting LC_CTYPE to en_US.UTF-8 fixes it for me. It was en_CA.UTF-8. Why would one be OK but not the other?

The fact that the program issues an error and exit with an non-zero status alleviates the problem (at least the admin can be alerted, luckily before someone ask the restore of a file that was ignored), but it doesn't solve it. It requires just a single user to upload a file with an "invalid" byte sequence in the name (from aws-cli locale POV) to ruin your backup.

I am not aware of any differences in conversion between en_US.UTF-8 and en_CA.UTF-8, so cannot provide concrete answers about why it works with one and not the other, but this might be caused by many factors, including bad/corrupt locale tables lying on your servers. But notice that changing locales is not an actual fix, as users can create files/directories using any byte sequences they want.

It's a fix in my case, at least for the current set of files. I agree at least that it's certainly not a satisfactory fix.

I'm running a stock AWS Ubuntu instance here, so I can't imagine locale tables being corrupt. I didn't set the locale to Canada myself, and can't find any configuration which mentions en_CA; perhaps that is inherited from the machine I'm logging into it from.

Basically what Python is trying to do in this case is convert an arbitrary sequence of bytes (a file/directory name) into UTF-8. Since this conversion is not always possible, it complains (and aws ignore the entry). However, it should not make much difference to which locale it is trying to convert to WRT this error, provided they use the same encoding. So setting LC_ALL to en_US.UTF-8, fr_CA.UTF-8, or even C.UTF-8 should be the same in this case (the key here is UTF-8). That's why I said it is a different issue -- and that _may_ be corrupt locale tables, or many other (unrelated) problem.

And, yes, your locale environment variables are forwarded to remote machines by ssh. So, if you use en_CA in your local machine, most likely it will be set on remotes as well when you ssh to them.

Just type on your commandline
$ export LC_ALL=en_US.UTF-8

Unfortunately this won't solve the problem.

If you happen to have file names with byte sequences that are not valid in UTF-8, the command will spit out a warning and skip the file.

File names should be treated as a sequence of bytes, with no implied or assumed encoding.

It's a valid short-term remedy for must use cases, I imagine. Doing a similar thing helped me get my backup script functional, at least. But a better permanent solution is certainly required.

I wouldn't say _most_ use cases, because for most people the charset and encoding is already setup properly. For those guys that can guarantee that all your users will be creating only single-encoding file names, I expect that the default and per-user locale will be already set properly, otherwise they'll face many other problems before even thinking about a backup.

Anyway, this issue is about you having two files A and B, where A's name is encoded as X, and B's as Y, so we can't backup both.

But, fair enough, your point is valid, as it may help others.

p.s.: I was rereading my reply for your post above, and thinking that one issue that might have caused your problems re en_US/en_CA was that the en_CA encoding table might not be available on the remote server when you connected.

Yes; that's a possibility. I would have hoped it would just fall back to en_<something else>...

My locale is all en_US.UTF-8 and I even tried export LC_ALL=en_US.UTF-8 but I'm still getting
"Please check your locale settings. The filename was decoded as: UTF-8
On posix platforms, check the LC_CTYPE environment variable."
on various files when I try to sync them up to s3.

I can't just re-name things, I need to leave them up as is.

Any idea when this issue going to be addressed?

@tremby Regarding LC_CTYPE not being set correctly, if you are ssh'ing in, your local machine may be forwarding that locale to the server.
In your ssh_config look for

Host *
   SendEnv LANG LC_*

and comment it out

In OSX (< Sierra) it is located at /etc/ssh_config
In debian/ubuntu and now in macOS Sierra it is located at /etc/ssh/ssh_config

I'm encountering this issue as well even with proper UTF-8 locale settings as the default:

'utf-8' codec can't encode character '\udce2' in position 53: surrogates not allowed

Is there a proper way to work around this without modifying the filename? If not then this issue should probably be closed with a won't/can't fix note.

@gravyboat That might indicate that your python was compiled with narrow unicode. You can check this by looking at sys.maxunicode. A value of something like 1112111 is wide while 65335 is narrow.

@JordonPhillips Thanks for the note, I double checked and it's 1114111. I think it had to do with the fact that unicode value doesn't actually exist (http://unicode.scarfboy.com/?s=U%2Bdce2) which created the problem. The conclusion we came to was to just change the name of the file.

Similar issue here.
I did some tests, and finally realized that it is my broken utf8 filaname config - ?ƻs.yml.
So i just change the filename.
Thanks @gravyboat

$ docker run --rm -v ~/.aws:/root/.aws -v ~:/data sstc/awscli:alpine ls -l /data/temp/
total 20
-rw-r--r--    1 root     root           803 Jul 29 22:04 config - ?ƻs.yml
-rw-r--r--    1 root     root             0 Jul 29 22:04 中文
-rw-r--r--    1 root     root             5 Jul 29 22:04 中文2

(alpine + python2) dryrun

one file failed,
all utf8 filename output invalid.

$ docker run --rm -v ~/.aws:/root/.aws -v ~:/data sstc/awscli:alpine-python2 aws s3 sync /data/temp/ s3://backup.up9cloud.net/a/ --dryrun
warning: Skipping file 'config - \xbd\xc6\xbbs.yml'. There was an error trying to decode the the file 'config - \xbd\xc6\xbbs.yml' in directory "/data/temp/". 
Please check your locale settings.  The filename was decoded as: UTF-8
On posix platforms, check the LC_CTYPE environment variable.
(dryrun) upload: temp/?? to s3://backup.up9cloud.net/a/??     
(dryrun) upload: temp/??2 to s3://backup.up9cloud.net/a/??2

(alpine + python3) dryrun

all ok

$ docker run --rm -v ~/.aws:/root/.aws -v ~:/data sstc/awscli:alpine aws s3 sync /data/temp/ s3://backup.up9cloud.net/a/ --dryrun
(dryrun) upload: temp/config - ?ƻs.yml to s3://backup.up9cloud.net/a/config - ?ƻs.yml
(dryrun) upload: temp/中文 to s3://backup.up9cloud.net/a/中文
(dryrun) upload: temp/中文2 to s3://backup.up9cloud.net/a/中文2

(alpine + python2) real world

one file decode failed.

$ docker run --rm -v ~/.aws:/root/.aws -v ~:/data sstc/awscli:alpine-python2 aws s3 sync /data/temp/ s3://backup.up9cloud.net/a/         
warning: Skipping file 'config - \xbd\xc6\xbbs.yml'. There was an error trying to decode the the file 'config - \xbd\xc6\xbbs.yml' in directory "/data/temp/". 
Please check your locale settings.  The filename was decoded as: UTF-8
On posix platforms, check the LC_CTYPE environment variable.
upload: temp/??2 to s3://backup.up9cloud.net/a/??2            
upload: temp/?? to s3://backup.up9cloud.net/a/??

(alpine + python3) real world

one file encode failed.

$ docker run --rm -v ~/.aws:/root/.aws -v ~:/data sstc/awscli:alpine aws s3 sync /data/temp/ s3://backup.up9cloud.net/a/
upload failed: temp/config - \udcbdƻs.yml to s3://backup.up9cloud.net/a/config - \udcbdƻs.yml 'utf-8' codec can't encode character '\udcbd' in position 11: surrogates not allowed
upload: temp/中文 to s3://backup.up9cloud.net/a/中文                                     
upload: temp/中文2 to s3://backup.up9cloud.net/a/中文2

(alpine + python2) debug

only a warning, no python trace statcks.

$ docker run --rm -v ~/.aws:/root/.aws -v ~:/data sstc/awscli:alpine-python2 aws s3 sync /data/temp/ s3://backup.up9cloud.net/a/ --debug
...
2017-07-30 07:18:47,407 - MainThread - botocore.hooks - DEBUG - Event choosing-s3-sync-strategy: calling handler <bound method DeleteSync.use_sync_strategy of <awscli.customizations.s3.syncstrategy.delete.DeleteSync object at 0x7f2967134bd0>>
warning: Skipping file 'config - \xbd\xc6\xbbs.yml'. There was an error trying to decode the the file 'config - \xbd\xc6\xbbs.yml' in directory "/data/temp/". 
Please check your locale settings.  The filename was decoded as: UTF-8
On posix platforms, check the LC_CTYPE environment variable.
2017-07-30 07:18:47,415 - MainThread - botocore.loaders - DEBUG - Loading JSON file: /usr/local/lib/python2.7/site-packages/botocore/data/s3/2006-03-01/paginators-1.json
...

(alpine + python3) debug

Don't know why need to input_str = input_str.encode('utf-8')

$ docker run --rm -v ~/.aws:/root/.aws -v ~:/data sstc/awscli:alpine aws s3 sync /data/temp/ s3://backup.up9cloud.net/a/ --debug
...
2017-07-30 05:24:26,617 - <concurrent.futures.thread.ThreadPoolExecutor object at 0x7f85b443eda0>_0 - s3transfer.tasks - DEBUG - Exception raised.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/s3transfer/tasks.py", line 126, in __call__
    return self._execute_main(kwargs)
  File "/usr/local/lib/python3.6/site-packages/s3transfer/tasks.py", line 150, in _execute_main
    return_value = self._main(**kwargs)
  File "/usr/local/lib/python3.6/site-packages/s3transfer/upload.py", line 679, in _main
    client.put_object(Bucket=bucket, Key=key, Body=body, **extra_args)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 310, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 573, in _make_api_call
    api_params, operation_model, context=request_context)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 628, in _convert_to_request_dict
    api_params, operation_model)
  File "/usr/local/lib/python3.6/site-packages/botocore/validate.py", line 293, in serialize_to_request
    operation_model)
  File "/usr/local/lib/python3.6/site-packages/botocore/serialize.py", line 409, in serialize_to_request
    partitioned['uri_path_kwargs'])
  File "/usr/local/lib/python3.6/site-packages/botocore/serialize.py", line 431, in _render_uri_template
    params[template_param[:-1]], safe='/~')
  File "/usr/local/lib/python3.6/site-packages/botocore/utils.py", line 332, in percent_encode
    input_str = input_str.encode('utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcbd' in position 11: surrogates not allowed
...

Has anybody found a viable workaround or solution for this?
I have the issue with awscli, but with sftp it works flawlessly.

All locales are utf-8.

+1

Changing the filename isn't really a valid solution, unless of course there's nothing else relying on that file.

But, setting the locale as the error suggests doesn't seem to solve the problem either:

[[email protected]:0 ~]# aws s3 sync /opt/storage/private/client-files/xxxxxx s3://xxxxxxxx/private/client-files/xxxxxx
warning: Skipping file 'xxxxxxx-d\xb5mbrosio.pdf'. There was an error trying to decode the the file 'xxxxxxx-d\xb5mbrosio.pdf' in directory "/opt/storage/private/client-files/xxxxxx/".
Please check your locale settings.  The filename was decoded as: UTF-8
On posix platforms, check the LC_CTYPE environment variable.

[[email protected]:0 ~]# LC_CTYPE=en_US.UTF-8 aws s3 sync /opt/storage/private/client-files/xxxxxx s3://xxxxxxxx/private/client-files/xxxxxx
warning: Skipping file 'xxxxxxx-d\xb5mbrosio.pdf'. There was an error trying to decode the the file 'xxxxxxx-d\xb5mbrosio.pdf' in directory "/opt/storage/private/client-files/xxxxxx/".
Please check your locale settings.  The filename was decoded as: UTF-8
On posix platforms, check the LC_CTYPE environment variable.

I tried the above with setting LC_ALL instead, but received the same result.

EDIT: Actually, I got there in the end with LC_ALL=en_US.iso88591. Two issues with that: setting LC_CTYPE like the error message suggests did not make a difference, LC_ALL was required instead; and secondly, I'm not sure this really is very workable, especially if different files might be encoded differently... (I suppose that is the point of this issue existing; it's just a pity that it's existed for so long!)

We were facing the same problem and we found the cause. Even if locale command prints LC_CTYPE="en_US.UTF-8", if you try echo $LC_TYPE it will be empty, which means the env var is not set. We found two solutions for this problem. One is export LANG=utf8 and the other one is just prepend LC_CTYPE=en_US.UTF-8 in the same line before aws s3 sync command like LC_CTYPE=en_US.UTF-8 aws s3 sync.....

Just type on your commandline
$ export LC_ALL=en_US.UTF-8

This fixed my issue after hours of searching

Just type on your commandline
$ export LC_ALL=en_US.UTF-8

This fixed my issue after hours of searching

This is a workaround that work most of times, but not a proper fix.

POSIX file names are a sequence of bytes, thus it doesn't make much sense to talk about file name encoding. Some file system may enforce specific encodings, but this is not the rule unfortunately. On the other side of the road, apparently AWS-CLI (and perhaps S3?) handles all file names as string data in Python (which implies an encoding), hence the issue.

Just so you know, it would suffice for any user (or perhaps an erratic app) to create a single file with a name that contains invalid byte sequences to trigger the problem and make AWS-CLI bail. There is no LC_ALL-fu you can do to prevent this.

Actual solution should be to use os.fsencode()/os.fsdecode()/pathlib in Python 3. Similar solutions for Python 2 could be backported, but given P2 EOL I'm not sure if it is worth the effort. In any case, this may be a hard problem to solve, one that probably touches several parts inside AWS-CLI.

Anyway, the fact that is a hard problem to solve may explain why it is taking so long to "fix" this. A proper fix, however, is much needed as unfortunately there is no way to workaround this safely. There is no room for "let's wait for or force that everybody migrate to UTF-8 filenames" because, as I said, this is not controlled by the the admins. Just an example, last time I checked, Drupal allowed people to upload filenames with any encoding, and I bet other CMS does it too. If an attacker want to DOS your backups, all they need to do is to upload a file with a carefully crafted, encoding incompatible filename -- which, by no means, mean an _invalid_ filename, as such thing doesn't exist in a POSIX file system as far as encoding goes (only basenames containing NUL and / (slash) characters are truly invalid).

Was this page helpful?
0 / 5 - 0 ratings