Paramiko: Crashes on filenames that are not UTF-8 (listdir, maybe others)

Created on 27 Jun 2015  路  40Comments  路  Source: paramiko/paramiko

In Obnam, my backup program, I use Paramiko to access files over SFTP. Mostly, Obnam users use this to access the backup repository, but some use it to access their live data. Live data files have all sorts of nasty filenames, and someone has just reported problems with non-UTF-8 filenames.

To reproduce this with Obnam 1.9 on Debian 8 (jessie):

$ mkdir data
$ touch data/"$(printf '\xff')"
$ obnam backup --no-default-config -r repo sftp://localhost$(pwd)/data
00h00m00s 1 files 0 B scanned: scanning for files in sftp://localhost/tmp/x/dataTraceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cliapp/app.py", line 190, in _run
    self.process_args(args)
...
  File "/usr/lib/python2.7/dist-packages/obnamlib/plugins/sftp_plugin.py", line 455, in listdir2
    attrs = self.sftp.listdir_attr(pathname)
  File "/usr/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 208, in listdir_attr
    filename = msg.get_text()
  File "/usr/lib/python2.7/dist-packages/paramiko/message.py", line 199, in get_text
    return u(self.get_bytes(self.get_size()))
  File "/usr/lib/python2.7/dist-packages/paramiko/py3compat.py", line 51, in u
    return s.decode(encoding)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

What Obnam does is to get the list of files using listdir_attr, and that one fails, if there's a non-UTF8 filename on the remote side.

Bug Needs investigation SFTP

Most helpful comment

On 2.0.2, the offending line is https://github.com/paramiko/paramiko/blob/master/paramiko/py3compat.py#L148

Changing:
s.decode(encoding)

To:
s.decode(encoding, 'ignore')

Would at least stop paramiko from failing catastrophically, but it still raises the issue that if the encoding is wrong (or characters are ignored) I assume you can't actually access the file later on during operations as well.

Personally, @larswirzenius suggestion to return the actual bytes (opt-in) would be the best for me though

All 40 comments

Program to exhibit the problem.

#!/usr/bin/python2

import os
import tempfile

import paramiko

filename = os.path.abspath('non-utf-filename-\xff')
with open(filename, 'w') as f:
    f.write('this file has a non-UTF8 filename\n')
print 'Created file with non-UTF8 filename:', repr(filename)

print 'Connecting to localhost over SSH'
remote = ('localhost', 22)
transport = paramiko.Transport(remote)
transport.connect()

print 'Authenticating using keys in SSH agent'
agent = paramiko.Agent()
for key in agent.get_keys():
    try:
        transport.auth_publickey(os.environ['USER'], key)
        break
    except paramiko.SSHException:
        pass

print 'Listing filenames over SFTP'
sftp = paramiko.SFTPClient.from_transport(transport)
sftp.listdir(os.getcwd())

The output I get with the program above, on Debian 8 (paramiko 1.15.1):

liw@exolobe1$ python nonutf8name.py 
Created file with non-UTF8 filename: '/home/liw/paramiko/non-utf-filename-\xff'
Connecting to localhost over SSH
Authenticating using keys in SSH agent
Listing filenames over SFTP
Traceback (most recent call last):
  File "nonutf8name.py", line 29, in <module>
    sftp.listdir(os.getcwd())
  File "/usr/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 172, in listdir
    return [f.filename for f in self.listdir_attr(path)]
  File "/usr/lib/python2.7/dist-packages/paramiko/sftp_client.py", line 208, in listdir_attr
    filename = msg.get_text()
  File "/usr/lib/python2.7/dist-packages/paramiko/message.py", line 199, in get_text
    return u(self.get_bytes(self.get_size()))
  File "/usr/lib/python2.7/dist-packages/paramiko/py3compat.py", line 51, in u
    return s.decode(encoding)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 17: invalid start byte

One should have an option to select encoding of the server, for example I have a serverthet returns list of files in cp1250. I get exatly the same error in get_text. There should be encoding passed to u() function (from py3compat)

Having a server-wide encoding isn't good enough, I'm afraid. There is no guarantee that all filenames on one server, or one filesystem, or one directory, are all in the same encoding.

The only safe way seems to be to treat filenames as byte strings.

Yes, the thing is there is no way to get the list as byte strings. listdir tries to convert all entries to utf-8, no questions asked.

As far as I understand, there is no need to try to convert things to Unicode at all. The SFTP protocol can handle non-Unicode filenames just fine. It's Paramiko that insists on conversions. Or am I wrong?

Yes, it's paramiko. However, the client is unusable, since listdir uses get_text() for getting names of files (and folders) in a path instead of get_str(). If the server serves them in encoding other than utf-8, listdir crashes.

I think the solution should be trying to convert to ascii, then to utf-8. If that fails, listdir should return plain bytes it received from server. There is even a function written in sftp_client.py - _to_unicode. Please review my commit to this file in my fork of this repo (I did other changes too, which were wrong, so sorry).

I don't understand why Paramiko wants to do any kind of conversion of filenames. For my needs, I need the actual filename, not some possibly-converted name. Can I have that, please?

@larswirzenius I agree that bytes should be available, but that should be an opt-in, not the default. Two similar cases in stdlib that I can think of are the functions in the os module such as getcwdb(), environb() and getenvb() and open(..., "r")/open(..., "rb"). It should be possible to specify either an encoding or that bytes should be returned. One possibility would be having listdir() and listdirb(). It would also be useful to add an "encoding" keyword parameter to SFTPClient.from_transport() and other relevant methods for non-utf8 encodings (utf8 should be the default, as it will work in most cases). Note that on Python 2 str is bytes == True whist it is false on Python 3, where str is more like Python 2's unicode.

Any news on this bug?

Still seems to be an issue here in version 2.0.1.

On 2.0.2, the offending line is https://github.com/paramiko/paramiko/blob/master/paramiko/py3compat.py#L148

Changing:
s.decode(encoding)

To:
s.decode(encoding, 'ignore')

Would at least stop paramiko from failing catastrophically, but it still raises the issue that if the encoding is wrong (or characters are ignored) I assume you can't actually access the file later on during operations as well.

Personally, @larswirzenius suggestion to return the actual bytes (opt-in) would be the best for me though

I acknowledge the desire to have access to raw bytes, though as noted by others, changing the default behavior isn't going to fly as it'd be backwards incompatible behavior. (I don't know that I'd change it even in a backwards incompat release, I'd guess the average Python (3) user would still expect "a filename" to be a Unicode string by default; but certainly open to debate.)

Changing the decoding to ignore errors seems like a decent step, as most recently noted, and I am also generally a fan of allowing users to specify a runtime override to the default encoding used. Another override stating "give me bytes objects for the filenames" is also possible but I'd need to look at how best to fit it in API-wise.

Of note, I cannot reproduce the error on OS X, but can on Debian. Haven't dug deep enough to see if this is a shell environmental issue or an SSH server issue, but it doesn't matter a ton.

Not reproducible on OS X because hfs+ does utf8 NFD normalization on filenames, which other unixes do not do.

Linux:

$ printf 't1\xc3\xa9z\n'
t1茅z
$ printf 't1\xc3\xa9z\n' | hexdump -C
00000000  74 31 c3 a9 7a 0a                                 |t1..z.|
00000006
$ echo blah >$(printf 't1\xc3\xa9z')
$ ls t1* | hexdump -C
00000000  74 31 c3 a9 7a 0a                                 |t1..z.|
00000006

macOS:

$ printf 't1\xc3\xa9z\n'
t1茅z
$ printf 't1\xc3\xa9z\n' | hexdump -C
00000000  74 31 c3 a9 7a 0a                                 |t1..z.|
00000006
$ echo blah >$(printf 't1\xc3\xa9z')
$ ls t1* | hexdump -C
00000000  74 31 65 cc 81 7a 0a                              |t1e..z.|
00000007

See, \xc3\xa9 is the utf-8 encoding of the "NFC" form of "茅", and macOS / hfs+ changes it into \x65\xcc\x81 which is the utf-8 encoding of the "NFD" form of "茅". (This can be a real and confusing problem, because NFC form is more common, and many unix tools do byte comparison of filenames, and are surprised when the filename they wrote is not there anymore.)

Linux passes through \xff, but macOS escapes it to %FF

$ printf 't2\xffz\n'
t2?z
$ printf 't2\xffz' | hexdump -C
00000000  74 32 ff 7a                                       |t2.z|
00000004
$ echo blah >$(printf 't2\xffz')
$ echo t2*
t2%FFz
$ echo t2* | hexdump -C
00000000  74 32 25 46 46 7a 0a                              |t2%FFz.|
00000007

Enabling the user to opt-into bytes representation of filenames is the sane solution, IMHO.

See also the "surrogateescape" error handler, and the os.fsdecode() and os.fsencode() functions, which were introduced in python 3.1 and 3.2 respectively (_not_ conveniently available in python 2.x)
https://docs.python.org/3.5/howto/unicode.html?highlight=surrogateescape#files-in-an-unknown-encoding
https://docs.python.org/3.5/library/codecs.html?highlight=surrogateescape#error-handlers
https://www.python.org/dev/peps/pep-0529/#background

In poking at a workaround for this, I'm actually seeing test fails on Python 3(.4.2), on my Debian 8 VM, before my changes are even present. Why these aren't present on Travis or on OS X is unclear.

At least one of them seems to stem from the test stub server hitting a "wtf this isn't ascii" in os.mkdir() (e.g. in the tests that create unicode-named folders). The other failures are that same spot or in os.rename, same symptom.

For now I'm already way deeper into fucking with this than I should be so I'm just going to ignore these for the time being, but it's a bit perplexing. (Possibly of note is the awful eval(compile()) bit in test_sftp.py that complains it's for Python 3.2...don't recall how that snuck in but we no longer even support 3.2, so.)

@ploxiln Yea, I was seeing that %FF translation - interesting stuff and something I was unaware of. Thanks.

Grump, testing this brings up a wider issue, namely that one of course cannot input these kinds of strings into a Paramiko _server_ either, without the same error. And while in the client-side situation I can get away with 1-2 places changed (one spot each in listdir_iter and listdir_attr), the server side involves many more calls to Message.get_text(), all of which would need updating with eg errors='replace'.

(And this holds true even if we were to switch the workaround to "get bytes instead".)

Makes me wonder if the "real" fix is to change the _default_ behavior of Message.get_text(); if the packet claims there's a string field from bytes N through M, and it has that many bytes, and they more-or-less can be decoded into a Unicode object...and if on top of that the Python user side of things is truly expecting Unicode text and not a bytes object...then shouldn't we always attempt to deliver some workable Unicode string even if it's got garbage in it?

Arguably, it's not the transport or server/client's fault if the "text" put into the packet by the other end happens to not be cleanly decodable, regardless of the reason (i.e. we know, here, that there is a good reason because the source of truth is an encoding-unspecified filesystem; but there could be all sorts of other sources for text in a protocol message.)


I've definitely gone way over my threshold for this ticket and it wasn't even one I was intending to work on today (it's not a security ticket and it's no longer truly low hanging fruit) so I'm gonna drop this for now. Anyone else who wants to can pick up the branch I pushed (see inline link above) and run with it, if they want. But I'd really prefer a test and for a test to work, the server needs to work...:)

@bitprophet would it make sense to have a form= parameter that allows one to select the normalization form, as per unicodedata.normalize()? There could be an 'auto' option that compares byte representations br to unicodedata.normalize(form, br.decode(encoding)).encode(encoding) (if I've understood the docs correctly) and picks one for you, which would be the default. I'll have a play with doing this using the byte string examples above, though not immediately as I have something else to do.

@ploxiln When you said "not conveniently available" were you aware of the backport in python-future, or did you just mean os.fsen/decode()? Would the BSD license be an issue?
https://github.com/PythonCharmers/python-future/blob/master/src/future/utils/surrogateescape.py

@bitprophet I notice that in Python3, os.listdir() returns bytes if the path is specified as bytes and does the same for str paths. The \xff file is a surrogate escaped str().
In Python2, os.listdir(u'.') returns unicode for all entries except the one that can't be decoded, which is returned as bytes/Py2 str.
Contradicting the suggested specifics of my previous comment, perhaps the least surprising thing would be to imitate what Python2 and 3 do on each? This would require client code to check the type of results in cases as in the OP. Would it work the same as the current implementation otherwise?

I was not aware when I said that, but did find it afterwards. I'm not sure how acceptable it would be to add a dependency on python-future, but perhaps it wouldn't be too much to copy the minimal code needed into py3compat.py

Is paramiko going to be updated as the @bilal414 commit ?

I think the best way to go by far, is to have a default False option that tells the API to treat filenames as bytes. Always coercing bytes to some encoding cannot always work. It'd probably be better, if we could go back in time, to just always return bytes, but we can't. The solutions with errors='whatever' are also problematic. Coercing to unicode and falling back on bytes if that fails is also troublesome, because then code has to be smart enough to deal with both (return) types, lest client code give weird tracebacks or b'abc' "strings" (for EG).

There's an essential difference here: Many application programs that an enduser would specify a filename to (like openoffice or inkscape or something) will probably be better off assuming utf-8. But programs like backup tools, or xargs or whatever, really need to just treat bytes as bytes.

In C:\ProgramData\Anaconda3\lib\site-packages\paramiko\py3compat.py
Change def u(s, encoding="utf8"): to def u(s, encoding="latin"): It works !

My idea would be to add an optional named parameter. This would ensure backward compatibility. Most users know what locale the target system is using.
Might the attached patch work for you?

add_encoding.patch.txt
The idea is to add encoding to all related functions using get_text(). I hope that I didn't miss something.

I would discourage the use of error=None, error="ignore" etc, because you won't be able to get files without their correct name.

Simply adding the optional encoding to all related functions would solve the problem while keeping backward compatibility.
You mostly know what the targeted system is using, so you could simply try sftp.listdir(".", encoding="sjis") etc.

Without the encoding, you would never be able to fetch files whose names contain non ASCII. The error option only prevents the crash, but it does not solve the problem at all.

I'm aware of five ways to handle it; all are appropriate in some circumstances:
1) Guess the encoding. There are ways, but this is best seen as a last resort and shouldn't be made part of a production pipeline.
2) Assume. Treat everything as UTF-8, for example.
3) Ask. Get the user or caller to specify the encoding. This is actually a pretty good way to go for applications like inkscape or blender or something.
4) Replace nonconvertible characters with sentinels.
5) Treat everything as bytes. This allows you to avoid encoding issues altogether. It's the best option for tools like a filesystem backup program on Unix/Linux (including Mac) that shouldn't have to care about encoding.

It actually might be good to support 3 and 5 in paramiko.

BTW, I am unaware of any encoding that gives a 100% successful roundtrip for all possible inputs. Even ISO-8859-1, latin-1 and ascii fail on some inputs.

@dstromberg
My proposal is to to treat everything as utf-8 unless the user specifies something. This includes your no. 2 and no. 3.

  1. Is not realistic, since you will never be able to work with the files
  2. Is not backward compatible, but might be an option for a future release.

3 is good, but doesn't solve all problems.

5 can be done in a backward compatible way by introducing a new callable or specifying a new argument that defaults to dealing with encodings.

One good way, IMO, would be to change the current single callable into two callables. The first callable always returns bytes, the second gets bytes from the first and converts to the user-provided (or default UTF-8) encoding. The second would have the current name for backward compat, while the first would have a totally new name (like ending in _b or something).

This way, if you need text, you can get that, and if you need bytes, you can get that instead. No code should break.

5 can be done in a backward compatible way by introducing a new callable or specifying a new argument that defaults to dealing with encodings.

This requires quite some deep changes to the code. I don't think that it's a bad idea though. It might raise the efficiency/performance of the code as well.

My patch simply adds the default argument, so that you could just call listdir(".") as before, and you will get UTF-8. You will also be able to do listdir(".", "sjis") or listdir(".", encoding="sjis") or some other encoding of your choice.

Sorry, seems I was a little bit naive about the changes. I ran tests with pytest, but quite a lot failed.
My new patch fixes listdir() and related functions and also readlines() for the result of exec_command().
I tested this and also ran pytest to ensure not breaking code.
The new style after including the patch would be:

(in_, out_, err_) = ssh.exec_command("ls -la")
print(out_.readlines(encoding="sjis"))  # for sjis
# print(out_.readlines(50, "sjis"))  # sizehint=50 and encoding=sjis
# print(out_.readlines())  # for utf-8 
ssh.listdir() # for utf-8
ssh.listdir(".", "sjis") # or ssh.listdir(encoding="sjis") 

And this would be the necessary changes:
add_encoding.patch.txt

It might still be a good idea to add the error option as well... I'll leave that to the decision of the paramiko developers/maintainers ;)

Instead of attaching patch files, you should open a pull-request (and you can then push up additional updates as you refine it). A pull request is more convenient for viewing/reviewing the diff, and for pulling down with the git client for testing, and you can still get the raw diff if you want like https://github.com/paramiko/paramiko/pull/1693.patch (for example)

I just didn't want to make a fork...
Well, okay. I'll do it. Maybe this evening (Tokyo time)

Can we get some speed behind this. This is still a much awaited functionality which is still missing since 2015!

Just try my fork, it will work for all encodings.
Not as beautiful as it could be, but it works flawlessly.
I already created a PR, but there has been no response of any kind.
I covered stat(), open(), utime(), listdir(), listdir_attr() and many more.

Just try my fork, it will work for all encodings.
Not as beautiful as it could be, but it works flawlessly.
I already created a PR, but there has been no response of any kind.
I covered stat(), open(), utime(), listdir(), listdir_attr() and many more.

For now I will be using your fork, thanks for creating a fix for this problem. I hope this will get merged soon since now I will be using the original paramiko and the forked one under another name at the same time.

I'm experiencing this problem right now. Migrating files from an on prem windows SFTP server to a cloud Linux based SFTP server. Every special character (damn all who name folders and files with special characters) blows up my script. Cant change the folder names because it will break existing automated processes for us and our clients. This issue might not crop up often, but it is making this library unusable for me.

Why don't you just use surrogateescape in py3compat.b() and u() and it will at least work for any filenames in python 3.
Like s.encode(encoding, 'surrogateescape') and s.decode(encoding, 'surrogateescape').
I currently monkey patch b() and u() to get it to work with mixed encodings.

What ever the solution, the authors keep ignoring them all. I already gave up on paramiko and went over to ssh2-python. It's much faster as well. But not as easy to use though.
This issue here is quite a huge drawback, I wonder why nothing is done. Maybe they want to completely re-write something including a solution...

@bitprophet any update on this ? This is still an annoying issue ...

Was this page helpful?
0 / 5 - 0 ratings