Can someone confirm that Python 2.6 ftplib does NOT support Unicode file names? Or must Unicode file names be specially encoded in order to be used with the ftplib module?
The following email exchange seems to support my conclusion that the ftplib module only supports ASCII file names.
Should ftplib use UTF-8 instead of latin-1 encoding?
http://mail.python.org/pipermail/python-dev/2009-January/085408.html
Any recommendations on a 3rd party Python FTP module that supports Unicode file names? I’ve googled this question without success [1], [2].
The official Python documentation does not mention Unicode file names [3].
Thank you,
Malcolm
[1] ftputil wraps ftplib and inherits ftplib’s apparent ASCII only support?
[2] Paramiko’s SFTP library does support Unicode file names, however I’m looking specifically for ftp (vs. sftp) support relative to our current project.
[3] http://docs.python.org/library/ftplib.html
WORKAROUND:
The encodings.idna.ToASCII and .ToUnicode methods can be used to convert Unicode path names to an ASCII format. If you wrap all your remote path names and the output of the dir/nlst methods with these functions, then you can create a way to preserve Unicode path names using the standard ftplib (and also preserve Unicode file names on file systems that don’t support Unicode paths). The downside to this technique is that other processes on the server will also have to use encodings.idna when referencing the files that you upload to the server. BTW: I understand that this is an abuse of the encodings.idna library.
Thank you Peter and Bob for your comments which I found very helpful.
ftplibhas no knowledge of Unicode whatsoever. It is intended to be passed byte-strings for filenames, and it’ll return byte strings when asked for a directory list. Those are the exact strings of bytes passed-to/returned-from the server.If you pass a Unicode string to
ftplibin Python 2.x, it’ll end up getting coerced to bytes when it’s sent to the underlying socket object. This coercion uses Python’s ‘default’ encoding, ie. US-ASCII for safety, with exceptions generated for non-ASCII characters.The python-dev message to which you linked is talking about
ftplibin Python 3.x, where strings are Unicode by default. This leaves modules likeftplibin a tricky situation because although they now use Unicode strings at their front-end, the actual protocol behind it is byte-based. There therefore has to be an extra level of encoding/decoding involved, and without explicit intervention to specify what encoding is in use, there’s a fair change it’ll choose wrong.ftplibin 3.x chose to default to ISO-8859-1 in order to preserve each byte as a character inside the Unicode string. Unfortunately this will give unexpected results in the common case where the target server uses a UTF-8 collation for filenames (whether or not the FTP daemon itself knows that filenames are UTF-8, which it commonly won’t). There are a number of cases like this where the Python standard libraries have been brutally hacked to Unicode strings with negative consequences; Python 3’s batteries-included are still leaking corrosive fluid IMO.