I have an application written in Python 2.7 that reads user’s file from the hard-drive using os.walk.
The application requires a UTF-8 system locale (we check the env variables before it starts) because we handle files with Unicode characters (audio files with the artist name in it for example), and want to make sure we can save these files with the correct file name to the filesystem.
Some of our users have UTF-8 locales (therefore a UTF-8 fs), but still somehow manage to have ISO-8859-1 files stored on their drive. This causes problems when our code tries to os.walk() these directories as Python throws an exception when trying to decode this sequence of ISO-8859-1 bytes using UTF-8.
So my question is, how do I get python to ignore this file and move on to the next one instead of aborting the entire os.walk(). Should I just roll my own os.walk() function?
Edit: Until now we’ve been telling our users to use the convmv linux command to correct their filenames, however many users have various different types of encodings (8859-1, 8859-2, etc.), and using convmv requires the user to make an educated guess on what files have what encoding before they run convmv on each one individually.
Please read Unicode filenames, part of the Python Unicode how-to. Most importantly, filesystem encodings are not necessarily the same as the current LANG setting in the terminal.
Specifically,
os.walkis built uponos.listdir, and will thus switch between unicode and 8-bit bytes depending on wether or not you give it a unicode path.Pass it an 8-bit path instead, and your code will work properly, then decode from UTF-8 or ISO 8859-1 as needed.