My mac os will generate a .DS_Store under my train data set file directory, and load_files will load it and raise exception like
UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xff in position 1116
I want to know that how to filter the .DS_Store file except delete it?
Looking at the documentation, there doesn’t seem to be any way to filter directly in
load_files(or, rather, you can whitelist categories, but you can’t whitelist files within the categories, or blacklist at either level).You might want to consider filing a feature request to the scikit-learn project. Alternatively, you might consider it a bug that hidden files (as defined appropriately for the platform—but on OS X and other POSIX systems that should include files whose names start with
.) are loaded, and file a bug report on that.Meanwhile, there is a
load_contentflag that you can set:Pass
False, and it will just find the filenames for you, which you can then filter however you want (e.g.,filenames = (filename for filename in ret.filenames if not filename.startswith('.'))), then load manually.This seems like the best solution available with the given tools.
On the other hand, given how simple
load_filesactually is—especially if you don’t use the extra features likecategoriesorshuffle—it might be simpler to just not use it, and instead useos.walkor justos.listdir. In this case, given that the files are exactly 2 levels deep, rather than at an arbitrary depth, the latter is probably simpler: