I have a Python project containing hundreds of modules. In Python 2.6, the encoding of source files (modules) must be ASCII unless
an explicit encoding declaration is present. Is there a simple way to find out which python modules contain non-ASCII characters? So that I can correct them.
Regards,
Have a look at the chardet python package. You can use the same
os.walkapproach as agf and call thechardet.detectmethod and flag files that are not ASCII (or with a lower confidence value).This does leave some room for error though, so if you wanted to be more sure, you could also scan each file for characters that are unlikely to appear in a python file (non-alphanum, non-punctuation, etc). However, this will not detect things like UTF-16 chars that have the same value as two 7-bit, zero padded ascii chars i.e.
U+16705<–>AA.That said, if the characters that you want to exclude are from a limited number of character sets, you should be able to locate them with high confidence.