After crawling many websites, in some of them i receive broken-encoding data. I can’t do anything with them, i just need to detect them. For example text like:
·ç¼wÃdª«¦Ê³f
or
ãà³n³¾å¢
How can I recognize text like that ? I any language, so searching for non-english text is not an option. The only option I can think of is guess-language module.
There’s NLTK which has a
guess_encodingfunction that takes a byte string and tries all of the available encodings, would this serve your purpose?