I’m trying to parse a web page using Python’s beautiful soup Python parser, and am running into an issue.
The header of the HTML we get from them declares a utf-8 character set, so Beautiful Soup encodes the whole document in utf-8, and indeed the HTML tags are encoded in UTF-8 so we get back a nicely structured HTML page.
The trouble is, this stupid website injects gb2312-encoded body text into the page that gets parsed as utf-8 by beautiful soup. Is there a way to convert the text from this “gb2312 pretending to be utf-8” state to “proper expression of the character set in utf-8?”
The simplest way might be to parse the page twice, once as UTF-8, and once as GB2312. Then extract the relevant section from the GB2312 parse.
I don’t know much about GB2312, but looking it up it appears to at least agree with ASCII on the basic letters, numbers, etc. So you should still be able to parse the HTML structure using GB2312, which would hopefully give you enough information to extract the part you need.
This may be the only way to do it, actually. In general, GB2312-encoded text won’t be valid UTF-8, so trying to decode it as UTF-8 should lead to errors. The BeautifulSoup documentation says:
This makes it sound like BeautifulSoup just ignores decoding errors and replaces the erroneous characters with U+FFFD. If this is the case (i.e., if your document has
contains_replacement_characters == True), then there is no way to get the original data back from document once it’s been decoded as UTF-8. You will have to do something like what I suggested above, decoding the entire document twice with different codecs.