I’m trying to parse a web page using Python’s beautiful soup Python parser, and

Question

0

Asked: June 9, 20262026-06-09T00:44:04+00:00 2026-06-09T00:44:04+00:00

I’m trying to parse a web page using Python’s beautiful soup Python parser, and

0

I’m trying to parse a web page using Python’s beautiful soup Python parser, and am running into an issue.

The header of the HTML we get from them declares a utf-8 character set, so Beautiful Soup encodes the whole document in utf-8, and indeed the HTML tags are encoded in UTF-8 so we get back a nicely structured HTML page.

The trouble is, this stupid website injects gb2312-encoded body text into the page that gets parsed as utf-8 by beautiful soup. Is there a way to convert the text from this “gb2312 pretending to be utf-8” state to “proper expression of the character set in utf-8?”

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T00:44:06+00:00

The simplest way might be to parse the page twice, once as UTF-8, and once as GB2312. Then extract the relevant section from the GB2312 parse.

I don’t know much about GB2312, but looking it up it appears to at least agree with ASCII on the basic letters, numbers, etc. So you should still be able to parse the HTML structure using GB2312, which would hopefully give you enough information to extract the part you need.

This may be the only way to do it, actually. In general, GB2312-encoded text won’t be valid UTF-8, so trying to decode it as UTF-8 should lead to errors. The BeautifulSoup documentation says:

In rare cases (usually when a UTF-8 document contains text written in a completely different encoding), the only way to get Unicode may be to replace some characters with the special Unicode character “REPLACEMENT CHARACTER” (U+FFFD, �). If Unicode, Dammit needs to do this, it will set the .contains_replacement_characters attribute to True on the UnicodeDammit or BeautifulSoup object.

This makes it sound like BeautifulSoup just ignores decoding errors and replaces the erroneous characters with U+FFFD. If this is the case (i.e., if your document has contains_replacement_characters == True), then there is no way to get the original data back from document once it’s been decoded as UTF-8. You will have to do something like what I suggested above, decoding the entire document twice with different codecs.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to parse a web page using Python’s beautiful soup Python parser, and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply