Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 711319
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 14, 20262026-05-14T04:43:19+00:00 2026-05-14T04:43:19+00:00

Trying to decode an invalid encoded utf-8 html page gives different results in python,

  • 0

Trying to decode an invalid encoded utf-8 html page gives different results in
python, firefox and chrome.

The invalid encoded fragment from test page looks like 'PREFIX\xe3\xabSUFFIX'

>>> fragment = 'PREFIX\xe3\xabSUFFIX'
>>> fragment.decode('utf-8', 'strict')
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-8: invalid data

UPDATE: This question concluded in a bug report to Python unicode component. The Issue is reported to be fixed in Python 2.7.11 and 3.5.2.


What follows is the replacement policies used to handle decoding errors in
Python, Firefox and Chrome. Note how they differs, and specially how
python builtin removes the valid S (plus the invalid sequence of bytes).

Python

The builtin replace error handler replaces the invalid \xe3\xab plus the
S from SUFFIX by U+FFFD

>>> fragment.decode('utf-8', 'replace')
u'PREFIX\ufffdUFFIX'
>>> print _
PREFIX�UFFIX

Browsers

To tests how browsers decode the invalid sequence of bytes will use a cgi script:

#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8

PREFIX\xe3\xabSUFFIX"""

Firefox and Chrome browsers rendered:

PREFIX�SUFFIX

Why builtin replace error handler for str.decode is removing the S from SUFFIX

(Was UPDATE 1)

According to wikipedia UTF-8 (thanks mjv),
the following ranges of bytes are used to indicate the start of a sequence of
bytes

  • 0xC2-0xDF : Start of 2-byte sequence
  • 0xE0-0xEF : Start of 3-byte sequence
  • 0xF0-0xF4 : Start of 4-byte sequence

'PREFIX\xe3\abSUFFIX' test fragment has 0xE3, it instructs python decoder
that a 3-byte sequence follows, the sequence is found invalid and python
decoder ignores the whole sequence including '\xabS', and continues after it
ignoring any possible correct sequence starting in the middle.

This means that for an invalid encoded sequence like '\xF0SUFFIX', it will
decode u'\ufffdFIX' instead of u'\ufffdSUFFIX'.

Example 1: Introducing DOM parsing bugs

>>> '<div>\xf0<div>Price: $20</div>...</div>'.decode('utf-8', 'replace')
u'<div>\ufffdv>Price: $20</div>...</div>'
>>> print _
<div>�v>Price: $20</div>...</div>

Example 2: Security issues (Also see Unicode security considerations):

>>> '\xf0<!-- <script>alert("hi!");</script> -->'.decode('utf-8', 'replace')
u'\ufffd- <script>alert("hi!");</script> -->'
>>> print _
�- <script>alert("hi!");</script> -->

Example 3: Remove valid information for a scraping application

>>> '\xf0' + u'it\u2019s'.encode('utf-8') # "it’s"
'\xf0it\xe2\x80\x99s'
>>> _.decode('utf-8', 'replace')
u'\ufffd\ufffd\ufffds'
>>> print _
���s

Using a cgi script to render this in browsers:

#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8

\xf0it\xe2\x80\x99s"""

Rendered:

�it’s

Is there any official recommended way for handling decoding replacements?

(Was UPDATE 2)

In a public review, the Unicode Technical Committee has opted for option 2
of the following candidates:

  1. Replace the entire ill-formed subsequence by a single U+FFFD.
  2. Replace each maximal subpart of the ill-formed subsequence by a single U+FFFD.
  3. Replace each code unit of the ill-formed subsequence by a single U+FFFD.

UTC Resolution was at 2008-08-29, source: http://www.unicode.org/review/resolved-pri-100.html

UTC Public Review 121 also includes an invalid bytestream as example
'\x61\xF1\x80\x80\xE1\x80\xC2\x62', it shows decoding results for each
option.

            61      F1      80      80      E1      80      C2      62
      1   U+0061  U+FFFD                                          U+0062
      2   U+0061  U+FFFD                  U+FFFD          U+FFFD  U+0062
      3   U+0061  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+0062

In plain Python the three results are:

  1. u'a\ufffdb' shows as a�b
  2. u'a\ufffd\ufffd\ufffdb' shows as a���b
  3. u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb' shows as a������b

And here is what python does for the invalid example bytestream:

>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace')
u'a\ufffd\ufffd\ufffd'
>>> print _
a���

Again, using a cgi script to test how browsers render the buggy encoded bytes:

#!/usr/bin/env python
print """\
Content-Type: text/plain; charset=utf-8

\x61\xF1\x80\x80\xE1\x80\xC2\x62"""

Both, Chrome and Firefox rendered:

a���b

Note that browsers rendered result matches option 2 of PR121 recomendation

While option 3 looks easily implementable in python, option 2 and 1 are a challenge.

>>> replace_option3 = lambda exc: (u'\ufffd', exc.start+1)
>>> codecs.register_error('replace_option3', replace_option3)
>>> '\x61\xF1\x80\x80\xE1\x80\xC2\x62'.decode('utf-8', 'replace_option3')
u'a\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdb'
>>> print _
a������b
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-14T04:43:19+00:00Added an answer on May 14, 2026 at 4:43 am

    You know that your S is valid, with the benefit of both look-ahead and hindsight 🙂 Suppose there was originally a legal 3-byte UTF-8 sequence there, and the 3rd byte was corrupted in transmission … with the change that you mention, you’d be complaining that a spurious S had not been replaced. There is no “right” way of doing it, without the benefit of error-correcting codes, or a crystal ball, or a tamborine.

    Update

    As @mjv remarked, the UTC issue is all about how many U+FFFD should be included.

    In fact, Python is not using ANY of the UTC’s 3 options.

    Here is the UTC’s sole example:

          61      F1      80      80      E1      80      C2      62
    1   U+0061  U+FFFD                                          U+0062
    2   U+0061  U+FFFD                  U+FFFD          U+FFFD  U+0062
    3   U+0061  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+FFFD  U+0062
    

    Here is what Python does:

    >>> bad = '\x61\xf1\x80\x80\xe1\x80\xc2\x62cdef'
    >>> bad.decode('utf8', 'replace')
    u'a\ufffd\ufffd\ufffdcdef'
    >>>
    

    Why?

    F1 should start a 4-byte sequence, but the E1 is not valid. One bad sequence, one replacement.
    Start again at the next byte, the 3rd 80. Bang, another FFFD.
    Start again at the C2, which introduces a 2-byte sequence, but C2 62 is invalid, so bang again.

    It’s interesting that the UTC didn’t mention what Python is doing (restarting after the number of bytes indicated by the lead character). Perhaps this is actually forbidden or deprecated somewhere in the Unicode standard. More reading required. Watch this space.

    Update 2 Houston, we have a problem.

    === Quoted from Chapter 3 of Unicode 5.2 ===

    Constraints on Conversion Processes

    The requirement not to interpret any ill-formed code unit subsequences in a string as characters (see conformance clause C10) has important consequences for conversion processes.

    Such processes may, for example, interpret UTF-8 code unit sequences as Unicode character
    sequences. If the converter encounters an ill-formed UTF-8 code unit sequence which
    starts with a valid first byte, but which does not continue with valid successor bytes (see
    Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence
    whenever those successor bytes themselves constitute part of a well-formed UTF-8 code
    unit subsequence
    .

    If an implementation of a UTF-8 conversion process stops at the first error encountered,
    without reporting the end of any ill-formed UTF-8 code unit subsequence, then the
    requirement makes little practical difference. However, the requirement does introduce a
    significant constraint if the UTF-8 converter continues past the point of a detected error,
    perhaps by substituting one or more U+FFFD replacement characters for the uninterpretable,
    ill-formed UTF-8 code unit subsequence. For example, with the input UTF-8 code
    unit sequence <C2 41 42>, such a UTF-8 conversion process must not return <U+FFFD>
    or <U+FFFD, U+0042>, because either of those outputs would be the result of misinterpreting a well-formed subsequence as being part of the ill-formed subsequence. The
    expected return value for such a process would instead be <U+FFFD, U+0041, U+0042>.

    For a UTF-8 conversion process to consume valid successor bytes is not only non-conformant,
    but also leaves the converter open to security exploits. See Unicode Technical Report
    #36, “Unicode Security Considerations.”

    === End of quote ===

    It then goes on to discuss at length, with examples, the “how many FFFD to emit” issue.

    Using their example in the 2nd last quoted paragraph:

    >>> bad2 = "\xc2\x41\x42"
    >>> bad2.decode('utf8', 'replace')
    u'\ufffdB'
    # FAIL
    

    Note that this is a problem with both the 'replace' and 'ignore' options of str.decode(‘utf_8’) — it’s all about omitting data, not about how many U+FFFD are emitted; get the data-emitting part right and the U+FFFD issue falls out naturally, as explained in the part that I didn’t quote.

    Update 3 Current versions of Python (including 2.7) have unicodedata.unidata_version as '5.1.0' which may or may not indicate that the Unicode-related code is intended to conform to Unicode 5.1.0. In any case, the wordy prohibition of what Python is doing didn’t appear in the Unicode standard until 5.2.0. I’ll raise an issue on the Python tracker without mentioning the word 'oht'.encode('rot13').

    Reported here

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm trying to decode a WBXML encoded SyncML message from a Nokia N95. My
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I'm trying to decode the result of the Python os.wait() function. This returns, according
Im trying to decode a json string returned from flickr within my PHP code.
I'm trying to download imags from a url and then decode them. The problem
I am trying to decode a string I took from file: file = open
I am trying to decode some UTF-8 strings in Java. These strings contain some
I am trying to encode/decode MIME headers in Ruby.
I'm still trying to decide whether my (home) project should use UTF-8 strings (implemented
I'm trying to decode a json of a dictionary with strings as keys. The

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.