I’m using this function to escape the HTML enities import re, htmlentitydefs ## #

Question

0

Asked: May 27, 20262026-05-27T21:13:01+00:00 2026-05-27T21:13:01+00:00

I’m using this function to escape the HTML enities import re, htmlentitydefs ## #

0

I’m using this function to escape the HTML enities

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

but when i try to process some text i get this error, (most of the text works) but python throws me this error

File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
  return codecs.charmap_encode(input,errors,encoding_map)
  UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 3
 48: character maps to <undefined>

i have tried encoding the text string a million different ways, nothing is working so far ascii, utf, unicode… all that stuff which i really don’t understand

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T21:13:02+00:00

Based on the error message, it looks like you may be attempting to convert a unicode string into CP 437 (an IBM PC character set). This doesn’t appear to be occurring in your function, but could happen when attempting to print the resulting string to your console. I ran a quick test with the input string "® some text" and was able to reproduce the failure when printing the resulting string:

print unescape("&#xae; some text")

You can avoid this by specifying the encoding you want to convert the unicode string to:

print unescape("&#xae; some text").encode('utf-8')

You’ll see non-ascii characters if you attempt to print this string to the console, however if you write it to a file and read it in a viewer that supports utf-8 encoded documents, you should see the characters you expect.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using this function to escape the HTML enities import re, htmlentitydefs ## #

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply