I’m working on some web-parser on python and now stacked with special symbols like ★ ✿ • and other, sometimes I get them in utf-8: "✿" and sometimes in unicode: u"\xe2\x80\xa2". I have found the table of them but the only thing I can do is:
set = []
set.append([u"\xe2\x80\xa2","•"])
set.append(["✿","✿"])
for i in set:
s=s.replace(i[0],i[1])
I write it with my hands.
Because I could not find the table that associate the left ones with the right.
Can you help me, please?
Given a Unicode string containing a single character:
It can be converted to the HTML syntax like so:
To convert back, extract the number by stripping off the
&#and;, convert to an integer, and then usechr(Python 3) orunichr(Python 2).If you need to deal with input not from the above conversion, you may need to deal with hexadecimal ones, too, which look like
&#xZZZ;whereZZZis a bunch of hexadecimal digits. To detect these, simply notice that it starts with anx; parse the remainder with radix 16.Furthermore, you may need to deal with named entities. See the last two paragraphs for that.
If you want Python to deal with the encoding of a whole string, you can use this:
Unfortunately, there is no equivalent for decoding, and this also does not escape potentially hazardous HTML characters such as
<(which may or may not be what you want). If you need to decode, perhaps use a proper HTML parser which will also be able to deal with named entities like♣(♣).If you want to deal with named entities and do not want to use a real HTML parser, there is a machine-readable (with Python’s
jsonmodule) list of entities.