I read a webpage which contains hebrew characters, using:
response = ('').join(opener.open(url).readlines())
The result I get is mixed, some of the characters come back as unicode, which I can handle.
Some of the response seems garbled. In a format I cant recognize.
An example of the recieved text is:
שלך
More precisely, it looks like this (only a snippet…):
<h3 class="_52r al aps">About גדי</h3><div>שלך ....</div>
The text between the divs seems scrambled. Can I convert it to unicode?
You are looking at HTML entities; use the
HTMLParserlibrary to decode these:To read a full
urllib2response, just use.read():