How to get cyrillic string from document?
I have fallowing code:
import urllib
from BeautifulSoup import BeautifulSoup
page = urllib.urlopen("http://habrahabr.ru/")
soup = BeautifulSoup(page.read())
for topic in soup.findAll(True, 'topic'):
print topic
print
raw_input()
There is cyrillic words on the site but python displays wrong characters.
I will be very helpful for any help in this issue.
PS.
I changed
soup = BeautifulSoup(page.read())
to
soup = BeautifulSoup(page.read(), fromEncoding="utf-8")
and still no results…
The data on the HTML page is encoded in UTF-8. It appears that you are printing it to your console, where sys.stdout.encoding is cp1251. That accounts for the rubbish that you are seeing.
Here are the results of inspecting the first 8 bytes of the first topic, using IDLE: