The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien’s ‘The Children of Húrin’.
import urllib2 from BeautifulSoup import BeautifulStoneSoup URL = ('http://www.librarything.com/services/rest/1.0/' '?method=librarything.ck.getwork&id=1907912' '&apikey=2a2e596b887f554db2bbbf3b07ff812a') soup = BeautifulStoneSoup(urllib2.urlopen(URL), convertEntities=BeautifulStoneSoup.ALL_ENTITIES) title_field = soup.find('field', attrs={'name': 'canonicaltitle'}) print title_field.find('fact').string
Unfortunately, instead of ‘Húrin’, it prints out ‘Húrin’. This is obviously an encoding issue, but I can’t work out what I need to do to get the expected output. Help would be greatly appreciated.
In the source of the web page it looks like this:
The Children of Húrin. So the encoding is already broken somewhere on their side before it even gets converted to XML…If it’s a general issue with all the books and you need to work around it, this seems to work: