I’m writing a program which fetches and edits articles on Wikipedia, and I’m having a bit of trouble handling Unicode characters prefixed with \u. I’ve tried .encode(“utf8”) and it isn’t seeming to do the trick here. How can I properly encode these values prefixed with \u to POST to Wikipedia? See this edit for my problem.
Here is some code:
To get the page:
url = "http://en.wikipedia.org/w/api.php?action=query&format=json&titles="+urllib.quote(name)+"&prop=revisions&rvprop=content"
articleContent = ClientCookie.urlopen(url).read().split('"*":"')[1].split('"}')[0].replace("\\n", "\n").decode("utf-8")
Before I POST the page:
data = dict([(key, value.encode('utf8')) for key, value in data.iteritems()])
data["text"] = data["text"].replace("\\", "")
editInfo = urllib2.Request("http://en.wikipedia.org/w/api.php", urllib.urlencode(data))
You are downloading JSON data without decoding it. Use the
jsonlibrary for that:JSON encoded data looks a lot like Python, it uses
\uescaping as well, but it is in fact a subset of JavaScript.The
datavariable now holds a deep datastructure. Judging by the string splitting, you wanted this piece:Now articleContent is an actual
unicode()instance; it is the revision text of the page you were looking for: