import urllib.request as u
zipcode = str(47401)
url = 'http://watchdog.net/us/?zip=' + zipcode
con = u.urlopen(url)
page = str(con.read())
value3 = int(page.find("<title>")) + 7
value4 = int(page.find("</title>")) - 15
district = str(page[value3:value4])
print(district)
newdistrict = district.replace("\xe2\x80\x99","'")
print(newdistrict)
For some reason, my code is pulling in the title in the following format: IN-09: Indiana\xe2\x80\x99s 9th. I know the \xe string of characters is unicode for the ' symbol, but I can’t figure out how to get python to replace that set of characters with the ' symbol. I’ve tried decoding the string but it’s already in unicode and the replace code above doesn’t change anything. Any advice as to what I’m doing incorrectly?
When you call
con.text(), this returns abytesobject. Callingstr()on it returns a string of the representation of it – thus, the escapes are used rather than the real characters, if you don’t specify an encoding. (That means that your string ends up containing\\xe2\\x80\\x99as well as all sorts of other undesired things.)bytesis mostly likestrin Python 2: it doesn’t have any encoding information stored.strin Python 3 is likeunicodein Python 2; it has the encoding. So, when turning abytesobject into astrobject, you need to tell it what encoding it is actually in. In this case, that’sutf-8.Instead of calling
str()on it, you would be better to usebytes.decode; it’s the same thing, just neater.The only functional change that has been made here is the specification to decode the
bytesobject as'utf-8'.