I know this is not very pretty code and that I’m sure there is an easier way but I’m more concerned on why python is not stripping the characters I requested it to.
import urllib, sgmllib
zip_code = raw_input('Give me a zip code> ')
url = 'http://www.uszip.com/zip/' + zip_code
print url
conn = urllib.urlopen('http://www.uszip.com/zip/' + zip_code)
i = 0
while i < 1000:
for line in conn.fp:
if i == 1:
print line[7:-10]
i += 1
elif i == 344:
line1 = line.strip()
line2 = line1.strip('<td>') #its not stripping the characters
print line2[17:-60]
i += 1
else:
i += 1
The way you call it, it should remove any occurrence of the
<,>,t, anddcharacters, and only at the beginning or end of the string:If you want to remove every occurrence of the substring
<td>, usereplace:Note that if you want to use that for some kind of input sanitization, it can be easily circumvented:
This is only one of many ways how people typically get screwed if they try to write their own HTML parsing code, so maybe you rather want to use a HTML parsing library like
BeautifulSoupor an XML parser likelxml.