I have built a webscraper with a for-loop. I don’t know why, but it returns an url (which is what I want it to return), and then before fetching the next url in the list, it returns a NoneType object. Other than making the script slower, it’s not a big deal, if it wasn’t because I can’t get it to print more than the first url.
from BeautifulSoup import BeautifulSoup
from mechanize import Browser
br = Browser()
page = br.open("https://bdkv2.borger.dk/foa/Sider/default.aspx?fk=22&foaid=11541520")
html = page.read()
soup = BeautifulSoup(html)
link = soup.findAll('a')
kommunelink = link[21:116]
for kommune in kommunelink:
kommuneside = br.open(kommune['href'])
html2 = kommuneside.read()
soup2 = BeautifulSoup(html2)
hjemmesidelink = soup2.find('a', id='_uscAncHomesite')
print hjemmesidelink['href']
This way my output is like this:
http://www.albertslund.dk
Traceback (most recent call last):
File "C:\Users\kba\Desktop\kommuneskraber.py", line 14, in <module>
print hjemmesidelink['href']
TypeError: 'NoneType' object has no attribute '__getitem__'
I’ve tried messing around with stuff like: If variable == specific class, then print, but that doesn’t work. Example:
If hjemmesidelink['href'] == <class 'BeautifulSoup.Tag'>:
print hjemmesidelink['href']
if hjemmesidelink.class == BeautifulSoup.Tag:
print hjemmesidelink['href']
Any idea how it should be? Or maybe even better, any idea where/why my script fetches a ‘NoneType’ object every second time it iterates through the loop? Thanks a bunch.
this is not a complete answer, but if you look at the comments this will answer just the part about not producing an error.
at this part of the code:
replace with:
the
if hjemmesidelink:checks ifhjemmesidelinkhas a value, if it does, then it prints it, if not, it will continue the loop.my results:
and counting.