Logical flow of the scraper: article links extracted from an XML feed are put into a list called self.raw_html. The following [simplified] method is then called to filter out the container the articles are in and remove text from the given articles:
def fetch_article_contents(self):
for article in self.raw_html:
self.css_selector_type == 'class':
soup = article.find(self.html_element,
self.css_selector)
soup = soup.get_text()
self.article_html.append(soup)
return self.article_html
This works well on most feeds, but on two notable exemptions (Forbes and Official Google Blog) fails with the following message when get_text() is called:
AttributeError: 'NoneType' object has no attribute 'get_text'
My first logical step in debugging was to see what was returning a NoneType object, so I stuck a print type(soup) right before soup = soup.get_text(). I found:
<class 'bs4.element.Tag'> (25 times, condensed to save space)
<type 'NoneType'>
This also strikes me as strange because there are currently 29 articles in self.raw_html when fetching the Forbes XML feed as verified by len(self.raw_html) when the class is initalized.
The Google Official Blog returns:
<class 'bs4.element.Tag'> (just once this time)
<type 'NoneType'>
and in reality has 25 fetched articles.
What is the problem I’m encountering? Thanks!
You don’t show what
self.html_elementandself.css_selectorare, but it seems clear the thearticle.findmethod is not finding them, and returningNone.