I’m using BeautifulSoup to parse through data on baseball-reference.com and it works find for every page, except for a few like this one Same pages (different data) work perfectly, ie this one.
I’m trying to filter out tables with ‘stats_table’ as one of the classes. I use this code:
bs = BeautifulSoup(stream, 'lxml', parse_only=SoupStrainer('table'))
and then I do sth like:
for table in bs.find_all('table'):
print table.attrs
... bla bla...
It is obvious out of table.attrs that this code doesn’t see batting and pitching tables and that they are there… I repeat: the same code works fine for almost all other pages like this.
Looking over str(bs) clearly shows that
ANY ideas?
As you posted in the comments there are errors on the page. You should use HTML Tidy to clean it up : http://pypi.python.org/pypi/pytidylib/0.2.1
You can check HTML Tidy at work: http://validator.w3.org/