I wrote the line below:
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
The data is achieved by urllib.urlopen(XXX).read() in python2.7.
It works well when the XXX is a page that consists of total English characters, such as http://python.org. But when it goes for a page there is some Chinese characters, it fails.
There will be a KeyError. And [x for ...] returns an empty list.
What’s more, if there is no parseOnlyThese=SoupStrainer('a'), it is OK for both.
Is there some bug of SoupStrainer?
from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib
data = urllib.urlopen('http://tudou.com').read()
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
gives the traceback:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
[x['href'] for x in BeautifulSoup(data, parseOnlyThese=SoupStrainer('a'))]
File "F:\ActivePython27\lib\site-packages\beautifulsoup-3.2.1-py2.7.egg\BeautifulSoup.py", line 613, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'
There are
<a>links on that page that do not have ahrefattribute. Use the following instead:For example, it is perfectly normal to declare a link target with
<a name="something" />; you are selecting those tags too, but they do not have ahrefattribute and your code fails on that.