I am fetching some html table rows with BeautifulSoup with this piece of code:
from bs4 import BeautifulSoup
import urllib2
import re
page = urllib2.urlopen('www.something.bla')
soup = BeautifulSoup(page)
rows = soup.findAll('tr', attrs={'class': re.compile('class1.*')})
This is what I get as a result:
<tr class="class1 class2 class3">...</tr>
<tr class="class1 class2 class3">...</tr>
<tr class="class1 class5">...</tr>
<tr class="class1_a class5_a">...</tr>
<tr class="class1 class5">...</tr>
<tr class="class1_a class5_a">...</tr>
<!-- etc. -->
However, I’d like to exclude (or not select them in the first place) those rows which have class1 class2 class3 as an attribute.
How can I do that?
Thanks for help!
Perhaps it’s easier without regex. This works with BeautifulSoup 3:
=>
With BeautifulSoup 4, I was able to make it work as follows:
=>
In BS4, multi-valued attributes like
classhave lists of strings as their values, not strings. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#id12.