Scraping pages using BeautifulSoup; trying to filter out links that end in “…html#comments”
Code follows:
import urllib.request
import re
from bs4 import BeautifulSoup
base_url = "http://voices.washingtonpost.com/thefix/morning-fix/"
soup = BeautifulSoup(urllib.request.urlopen(base_url)).findAll('a')
links_to_follow = []
for i in soup:
if i.has_key('href') and \
re.search(base_url, i['href']) and \
len(i['href']) > len(base_url) and \
re.search(r'[^(comments)]', i['href']):
print(i['href'])
Python 3.2, Windows 7 64-bit.
The above script retains a link ending in “#comments”
I tried re.search([^comments], i['href']), re.search([^(comments)], i['href']) and re.search([^'comments'], i['href']) — all threw syntax errors.
New to Python, so apologies for banality.
I’m guessing either
(a) I don’t understand enough about the ‘r’ prefix to use it correctly or
(b) in response to [^(foo)] re.search returns not the set of lines that exclude ‘foo’, but the set of lines comprised of more than ‘foo’ alone. e.g., I keep my …#comments link because …texttexttext.html#comments precedes it or
(c) Python interprets “#” as a comment ending the line re.search is supposed to match.
I think I’m wrong on (b).
Sorry, know this is simple. Thanks,
Zack
means “one character that is neither a
(nor ac, ano, anm, ane, ann, at, ansor a)“. Probably not what you intended.If your goal is to have a regex that only matches if the provided string does not end in
#comments, then I would useor even better (why use a regex at all if it’s that simple?):
As for your other questions:
The
r'...'notation allows you to write “raw strings”, meaning that backslashes don’t need to be escaped:r'\b'means “backslash + b” (which will be interpreted by the regex engine as “word boundary”'\b'means “backspace character”#has no special meaning in a regex unless you use the(?x)orre.VERBOSEoption. In that case, it does indeed start a comment in a multiline regex.