Scraping pages using BeautifulSoup; trying to filter out links that end in …html#comments Code

Question

0

Editorial Team

Asked: May 31, 20262026-05-31T23:29:46+00:00 2026-05-31T23:29:46+00:00

Scraping pages using BeautifulSoup; trying to filter out links that end in …html#comments Code

0

Scraping pages using BeautifulSoup; trying to filter out links that end in “…html#comments”

Code follows:

import urllib.request
import re
from bs4 import BeautifulSoup

base_url = "http://voices.washingtonpost.com/thefix/morning-fix/"
soup = BeautifulSoup(urllib.request.urlopen(base_url)).findAll('a')
links_to_follow = []
for i in soup:
        if i.has_key('href') and \
    re.search(base_url, i['href']) and \
    len(i['href']) > len(base_url) and \
    re.search(r'[^(comments)]', i['href']):
        print(i['href'])

Python 3.2, Windows 7 64-bit.

The above script retains a link ending in “#comments”

I tried re.search([^comments], i['href']), re.search([^(comments)], i['href']) and re.search([^'comments'], i['href']) — all threw syntax errors.

New to Python, so apologies for banality.

I’m guessing either
(a) I don’t understand enough about the ‘r’ prefix to use it correctly or
(b) in response to [^(foo)] re.search returns not the set of lines that exclude ‘foo’, but the set of lines comprised of more than ‘foo’ alone. e.g., I keep my …#comments link because …texttexttext.html#comments precedes it or
(c) Python interprets “#” as a comment ending the line re.search is supposed to match.

I think I’m wrong on (b).

Sorry, know this is simple. Thanks,

Zack

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T23:29:48+00:00

[^(comments)]

means “one character that is neither a ( nor a c, an o, an m, an e, an n, a t, an s or a )“. Probably not what you intended.

If your goal is to have a regex that only matches if the provided string does not end in #comments, then I would use

... and not re.search("#comments$", i['href'])

or even better (why use a regex at all if it’s that simple?):

... and not i['href'].endswith("#comments")

As for your other questions:

The r'...' notation allows you to write “raw strings”, meaning that backslashes don’t need to be escaped:

r'\b' means “backslash + b” (which will be interpreted by the regex engine as “word boundary”
'\b' means “backspace character”
etc.

# has no special meaning in a regex unless you use the (?x) or re.VERBOSE option. In that case, it does indeed start a comment in a multiline regex.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Scraping pages using BeautifulSoup; trying to filter out links that end in …html#comments Code

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply