I am trying to write a script to scrape a website, and am using

Question

0

Asked: June 14, 20262026-06-14T09:07:07+00:00 2026-06-14T09:07:07+00:00

I am trying to write a script to scrape a website, and am using

0

I am trying to write a script to scrape a website, and am using this one (http://www.theericwang.com/scripts/eBayRead.py).

I however want to use it to crawl sites other than ebay, and to customize to my needs.

I am fairly new to python and have limited re experience.

I am unsure of what this line achieves.

for url, title in re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):

Could someone please give me some pointers?

Is there anything else I need to consider if I port this for other sites?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T09:07:08+00:00

In general, parsing HTML is best done with a library such as BeautifulSoup, which takes care of virtually all of the heavy lifting for you, leaving you with more intuitive code. Also, read @Tadeck’s link below – regex and HTML shouldn’t be mixed if it can be avoided (to put it lightly).

As for your question, that line uses something called ‘regular expression’ to find matching patterns in a text (in this case, HTML). re.findall() is a method that returns a list, so if we focus on just that:

re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):

r indicates that the following will be interpreted ‘raw’, meaning that characters like backslashes, etc., will be interpreted literally.

href="([^"]+)

The parentheses indicate a group (what we care about in the match), and the [^"]+ means ‘match anything that isn’t a quote’. As you can probably guess, this group will return the URL of the link.

.*class="vip"

The .* matches anything (well, almost anything) 0 or more times (which here could include other tags, the closing quote of the link, whitespace, etc.). Nothing special with class="vip" – it just needs to appear.

title=\'([^\']+)', lines):

Here you see an escaped quote and then another group as we saw above. This time, we are capturing anything between the two apostrophes after the title tag.

The end result of this is you are iterating through a list of all matches, and those matches are going to look something like (my_matched_link, my_matched_title), which are passed into for url, title, after which further processing is done.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to write a script to scrape a website, and am using

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply