I am trying to write a script to scrape a website, and am using this one (http://www.theericwang.com/scripts/eBayRead.py).
I however want to use it to crawl sites other than ebay, and to customize to my needs.
I am fairly new to python and have limited re experience.
I am unsure of what this line achieves.
for url, title in re.findall(r'href="([^"]+).*class="vip" title=\'([^\']+)', lines):
Could someone please give me some pointers?
Is there anything else I need to consider if I port this for other sites?
In general, parsing HTML is best done with a library such as BeautifulSoup, which takes care of virtually all of the heavy lifting for you, leaving you with more intuitive code. Also, read @Tadeck’s link below – regex and HTML shouldn’t be mixed if it can be avoided (to put it lightly).
As for your question, that line uses something called ‘regular expression’ to find matching patterns in a text (in this case, HTML).
re.findall()is a method that returns a list, so if we focus on just that:rindicates that the following will be interpreted ‘raw’, meaning that characters like backslashes, etc., will be interpreted literally.The parentheses indicate a group (what we care about in the match), and the
[^"]+means ‘match anything that isn’t a quote’. As you can probably guess, this group will return the URL of the link.The
.*matches anything (well, almost anything) 0 or more times (which here could include other tags, the closing quote of the link, whitespace, etc.). Nothing special withclass="vip"– it just needs to appear.Here you see an escaped quote and then another group as we saw above. This time, we are capturing anything between the two apostrophes after the
titletag.The end result of this is you are iterating through a list of all matches, and those matches are going to look something like
(my_matched_link, my_matched_title), which are passed intofor url, title, after which further processing is done.