I’m having trouble with some Regex code while scraping YouTube playlist pages. It mostly works fine but its picking up a couple of strange results
Expression:
(?<=v=)[a-zA-Z0-9-_]+(?=&)|(?<=[0-9]/)[^&\n]+|(?<=v=)[^&\n]+
Examples of what to pick out:
yXBckFyiMyU,
opWYnUpNtG8,
YFbLRZCExBk,
I_GZahAl-PQ,
G6F_iP-F7Fw
from links like this
https://www.youtube.com/watch?v=_ClmClS_Mqs&list=PL6422619E56951B73&index=5&feature=plpp_video
For the most part this appears to be working okay, however it is also picking up these instances
data-thumb="//i1.ytimg.com/vi/84GVRtJ1CvY/<FROM RIGHT ONWARDS IS WHAT IT MATCHES>default.jpg" ><span class="vertical-align"></span></span></span></span>
data-thumb="//i4.ytimg.com/vi/WNIPqafd4As/<FROM RIGHT ONWARDS IS WHAT IT MATCHES>default.jpg" alt="" class="thumb"></span></span></span><span class="clip"><span class="centering-offset"><span class="centering"><span class="ie7-vertical-align-hack">
Regex is rather daunting. Does anyone know whats wrong with the expression?
As a suggestion, the strings you want to match are always 11 characters long. Instead of trying to match “as many characters as possible” using the
+quantifier, instead match “exactly 11 characters” using the{11}quantifier.This may cure the symptoms of the over-matching problem you are seeing, though I don’t know why it’s matching those strings in the first place. (They don’t start with
v=.)You should probably clarify your alternations
|by parenthesising:and if your regex flavour supports verbose regular expressions (comments inside regexes) use them!
As a suggestion – parsing URLs with regex is nasty. I would instead:
BeautifulSoup, which makes it very easy to get ‘all links’.)parse_url()(more Python), obtaining a dictionary/hash of the GET attributes. Example:The dictionary might look like
Then you can just ask for the GET attribute
v. No regexes required.This is python specific, but Java will have equivalents. The point is that regex is not always the best tool (just the most general tool.)