I’m having trouble with some Regex code while scraping YouTube playlist pages. It mostly

Question

0

Asked: May 31, 20262026-05-31T20:40:39+00:00 2026-05-31T20:40:39+00:00

I’m having trouble with some Regex code while scraping YouTube playlist pages. It mostly

0

I’m having trouble with some Regex code while scraping YouTube playlist pages. It mostly works fine but its picking up a couple of strange results

Expression:

(?<=v=)[a-zA-Z0-9-_]+(?=&)|(?<=[0-9]/)[^&\n]+|(?<=v=)[^&\n]+

Examples of what to pick out:

yXBckFyiMyU,
opWYnUpNtG8,
YFbLRZCExBk,
I_GZahAl-PQ,
G6F_iP-F7Fw

from links like this

https://www.youtube.com/watch?v=_ClmClS_Mqs&list=PL6422619E56951B73&index=5&feature=plpp_video

For the most part this appears to be working okay, however it is also picking up these instances

data-thumb="//i1.ytimg.com/vi/84GVRtJ1CvY/<FROM RIGHT ONWARDS IS WHAT IT MATCHES>default.jpg" ><span class="vertical-align"></span></span></span></span>

data-thumb="//i4.ytimg.com/vi/WNIPqafd4As/<FROM RIGHT ONWARDS IS WHAT IT MATCHES>default.jpg" alt="" class="thumb"></span></span></span><span class="clip"><span class="centering-offset"><span class="centering"><span class="ie7-vertical-align-hack">

Regex is rather daunting. Does anyone know whats wrong with the expression?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T20:40:41+00:00

As a suggestion, the strings you want to match are always 11 characters long. Instead of trying to match “as many characters as possible” using the + quantifier, instead match “exactly 11 characters” using the {11} quantifier.

This may cure the symptoms of the over-matching problem you are seeing, though I don’t know why it’s matching those strings in the first place. (They don’t start with v=.)

You should probably clarify your alternations | by parenthesising:

((?<=v=)[a-zA-Z0-9-_]+(?=&))|((?<=[0-9]/)[^&\n]+)|((?<=v=)[^&\n]+)

and if your regex flavour supports verbose regular expressions (comments inside regexes) use them!

As a suggestion – parsing URLs with regex is nasty. I would instead:

get a list of all URLs on the page using a HTML parser (in Python I would use BeautifulSoup, which makes it very easy to get ‘all links’.)
Parse each URL using parse_url() (more Python), obtaining a dictionary/hash of the GET attributes. Example:

The dictionary might look like

{
'v' : '_ClmClS_Mqs',
'list' : 'PL6422619E56951B73',
'index' : '5'
'feature' : 'plpp_video',
}

Then you can just ask for the GET attribute v. No regexes required.

This is python specific, but Java will have equivalents. The point is that regex is not always the best tool (just the most general tool.)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m having trouble with some Regex code while scraping YouTube playlist pages. It mostly

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply