I am struggling with nailing down a fairly complex regular expression to parse song titles with optional artist attribution from loosely-typed English. The user input comes from a single text field and the regex matches will be used to query a song database to get unique track IDs. I need to be able to get these matches:
\1= song title\2= artist
while being fairly liberal in allowed formats.
Examples
The wold "by" should split the string into song title and artist (but only on word boundaries); as should a comma with/without trailing whitespace:
baby one more time by britney spears
baby one more time, britney spears
baby one more time,britney spears
\1= baby one more time\2= britney spears
False positives like these are acceptable:
down by the bay
\1= down\2= the bay
whatever people say i am, that’s what i’m not
\1= whatever people say i am\2= that’s what i’m not
…assuming quotes can be used to mark a run of text as a song title explicitly:
"down by the bay"
\1= down by the bay\2not matched
"whatever people say i am, that’s what i’m not" by arctic monkeys
\1= whatever people say i am, that’s what i’m not\2= arctic monkeys
Single quotes should work too, but obviously not if they appear within the title:
‘whatever people say i am, that’s what i’m not’
\1= whatever people say i am, that\2= s what i’m not’
Additionally, if quotes are in use, the word "by" or a comma are optional:
"down by the bay" raffi
\1= down by the bay\2= raffi
However, if there are no quotes, and more than one "by", then only the last "by" should be used as a delimiter:
down by the bay by raffi
\1= down by the bay\2= raffi
Is this even possible with a single regex? Or would the more sane way be to split it up into multiple expressions? Either way, what might this look like?
Here is an example, using C#:
Output matches your specification, as far as I can tell:
You can actually make it better for the single-quote case by allowing apostrophes inside words:
Which fixes this case:
Here’s a commented version of the regex, which explains what each part does (should be matched with
RegexOptions.ExplicitCapture|RegexOptions.IgnorePatternWhitespace):Edit:
I’ve played around a bit with the PHP code, but I can’t get it to use named capturing groups properly. Here is a version using unnamed capturing groups:
The title will be in group 1, 2, or 3, and the artist in group 4.