I’m experiencing some trouble ‘picking’ this data ‘apart’. Altough helper functions etc. are an option, I would really like to solve this using a regex only (and processing the matchgroups after matching).
This is (part of) the data I have:
Belgium
Belgium M_Foo
Belgium A_Bar
Belgium M_FooBar
Belgium S_Whooptee Doo
Belgium Xxx
Belgium S_Foo Bar
United Kingdom
United Kingdom W_Foo-Bar
United Kingdom M_Yay
United Kingdom Xxx
United Kingdom S_Derp
United Kingdom F_Doh Lorem
United Kingdom S_Ipsum Dolor
United States of America L_Foo
Macedonia F.Y.R. Xxx
Macedonia F.Y.R. S_Foo Bar
Cyprus (Greek) M_Foo
Congo (Democratic Republic of)
Congo (Democratic Republic of) Q_Yolo
Essentially this is a “key / value” sort of array of strings. It contains a countryname (which is not normalized so I can’t use hard-coded countrynames or ‘lookups’, it might as well be some other string than a countryname) and is optionally followed by either keyword Xxx or <random_upcase_char>_<random_text>.
I have come up with the following regex:
^(.+?)(?:\s+(Xxx|[A-Z]_.*)?)
or, small difference in the first matchgroup:
^(.*?)(?:\s+(Xxx|[A-Z]_.*)?)
This works fine for the first strings starting with Belgium. It returns, for these records, the following results:
Group 1 Group 2
================================
Belgium
Belgium M_Foo
Belgium A_Bar
Belgium M_FooBar
Belgium S_Whooptee Doo
Belgium Xxx
Belgium S_Foo Bar
However, the following lines cause trouble:
Group 1 Group 2
================================
United
United
United
United
United
United
United
United
Macedonia
Macedonia
Cyprus
Congo
Congo
What I’d like the regex to do is the following:
Group 1 Group 2
================================================
United Kingdom
United Kingdom W_Foo-Bar
United Kingdom M_Yay
United Kingdom Xxx
United Kingdom S_Derp
United Kingdom F_Doh Lorem
United Kingdom S_Ipsum Dolor
United States of America L_Foo
Macedonia F.Y.R. Xxx
Macedonia F.Y.R. S_Foo Bar
Cyprus (Greek) M_Foo
Congo (Democratic Republic of)
Congo (Democratic Republic of) Q_Yolo
But I can’t get the first part to match. I’m pretty sure it has something to do with greedy/ungreedy options for the first matchgroup but after fiddling around for some time I can’t get it to work…
I don’t care if extra/other/more matchgroups are returned. The regex is intended to be used in a .Net C# application (in case you’re wondering which ‘dialect’ this is).
Any help would be very much appreciated.
Sometimes, with non-greedy matches, the anchoring is extremely important. In this case, anchoring to the end of the line solves the problem. Your regexp should be:
Note that I also moved the optional (
?) quantifier outside one more grouping level, so the space is optional.