I am trying to parse a list of items which satisfies the python regex
r'\A(("[\w\s]+"|\w+)\s+)*\Z'
that is, it’s a space separated list except that spaces are allowed inside quoted strings. I would like to get a list of items in the list (that is of items matched by the
r'("[\w\s]+"|\w+)'
part. So, for example
>>> parse('foo "bar baz" "bob" ')
['foo', '"bar baz"', '"bob"']
Is there any nice way to do this with python re?
Many things don’t quite work. For example
>>> re.match(r'\A(("[\w\s]+"|\w+)\s+)*\Z', 'foo "bar baz" "bob" ').group(2)
'"bob"'
only returns the last one it matched. On the other hand
>>> re.findall(r'("[\w\s]+"|\w+)', 'foo "bar baz" "bob" ')
['foo', '"bar baz"', '"bob"']
but it also accepts malformed expressions like
>>> re.findall(r'("[\w\s]+"|\w+)', 'foo "bar b-&&az" "bob" ')
['foo', 'bar', 'b', 'az', '" "', 'bob']
So is there any way to use the original regex and get all of the items that matched group 2? Something like
>>> re.match_multigroup(r'\A(("[\w\s]+"|\w+)\s+)*\Z', 'foo "bar baz" "bob" ').group(2)
['foo', '"bar baz"', '"bob"']
>>> re.match_multigroup(r'("[\w\s]+"|\w+)', 'foo "bar b-&&az" "bob" ')
None
Edit: It is important that I preserve the quotes in the output, thus I don’t want
>>> re.match_multigroup(r'\A(("[\w\s]+"|\w+)\s+)*\Z', 'foo "bar baz" "bob" ').group(2)
['foo', 'bar baz', 'bob']
because then I don’t know if bob was quoted or not.
Alright, I ended up deciding to do this in two steps.
First I check that the expression is syntactically valid and second I break it into individual pieces:
So:
I’m about 90% sure that this method works appropriately for all strings, but I would still be interested if anyone had a more general solution, this seems sort of kludgey to me.
Thanks SilentGhost and Alan Moore for the help. I did not know about python csv or regex lookaheads before, it might be helpful to me to learn about those.