Consider a string s = "aa,bb11,22 , 33 , 44,cc , dd ".
I would like to split s into the following list of tokens using the regular expressions module in Python, which is similar to the functionality offered by Perl:
"aa,bb11""22""33""44,cc , dd "
Note:
- I want to tokenise on commas, but only if those commas have numbers to either side.
- Any (optional) whitespace around these “numerical commas” that I’m targeting should be removed in the result. The optional whitespace may be more than a single space.
- Any other whitespace should be left as it appears in the original string.
My best attempt so far is the following:
import re
pattern = r'(?<=\d)(\s*),(\s*)(?=\d)'
s = 'aa,bb11,22 , 33 , 44,cc , dd '
print re.compile(pattern).split(s)
but this prints:
['aa,bb11', '', '', '22', ' ', ' ', '33', ' ', ' ', '44,cc , dd ']
which is close to what I want, inasmuch as the 4 things I want are contained in the list. I could go through and get rid of any empty strings and any strings that consist of only spaces/commas, but I’d rather have a single line regex that does all this for me.
Any ideas?
Don’t put capture groups on the
\s*: