I’m working with text that uses spaces as thousands separators, e.g. 400 or 40 000 or 40 000 000 or 4 000 000 000. I need to identify the number in the string. Once identified, there are many options to re-format the number. I’m a rookie at regex. This doesn’t work:
import re
line = '40) He had 120 hours to increase from 40 000 units to 20 000 000.'
regex = re.compile("(\d+ *\d+)")
re.findall(regex, line)
['40', '120', '40 000', '20 000', '000']
The following will do it:
This uses a non-capturing group
(?:)that matches one or more spaces (\s+) followed by at least one digit (\d+). The entire non-capture group can appear zero or more times (*).It is worth pointing out that it’s generally a good idea to use raw strings (
r""orr'') for Python regular expressions.Finally, I’d probably tighten up the regex like so:
This requires every group of digits except the first one to be exactly three digits long.