I need to find and replace version numbers within text with a generic placeholder e.g. ‘*’.
Problem is writing the regex that would capture the version numbers.
Some examples:
Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.1 (KHTML, like Gecko) Ubuntu/11.04 Chromium/14.0.825.0 Chrome/14.0.825.0 Safari/535.1
Mozilla/5.0(iPad; U; CPU iPhone OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B314 Safari/531.21.10gin_lib.cc
Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.2.7) Gecko/20100713 Firefox/3.6.7 (.NET CLR 3.5.30729)
Version numbers contain:
- alphanumeric characters
- special characters i.e. ‘.-_:’
A simple regex might be r'[0-9._:-]+' but this does not work as version number needs at least 1 alphanumeric chars and special character in between alphanumeric characters.
Any ideas?
In the re module, use the sub function. It will return a string where all the matches for the input regex are replaced by the output of a function, or just a string. The problem is in determining which version numbers in each string you want to replace. I’m assuming that you want all version numbers replaced.
gives these results:
The regex isn’t very good, I wanted a repeating set of alphanumerics followed by a delimiter. But I couldn’t seem to get it to work. Something like
([0-9a-zA-Z]+[._:-])+