I’ve got a string which has the following format
some_string = “,,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,”
and this is the content of a text file called f
I want to search for a specific term within the xxx (let’s say that term is ‘silicon’)
note that the xxx can all be different and can contain any special characters (including meta characters) except for a new line
match = re.findall(r",{3}(.*?silicon.*?),{3}", f.read())
print match
But this doesn’t seem to work because it returns results which are in the format:
[“xxx,,,xxx,,,xxx,,,xxx,,,silicon”, “xxx,,,xxx,,,xxx,,,xxsiliconxx”] but I only want it to return [“silicon”, “xxsiliconxx”]
What am I doing wrong?
Try the following regex:
Example:
I am assuming that the content in the
xxxcan contain commas, just not three consecutive commas or it would end the field. If the content in thexxxsections cannot contain any commas, you can use the following instead:The reason your current approach doesn’t work is that even though
.*?will try to match as few characters as possible, the match will still start as early as possible. So for example the regexa*?bwould match the entire string"aaaab". The only time the regex will advance the starting position is when the regex fails to match, and since,,,can be matched by the.*?, your match will always start at the beginning of the string or just after the previous match.The lookbehind and lookahead are used to address the issue raised by JaredC in comments, basically
re.findall()won’t return overlapping matches, so you need the leading and trailing,,,to not be a part of the match.