I got some help with this earlier today but I cannot figure out the last part of the problem I am having. This regex search returns all of the matches in the open file from the input. What I need to do is also find which part of the file that the match comes from.
Each section is opened and closed with a tag. For example one of the tags opens with <opera> and ends with </opera>. What I want to be able to do is when I find a match I want to either go backwards to the open tag or forwards to the close tag and include the contents of the tag, in this case “opera” in the output. My question is can I do this with an addition to the regular expression or is there a better way? Here is the code I have that works great already:
text = open_file.read()
#the test string for this code is "NNP^CC^NNP"
grammarList = raw_input("Enter your grammar string: ");
tags = grammarList.split("^")
tags_pattern = r"\b" + r"\s+".join(r"(\w+)/{0}".format(tag) for tag in tags) + r"\b"
# gives you r"\b(\w+)/NNP\s+(\w+)/CC\s+(\w+)/NNP\b"
from re import findall
print(findall(tags_pattern, text))
One way to do it would be to find all occurrences of your start and end section tags (say they’re
<opera>and</opera>), get the indices, and compare them to each match oftags_pattern. This usesfinditerwhich is likefindallbut returns indices too. Something like:(Note: you can get the matched text with
m.group()Default()has group 0 (ie entire string), and you can usem.group(i)for the ith capturing group).