I got some help with this earlier today but I cannot figure out the

Question

0

Asked: May 29, 20262026-05-29T10:06:42+00:00 2026-05-29T10:06:42+00:00

I got some help with this earlier today but I cannot figure out the

0

I got some help with this earlier today but I cannot figure out the last part of the problem I am having. This regex search returns all of the matches in the open file from the input. What I need to do is also find which part of the file that the match comes from.

Each section is opened and closed with a tag. For example one of the tags opens with <opera> and ends with </opera>. What I want to be able to do is when I find a match I want to either go backwards to the open tag or forwards to the close tag and include the contents of the tag, in this case “opera” in the output. My question is can I do this with an addition to the regular expression or is there a better way? Here is the code I have that works great already:

text = open_file.read()
#the test string for this code is "NNP^CC^NNP"
grammarList = raw_input("Enter your grammar string: ");

tags = grammarList.split("^")
tags_pattern = r"\b" + r"\s+".join(r"(\w+)/{0}".format(tag) for tag in tags) + r"\b" 
# gives you r"\b(\w+)/NNP\s+(\w+)/CC\s+(\w+)/NNP\b"

from re import findall
print(findall(tags_pattern, text))

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T10:06:43+00:00

One way to do it would be to find all occurrences of your start and end section tags (say they’re <opera> and </opera>), get the indices, and compare them to each match of tags_pattern. This uses finditer which is like findall but returns indices too. Something like:

startTags = re.finditer("<opera>",text)
endTags   = re.finditer("</opera>",text)

matches = re.finditer(tags_pattern,text)

# Now, [m.start() for m in matches] gives the starting index into `text`.
# if <opera> starts at subindices 0, 1000, 2345
# and you get a match starting at subindex 1100,
#  then it's in the 1000-2345 block.
for m in matches:
    # find first
    sec = [i for i in xrange(len(startTags)) if i>startTags[i].start()]
    if len(sec)=0:
        print "err couldn't find it"
    else:
        sec = sec[0]
        print "found in\n" + text[startTags[sec].start():endTags[sec].end()]

(Note: you can get the matched text with m.group() Default () has group 0 (ie entire string), and you can use m.group(i) for the ith capturing group).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I got some help with this earlier today but I cannot figure out the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply