I have a string of syntactically parsed text:
s = 'ROOT (S (VP (VP (VB the) (SBAR (S (NP (DT same) (NN lecturer)) (VP (VBZ says)'
I’d like to match ‘the same’ to s. It’s key that ‘the’ and ‘same’ only match when separated by syntactic markup (i.e, (, NP, S, etc.). So, ‘the same’ should NOT find a match in s2:
s2= 'ROOT (S (VP (VP (VB the) (SBAR (S (NP (DT lecturer) (NN same)) (VP (VBZ says)'
I’ve tried a double negative lookahead assertion to no avail:
>>>rx = r'the(?![a-z]*)same(?![a-z]*)'
>>>re.findall(rx,s)
[]
The idea is to match’the’ when not followed by lowercase characters and then match ‘same’ when not followed by lowercase characters.
Does anyone have a better approach?
So you want to match if all of the characters between
theandsameare not lowercase letters, here is how you can write that in regex:Note that you might want to add word boundaries as well, so you don’t match something like
foothe ... samebar, that would look like this: