I am trying to scan documents and identify where sections of the document begin and end. Sometimes, the document has a table of contents that lists page numbers I do not want to capture the TOC because it does not identify part of the document. I have been messing with this for sometime and am stuck on something. I can’t seem to avoid capturing the lines from the table of contents with line numbers
Here is the regular expression
verbose_item_pattern_3 = re.compile(r"""
^ # begin match at newline
\t* # 0-or-more tabspace
[ ]* # 0-or-more blank space
I # a capital I
[tT][eE][mM] # one character from each of the three sets this allows for unknown case
\t* # 0-or-more tabspace
[ ]* # 0-or-more blankspace
\d{1,2} # 1-or-2 digits
[.]? # 0-or-1 literal .
\(? # 0-or-1 literal open paren
[a-e]? # 0-or-1 letter in the range a-e
\)? # 0-or-1 closing paren
.* # any number of unknown characters so we can have words and punctuation
[^0-9] # anything but [0-9]
$ # 1 newline character
""", re.VERBOSE|re.MULTILINE)
here is an example of a line I DO NOT want to capture
test_string='\nItem 6. TITLE ITEM 6..................................................25\n'
Here is an example of what I do want to capture
test_string='\nItem 6. TITLE ITEM 6 maybe other words here who knows \n'
But when I run
re.findall(verbose_item_pattern_3,test_string)
the result is
['Item 6. TITLE ITEM 6..................................................25\n']
Now the thing to me that is interesting is that if my test string is this
test_string='PART I\nItem 1. TITLE ITEM 1...................................................1\nItem 2. TITLE ITEM 2..................................................21\n'
and run that with
re.findall(verbose_item_pattern_3,test_string)
the result is closer to what I want but still not correct
['Item 2. TITLE ITEM 2..................................................21\n']
There should not be anything captured
Your regex matches because of three things.
.*that eats the entire line, so your last condition[^0-9]will never come to bear, and that’s because:[^0-9], so the[^0-9]can successfully match even though the line ends in a number.The minimal change would be to use a negative look-behind at the end:
Note that neither the
^not the$actually match a newline character. They match the position right after (^) or the position right before ($) a newline character. The newline character itself is never part of the match.I’ve changed their comments to the more precise
start-of-lineandend-of-linefor that reason.Also note how I can apply a negative look-behind even after the
$. Doing it this way is useful to prevent backtracking, making the regex faster.