I am trying to scan documents and identify where sections of the document begin

Question

0

Asked: June 15, 20262026-06-15T21:57:19+00:00 2026-06-15T21:57:19+00:00

I am trying to scan documents and identify where sections of the document begin

0

I am trying to scan documents and identify where sections of the document begin and end. Sometimes, the document has a table of contents that lists page numbers I do not want to capture the TOC because it does not identify part of the document. I have been messing with this for sometime and am stuck on something. I can’t seem to avoid capturing the lines from the table of contents with line numbers

Here is the regular expression

verbose_item_pattern_3 = re.compile(r"""
  ^            # begin match at newline
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blank space
  I            # a capital I
  [tT][eE][mM] # one character from each of the three sets this allows for unknown case
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blankspace
  \d{1,2}      # 1-or-2 digits
  [.]?         # 0-or-1 literal .
  \(?          # 0-or-1 literal open paren
  [a-e]?       # 0-or-1 letter in the range a-e
  \)?          # 0-or-1 closing paren
  .*           # any number of unknown characters so we can have words and punctuation
  [^0-9]       # anything but [0-9]
  $           # 1 newline character
  """, re.VERBOSE|re.MULTILINE)

here is an example of a line I DO NOT want to capture

test_string='\nItem 6.       TITLE ITEM 6..................................................25\n'

Here is an example of what I do want to capture

test_string='\nItem 6.       TITLE ITEM 6 maybe other words here who knows  \n'

But when I run

re.findall(verbose_item_pattern_3,test_string)

the result is

['Item 6.       TITLE ITEM 6..................................................25\n']

Now the thing to me that is interesting is that if my test string is this

test_string='PART I\nItem 1.       TITLE ITEM 1...................................................1\nItem 2.       TITLE ITEM 2..................................................21\n'

and run that with
re.findall(verbose_item_pattern_3,test_string)

the result is closer to what I want but still not correct

['Item 2.       TITLE ITEM 2..................................................21\n']

There should not be anything captured

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T21:57:20+00:00

Your regex matches because of three things.

most of it is optional, so it is very unspecific
there is a .* that eats the entire line, so your last condition [^0-9] will never come to bear, and that’s because:
the newline character itself fulfills [^0-9], so the [^0-9] can successfully match even though the line ends in a number.

The minimal change would be to use a negative look-behind at the end:

verbose_item_pattern_3 = re.compile(r"""
  ^            # start-of-line
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blank space
  I            # a capital I
  [tT][eE][mM] # one character from each of the three sets this allows for unknown case
  \t*          # 0-or-more tabspace
  [ ]*         # 0-or-more blankspace
  \d{1,2}      # 1-or-2 digits
  [.]?         # 0-or-1 literal .
  \(?          # 0-or-1 literal open paren
  [a-e]?       # 0-or-1 letter in the range a-e
  \)?          # 0-or-1 closing paren
  .*           # any number of unknown characters so we can have words and punctuation
  $            # end-of-line
  (?<![0-9])   # NOT preceded by a decimal digit (via look-behind)
  """, re.VERBOSE|re.MULTILINE)

Note that neither the ^ not the $ actually match a newline character. They match the position right after (^) or the position right before ($) a newline character. The newline character itself is never part of the match.

I’ve changed their comments to the more precise start-of-line and end-of-line for that reason.

Also note how I can apply a negative look-behind even after the $. Doing it this way is useful to prevent backtracking, making the regex faster.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to scan documents and identify where sections of the document begin

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply