My input file is a large txt file with concatenated texts I got from

Question

0

Asked: May 31, 20262026-05-31T22:18:32+00:00 2026-05-31T22:18:32+00:00

My input file is a large txt file with concatenated texts I got from

0

My input file is a large txt file with concatenated texts I got from an open text library. I am now trying to extract only the content of the book itself and filter out other stuff such as disclaimers etc. So I have around 100 documents in my large text file (around 50 mb).

I then have identified the start and end markers of the contents themselves, and decided to use a Python regex to find me everything between the start and end marker. To sum it up, the regex should look for the start marker, then match everything after it, and stop looking once the end marker is reached, then repeat these steps until the end of the file is reached.

The following code works flawlessly when I feed a small, 100kb sized file into it:

import codecs
import re

outfile = codecs.open("outfile.txt", "w", "utf-8-sig")
inputfile = codecs.open("infile.txt", "r", "utf-8-sig")
filecontents = inputfile.read()
for result in re.findall(r'START\sOF\sTHE\sPROJECT\sGUTENBERG\sEBOOK.*?\n(.*?)END\sOF\THE\sPROJECT\sGUTENBERG\sEBOOK', filecontents, re.DOTALL):
    outfile.write(result)
outfile.close()

When I use this regex operation on my larger file however, it will not do anything, the program just hangs. I tested it overnight to see if it was just slow and even after around 8 hours the program was still stuck.

I am very sure that the source of the problem is the
(.*?)
part of the regex, in combination with re.DOTALL.
When I use a similar regex on smaller distances, the script will run fine and fast.
My question now is: why is this just freezing up everything? I know the texts between the delimiters are not small, but a 50mb file shouldn’t be too much to handle, right?
Am I maybe missing a more efficient solution?

Thanks in advance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T22:18:33+00:00

You are correct in thinking that using the sequence .*, which appears more than once, is causing problems. The issue is that the solver is trying many possible combinations of .*, leading to a result known as catastrophic backtracking.

The usual solution is to replace the . with a character class that is much more specific, usually the production that you are trying to terminate the first .* with. Something like:

`[^\n]*(.*)`

so that the capturing group can only match from the first newline to the end. Another option is to recognize that a regular expression solution may not be the best approach, and to use either a context free expression (such as pyparsing), or by first breaking up the input into smaller, easier to digest chunks (for example, with corpus.split('\n'))

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My input file is a large txt file with concatenated texts I got from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply