I’m trying to extract “entries” from a text file using a regular expression. Each line of the file is a separate entry unless the line begins with whitespace, in which case that line is a continuation of the previous line.
Example:
import re
INPUT = """\
This is entry 1.
This
is
entry 2.
And this is entry 3.
This
is
entry
4."""
OUTPUT = ["This is entry 1.",
"This\n is\n entry 2.",
"And this is entry 3.",
"This\n is\n entry\n 4."]
# What should the pattern be?
PATTERN = re.compile("(.+)(?=\n|$)")
assert PATTERN.findall(INPUT) == OUTPUT
What should PATTERN be to match all the entries?
I think figured it out.
The trick is “
.(which doesn’t match newlines) or a newline followed by whitespace”.