My software allows users to use regexp to prepare files. I am in the process of adding a default regexp library with common expressions that can be re-used to prepare a variety of formats.
One common task is to remove crlf in specific parts of the files, but not in others. For instance, this:
<TU>Lorem
Ipsum</TU>
<SOURCE>This is a sentence
that should not contain
any line break.
</SOURCE>
Should become:
<TU>Lorem
Ipsum</TU>
<SOURCE>This is a sentence that should not contain any line break.
</SOURCE>
I have a rexep that does the job pretty nicely:
(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))
The problem is that it is processing intensive and with files above 500kb, it can take 30+ seconds. (regex is compiled, in this case, uncompiled is much slower)
It’s not a big issue, but I wonder is there is a better way to achieve the same results with Regex.
Thanks in advance for your suggestions.
Try this:
It starts out by matching
\r\n, then uses a lookahead to see if the match is between<SOURCE>and</SOURCE>. It does that by looking for a</SOURCE>, but if it finds<SOURCE>first it fails. Atomic groups prevent it from saving the state information that would be needed for backtracking, because pass or fail, backtracking is never necessary.