Conditions updated
There is often a situation where you want to extract a substring upto (immediately before) certain characters. For example, suppose you have a text that:
- Does not start with a semicolon or a period,
- Contains several sentences,
- Does not contain any “\n”, and
- Ends with a period,
and you want to extract the sequence from the start upto the closest semicolon or period. Two strategies come to mind:
/[^;.]*//.*?[;.]/
I do either of these quite randomly, with slight preference to the second strategy, and also see both ways in other people’s code. Which is the better way? Is there a clear reason to prefer one over the other, or are there better ways? I personally feel, efficiency aside, that negating something (as with [^]) is conceptually more complex than not doing it. But efficiency may also be a good reason to chose one over the other.
I came up with my answer. The two regexes in my question were actually not expressing the same thing. And the better approach depends on what you want.
If you want a match up to and including a certain character, then using
/.*?[;.]/
is simpler.
If you want a match up to right before (excluding) a certain character, then you should use:
/[^;.]*/