I’m trying to capture a string up until a certain word that is within some group of words.
I only want to capture the string up until the FIRST instance of one of these words, as they may appear many times in the string.
For example:
Group of words: (was, in, for)
String = “Once upon a time there was a fox in a hole”;
would return “Once upon a time there”
Thank you
What you need is called a Lookahead. The exact regex for your situation is:
Anyway, the ^ matches the beginning of the string, .+? is a lazy match(it will match the shortest possible string), (?= … ) means "followed by" and (?: … ) is a noncapturing group – which may or may not be necessary for you.
For bonus points, you should probably be using word boundaries to make sure you’re matching the whole word, instead of a substring ("The fox wasn’t" would return "The fox "), and a leading space in the lookahead to kill the trailing space in the match:
Where \s* matches any amount of white space (including none at all) and \b matches the beginning or end of a word. It’s a Zero-Width assertion, meaning it doesn’t match an actual character.
Or, in Java:
I think that will work. I haven’t used it, but according to the documentation, that exact string should work. Just had to escape all the backslashes.
Edit
Here I am, more than a year later, and I just realized the regex above does not do what I thought it did at the time. Alternation has the highest precedence, rather than the lowest, so this pattern is more correctly:
/^.+?(?=\s*\b(?:was|in|for)\b)/
Compare this new regex to my old one. Additionally, future travelers, you may wish to capture the whole string if no such breaker word exists. Try THIS on for size:
This one uses a NEGATIVE lookahead (which asserts a match that fails the pattern). It’s possibly slower, but it still does the job. See it in action here.