This is a follow-up to my previous question
I would like to find a minimal sequence of characters of length > N, which starts at a word boundary and ends at the end of input.
For example:
N = 5, input = "aaa bbb cccc dd" result = "cccc dd"
I tried \b.{5,}?$ but it matches the whole input rather the minimal part.
What regex would you suggest?
The problem this time isn’t greediness, it’s eagerness. Regexes naturally try to find the earliest possible match, and getting them to find the last one can be tricky. The easiest way is usually the one @Arcadien demonstrated: use
.*to gobble up the whole string, then use backtracking to find the match on the rebound.I have some questions about your requirements, though.
\bcan match the beginning or the end of a word, so if (for example)N=5and the string ends with"foo1 bar2", the result would be" bar2"(notice the leading space). Do you really want a match that starts at the end of a word, or should it drop the space or back up to the beginning of"foo1"? Also, will all words consist entirely of word characters? If there are any non-word characters,\bwill be able to match in even more surprising places.For the regex below, I redefined “word” to mean any complete chunk of non-whitespace characters. The
.*starts out by consuming the whole string, then the lookahead –(?=.{5,})– forces it to backtrack five positions before it tries to match anything. The\sforces the match to start at the beginning of a word, so the rest of the regex captures one or more complete words.This regex won’t match anything that’s less than five characters long or doesn’t contain whitespace. If that’s a problem, let me know and I’ll tweak it.