I’m looking for a RegEx to return either the first [n] words in a paragraph or, if the paragraph contains less than [n] words, the complete paragraph is returned.
For example, assuming I need, at most, the first 7 words:
<p>one two <tag>three</tag> four five, six seven eight nine ten.</p><p>ignore</p>
I’d get:
one two <tag>three</tag> four five, six seven
And the same RegEx on a paragraph containing less than the requested number of words:
<p>one two <tag>three</tag> four five.</p><p>ignore</p>
Would simply return:
one two <tag>three</tag> four five.
My attempt at the problem resulted in the following RegEx:
^(?:\<p.*?\>)((?:\w+\b.*?){1,7}).*(?:\</p\>)
However, this returns just the first word – “one”. It doesn’t work. I think the .*? (after the \w+\b) is causing problems.
Where am I going wrong? Can anyone present a RegEx that will work?
FYI, I’m using .Net 3.5’s RegEX engine (via C#)
Many thanks
OK, complete re-edit to acknowledge the new “spec” 🙂
I’m pretty sure you can’t do that with one regex. The best tool definitely is an HTML parser. The closest I can get with regexes is a two-step approach.
First, isolate each paragraph’s contents with:
You need to set
RegexOptions.Singlelineif paragraphs can span multiple lines.Then, in a next step, iterate over your matches and apply the following regex once on each match’s
Group[1].Value:That will match the first seven items separated by spaces/tabs/newlines, ignoring any trailing punctuation or non-word characters.
BUT it will treat a tag separated by spaces as one of those items, i. e. in
it will only match up until
six. I guess that regex-wise, there’s no way around that.