Let me start off by saying I need a regex only solution.
I’m trying to pull a description from html files with a 3rd program program. This program is java based, but I cannot manipulate the source code in any way!. The program I submit the regex into already has another regex script designating where to grab the description from on every page. It has this handy feature to further break down that info into an array if you define the matches within.
I want to match every sentence in the description regardless of if it is a list item or not. Getting rid of the tags would be ideal since they are causing problems using \b to designate where to start the match.
At first I thought I could just write a regex solution that captures everything between a word boundary and a sentence ending character. Something like this \b([^.!]+)[.!] Then I noticed a problem where the description will sometimes have an additional part with list items. What complicates it even more is that sometimes the first part of the list item will be bolded or italicized. Even more rarely there might be a random <br> and </br> tag in there for reasons I don’t understand…
Here is an example description of the common layout from a hilarious article:
Children around the world are constantly exposed to the evil “Dolan”, an evil
duckwho encourages rape, murder, pedophilia, stealing, homosexuality and a range
of other sins. ”Dolan” is considered a “meme”: an image that makes its way
around the internet via social networks such as Myspace, Friendster, or
Wikipedia.
<li>The duck is based on the character “Donald” created by the company Disney.
</li><li><b>Dolan, however</b>, is more overtly satanic and enjoys commit crimes
and offending Christianity. </li><li>He is best known for a series of internet
comics created in the socialist nation of Finland. </li><li><i>Being part of
Scandinavia</i>, the Finnish are clearly followers of Satan and Skrillex. </li>
<li>The comics are written in poor English to distract the viewer from how evil
and offensive they truly are.</li>
I tried a couple different things, but am still quite a regex noob and got a variety of different returns that didn’t work correctly. This one broke everything up starting with whatever letter was in a tag:
(?:<li>|<b>|<i>)?\b([^.!<]+)[.!< ][<lbi/ ]
Above code gives an array like this (order gets randomized or at least organized in a way I don’t understand)
i>
Being Part of Scandinavia
i>
b>
Dolan, however
b>
The same one with nearly identical could would leave in some of the html tags which I assume is because li> fills the the word boundary requirement. Note: there is a space on the end of the code below
\b([^.!<]+)[.!]
This gives an array like this
li>The duck is based on the character “Donald”...
li>li>b>Dolan, however/b>, is more overtly satanic...
Like I said earlier I’m a noob to regex and am more than certain I’m using the lookahead wrong.
Please help me with a solution! I don’t know what to try next.
PS, I didn’t write the article, I copied it from another website. Not trying to be offensive
Don’t bother with
\b, it’s just getting in your way. You don’t really need lookarounds, either. The following regex correctly matches all the sentences in your sample text. As with @icrf’s regex, any tag that’s inside a sentence will remain there. Getting rid of those will require a second step, I don’t see any way around that.To break it down:
[^\s<>.!?]starts matching at the next character that isn’t whitespace, an angle bracket, or sentence punctuation.[^<>.!?]*continues matching desirable characters, which now includes whitespace.<[^<>]+>: If it finds a left angle bracket, this part attempts to match an HTML tag. Then it goes back to matching non-special characters with[^<>.!?]*. It continues trading off like that until there are no more tags or non-special characters to consume.And finally,
[.!?]matches the sentence-ending punctuation.