Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8588759
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T22:52:58+00:00 2026-06-11T22:52:58+00:00

Let me start off by saying I need a regex only solution. I’m trying

  • 0

Let me start off by saying I need a regex only solution.

I’m trying to pull a description from html files with a 3rd program program. This program is java based, but I cannot manipulate the source code in any way!. The program I submit the regex into already has another regex script designating where to grab the description from on every page. It has this handy feature to further break down that info into an array if you define the matches within.

I want to match every sentence in the description regardless of if it is a list item or not. Getting rid of the tags would be ideal since they are causing problems using \b to designate where to start the match.

At first I thought I could just write a regex solution that captures everything between a word boundary and a sentence ending character. Something like this \b([^.!]+)[.!] Then I noticed a problem where the description will sometimes have an additional part with list items. What complicates it even more is that sometimes the first part of the list item will be bolded or italicized. Even more rarely there might be a random <br> and </br> tag in there for reasons I don’t understand…

Here is an example description of the common layout from a hilarious article:

Children around the world are constantly exposed to the evil “Dolan”, an evil 
duckwho encourages rape, murder, pedophilia, stealing, homosexuality and a range
of other sins.  ”Dolan” is considered a “meme”: an image that makes its way
around the internet via social networks such as Myspace, Friendster, or
Wikipedia.

<li>The duck is based on the character “Donald” created by the company Disney. 
</li><li><b>Dolan, however</b>, is more overtly satanic and enjoys commit crimes
and offending Christianity. </li><li>He is best known for a series of internet 
comics created in the socialist nation of Finland. </li><li><i>Being part of
Scandinavia</i>, the Finnish are clearly followers of Satan and Skrillex. </li>
<li>The comics are written in poor English to distract the viewer from how evil
and offensive they truly are.</li>

I tried a couple different things, but am still quite a regex noob and got a variety of different returns that didn’t work correctly. This one broke everything up starting with whatever letter was in a tag:

(?:<li>|<b>|<i>)?\b([^.!<]+)[.!< ][<lbi/ ]

Above code gives an array like this (order gets randomized or at least organized in a way I don’t understand)

i>
Being Part of Scandinavia
i>
b>
Dolan, however
b>

The same one with nearly identical could would leave in some of the html tags which I assume is because li> fills the the word boundary requirement. Note: there is a space on the end of the code below

\b([^.!<]+)[.!] 

This gives an array like this

li>The duck is based on the character “Donald”...
li>li>b>Dolan, however/b>, is more overtly satanic...

Like I said earlier I’m a noob to regex and am more than certain I’m using the lookahead wrong.

Please help me with a solution! I don’t know what to try next.

PS, I didn’t write the article, I copied it from another website. Not trying to be offensive

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T22:52:59+00:00Added an answer on June 11, 2026 at 10:52 pm

    Don’t bother with \b, it’s just getting in your way. You don’t really need lookarounds, either. The following regex correctly matches all the sentences in your sample text. As with @icrf’s regex, any tag that’s inside a sentence will remain there. Getting rid of those will require a second step, I don’t see any way around that.

    [^\s<>.!?][^<>.!?]*(?:<[^<>]+>[^<>.!?]*)*[.!?]
    

    To break it down:

    1. [^\s<>.!?] starts matching at the next character that isn’t whitespace, an angle bracket, or sentence punctuation.

    2. [^<>.!?]* continues matching desirable characters, which now includes whitespace.

    3. <[^<>]+> : If it finds a left angle bracket, this part attempts to match an HTML tag. Then it goes back to matching non-special characters with [^<>.!?]*. It continues trading off like that until there are no more tags or non-special characters to consume.

    4. And finally, [.!?] matches the sentence-ending punctuation.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Let me start off by saying I don't want to print only the duplicate
Ok, let me first start off by saying that I've only ever dealt with
First off, let me start by saying that I am totally new to working
Let me start by saying this is a homework assignment, I don't need any
Let me start by saying that I've never coded in Python. I need to
First let me start off by saying I do not believe I am leaking,
Let me start off by saying, I'm using the twisted.web framework. Twisted.web 's file
To start off, let me clear the air by saying we are aware of
Let me start off by saying that I am very new to WPF and
Let me start off by saying I had one DLL loading in just fine.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.