Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8401035
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T21:42:10+00:00 2026-06-09T21:42:10+00:00

I need an efficient and (reasonably) reliable way to strip HTML tags from documents.

  • 0

I need an efficient and (reasonably) reliable way to strip HTML tags from documents. It needs to be able to handle some fairly adverse circumstances:

  • It’s not known ahead of time whether a document contains HTML at all.
  • More likely than not, any HTML will be very poorly formatted.
  • Individual documents might be very large, perhaps hundreds of megabytes.
  • Non-HTML content might still be littered with angle brackets for whatever odd reason, so naive regular expressions along the lines of <.+/?> are a no go. (And stripping XML is less desirable, anyway.)

I’m currently using HTML Agility Pack, and it’s just not cutting the mustard. Performance is poorer than I’d like, it doesn’t always handle truly awful formatting as gracefully as it could, and lately I’ve been running into problems with stack overflows on some of the more upsettingly large files.

I suspect that all of these problems stem from the fact that it’s trying to actually parse the data, which makes it a poor fit for my needs. I don’t want a syntax tree; I just want (most of) the tags to go away.

Using regular expressions seems like the obvious candidate. But then I remember this famous answer and it makes me worry that’s not such a great idea. But that diatribe’s points are very focused on parsing, and not necessarily dumb tag-stripping. So are regex OK for this purpose?

Assuming it isn’t a terrible idea, suggestions for regex that would do a good job are very welcome.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T21:42:11+00:00Added an answer on June 9, 2026 at 9:42 pm

    This regex finds all tags avoiding angle brackets inside quotes in tags.

    <[a-zA-Z0-9/_-]+?((".*?")|([^<"']+?)|('.*?'))*?>
    

    It isn’t able to detect escaped quotes inside quotes (but I think it is unnecessary in html)

    Having the list of all allowed tags and replacing it in the first part of the regex, like <(tag1|tag2|...) could bring to a more precise solution, I’m afraid an exact solution can’t be found starting with your assumption about angle brackets, think for example to something like <a href="test.html"> b<a </a>…

    EDIT:

    Updated regex (performing a lot better than the latter), moreover if you need to strip out code I suggest to perform a little cleaning before the first launch, something like replacing <script.+?</script> with nothing.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I need to find the most efficient way of matching multiple regular expressions on
I need an efficient way to pass in a parameter [StartingNumber] and to count
I need an efficient data structure to generate IDs. The IDs should be able
I need a fast, efficient way to randomly return to me the names of
I need an efficient mechanism for detecting changes to the DOM. Preferably cross-browser, but
I am doing a project at the moment and I need an efficient method
I need to make efficient d-dimensional points searching and also make efficient k-NN queries
I need to perform efficient hit testing against a (potentially huge) number of components
I need to implement an efficient excel-like app. I'm looking for a data structure
I need to replace many different sub-string in a string in the most efficient

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.