Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3237670
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T17:45:56+00:00 2026-05-17T17:45:56+00:00

I am creating an application that will take a URL as input, retrieve the

  • 0

I am creating an application that will take a URL as input, retrieve the page’s html content off the web and extract everything that isn’t contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes ‘masking’ out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).

I have constructed this regex:

(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)

It correctly selects all the content that i want to ignore, and only leaves the page’s text contents. However, that means that what I want to extract won’t show up in the match collection (I am using VB.Net in Visual Studio 2010).

Is there a way to “invert” the matching of a whole document like this, so that I’d get matches on all the text strings that are left out by the matching in the above regex?

So far, what I did was to add another alternative at the end, that selects “any sequence that doesn’t contain < or >”, which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the “text” group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.

This is supposed to work generically, without knowing any specific tags in the html. It’s supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts – i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of “renaming” any tags, attributes or script variables etc (so I can’t just do a “replace with nothing” on all the matches I get, because even though I am then left with what I need, it’s a hassle to reinsert that back into the correct places of the fully functional document).

I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don’t feel like).

Any suggestions?

Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually – as well as views where all the regex matches are highlighted in place in the complete HTML document).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T17:45:57+00:00Added an answer on May 17, 2026 at 5:45 pm

    OK, so here’s how I’m doing it:

    Using my original regex (with the added search pattern for the plain text, which happens to be any text that’s left over after the tag searches are done):

    (?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)

    Then in VB.Net:

    Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
    Dim source As String = File.ReadAllText("html.txt")
    Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
    Dim newHtml As String = regexText.Replace(source, evaluator)
    

    The actual replacing of text happens here:

    Private Function MatchEvalFunction(ByVal match As Match) As String
        Dim plainText As String = match.Groups("text").Value
        If plainText IsNot Nothing AndAlso plainText <> "" Then
            MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
        Else
            MatchEvalFunction = match.Value
        End If
    End Function
    

    Voila. newHtml now contains an exact copy of the original, except every occurrence of “Original word” in the page (as it’s presented in a browser) is switched with “Replacement word”, and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I’d be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks – in SCRIPT rewriting – but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a new web app that is packaged as a WAR as part
Let say I have the following desire, to simplify the IConvertible's to allow me
I want the messagebox to only show if the number is equal to 0.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.