Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 461379
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 12, 20262026-05-12T22:58:08+00:00 2026-05-12T22:58:08+00:00

A recent blog entry by a Jeff Atwood says that you should never parse

  • 0

A recent blog entry by a Jeff Atwood says that you should never parse HTML using regular expressions – yet doesn’t give an alternative.

I want to scrape search search results, extracting values:

<div class="used_result_container"> 
   ...
      ...
         <div class="vehicleInfo"> 
            ...
               ...
                  <div class="makemodeltrim">
                     ...
                     <a class="carlink" href="[Url]">[MakeAndModel]</a>
                     ...
                  </div> 
                  <div class="kilometers">[Kilometers]</div> 
                  <div class="price">[Price]</div> 
                  <div class="location">
                     <span class='locationText'>Location:</span>[Location]
                  </div> 
               ...          
            ...
         </div> 
      ...
   ...
</div> 

...and it repeats

You can see the values I want to extract, [enclosed in brackets]:

  • Url
  • MakeAndModel
  • Kilometers
  • Price
  • Location

Assuming we accept the premise that parsing HTML:

  • generally a bad idea
  • rapidly devolves into madness

What’s the way to do it?

Assumptions:

  • native Win32
  • loose html

Assumption clarifications:

Native Win32

  • .NET/CLR is not native Win32
  • Java is not native Win32
  • perl, python, ruby are not native Win32
  • assume C++, in Visual Studio 2000, compiled into a native Win32 application

Native Win32 applications can call library code:

  • copied source code
  • DLLs containing function entry points
  • DLLs containing COM objects
  • DLLs containing COM objects that are COM-callable wrappers (CCW) around managed .NET objects

Loose HTML

  • xml is not loose HTML
  • xhtml is not loose HTML
  • strict HTML is not loose HTML

Loose HTML implies that the HTML is not well-formed xml (strict HTML is not well-formed xml anyway), and so an XML parser cannot be used. In reality I was present the assumption that any HTML parser must be generous in the HTML it accepts.


Clarification#2

Assuming you like the idea of turning the HTML into a Document Object Model (DOM), how then do you access repeating structures of data? How would you walk a DOM tree? I need a DIV node that is a class of used_result_container, which has a child DIV of class of vehicleInfo. But the nodes don’t necessarily have to be direct children of one another.

It sounds like I’m trading one set of regular expression problems for another. If they change the structure of the HTML, I will have to re-write my code to match – as I would with regular expressions. And assuming we want to avoid those problems, because those are the problems with regular expressions, what do I do instead?

And would I not be writing a regular expression parser for DOM nodes? i’m writing an engine to parse a string of objects, using an internal state machine and forward and back capture. No, there must be a better way – the way that Jeff alluded to.

I intentionally kept the original question vague, so as not to lead people down the wrong path. I didn’t want to imply that the solution, necessarily, had anything to do with:

  • walking a DOM tree
  • xpath queries

Clarification#3

The sample HTML I provided I trimmed down to the important elements and attributes. The mechanism I used to trim the HTML down was based on my internal bias that uses regular expressions. I naturally think that I need various “sign-posts in the HTML that I look for.

So don’t confuse the presented HTML for the entire HTML. Perhaps some other solution depends on the presence of all the original HTML.

Update 4

The only proposed solutions seem to involve using a library to convert the HTML into a Document Object Model (DOM). The question then would have to become: then what?

Now that I have the DOM, what do I do with it? It seems that I still have to walk the tree with some sort of regular DOM expression parser, capable of forward matching and capture.

In this particular case i need all the used_result_container DIV nodes which contain vehicleInfo DIV nodes as children. Any used_result_container DIV nodes that do not contain vehicleInfo has a child are not relevant.

Is there a DOM regular expression parser with capture and forward matching? I don’t think XPath can select higher level nodes based on criteria of lower level nodes:

\\div[@class="used_result_container" && .\div[@class="vehicleInfo"]]\*

Note: I use XPath so infrequently that I cannot make up hypothetical xpath syntax very goodly.

  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-12T22:58:09+00:00Added an answer on May 12, 2026 at 10:58 pm

    Native Win32

    You can always use IHtmlDocument2. This is built-in to Windows at this point. With this COM interface, you get native access to a powerful DOM parser (IE’s DOM parser!).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I been looking at the recent blog post by Jeff Atwood on Alternate Sorting
Jeff Atwood's recent blog about bad apples reminded me about something I've given a
I was adding recent videos gadget on my blog. In that widget i was
After Jeph most recent post: http://www.codinghorror.com/blog/archives/001310.html , I thought to myself it would be
I'm using jQuery to automatically fetch the most recent post on a blog. jQuery.get()
I implemented a ViewPager inside my app (according to this recent blog entry: http://android-developers.blogspot.com/2011/08/horizontal-view-swiping-with-viewpager.html
I am displaying recent comments on the home page of a blog application I
A recent post by John Gruber notes that the following legalese: 3.3.1 — Applications
A recent question contains a problem that I many times used to think about
chromatic's recent blog got me curious about the Moose subroutine has . I was

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.