Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8534431
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T10:11:44+00:00 2026-06-11T10:11:44+00:00

Hello I am building a database of factual data about my book collection, i.e.

  • 0

Hello I am building a database of factual data about my book collection, i.e. titles, number of pages, width, length, author, author birthdate, publisher name, publisher address, and so on.
For that purpose, I input ISBNs and the application fetches that info from the web. From a few sites I defined myself, that I know among them will have all the info I require. At the current moment, it’s 3 sites, and it will most probably never be more than five. On each of these sites, I CURL a search page with the isbn as a query parameter, extract the links the search page presents, then CURL these links and extract the above info (birth, title, publisher, etc…) out of them.
The extent of my scraping, therefore, is 3 x (search page + info page) = 6 HTML pages.

These pages all present relevant information in ludicrous ways. For example the publisher info has address, phone, email, website in one HTML tag, with brs as separators. Some publishers don’t have one of these fields, therefore it’s not even always the same number of brs.
Another of these sites has lis for most of the info, but a for one field, p for another, and div for another.
Etc…

I have succesfully extracted what I wanted with regex, then with a DOM parser. In the end, the readability of the code is way worse with the DOM parser, as more operations are needed for extracting a field of info. As an example:

<li>Né le : 23/12/1990 (ANGLETERRE)</li>

for a male author birthdate, could also show up for a female one as

<li>Née le : 11/07/1832</li>

With the DOM parser, I need to get a list of lis, which is not enough, as some important info is in a p, a div, and a a. Then for each li, I need to check if the li contains “Né le” or “Née le”, which is either to ifs, or a regex – the to check if there is a parenthetized birthplace, and extract it, which is at least two more operations.
With a regex, I can get it in one line of code.

Moreover, how exactly is a parser built? Does the underlying code do regexes, or is it something else? If it is so, I figure there is a high performance cost, when using a parsing engine, vs. quick and dirty regexes?

So here are my two interrogations, how is a DOM parser built, is it with underlying regexes? And secondly, for my very limited scope of parsing six to ten pages, mostly for my personal use, shouldn’t I go for code readability (and performance depending on the first question)?

Best regards,
Sebastian

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T10:11:45+00:00Added an answer on June 11, 2026 at 10:11 am

    how is a DOM parser built, is it with underlying regexes?

    It is a parser and normally would not be implemented with regex. Internally one would go through each character of the HTML at at time and use a state machine to “figure out” what the character means and how it fits into the DOM (this will include fixing broken HTML, closing elements that should be closed and more).

    If you can read C# (or Java), I suggest reading the source code for the HTML Agility Pack – in particular the Parse methods. It will show quite clearly how this is done.

    The definite source for how to correctly parse HTML is in section 12.2 of the whatwg HTML specification – (note that the link is to the first page only – there is more). This is not for the feint of heart 😉

    for my very limited scope of parsing six to ten pages, mostly for my personal use, shouldn’t I go for code readability (and performance depending on the first question)?

    Regex for parsing well known HTML formats is fine. People rage against trying to parse HTML from many different sources with regex, as this is not really possible (HTML not being a regular language, you end up with many exceptions and contradictions).

    If this is for a limited use and limited HTML formats, go ahead and use regex. Do whatever is more readable for you.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Possible Duplicate: NSNumberFormatter and ‘th’ ‘st’ ‘nd’ ‘rd’ (ordinal) number endings Hello, I'm building
hello i'm building a wpf app with data grids, the pattern is model view
Hello all: I'm building an app using C#/MVC3/Razor which has a database configured in
Hello fellow programmers, I am building a website and i read about sitemap.xml, but
Hello I'm building a Spring MVC web application that runs on Tomcat 6.0.20 and
Hello I'm building a web application with spring ibatis and mysql. I'm going to
(source: kominetz.com ) Hello. I'm diving into iOS development and am building my own
Hello I know all about http://www.php.net/manual/en/function.http-build-query.php to do this however I have a little
Intro Hello! I am in the process of building a lyrics website where I
hello i am building a template in c++ ,and i need to overwrite the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.