Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 158881
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T10:45:36+00:00 2026-05-11T10:45:36+00:00

I was answering some quiz questions for an interview, and the question was about

  • 0

I was answering some quiz questions for an interview, and the question was about how would I do screen scraping. That is, picking content out of a web page, assuming you don’t have a better structured way to query the information directly (e.g. a web service).

My solution was to use an XQuery expression. The expression was fairly long because the content I needed was pretty deep in the HTML hierarchy. I had to search up through the ancestors a fair way before I found an element with an id attribute. For example, scraping an Amazon.com page for Product Dimensions looks like this:

//a[@id='productDetails'] /following-sibling::table //h2[contains(child::text(), 'Product Details')] /following-sibling::div //li /b[contains(child::text(), 'Product Dimensions:')] /following-sibling::text() 

That’s a pretty nasty expression, but that’s why Amazon provides a web service API. Anyway, it’s just one example. The question was not about Amazon, it’s about screen scraping.

The interviewer didn’t like my solution. He thought it was fragile, because a change to the page design by Amazon could require rewriting the XQuery expression. Debugging an XQuery expression that doesn’t match anything in the page it’s applied against is hard.

I did not disagree with his statements, but I didn’t think his solution was any improvement: he thought it’s better to use a regular expression, and search for content and markup near the shipping weight. For example, using Perl:

$html =~ m{<li>\s*<b>\s*Product Dimensions:\s*</b>\s*(.*?)</li>}s; 

My counter-argument was that this is also susceptible to Amazon changing their HTML code. They could spell HTML tags in capitals (<LI>), or add CSS attributes or change <b> to <span> or change the label ‘Product Dimensions:’ to ‘Dimensions:’ or many other kinds of changes. My point was that regular expressions don’t solve the weaknesses he called out in my XQuery solution.

But in addition, regular expressions can find false positives, unless you add enough context to the expression. It can also unintentionally match content that happens to be inside a comment, or an attribute string, or a CDATA section.

My question is, what technology do you use to do screen scraping? Why did you choose that solution? Is there some compelling reason to use one? Or never use the other? Is there a third choice besides those I showed above?

PS: Assume for the sake of argument that there is no web service API or other more direct way to acquire the desired content.

  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-11T10:45:37+00:00Added an answer on May 11, 2026 at 10:45 am

    I’d use a regular expression, for the reasons the manager gave, pluss a few (more portable, easier for outside programmers to follow, etc).

    Your counter argument misses the point that his solution was fragile with regard to local changes while yours is fragile with regard to global changes. Anything that breaks his will probably break yours, but not visa-versa.

    Finally, it’s a lot easier to build slop / flex into his solution (if, for example, you have to deal with multiple minor variations in the input).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a difficult question that needs some answering, i have seen some projects
I was just answering a question about different approaches for picking the partition in
When answering another question, it occurred to me that I maybe optimize some of
I'm here again asking questions. I hope somebody would put some effort in answering
While answering this question DOM parser: remove certain attributes only I noticed that some
When answering a question with a suggestion to use GADTs , some questions with
I would like to see how I was answering questions, asked in some Facebook
Answering another question about how const string data was stored in an executable a
Before answering, it is not as easy question as you might have thought about
After answering a question about how to force-free objects in Java (the guy was

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.