I am trying to do some html parsing.
I am dealing with some very dynamic data, and my sources vary widely.
If to be more specific, I am trying to parse product information, including
name, price and description from pages that I do not know in advance.
Throughout these pages, the only basic information the stays the same is the title of the page
the name of them item I am querying (they both match each other) and the price.
The only real logic that remains the same throughout different websites is the
proximity between the different sets of information.
So, a price label will be close to the product’s name and close to its description.
I am looking for an html parser that will give me the ability to narrow down my parsing based on the distance in pixels between the different html tags.
Do you know of such a library?
Is there any other way I could try to tackle this issue?
EDIT:
The language, the os and the resolution don’t metter.
What tools do you know that might help with this problem?
I might decide to change my underlaying OS and language if I
find a good enough library.
The price of an item is normally preceeded by a particular special character denoting the currency inside the same tag as the numerals displaying the value in a eg:
Assuming you are using a search API such as google or bing to get a list of pages that contain a specific products name then opening that page up a simple regex statement will be able to retrieve everything between the currency marker (£,$,¥ etc) and the end of div or span.
However if the search results throw up pages that contain more than one product or multiple price markers then this system will may not work quite as well as hoped. The only way to be sure is to code individual scraper routines for each site or try and scrape somebody elses comparison service.