I have a Java application that will parse html pages and extract data from them. Currently, I have a class that acts as template or instructions on how to read a specific web page. The application will need to read from several different sites that will be formatted differently. Rather than creating a new template class for each type of format, I’d like to be able to read an accompanying XML file (or another document) that will provide the instructions as to which data and where to extract.
I’ve attempted to search the internet on how to do this, but I’m guessing I’m not asking the right question or using the right keywords.
The solution doesn’t have to use XML as the template, but it was my first thought.
Can anyone point me in the right direction?
An
ExtractorusesExtractionInstructionsto extract the data of interest from a single source. You can later retrieve the extracted data from the extractor.In this high-level design
I suggest using JSoup as the base library to build these abstractions.