I have a Java application that will parse html pages and extract data from

Question

0

Asked: June 14, 20262026-06-14T01:19:07+00:00 2026-06-14T01:19:07+00:00

I have a Java application that will parse html pages and extract data from

0

I have a Java application that will parse html pages and extract data from them. Currently, I have a class that acts as template or instructions on how to read a specific web page. The application will need to read from several different sites that will be formatted differently. Rather than creating a new template class for each type of format, I’d like to be able to read an accompanying XML file (or another document) that will provide the instructions as to which data and where to extract.

I’ve attempted to search the internet on how to do this, but I’m guessing I’m not asking the right question or using the right keywords.

The solution doesn’t have to use XML as the template, but it was my first thought.

Can anyone point me in the right direction?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T01:19:09+00:00

An Extractor uses ExtractionInstructions to extract the data of interest from a single source. You can later retrieve the extracted data from the extractor.

In this high-level design

Source: Each page that you want to extract the same data from
Extractor: 1 instance for each extraction run on a single source
ExtractionInstructions: A set of instructions unambiguously describing a way to extract data from a single source.
- You can specify the instructions unambiguously by
  - tag id and/or
  - Using CSS 3 selectors and/or
  - xpath etc.
- You can use a combination of all the above by chaining them together (chain of responsibility pattern) to have more percentage of success. Idea is if the data is not found using 1 type of extraction, then you can try with other options until you either find the data or you ran out of the instructions).

I suggest using JSoup as the base library to build these abstractions.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a Java application that will parse html pages and extract data from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply