I have some problems with a java app i’m developing, i’m using HtmlCleaner 2.2

Question

0

Asked: June 1, 20262026-06-01T13:30:56+00:00 2026-06-01T13:30:56+00:00

I have some problems with a java app i’m developing, i’m using HtmlCleaner 2.2

0

I have some problems with a java app i’m developing, i’m using HtmlCleaner 2.2 library (the one used in web-harvest proyect) and have no problem getting the source of a page.

My problem starts when i want to recursively browse the site and get an tree of categories and products as childs. I guess that each time the script visits a page, it counts as a user entering the site, so when it visits 15 or 20 category or product pages, the website firewall blocks my ip for about an hour.

With this problem 2 solutions come to my mind, first: use proxys, i don’t get banned and i can download faster using threads, second: open only one connection. I guess it’s a bad idea to use proxies so i want to ask, in a simple code, what is the best way to visit recursively about 300000 products of a website without being banned? fastest and simple

Putting the source in a string it’s enough to count as visited.
I don’t want a debate about the best way, only a well justificated one.

Acclaration: This is a school task, i’m not making any profit of this, and i’m trying to be the less harmful for the site

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T13:30:58+00:00

If your spidering provides legitimate business value to the site your are scraping, you could contact the website owner and ask for either a data feed or an exclusion to their banning algorithm (after all, it’s often beneficial for people to have their products exposed to prospective buyers).

UPDATE

Based on your statement that this is a school task, ask your teacher for assistance in finding a website that is willing to be bombarded with traffic in the interest of education, or reach out to the website owner, explain what you are doing, and ask for permission.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have some problems with a java app i’m developing, i’m using HtmlCleaner 2.2

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply