I have some problems with a java app i’m developing, i’m using HtmlCleaner 2.2 library (the one used in web-harvest proyect) and have no problem getting the source of a page.
My problem starts when i want to recursively browse the site and get an tree of categories and products as childs. I guess that each time the script visits a page, it counts as a user entering the site, so when it visits 15 or 20 category or product pages, the website firewall blocks my ip for about an hour.
With this problem 2 solutions come to my mind, first: use proxys, i don’t get banned and i can download faster using threads, second: open only one connection. I guess it’s a bad idea to use proxies so i want to ask, in a simple code, what is the best way to visit recursively about 300000 products of a website without being banned? fastest and simple
Putting the source in a string it’s enough to count as visited.
I don’t want a debate about the best way, only a well justificated one.
Acclaration: This is a school task, i’m not making any profit of this, and i’m trying to be the less harmful for the site
If your spidering provides legitimate business value to the site your are scraping, you could contact the website owner and ask for either a data feed or an exclusion to their banning algorithm (after all, it’s often beneficial for people to have their products exposed to prospective buyers).
UPDATE
Based on your statement that this is a school task, ask your teacher for assistance in finding a website that is willing to be bombarded with traffic in the interest of education, or reach out to the website owner, explain what you are doing, and ask for permission.