I’m using pjscrape to scrape a high number of pages.
The problem I’m facing is that servers usually ban you out after a certain number of iterative connections made with a short delay.
The only way I found out to create some delay between a page scrape and its following is to use the ready function, i.e.
pjs.addSuite({
// single URL or array
url: urls,
ready: function() {
return $('#MY_LAST_DIV').length > 0;
},
// single function or array, evaluated in the client
scraper: function() {
//...SCRAPING CODE...
}
});
The pjscrape timeout functions seems to deal with pother issues
(I refer to the following)
pjs.config({
...
timeoutInterval: 20000,
timeoutLimit: 20000
});
Is there a way to create an interval between scrapes?
Looking at the source code, there is at the moment no mechanism to wait for an amount of time before scraping them
But it shouldn’t be difficult to add one. Here is a proto-patch (non tested, and just here to give an idea)
It simply puts a
setTimoutwrapping the scrape call with a timout defined at 0 by default. First line is the added config key