I’ve previously written applications, specifically data scrapers, in Node.js. These types of applications had no web front end, but were merely processes timed with cron jobs to asynchronously make a number of possibly complicated HTTP GET requests to pull web pages, and then scrape and store the data from the results.
A sample of a function I might write would be this:
// Node.js
var request = require("request");
function scrapeEverything() {
var listOfIds = [23423, 52356, 63462, 34673, 67436];
for (var i = 0; i < listOfIds.length; i++) {
request({uri: "http://mydatasite.com/?data_id = " + listOfIds[i]},
function(err, response, body) {
var jsonobj = JSON.parse(body);
storeMyData(jsonobj);
});
}
}
This function loops through the IDs and makes a bunch of asynchronous GET requests, from which it then stores the data.
I’m now writing a scraper in Python and attempting to do the same thing using Tornado, but everything I see in the documentation refers to Tornado acting as a web server, which is not what I’m looking for. Anyone know how to do this?
Slightly more involved answer than I thought I would throw together, but it’s a quick demo of how to use Tornado ioloop and AsyncHTTPClient to fetch some data. I’ve actually written a webcrawler in Tornado, so it can be used “headless”.