I am using my node.js and jsdom based getratings.js script from the getratings github project to scrape user reviews from sites like NewEgg, BestBuy etc.
The script is hosted on an EC2 micro instance. It works fine until more than around 12 simultaneous requests are sent to the service. Beyond that, resource and memory utilization on the host is very high and response to the client takes forever.
I’ve tried to take care of memory leaks. Once its done processing requests, the memory usage does eventually go down, but the usage peaks are very high.
I was wondering if there is something that I can do to make the processing of html through jsdom more efficient in terms of resource utilization.
You might want to do manual garbage clean cleanup if you look at this site. He give a general way of doing it.
My suggestion would be to clean up after every request. But maybe that might be too much. You might also nee to increase the amount of sockets that can be created by you machine. Again, the code is in the site listed below. I’ll post the basics that I took from his article.
http://dev.caustik.com/websvn/filedetails.php?repname=sprites&path=%2Fsprites-server%2Ftrunk%2Freadme.txt
You will need to call the gc() function to garbage collect there after.