I created a crawler that will operate as a cron job. The object of the crawler is to go through posts on my site and pull keywords from them.
Currently, I am optimizing the script for both speed and server load – but I am curious one what types of benchmarks for each are considered “good”?
For example, here are some configurations I have tested, running through 5,000 posts each time (you’ll notice the trade off between speed and memory):
Test 1 – script optimized for memory conservation:
Run time: 52 seconds
Avg. memory load: ~6mb
Peak memory load: ~7mb
Test 2 – script optimized for speed
Run time: 30 seconds
Avg. memory load: ~40mb
Peak memory load: ~48mb
Clearly the decision here is speed vs. server load. I am curious what your reactions are to these numbers. Is 40mb an expensive number, if it increases speed so drastically (and also minimizes MySQL connections?)
Or is it better to run the script slower with more MySQL connections, and keep the overhead memory low?
This is a really subjective question given that what is “tolerable” depends on many factors such as how many concurrent processes will be running, the specs of the hardware it’ll be running on, and how long you expect it to take.