I am developing a website in Rails that will have to run a script once a day. The script reads xml-feeds and updates the database. I am using Rails 3.1.1 and run the website on Heroku.
What options do I have here in order for the script to not completely kill the website when it is run? Adding a dyno I think would solve it but is quite expensive, especially since it is not really needed outside when I run the script.
Could I run the script on another database and copy it? Run it in the background? In short, what options do I have?
Edit: I wasn’t really clear here. My issue here is to affect the webserver/database as little as possible, not to run the script (with whenever etc). I am planning to run the script during the night to affect as little as possible, but still I don’t want the website to be completely down during that hour.
A lot of this depends on the performance characteristics of your script. If is very cpu intensive but otherwise low impact then I wouldn’t worry: when using something like heroku scheduler the job runs in a separate dyno. Since it’s a separate dyno it won’t affect your other dynos that are serving requests.
Heavy database usage is another thing all together. Your database has a finite amount of IO, cache, CPU etc. and if you are pushing it hard (where lots of writes is generally worse than lots of reads, since those bust caches) then you can degrade performance for your other dynos.
It is also possible to stop the website from working at all – if you job ends up taking locks on rows/tables that the rest of the app is trying to access then your web dynos will be blocked until your job releases those locks.
If you’re parsing a feed an updating db rows one by one as you traverse the feed, you’ll probably be ok: lots of little writes/reads are better than massive ones in terms of lock contention and I don’t think you’ll be hitting the db that hard since it sounds like you’d mostly be loading one row at a time from an indexed column, doing some ruby calculations and then updating a single row.
If you do find that performance is being degraded unacceptably, and if the bottle neck is reads then one way out is to have a read slave (also known as a replica, or in heroku speak a follower). In a nutshell this is a separate, read only, database server that tracks the main database server (so it’s always pretty much up to date). Anything you do to this server can’t affect your master db, so you can query away without worrying.
This doesn’t help you if the problem is the number of writes you need to do. To an extent this can be solved (at a cost) by switching to a beefier database server. Different types of data store (e.g. mongo, redis) are sometimes more appropriate than a relational database for some usage patterns. Sometimes it is possible to architect out some of your performance hot spots, but clearly you’re the one best placed to consider that.
This is all very abstract – the only way you’ll really know is by trying. Setup a copy of your app, kick off this task and see how performance degrades (or do this against the real app if you’re not worried about a one-off impact)