I’ve been developing a website that will import ~12million lines of code per hour (~1GB/Data) into a mySQL database. After looking at different VPS’ and then amazon ec2 I was thinking of going with the most cost effective solution.
As for visitors to the website there should only be ~300-600k page views a month(max of 15GB of Bandwidth), evenly spreadout through the day.
When I import the data I use “IN FILE” and it imports ~200-350k lines at time and only takes ~1-3seconds. The imports are ran via a Cron Job and will run 1 every minute (that’s ~1400 times a day).
Would it be better to have a VPS, or go with Amazon EC2? And if I go with Amazon EC2 would the CPU spikes be too much for a micro or even a small (would I need a High-CPU Medium Instance)?
If anyone could share some insight on how much CPU mySQL would actually pull in those ~1-3seconds or how much a micro Instance is allowed in CPU time.
If I go with the VPS route I will be getting the S1 package from http://x10vps.com/self-managed-vps.php and upgrade to S3 if needed.
It’s difficult to say much about this without knowing anything about the particulars of the data you are loading. Does MySQL have to update indexes in the table? Is everything already sorted properly? How many columns are there, and are you performing any CPU-intensive operations in your
LOAD DATA INFILEstatement (e.g. replacing parts of strings)? Is old data deleted after a certain period of time, or should all data be retained for the life span of your application?Having said that, it sounds like a single Micro instance will probably have trouble dealing with this. Importing the data should be OK provided you back up the instance with plenty of EBS storage (maybe in a RAID configuration), but if that single instance will also be responsible for running user queries on such a large data set, that’s probably not going to run very smoothly. At best you’ll end up with a few seconds’ latency for user requests when the import script is running. Depending on your application, that may or may not be acceptable.
If you are going to run expensive queries on your data, I can say right now that that is not going to work very well on a single Micro instance 🙂 You could scale up to a larger instance or depending on your needs, you may also want to consider using SimpleDB or a similar NoSQL solution instead (though that would take more code in your import script, because you’ll have to do batch puts of at most 25 items per batch).
But these are just some general thoughts. And AWS actually provides new users with a free usage tier, which lets you run an EC2 Micro instance continuously for a full year without having to pay a dime, so why not sign up for an account and run your own tests? More details here. Here‘s some more general information on how micro instances work and what applications they would be suitable for.