I have used ElasticMapReduce for some time. It is quite convenient but I can’t run HBase since Hadoop cluster is only temporarily available (I have asked somewhat related question at HBase and Hadoop).
So I want to try out installing Hadoop on a set of EC2 machines. I know Hadoop has some EC2 related directory – src/contrib/ec2. It looks like a Hadoop cluster can be launched simply by typing a command and I can log into a master node to run jobs and so on. Before trying this, I would like to know any gotchas from ppl who have been using this. Thanks!
Indeed there are two options of using hadoop on amazon – provisioning of you own cluster or usint EMR. Orthogonal to this decision you can use HDFS or S3 as your file system.
It is not short story but I will try to highligt some pros/cons of all these choices.
You can use EMR if you need to run single / few jobs a day and do not need hadoop cluster all the time. In this case you put your data into s3 and can fully script the process. Main disadvatage – it is not easy to customize, use third party libraries etc. In this case you also save time of installing the cluster.
If you want to tweak hadoop – you should install your own cluster.
When your data is already in s3 or you need to store it after processings – s3 is a good choice. In the same time – you will get probabbly less performance then using HDFS. Have to be stated that amazon instances has very little local storage – so it get really expensive and you should keep cluster running (and pay for it) just to preserve this storage.
I would tell that if you indeed need HDFS with all its throuput you indeed need own cluster on own hardware. When you working on Amazon – it is most practical to use S3 as your file system.