When running Hadoop in EC2, I seem to have two options:
- A: Manage the cluster myself, using the EC2-specific shell scripts that come with Hadoop.
- B: Use Elastic MapReduce, and pay a little extra for the convenience.
I’m leaning towards B, but I’d appreciate some advice from people with more experience. Here are my questions:
- Are there any tasks that can be done with one of these methods but not the other?
- Are there other options besides these two that I’m overlooking?
- If I choose B, how easy would it be to go back to A? That is, what’s the danger of vendor lock-in?
I have been told by people close to the Amazon Elastic MapReduce (EMR) development team that there are at least two other advantages to using EMR: a) Amazon is actively applying bug fixes and performance enhancements to the Hadoop code base used on EMR, and b) Amazon employs a high performance network between EMR servers and S3 servers that may not be available between EC2 servers and S3 servers.
UPDATE: See @mat’s comments that refute the rumored advantages of using EMR.