EC2 provides a very convenient, on-demand scalable mechanism to execute distributable(parallel-izable) processes and S3 provides a reliable storage service.
I was trying to employ EC2 nodes for a ETL & analytics process, this process needs large amount of data(100GB – 1TB) ingested very quickly (and several times in a day) and adequate compute resources to be made available for a short duration.
The above design needs
- A high-bandwidth/fast connection between S3 and EC2.
- S3–> EC2 connection should also be reliable since scheduling of starting, pumping-in data, executing processes and terminating nodes has to be done as soon as possible not just to save costs but also because SLA’s are involved.
But as yet
- The only means of pulling data out of S3 seems to be via http and hence it is constrained by the download bandwidths of the EC2 nodes.
- Also the data ingestion goes over the internet and hence can be unreliable enough for strict scheduling purposes necessitating adequate buffering across jobs.
In a private data-center setup one can setup a faster (say 10Gbps) dedicated line between storage and physical nodes.
Are there any possible alternatives/services options in case of aws that can address the above requirements.
Depends, hugely, on all sorts of things – how much network activity the other EC2 instances on the same physical server are doing, the particular S3 node you’re hitting at any one time, whether you’re in the same region as your S3 endpoint, etc.
You can benchmark yourself, but even then it’ll vary a lot. I’ve gotten multiple megabytes per second at times and a couple hundred kilobytes at other times.