I’m looking for a cloud computing solution for the following scenario, but I don’t find any service among Amazon AWS and the like that matches my problem description.
Do you know any cloud computing platform for my problem?
The general problem:
I want to run some data analysis on a data stream (only about 1k per second).
Data analysis is carried out by a bunch of independent threads that operate on that data stream.
Each thread simply computes a Boolean value.
The more threads I have the better is the computed result.
My current solution:
I’ve scrounged a box with an Intel Core i7 from another department, but now they want it back :-).
The ideal solution:
Some service that provides me with an abstract machine (like a JVM with unlimited resources) on which I can spawn a great number of threads.
Also there needs to be some kind of connection to stream the input data and get back the computed results (< 1k per second).
Things should happen in real time (in contrast to being scheduled to be executed like “in the next few minutes”).
So the bottleneck is not memory or disk space, but just computing power and latency.
(And since I need the data analysis just every now and then, cloud computing seems to be economically reasonable here.)
For completeness from the major vendors you have a few categories of choices:
Cloud compute which scales, from AWS it’s EC2; from Google it’s Google Compute Engine (still in private beta); from Microsoft it’s Azure Virtual Machines (also still in private beta). There are, of course, many other vendors, such as Rackspace (which uses OpenStack and more). Given your scenario, I believe something in this category would be the best choice for you.
Cloud-based MapReduce (running on Hadoop) – from AWS that’s Elastic MapReduce; from Google that’s BigQuery; from Microsoft that’s Hadoop on Azure (which is still in beta). There are other vendors in this space as well…Cloudera, HortonWorks, etc… here’s a list.