Context:
We are considering an AMQP-compliant solution as a way to compute a constant live stream of data that amounts to 90 gb daily. What we’d like to achieve is live stats, more or less, based on all or some combination of the metrics we’re observing. The considered strategy is to send data on the queue and have worker process deltas of the data, sending the data back on the queue as an aggregation of the original data.
Observation:
To me, this looks like a job for something like Hadoop, but concerns (and shields) were raised, mainly about speed. I didn’t have the time to benchmark both, we’re expecting to pump a good amount of data through the queue (anywhere in the neighborhood of 10~100 mb/s) though. I still think it looks like a job for a distributed computing system, and I also feel the queue solution will scale poorer than a distributed computing solution.
Question:
Put simply, am I right? I’ve read a bit on Hadoop + HDFS, I was thinking about using another FS, like Lustre or something, to circumvent the NodeName SPOF, and use some kind of solution to have some kind of tolerance to failure of nodes of any kind on the whole cluster.
Its really hard to write your own “distributed environment” solution when you need fail-tolarence, good balancing, etc.If you need near-realtime map/reduce you should checkout storm which is what twitter uses for their huge data needs. Its less complicated then hadoop, and better on consuming queue type input (In my opinion).
Also if you decide to analyze your data on hadoop don’t worry too much on SPOF of name node, there are some ways to avoid it.