I have a task: speed up current implementation of inverted index. In my opinion the best approach is to run it in the cloud:
- Divide the input text for a few parts (or just grab a few different text files)
- Send texts to nodes
- Run the algorithm on each node for different input data
- Collect the results and merge them
My question is: what is the easiest way to implement it?
My current ideas are:
- Windows Azure with worker roles – is it possible to send different data to nodes and later on merge them?
- Windows Azure and HPC Scheduler – isn’t it too powerful for a task like this? I am afraid of configuration and costs (new node = new worker role?)
- Use any other cloud, like Amazon or Google – I’d like to code in c#, and I am familiar with Microsoft technologies, so I am a little afraid of them
Please give me any advices how would you achieve this goal, I am new to cloud computing (although I have some basics like mpi, soa, cuda, azure basics)
This is a case for MapReduce.
In fact, Hadoop was created out of the needs of Nutch (which does Inverted Index)
You could either use:
a) Amazon’s Elastic MapReduce
or
b) Signup for HDInsights on Azure
There are other providers (picloud is one which comes to mind)