I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ).

Question

0

Asked: June 5, 20262026-06-05T14:42:33+00:00 2026-06-05T14:42:33+00:00

I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ).

0

I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ). I would like to run some Map/Reduce tasks on the data to compute some expensive statistics and was wondering what the best workflow is for getting this done. Ideally I would like to use Amazon’s Map/Reduce services so to do this instead of maintaining my own Hadoop cluster.

Does it make sense to copy the data from the database to S3. Then run Amazon Map/Reduce on it? Or are there better ways to get this done.

Also if further down the line I might want to run the queries for frequently like every day so the data on S3 would need to mirror what is in Mongo would this complicate things?

Any suggestions/war stories would be super helpful.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T14:42:35+00:00

Amazon S3 provides a utility called S3DistCp to get data in and out of S3. This is commonly used when running Amazon’s EMR product and you don’t want to host your own cluster or use up instances to store data. S3 can store all your data for you and EMR can read/write data from/to S3.

However, transferring 100GB will take time and if you plan on doing this more than once (i.e. more than a one-off batch job), it will be a significant bottleneck in your processing (especially if the data is expected to grow).

It looks you may not need to use S3. Mongo has implemented an adapter to implement map reduce jobs on top of your MongoDB. http://blog.mongodb.org/post/24610529795/hadoop-streaming-support-for-mongodb

This looks appealing since it lets you implement the MR in python/js/ruby.

I think this mongo-hadoop setup would be more efficient than copying 100GB of data out to S3.

UPDATE: An example of using map-reduce with mongo here.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large Mongo database (100GB) hosted in the cloud (MongoLab or MongoHQ).

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply