Mongo support Map/Reduce queries but they don’t seem to be map reduce in the Hadoop sense (running in parallel). What is the best way to run queries on a massive Mongo database? Do I need to export it to another place?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Depending on what exactly you need to do, your options (while staying within Mongo) are:
1) Keep using map/reduce in Mongo, but fire up some secondaries for m/r purposes. This is one somewhat easy way of parallelizing map reduce. There are limits, though, you can only use the “out: inline” option, so the results need to be ~16MB or less. This is only really feasible if you haven’t sharded yet.
2) Look into the aggregation framework coming in 2.2 (2.2.0-rc0 is out, we’ve found it to be pretty stable at MongoHQ). This is better optimized on the db level, mostly keeps you out of the janky javascript engine, and is one of the more interesting features 10gen has added. It will also work in a sharded environment.
For either of the above, you want to make sure you have enough RAM (or really fast disks) to hold all the input data, the intermediate steps, and the result. Otherwise you’re bound by IO speeds and not getting much out of your CPU.
If you want to step outside of Mongo, you can try the Mongo Hadoop adapter. Hadoop is a much better way of doing map/reduce, and this will let you use your Mongo data as input. This can be operationally complicated, though, which means either high effort or fragile.