I have an interesting problem. I have a working M/R version of this but it’s not really a viable solution in a small-scale environment since it’s too slow and the query needs to be executed real-time.
I would like to iterate over each element in a collection and score it, sort by descending, limit to top 10 and return the results to the applications.
Here is the function I’d like applied to each document in pseudo code.
var score = 0;
foreach(tag in document.Tags) {
score += someMap[tag];
}
return score;
Since your
someMapis changing each time, I don’t see any alternative other than to score all the documents and return the highest-scoring ones. Whatever method you adopt for this type of operation, you’ll have to consider all the documents in the collection, which is going to be slow, and will become more and more costly as the collection you’re scanning grows.One issue with map reduce is that each
mongodinstance can only run one concurrent map reduce. This is a limitation of the javascript engine, which is single-threaded. Multiple map reduces will be interleaved, but they cannot run concurrently with one another. This means that if you’re relying on map reduce for “real-time” uses, that is, if your web page has to run a map reduce to render, you’ll eventually hit a limit where page load times become unacceptably slow.You can work around this by querying all the documents into your application, and doing the scoring, sorting, and limiting in your application code. Queries in MongoDB can run concurrently, unlike map reduce, though of course this means that your application servers will have to do a lot of work.
Finally, if you are willing to wait for MongoDB 2.2 to be released (which should be within a few months), you can use the new aggregation framework in place of map reduce. You’ll have to massage the
someMapto generate the correct pipeline steps. Here’s an example of what this might look like ifsomeMapwere{"a": 5, "b": 2}:This is a little complicated, and bears explaining:
_id) are duplicated for each unwound element.$cond/$eqexpression for each roughly means (for thetag1scoreexample) “if the value in the document in the ‘tags’ field id equal to ‘a’, then return 5 and assign that value to a new fieldtag1score, else return 0 and assign that”. This expression would be repeated for each tag/score combination in yoursomeMap. At this point in the pipeline, each document will nave NtagNscorefields, but at most one of them will have a non-zero value.scorefield whose value is the sum of thetagNscorefields in the document._id, and sum up the value of thescorefield from the previous step across all documents in each group.score, descending (i.e. greatest scores first)I’ll leave it as an exercise to the reader how to convert
someMapinto the correct set of projections in step 2, and the correct set of fields to add in step 3.This is essentially the same set of steps that your application code or map reduce would go through, but has the following distinct advantages: instead of map reduce, the aggregation framework is fully implemented in C++ and is faster and more concurrent than map reduce; and unlike querying all the documents to your application, the aggregation framework works with the data on the server side, saving network load. But like the other two approaches, this will still have to consider each document, and can only limit the result set once the score has been calculated for all of them.