I am looking for a way to generate some summary statistics using Mongo. Suppose I have a collection with many records of the form
{"name" : "Jeroen", "gender" : "m", "age" :27.53 }
Now I want to get the distributions for gender and age. Assume for gender, there are only values "m" and "f". What is the most efficient way of getting the total count of males and females in my collection?
And for age, is there a way that does some ‘binning’ and gives me a histogram like summary; i.e. the number of records where age is in the intervals: [0, 2), [2, 4), [4, 6) ... etc?
Konstantin’s answer was right. MapReduce gets the job done. Here is the full solution in case others find this interesting.
To count genders, the map function key is the
this.genderattribute for every record. The reduce function then simply adds them up:To do the binning, we set the key in the map function to round down to the nearest division by two. Therefore e.g. any value between 10 and 11.9999 will get the same key
"10-12". And then again we simply add them up: