I have a large dataset (about 1.1M documents) that I need to run mapreduce on.
The field to group on is an array named xref. Due to the size of the collection and the fact I’m doing this in a 32-bit environment, I’m trying to reduce the collection to another collection in a new database.
First, here’s a data sample:
{ "_id" : ObjectId("4ec6d3aa61910ad451f12e01"),
"bii" : -32.9867,
"class" : 2456,
"decdeg" : -82.4856,
"lii" : 297.4896,
"name" : "HD 22237",
"radeg" : 50.3284,
"vmag" : 8,
"xref" : ["HD 22237", "CPD -82 65", "-82 64","PPM 376283", "SAO 258336",
"CP-82 65","GC 4125" ] }
{ "_id" : ObjectId("4ec6d44661910ad451f78eba"),
"bii" : -32.9901,
"class" : 2450,
"decdeg" : -82.4781,
"decpm" : 0.013,
"lii" : 297.4807,
"name" : "PPM 376283",
"radeg" : 50.3543,
"rapm" : 0.0357,
"vmag" : 8.4,
"xref" : ["HD 22237", "CPD -82 65", "-82 64","PPM 376283", "SAO 258336",
"CP-82 65","GC 4125" ] }
{ "_id" : ObjectId("4ec6d48a61910ad451feae04"),
"bii" : -32.9903,
"class" : 2450,
"decdeg" : -82.4779,
"decpm" : 0.027,
"hd_component" : 0,
"lii" : 297.4806,
"name" : "SAO 258336",
"radeg" : 50.3543,
"rapm" : 0.0355,
"vmag" : 8,
"xref" : ["HD 22237", "CPD -82 65", "-82 64","PPM 376283", "SAO 258336",
"CP-82 65","GC 4125" ] }
Here are the map and reduce functions (right now I’m only lii and bii fields):
function map() {
try {
emit(this.xref, {lii:this.lii, bii:this.bii});
} catch(e) {
}
}
function reduce(key, values) {
var result = {xref:key, lii: 0.0, bii: 0.0};
try {
values.forEach(function(value) {
if (value.lii && value.bii) {
result.lii += value.lii;
result.bii += value.bii;
}
});
result.bii /= values.length;
result.lii /= values.length;
} catch(e) {
}
return result;
}
Unfortunately, running this eventually comes up with an error message:
db.catalog.mapReduce(map, reduce, {out:{replace:"catalog2", db:"astro2"}});
Wed Nov 23 10:12:25 uncaught exception: map reduce failed:{
"assertion" : "_id cannot be an array",
"assertionCode" : 10099,
"errmsg" : "db assertion failure",
"ok" : 0
The xref field IS an array, but all values are equal in that array. Is it trying to use that array as the id field in the new collections?
Yes it is not possible to set _id as an array, because it has a special behavior for indexing.
The key you emit by is used as _id in the output collection.
Potentially this could work only with an “inline” output mode if the result is small, since it wont go to a collection.
But ideally you would translate the array into a string (for example concat the values) and use that as _id, or make it a sub-object instead of an array.
Also note that the result of your reduce function should not include the key.
Just return {lii: .., bii: ..}