Given:
- Several million records in a mongo collection.
- Each record has 10 fields, of which 4 make a compound non unique index, lets call them the KEY.
- Each record has a timestamp.
- Some records have the same KEY value.
- It is possible that the same KEY is found in thousands of records.
I would like to create another collection, containing a subset of the original collection, where I want to limit the number of duplicates per every KEY to no more than some constant, for instance 1000, where only the most recent duplicates must be included.
So, if there are 10000 records with the same KEY, then there will be only the 1000 most recent ones in the new collection.
Below is my code to create an aggregated collection, containing all the original records grouped by KEY. So, I missing the part of retaining only the most recent 1000, but my code is already extremely inefficient, so I’ve figured I am doing something wrong already:
from pymongo import Connection
def main():
with Connection() as connection:
mydb = connection.mydb
try:
mydb.aggregated.drop()
mydb.static.map_reduce("""
// map
function() {
emit({
indexed_field1: this.indexed_field1,
indexed_field2: this.indexed_field2,
indexed_field3: this.indexed_field3
}, {
id: this._id,
ts: this.ts,
// other fields
});
}
""", """
// reduce - group the records with the same KEY
// return the given values array wrapped in an object
function(key, values) {
for (var i = 0; i < values.length; ++i) {
if (values[i].items) {
values[i] = values[i].items;
}
}
return {items: values};
}
""", 'aggregated', finalize="""
// finalize by flattening the value, which is likely to be an array of nested arrays
function(key, value) {
function flatten(value, collector) {
var items = value;
if (!(value instanceof Array)) {
if (!value.items) {
collector.push(value);
return;
}
items = value.items;
}
for (var i = 0; i < items.length; ++i) {
flatten(items[i], collector);
}
}
var collector = [];
flatten(value, collector);
return collector;
}
""")
except Exception as exc:
print exc
if __name__ == "__main__":
main()
Another problem is that I violate the principal that reduce should return the same type as map, but I think it is OK in my case, because my reduce and finalize deal with it.
It feels that I am on a wrong track. Advices?
EDIT
The data looks like this:
{_id: , key1: , key2: , key3: , ts: , bla-bla-bla}
For instance:
- 20,000 records with
(key1,key2,key3) == ('yaba', 'daba', 'doo') - 15,000 records with
(key1,key2,key3) == ('yogi', 'bear', '') - 700 records with
(key1,key2,key3) == ('yo', 'ho', 'ho') - and so on
At the end of the process I need to be left with:
- 1,000 most recent yaba-daba-doo records
- 1,000 most recent yogi-bear records
- all the yo-ho-ho records (because there are less than 1000 of them)
- and so on
I have got a bad feeling I am missing something really important here and should prolly sober up before I decide to try and program but:-
Ok so given you have a document that could potentially look like:
You want to take, say, the compound key of [key2, key3, key5, key7] and then pull out the latest 1000 repeated documents with this exact key.
Can’t this be solved with a little hacking? I mean you are essentially getting 1k of the latest of a key??
Doesn’t that do it?
I mean being really intelligent you can actually make a script that could run faster than an MR to go through the original table to buld anew one based upon these sort of queries.