Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7737047
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T07:53:55+00:00 2026-06-01T07:53:55+00:00

Given: Several million records in a mongo collection. Each record has 10 fields, of

  • 0

Given:

  • Several million records in a mongo collection.
  • Each record has 10 fields, of which 4 make a compound non unique index, lets call them the KEY.
  • Each record has a timestamp.
  • Some records have the same KEY value.
  • It is possible that the same KEY is found in thousands of records.

I would like to create another collection, containing a subset of the original collection, where I want to limit the number of duplicates per every KEY to no more than some constant, for instance 1000, where only the most recent duplicates must be included.

So, if there are 10000 records with the same KEY, then there will be only the 1000 most recent ones in the new collection.

Below is my code to create an aggregated collection, containing all the original records grouped by KEY. So, I missing the part of retaining only the most recent 1000, but my code is already extremely inefficient, so I’ve figured I am doing something wrong already:

from pymongo import Connection

def main():
  with Connection() as connection:
    mydb = connection.mydb
    try:
      mydb.aggregated.drop()
      mydb.static.map_reduce("""
// map
function() {
  emit({
    indexed_field1: this.indexed_field1,
    indexed_field2: this.indexed_field2,
    indexed_field3: this.indexed_field3
  }, {
    id: this._id,
    ts: this.ts,
    // other fields
  });
}
""", """
// reduce - group the records with the same KEY
// return the given values array wrapped in an object
function(key, values) {
  for (var i = 0; i < values.length; ++i) {
    if (values[i].items) {
      values[i] = values[i].items;
    }
  }
  return {items: values};
}
""", 'aggregated', finalize="""
// finalize by flattening the value, which is likely to be an array of nested arrays
function(key, value) {
  function flatten(value, collector) {
    var items = value;
    if (!(value instanceof Array)) {
      if (!value.items) {
        collector.push(value);
        return;
      }

      items = value.items;
    }
    for (var i = 0; i < items.length; ++i) {
      flatten(items[i], collector);
    }
  }

  var collector = [];
  flatten(value, collector);
  return collector;
}
""")
    except Exception as exc:
      print exc

if __name__ == "__main__":
  main()

Another problem is that I violate the principal that reduce should return the same type as map, but I think it is OK in my case, because my reduce and finalize deal with it.

It feels that I am on a wrong track. Advices?

EDIT

The data looks like this:

{_id: , key1: , key2: , key3: , ts: , bla-bla-bla}

For instance:

  • 20,000 records with (key1,key2,key3) == ('yaba', 'daba', 'doo')
  • 15,000 records with (key1,key2,key3) == ('yogi', 'bear', '')
  • 700 records with (key1,key2,key3) == ('yo', 'ho', 'ho')
  • and so on

At the end of the process I need to be left with:

  • 1,000 most recent yaba-daba-doo records
  • 1,000 most recent yogi-bear records
  • all the yo-ho-ho records (because there are less than 1000 of them)
  • and so on
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T07:53:56+00:00Added an answer on June 1, 2026 at 7:53 am

    I have got a bad feeling I am missing something really important here and should prolly sober up before I decide to try and program but:-

    Ok so given you have a document that could potentially look like:

    {
      _id: {},
      key1: '', key2: '', key3: '', key4: '', key5: '', key6: '', key7: '', key8: '', key9: '', key10: ''
    }
    

    You want to take, say, the compound key of [key2, key3, key5, key7] and then pull out the latest 1000 repeated documents with this exact key.

    Can’t this be solved with a little hacking? I mean you are essentially getting 1k of the latest of a key??

    db.awesome_collection.find({ 'key2': '', 'key3': '', 'key5': '', 'key7': '' }).sort({ 'ts': -1 }).limit(1000);
    

    Doesn’t that do it?

    I mean being really intelligent you can actually make a script that could run faster than an MR to go through the original table to buld anew one based upon these sort of queries.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Given several vectors/sets, each of which contains multiple integer numbers which are different within
Given a class with several constructors - how can I tell Resolve which constructor
Given remote page: http://example.com/paged_list.aspx which uses a Javascript function call to display several pages
Having tried several solutions to get NHibernate to delete orphan records. Given the following
I have a Google App Engine datastore that could have several million records in
Given a set of several million points with x,y coordinates, what is the algorithm
My application has got several pages. On some pages I have given a back
I have several records with a given attribute, and I want to find the
given a plain text document with several lines like: c48 7.587 7.39 c49 7.508
Summary: Given an array {a, b, ..., w, x, ..., z} insert several elements

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.