Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 291291
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 12, 20262026-05-12T06:04:32+00:00 2026-05-12T06:04:32+00:00

I asked another question: https://stackoverflow.com/questions/1180240/best-way-to-sort-1m-records-in-python where I was trying to determine the best approach

  • 0

I asked another question:
https://stackoverflow.com/questions/1180240/best-way-to-sort-1m-records-in-python
where I was trying to determine the best approach for sorting 1 million records. In my case I need to be able to add additional items to the collection and have them resorted. It was suggested that I try using Zope’s BTrees for this task. After doing some reading I am a little stumped as to what data I would put in a set.

Basically, for each record I have two pieces of data. 1. A unique ID which maps to a user and 2. a value of interest for sorting on.

I see that I can add the items to an OOSet as tuples, where the value for sorting on is at index 0. So, (200, 'id1'),(120, 'id2'),(400, 'id3') and the resulting set would be sorted with id2, id1 and id3 in order.

However, part of the requirement for this is that each id appear only once in the set. I will be adding additional data to the set periodically and the new data may or may not include duplicated ‘ids’. If they are duplicated I want to update the value and not add an additional entry. So, based on the tuples above, I might add (405, 'id1'),(10, 'id4') to the set and would want the output to have id4, id2, id3, id1 in order.

Any suggestions on how to accomplish this. Sorry for my newbness on the subject.

* EDIT – additional info *

Here is some actual code from the project:

for field in lb_fields:
        t = time.time()
        self.data[field] = [ (v[field], k) for k, v in self.foreign_keys.iteritems() ]
        self.data[field].sort(reverse=True)
        print "Added %s: %03.5f seconds" %(field, (time.time() - t))

foreign_keys is the original data in a dictionary with each id as the key and a dictionary of the additional data as the value. data is a dictionary containing the lists of sorted data.

As a side note, as each itereation of the for field in lb_fields runs, the time to sort increases – not by much… but it is noticeable. After 1 million records have been sorted for each of the 16 fields it is using about 4 Gigs or RAM. Eventually this will run on a machine with 48 Gigs.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-12T06:04:33+00:00Added an answer on May 12, 2026 at 6:04 am

    I don’t think BTrees or other traditional sorted data structures (red-black trees, etc) will help you, because they keep order by key, not by corresponding value — in other words, the field they guarantee as unique is the same one they order by. Your requirements are different, because you want uniqueness along one field, but sortedness by the other.

    What are your performance requirements? With a rather simple pure Python implementation based on Python dicts for uniqueness and Python sorts, on a not-blazingly-fast laptop, I get 5 seconds for the original construction (essentially a sort over the million elements, starting with them as a dict), and about 9 seconds for the “update” with 20,000 new id/value pairs of which half “overlap” (thus overwrite) an existing id and half are new (I can implement the update in a faster way, about 6.5 seconds, but that implementation has an anomaly: if one of the “new” pairs is exactly identical to one of the “old” ones, both id and value, it’s duplicated — warding against such “duplication of identicals” is what pushes me from 6.5 seconds to 9, and I imagine you would need the same kind of precaution).

    How far are these 5-and-9 seconds times from your requirements (taking into account the actual speed of the machine you’ll be running on vs the 2.4 GHz Core Duo, 2GB of RAM, and typical laptop performance issues of this laptop I’m using)? IOW, is it close enough to “striking distance” to be worth tinkering and trying to squeeze a last few cycles out of, or do you need orders of magnitude faster performance?

    I’ve tried several other approaches (with a SQL DB, with C++ and its std::sort &c, …) but they’re all slower, so if you need much higher performance I’m not sure what you could do.

    Edit: since the OP says this performance would be fine but he can’t achieve anywhere near it, I guess I’d best show the script I used to measure these times…:

    import gc
    import operator
    import random
    import time
    
    
    nk = 1000
    
    def popcon(d):
      for x in xrange(nk*1000):
        d['id%s' % x] = random.randrange(100*1000)
    
    def sorted_container():
      ctr = dict()
      popcon(ctr)
      start = time.time()
      ctr_sorted = ctr.items()
      ctr_sorted.sort(key=operator.itemgetter(1))
      stend = time.time()
      return stend-start, ctr_sorted
    
    def do_update(ctr, newones):
      start = time.time()
      dicol = dict(ctr)
      ctr.extend((k,v) for (k,v) in newones if v!=dicol.get(k,None))
      dicnu = dict(newones)
      ctr.sort(key=operator.itemgetter(1))
      newctr = [(k,v) for (k,v) in ctr if v==dicnu.get(k,v)]
      stend = time.time()
      return stend-start, newctr
    
    def main():
      random.seed(12345)
      for x in range(3):
        duration, ctr = sorted_container()
        print 'dict-to-sorted, %d: %.2f sec, len=%d' % (x, duration, len(ctr))
        newones = [('id%s' % y, random.randrange(nk*100))
                    for y in xrange(nk*990,nk*1010)]
        duration, ctr = do_update(ctr, newones)
        print 'updt-to-sorted, %d: %.2f sec, len=%d' % (x, duration, len(ctr))
        del ctr
        gc.collect()
    
    main()
    

    and this is a typical run:

    $ time python som.py
    dict-to-sorted, 0: 5.01 sec, len=1000000
    updt-to-sorted, 0: 9.78 sec, len=1010000
    dict-to-sorted, 1: 5.02 sec, len=1000000
    updt-to-sorted, 1: 9.12 sec, len=1010000
    dict-to-sorted, 2: 5.03 sec, len=1000000
    updt-to-sorted, 2: 9.12 sec, len=1010000
    
    real    0m54.073s
    user    0m52.464s
    sys 0m1.258s
    

    the overall elapsed time being a few seconds more than the totals I’m measuring, obviously, because it includes the time needed to populate the container with random numbers, generate the “new data” also randomly, destroy and garbage-collect things at the end of each run, and so forth.

    This is with the system-supplied Python 2.5.2 on a Macbook with Mac OS X 10.5.7, 2.4 GHz Intel Core Duo, and 2GB of RAM (times don’t change much when I use different versions of Python).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 177k
  • Answers 177k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer At the moment the T4 engine is only used for… May 12, 2026 at 3:33 pm
  • Editorial Team
    Editorial Team added an answer I suspect that if ProdMiscDAO was an interface (is it?)… May 12, 2026 at 3:33 pm
  • Editorial Team
    Editorial Team added an answer Usually a callback is in the form of delegate that… May 12, 2026 at 3:33 pm

Related Questions

I imagine most of you know what I am getting at. You start a
Given a class hierarchy where the base class defines a recursive self-type: abstract class
Please look at http://www.idea-palette.com I have multiple pages that are being faded in using
A while ago I had asked how to make a collapsible comment box like

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.