Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8681319
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 12, 20262026-06-12T21:28:36+00:00 2026-06-12T21:28:36+00:00

I am having trouble with the time it takes for my python script to

  • 0

I am having trouble with the time it takes for my python script to iterate a data set. The data set is about 40k documents. This is large enough to cause the pymongo cursor to issue multiple fetches which are internal and abstracted away from the developer. I simplified my script down as much as possible to demonstrate the problem:

from pymongo import Connection
import time

def main():
    starttime = time.time()
    cursor = db.survey_answers.find()
    counter=0;
    lastsecond=-1;
    for entry in cursor:
        if int(time.time()-starttime)!=lastsecond:
            print "loop number:", counter, "   seconds:",int(time.time()-starttime);
            lastsecond= int(time.time()-starttime)
        counter+=1;
    print (time.time()-starttime), "seconds for the mongo query to get rows:",counter;

connection = Connection(APPSERVER)#either localhost or hostname depending on test
db = connection.beacon

if __name__ == "__main__":
    main()

My set up is as follows. I have 4 separate hosts, one APPSERVER running mongos, and 3 other shard hosts with each being a primary replica set and secondary replica sets of the other two.

I can run this from one of the shard servers (with the connection pointing to the APPSERVER hostname) and I get:

loop number: 0    seconds: 0
loop number: 101    seconds: 2
loop number: 7343    seconds: 5
loop number: 14666    seconds: 8
loop number: 21810    seconds: 10
loop number: 28985    seconds: 13
loop number: 36078    seconds: 15
16.0257680416 seconds for the mongo query to get rows: 41541

So it’s obvious what’s going on here, the first batchsize of a cursor request is 100, and then each subsequent one is 4m worth of data which appears to be just over 7k documents for me. And each fetch costs 2-3 seconds!!!!

I thought I could fix this problem by moving my application closer to the mongos instance. I ran the above code on APPSERVER (with the connection pointing to localhost) hoping to decrease the network usage …. but it was worse!

loop number: 0    seconds: 0
loop number: 101    seconds: 9
loop number: 7343    seconds: 19
loop number: 14666    seconds: 28
loop number: 21810    seconds: 38
loop number: 28985    seconds: 47
loop number: 36078    seconds: 53
53.5974030495 seconds for the mongo query to get rows: 41541

The cursor sizes are exactly the same in both test, which is nice, but each cursor fetch costs 9-10 seconds here!!!

I know I have four separate hosts that need to communicate, so this can’t be instant. But I will need to iterate over collections of maybe 10m records. At 2 seconds per 7k, that would take just shy of an hour! I can’t have this!

Btw, I’m new to the python/mongo world, I’m used to php and mysql where I would expect this to process in a fraction of a second:

$q=mysql_query("select * from big_table");//let's say 10m rows here ....
$c=0;
while($r=mysql_fetch_rows($q))
    $c++;
echo $c." rows examined";

Can somebody explain the gargantuan difference between the pymongo (~1 hour) and php/mysql (<1 sec) approaches I’ve presented? Thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-12T21:28:37+00:00Added an answer on June 12, 2026 at 9:28 pm

    I was able to figure this out with the help of A. Jesse Jiryu Davis. It turns out I didn’t have C extension installed. I wanted to run another test without the shards so I could rule out the network latency as an issue. I got a fresh clean host, set up mongo, imported my data, and ran my script and it took the same amount of time. So I know the sharding/replica sets didn’t have anything to do with the problem.

    Before the fix, I was able to print:

    pymongo.has_c(): False
    pymongo version 2.3
    

    I then followed the instructions to install the dependencies for c extensions:

    yum install gcc python-devel
    

    Then I reinstalled the pymongo driver:

    git clone git://github.com/mongodb/mongo-python-driver.git pymongo
    cd pymongo/
    python setup.py install
    

    I reran my script and it now prints:

    pymongo.has_c(): True
    pymongo version 2.3+
    

    And it takes about 1.8 seconds to run as opposed to the 16 above. That still seems long to fetch 40k records and iterate over them, but it’s a significant improvement.

    I will now run these updates on my prod (sharded, replica set) environment to hopefully see the same results.

    **UPDATE**
    I updated my pymongo driver on my prod environment and there was an improvement, though not as much. It took about 2.5-3.5 seconds over a few tests. I presume the sharding nature was the fault here. That still seems incredibly slow to iterate over 40k records.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm having trouble using os.utime to correctly set the modification time on the mac
I'm having trouble understanding why this doesn't sort the vector by the time/size of
I'm sure this has been asked a number of time but I'm having trouble
So this is my first time using cookies and i'm having some trouble setting
I am having trouble calculating time difference. I am getting one time from web
I am having trouble to extend the expiration time of Facebook access token to
Im having some trouble with an oracle database. Every time i try to connect,
Long time reader never posted until now. Im having some trouble with Android, im
I'm building an R package for the first time and am having some trouble.
I'm working on a simple wall post and having a hard time trouble-shooting it.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.