I am having trouble with the time it takes for my python script to

Question

0

Asked: June 12, 20262026-06-12T21:28:36+00:00 2026-06-12T21:28:36+00:00

I am having trouble with the time it takes for my python script to

0

I am having trouble with the time it takes for my python script to iterate a data set. The data set is about 40k documents. This is large enough to cause the pymongo cursor to issue multiple fetches which are internal and abstracted away from the developer. I simplified my script down as much as possible to demonstrate the problem:

from pymongo import Connection
import time

def main():
    starttime = time.time()
    cursor = db.survey_answers.find()
    counter=0;
    lastsecond=-1;
    for entry in cursor:
        if int(time.time()-starttime)!=lastsecond:
            print "loop number:", counter, "   seconds:",int(time.time()-starttime);
            lastsecond= int(time.time()-starttime)
        counter+=1;
    print (time.time()-starttime), "seconds for the mongo query to get rows:",counter;

connection = Connection(APPSERVER)#either localhost or hostname depending on test
db = connection.beacon

if __name__ == "__main__":
    main()

My set up is as follows. I have 4 separate hosts, one APPSERVER running mongos, and 3 other shard hosts with each being a primary replica set and secondary replica sets of the other two.

I can run this from one of the shard servers (with the connection pointing to the APPSERVER hostname) and I get:

loop number: 0    seconds: 0
loop number: 101    seconds: 2
loop number: 7343    seconds: 5
loop number: 14666    seconds: 8
loop number: 21810    seconds: 10
loop number: 28985    seconds: 13
loop number: 36078    seconds: 15
16.0257680416 seconds for the mongo query to get rows: 41541

So it’s obvious what’s going on here, the first batchsize of a cursor request is 100, and then each subsequent one is 4m worth of data which appears to be just over 7k documents for me. And each fetch costs 2-3 seconds!!!!

I thought I could fix this problem by moving my application closer to the mongos instance. I ran the above code on APPSERVER (with the connection pointing to localhost) hoping to decrease the network usage …. but it was worse!

loop number: 0    seconds: 0
loop number: 101    seconds: 9
loop number: 7343    seconds: 19
loop number: 14666    seconds: 28
loop number: 21810    seconds: 38
loop number: 28985    seconds: 47
loop number: 36078    seconds: 53
53.5974030495 seconds for the mongo query to get rows: 41541

The cursor sizes are exactly the same in both test, which is nice, but each cursor fetch costs 9-10 seconds here!!!

I know I have four separate hosts that need to communicate, so this can’t be instant. But I will need to iterate over collections of maybe 10m records. At 2 seconds per 7k, that would take just shy of an hour! I can’t have this!

Btw, I’m new to the python/mongo world, I’m used to php and mysql where I would expect this to process in a fraction of a second:

$q=mysql_query("select * from big_table");//let's say 10m rows here ....
$c=0;
while($r=mysql_fetch_rows($q))
    $c++;
echo $c." rows examined";

Can somebody explain the gargantuan difference between the pymongo (~1 hour) and php/mysql (<1 sec) approaches I’ve presented? Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T21:28:37+00:00

I was able to figure this out with the help of A. Jesse Jiryu Davis. It turns out I didn’t have C extension installed. I wanted to run another test without the shards so I could rule out the network latency as an issue. I got a fresh clean host, set up mongo, imported my data, and ran my script and it took the same amount of time. So I know the sharding/replica sets didn’t have anything to do with the problem.

Before the fix, I was able to print:

pymongo.has_c(): False
pymongo version 2.3

I then followed the instructions to install the dependencies for c extensions:

yum install gcc python-devel

Then I reinstalled the pymongo driver:

git clone git://github.com/mongodb/mongo-python-driver.git pymongo
cd pymongo/
python setup.py install

I reran my script and it now prints:

pymongo.has_c(): True
pymongo version 2.3+

And it takes about 1.8 seconds to run as opposed to the 16 above. That still seems long to fetch 40k records and iterate over them, but it’s a significant improvement.

I will now run these updates on my prod (sharded, replica set) environment to hopefully see the same results.

**UPDATE**
I updated my pymongo driver on my prod environment and there was an improvement, though not as much. It took about 2.5-3.5 seconds over a few tests. I presume the sharding nature was the fault here. That still seems incredibly slow to iterate over 40k records.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am having trouble with the time it takes for my python script to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply