I am having trouble with the time it takes for my python script to iterate a data set. The data set is about 40k documents. This is large enough to cause the pymongo cursor to issue multiple fetches which are internal and abstracted away from the developer. I simplified my script down as much as possible to demonstrate the problem:
from pymongo import Connection
import time
def main():
starttime = time.time()
cursor = db.survey_answers.find()
counter=0;
lastsecond=-1;
for entry in cursor:
if int(time.time()-starttime)!=lastsecond:
print "loop number:", counter, " seconds:",int(time.time()-starttime);
lastsecond= int(time.time()-starttime)
counter+=1;
print (time.time()-starttime), "seconds for the mongo query to get rows:",counter;
connection = Connection(APPSERVER)#either localhost or hostname depending on test
db = connection.beacon
if __name__ == "__main__":
main()
My set up is as follows. I have 4 separate hosts, one APPSERVER running mongos, and 3 other shard hosts with each being a primary replica set and secondary replica sets of the other two.
I can run this from one of the shard servers (with the connection pointing to the APPSERVER hostname) and I get:
loop number: 0 seconds: 0
loop number: 101 seconds: 2
loop number: 7343 seconds: 5
loop number: 14666 seconds: 8
loop number: 21810 seconds: 10
loop number: 28985 seconds: 13
loop number: 36078 seconds: 15
16.0257680416 seconds for the mongo query to get rows: 41541
So it’s obvious what’s going on here, the first batchsize of a cursor request is 100, and then each subsequent one is 4m worth of data which appears to be just over 7k documents for me. And each fetch costs 2-3 seconds!!!!
I thought I could fix this problem by moving my application closer to the mongos instance. I ran the above code on APPSERVER (with the connection pointing to localhost) hoping to decrease the network usage …. but it was worse!
loop number: 0 seconds: 0
loop number: 101 seconds: 9
loop number: 7343 seconds: 19
loop number: 14666 seconds: 28
loop number: 21810 seconds: 38
loop number: 28985 seconds: 47
loop number: 36078 seconds: 53
53.5974030495 seconds for the mongo query to get rows: 41541
The cursor sizes are exactly the same in both test, which is nice, but each cursor fetch costs 9-10 seconds here!!!
I know I have four separate hosts that need to communicate, so this can’t be instant. But I will need to iterate over collections of maybe 10m records. At 2 seconds per 7k, that would take just shy of an hour! I can’t have this!
Btw, I’m new to the python/mongo world, I’m used to php and mysql where I would expect this to process in a fraction of a second:
$q=mysql_query("select * from big_table");//let's say 10m rows here ....
$c=0;
while($r=mysql_fetch_rows($q))
$c++;
echo $c." rows examined";
Can somebody explain the gargantuan difference between the pymongo (~1 hour) and php/mysql (<1 sec) approaches I’ve presented? Thanks!
I was able to figure this out with the help of A. Jesse Jiryu Davis. It turns out I didn’t have C extension installed. I wanted to run another test without the shards so I could rule out the network latency as an issue. I got a fresh clean host, set up mongo, imported my data, and ran my script and it took the same amount of time. So I know the sharding/replica sets didn’t have anything to do with the problem.
Before the fix, I was able to print:
I then followed the instructions to install the dependencies for c extensions:
Then I reinstalled the pymongo driver:
I reran my script and it now prints:
And it takes about 1.8 seconds to run as opposed to the 16 above. That still seems long to fetch 40k records and iterate over them, but it’s a significant improvement.
I will now run these updates on my prod (sharded, replica set) environment to hopefully see the same results.
**UPDATE**
I updated my pymongo driver on my prod environment and there was an improvement, though not as much. It took about 2.5-3.5 seconds over a few tests. I presume the sharding nature was the fault here. That still seems incredibly slow to iterate over 40k records.