Problem: Given a set of ~250000 integer user IDs, and about a terabyte of JSON-formatted one-per-line records, load the records in which the user ID matches to a database.
Only about 1% of all the records will match the 250000 user IDs. Rather than JSON decode each record, which takes a long time, I am trying to use string matching to determine if the user ID is in the raw JSON; if it matches, then the JSON is decoded and the record checked and then inserted.
The problem is that matching one string of raw JSON against a set containing ~250k string entries is slow.
Here’s the code so far:
// get the list of integer user IDs
cur.execute('select distinct user_id from users')
// load them as text into a set
users = set([])
for result in cur.fetchall():
users.add(str(result[0]))
// start working on f, the one-json-record-per-line text file
for line in f:
scanned += 1
if any(user in line for user in users):
print "got one!"
// decode json
// check for correct decoded user ID match
// do insert
I am approaching this the right way? What’s a faster method of matching these strings? At present, when looking for so many user IDs, this manages ~2 entries a second on a 3ghz machine (not so good). When the list of user IDs is very short, it manages ~200000 entries/second.
Aho-Corasick appears to be built for this purpose. There’s even a handy Python module for it (easy_install ahocorasick).
This reaches closer to ~450 entries per second.