Problem: Given a set of ~250000 integer user IDs, and about a terabyte of

Question

0

Asked: June 14, 20262026-06-14T13:50:41+00:00 2026-06-14T13:50:41+00:00

Problem: Given a set of ~250000 integer user IDs, and about a terabyte of

0

Problem: Given a set of ~250000 integer user IDs, and about a terabyte of JSON-formatted one-per-line records, load the records in which the user ID matches to a database.

Only about 1% of all the records will match the 250000 user IDs. Rather than JSON decode each record, which takes a long time, I am trying to use string matching to determine if the user ID is in the raw JSON; if it matches, then the JSON is decoded and the record checked and then inserted.

The problem is that matching one string of raw JSON against a set containing ~250k string entries is slow.

Here’s the code so far:

// get the list of integer user IDs
cur.execute('select distinct user_id from users')

// load them as text into a set
users = set([])
for result in cur.fetchall():
    users.add(str(result[0]))

// start working on f, the one-json-record-per-line text file
for line in f:
    scanned += 1
    if any(user in line for user in users):
        print "got one!"
        // decode json
        // check for correct decoded user ID match
        // do insert

I am approaching this the right way? What’s a faster method of matching these strings? At present, when looking for so many user IDs, this manages ~2 entries a second on a 3ghz machine (not so good). When the list of user IDs is very short, it manages ~200000 entries/second.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T13:50:42+00:00

Aho-Corasick appears to be built for this purpose. There’s even a handy Python module for it (easy_install ahocorasick).

import ahocorasick

# build a match structure
print 'init empty tree'
tree = ahocorasick.KeywordTree()

cur.execute('select distinct user_id from users')

print 'add usernames to tree'
for result in cur.fetchall():
   tree.add(str(result[0]))

print 'build fsa'
tree.make()

for line in f:
     scanned += 1
     if tree.search(line) != None:
         print "got one!"

This reaches closer to ~450 entries per second.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Problem: Given a set of ~250000 integer user IDs, and about a terabyte of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply