I have a huge python list(A) of lists. The length of list A is around 90,000. Each inner list contain around 700 tuples of (datetime.date,string). Now, I am analyzing this data. What I am doing is I am taking a window of size x in inner lists where- x = len(inner list) * (some fraction <= 1) and I am saving each ordered pair (a,b) where a occurs before b in that window (actually the innerlists are sorted wrt time). I am moving this window upto the last element adding one element at a time from one end and removing from other which takes O(window-size)time as I am considering the new tuples only. My code:
for i in xrange(window_size):
j = i+1;
while j<window_size:
check_and_update(cur, my_list[i][1], my_list[j][1],log);
j=j+1
i=1;
while i<=len(my_list)-window_size:
j=i;
k=i+window_size-1;
while j<k:
check_and_update(cur, my_list[j][1], my_list[k][1],log);
j+=1
i += 1
Here cur is actually a sqlite3 database cursor,my_list is a list containing the tuples and I iterate this code for all the lists in A and log is a opened logfile. In method check_and_update() I am looking up my database to find the tuple if exists or else I insert it, along with its total number of occurrence so far. Code:
def check_and_update(cur,start,end,log):
t = str(start)+":"+ str(end)
cur.execute("INSERT OR REPLACE INTO Extra (tuple,count)\
VALUES ( ? , coalesce((SELECT count +1 from Extra WHERE tuple = ?),1))",[t,t])
As expected this number of tuples is HUGE and I have previously experimented with dictionary which eats up the memory quite fast. So, I resorted to SQLite3, but now it is too slow. I have tried indexing but with no help. Probably the my program is spending way to much time querying and updating the database. Do you have any optimization ideas for this problem? Probably changing the algorithm or some different approach/tools. Thank you!
Edit: My goal here is to find the total number of tuples of strings that occur within the window grouped by the number of different innerlists they occur in. I extract this information with this query:
for i in range(1,size+1):
cur.execute('select * from Extra where count = ?',str(i))
#other stuff
For Example ( I am ignoring the date entries and will write them as ‘dt’):
My_list = [
[ ( dt,'user1') , (dt, 'user2'), (dt, 'user3') ]
[ ( dt,'user3') , (dt, 'user4')]
[ ( dt,'user2') , (dt, 'user3'), (dt,'user1') ]
]
here if I take fraction = 1 then, results:
only 1 occurrence in window: 5 (user 1-2,1-3,3-4,2-1,3-1)
only 2 occurrence in window: 2 (user 2-3)
Let me get this straight.
You have up to about 22 billion potential tuples (for 90000 lists, any of 700, any of the following entries, on average 350) which might be less depending on the window size. You want to find, but number of inner lists that they appear in, how many tuples there are.
Data of this size has to live on disk. The rule for data that lives on disk due to size is, “Never randomly access, instead generate and then sort.”
So I’d suggest that you write out each tuple to a log file, one tuple per line. Sort that file. Now all instances of any given tuple are in one place. Then run through the file, and for each tuple emit the count of how many times it appears in (that is how many inner lists it is in). Sort that second file. Now run through that file, and you can extract how many tuples appeared 1x, 2x, 3x, etc.
If you have multiple machines, it is easy to convert this into a MapReduce. (Which is morally the same approach, but you get to parallelize a lot of stuff.)