I am trying to process over 1.3 mil files using deephashes (SSDEEP) http://code.google.com/p/pyssdeep
What it does is ,. it generates Hashes (1.3 mil generated within 3-6 minutes) and then compare each other to get similarity results.Comparison is very fast but just running single process wont make things finish.So we put in Python Multiprocessing module to get things done.
Result is 1.3 mil text files done within 30 mins . using 18 cores (Quad Xeon processors ,. totalling 24 CPUS)
Here is how each process works :
- Generate SSDEEP Sums.
- Split those list of sums into 5000 group of chunks.
- Compare each chunks 1 vs 5000 within 18 process : 18 sums compared each iteration.
- Group the Results based on Similarity score (Default is 75)
- Removed the files which are already checked for next iteration.
- Start with next file which is < 75% score for next group
- Repeat until all groups are done.
- If there are files which are not included (not similar to any files) they are added to remaining list.
When all processed are done the remaining files are combined and compared against each other recursively until there is no result left.
The problem is, when list of files are chunked into smaller (5000) files . There are files which included in first 5000 chunk but not included in another group , making the groups incomplete.
If i run without chunking it takes very long time for a loop to complete . over 18 hrs and not done ,. do not know how long.
Please advice me.
Modules used : multiprocessing.Pool , ssdeep python
def ssdpComparer(lst, threshold):
s = ssdeep()
check_file = []
result_data = []
lst1 = lst
set_lst = set(lst)
print '>>>START'
for tup1 in lst1:
if tup1 in check_file:
continue
for tup2 in set_lst:
score = s.compare(tup1[0], tup2[0])
if score >= threshold:
result_data.append((score, tup1[2], tup2[2])) #Score, GroupID, FileID
check_file.append(tup2)
set_lst = set_lst.difference(check_file)
print """####### DONE #######"""
remain_lst = set(lst).difference(check_file)
return (result_data, remain_lst)
def parallelProcessing(tochunk_list, total_processes, threshold, source_path, mode, REMAINING_LEN = 0):
result = []
remainining = []
pooled_lst = []
pair = []
chunks_toprocess = []
print 'Total Files:', len(tochunk_list)
if mode == MODE_INTENSIVE:
chunks_toprocess = groupWithBlockID(tochunk_list) #blockID chunks
elif mode == MODE_THOROUGH:
chunks_toprocess = groupSafeLimit(tochunk_list, TOTAL_PROCESSES) #Chunks by processes
elif mode == MODE_FAST:
chunks_toprocess = groupSafeLimit(tochunk_list) #5000 chunks
print 'No. of files group to process: %d' % (len(chunks_toprocess))
pool_obj = Pool(processes = total_processes, initializer = poolInitializer, initargs = [None, threshold, source_path, mode])
pooled_lst = pool_obj.map(matchingProcess, chunks_toprocess) #chunks_toprocess
tmp_rs, tmp_rm = getResultAndRemainingLists(pooled_lst)
result += tmp_rs
remainining += tmp_rm
print 'RESULT LEN: %s, REMAINING LEN: %s, P.R.L: %s' % (len(result), len(remainining), REMAINING_LEN)
tmp_r_len = len(remainining)
if tmp_r_len != REMAINING_LEN and len(result) > 0 :
result += parallelProcessing(remainining, total_processes, threshold, source_path, mode, tmp_r_len)
else:
result += [('','', rf[2]) for rf in remainining]
return result
def getResultAndRemainingLists(pooled_lst):
g_result = []
g_remaining = []
for tup_result in pooled_lst:
tmp_result, tmp_remaining = tup_result
g_result += tmp_result
if tmp_remaining:
g_remaining += tmp_remaining
return (g_result, g_remaining)
first advice: in your case there is no need to have check_file as list => change it to set() – then it should be better (explanation at the end).
If you need to have chunks maybe such procedure is enough:
after that you can filter result:
groups = filter(lambda x: len(x) > 1, result)
remains = filter(lambda x: len(x) == 1, result)
NOTE: this algorithm assume that first element of chunk is kind of ‘base’. The goodness of result is strongly dependent on ssdeep behavior (I can image a strange question for it: how much ssdeep is transitive?) If this kind of similarity then it should be…
The worst case is if score of any pair s.compare(fileId1, fileId2) does not satisfy threshold condition (then complexity is n^2, so in your case 1.3mln * 1.3mln).
There is no simple way to optimize this case. Let imagine situation, where s.compare(file1, file2) is always close to 0 then (as I understand) even you know that s.compare(A, B) is very low and s.compare(B, C) is very low then you still can’t say anything about s.compare(A, C) => so you need to have n*n operations.
The other attention: suppose you are using too much structures and to much lists, example:
This instruction create new set() and all elements from set_lst and check_file HAVE to be touched at least once and because check_file is a list so there is no way to optimize ‘difference’ function and it got complexity: len(check_file) * log(len(set_lst))
Basically: if those structures are growing (1.3 mln it’s pretty much) then your computer need to execute much more calculations. If you would use check_file = set() instead of [] (list) then complexity of it should be: len(set_lst) + len(check_file)
The same with checking if element is in the python’s list (array):
because check_file is list -> in case tup1 is not on the list, your cpu need to compare tup1 with all elements so complexity of that is len(check_file)
If you would change check_file to set then complexity of that would be around log2(len(check_file))
Let make difference more visual, let assume len(*check_file*) = 1mln, how much comparisons do you need??
set: log2(1mln) = log2(1000000) ~ 20
list: len(check_file) = 1mln