I am trying to process over 1.3 mil files using deephashes (SSDEEP) http://code.google.com/p/pyssdeep What

Question

0

Asked: June 4, 20262026-06-04T09:39:05+00:00 2026-06-04T09:39:05+00:00

I am trying to process over 1.3 mil files using deephashes (SSDEEP) http://code.google.com/p/pyssdeep What

0

I am trying to process over 1.3 mil files using deephashes (SSDEEP) http://code.google.com/p/pyssdeep

What it does is ,. it generates Hashes (1.3 mil generated within 3-6 minutes) and then compare each other to get similarity results.Comparison is very fast but just running single process wont make things finish.So we put in Python Multiprocessing module to get things done.

Result is 1.3 mil text files done within 30 mins . using 18 cores (Quad Xeon processors ,. totalling 24 CPUS)

Here is how each process works :

Generate SSDEEP Sums.
Split those list of sums into 5000 group of chunks.
Compare each chunks 1 vs 5000 within 18 process : 18 sums compared each iteration.
Group the Results based on Similarity score (Default is 75)
Removed the files which are already checked for next iteration.
Start with next file which is < 75% score for next group
Repeat until all groups are done.
If there are files which are not included (not similar to any files) they are added to remaining list.

When all processed are done the remaining files are combined and compared against each other recursively until there is no result left.

The problem is, when list of files are chunked into smaller (5000) files . There are files which included in first 5000 chunk but not included in another group , making the groups incomplete.

If i run without chunking it takes very long time for a loop to complete . over 18 hrs and not done ,. do not know how long.

Please advice me.

Modules used : multiprocessing.Pool , ssdeep python

def ssdpComparer(lst, threshold):
    s = ssdeep()
    check_file = []
    result_data = []
    lst1 = lst
    set_lst = set(lst)

    print '>>>START'
    for tup1 in lst1:
        if tup1 in check_file:
            continue
        for tup2 in set_lst:
            score = s.compare(tup1[0], tup2[0])
            if score >= threshold:
                result_data.append((score, tup1[2], tup2[2])) #Score, GroupID, FileID
                check_file.append(tup2)
        set_lst = set_lst.difference(check_file)
    print """####### DONE #######"""
    remain_lst = set(lst).difference(check_file)

    return (result_data, remain_lst)



def parallelProcessing(tochunk_list, total_processes, threshold, source_path, mode, REMAINING_LEN = 0):
    result = []
    remainining = []
    pooled_lst = []
    pair = []
    chunks_toprocess = []

    print 'Total Files:', len(tochunk_list)

    if mode == MODE_INTENSIVE:
        chunks_toprocess = groupWithBlockID(tochunk_list) #blockID chunks
    elif mode == MODE_THOROUGH:
        chunks_toprocess = groupSafeLimit(tochunk_list, TOTAL_PROCESSES) #Chunks by processes
    elif mode == MODE_FAST:
        chunks_toprocess = groupSafeLimit(tochunk_list) #5000 chunks

    print 'No. of files group to process: %d' % (len(chunks_toprocess))
    pool_obj = Pool(processes = total_processes, initializer = poolInitializer, initargs = [None, threshold, source_path, mode])
    pooled_lst = pool_obj.map(matchingProcess, chunks_toprocess) #chunks_toprocess
    tmp_rs, tmp_rm = getResultAndRemainingLists(pooled_lst)
    result += tmp_rs
    remainining += tmp_rm

    print 'RESULT LEN: %s, REMAINING LEN: %s, P.R.L: %s' % (len(result), len(remainining), REMAINING_LEN)
    tmp_r_len = len(remainining)

    if tmp_r_len != REMAINING_LEN and len(result) > 0 :
        result += parallelProcessing(remainining, total_processes, threshold, source_path, mode, tmp_r_len)
    else:
        result += [('','', rf[2]) for rf in remainining]

    return result

def getResultAndRemainingLists(pooled_lst):
    g_result = []
    g_remaining = []

    for tup_result in pooled_lst:
        tmp_result, tmp_remaining = tup_result
        g_result += tmp_result
        if tmp_remaining:
            g_remaining += tmp_remaining

    return (g_result, g_remaining)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T09:39:07+00:00

first advice: in your case there is no need to have check_file as list => change it to set() – then it should be better (explanation at the end).

If you need to have chunks maybe such procedure is enough:

def split_to_chunks(wholeFileList):
    s = ssdeep()
    calculated_chunks = []
    for someFileId in wholeFileList:
        for chunk in calculated_chunks:
            if s.compare(chunk[0], someFileId) > threshold:
                chunk.append(someFileId)
                break
        else: # important: this else is on 'for ' level
            # so if there was no 'break' so someFileId is a base for new chunk:
            calculated_chunks.append( [someFileId] )
    return calculated_chunks

after that you can filter result:
groups = filter(lambda x: len(x) > 1, result)
remains = filter(lambda x: len(x) == 1, result)

NOTE: this algorithm assume that first element of chunk is kind of ‘base’. The goodness of result is strongly dependent on ssdeep behavior (I can image a strange question for it: how much ssdeep is transitive?) If this kind of similarity then it should be…

The worst case is if score of any pair s.compare(fileId1, fileId2) does not satisfy threshold condition (then complexity is n^2, so in your case 1.3mln * 1.3mln).

There is no simple way to optimize this case. Let imagine situation, where s.compare(file1, file2) is always close to 0 then (as I understand) even you know that s.compare(A, B) is very low and s.compare(B, C) is very low then you still can’t say anything about s.compare(A, C) => so you need to have n*n operations.

The other attention: suppose you are using too much structures and to much lists, example:

set_lst = set_lst.difference(check_file)

This instruction create new set() and all elements from set_lst and check_file HAVE to be touched at least once and because check_file is a list so there is no way to optimize ‘difference’ function and it got complexity: len(check_file) * log(len(set_lst))

Basically: if those structures are growing (1.3 mln it’s pretty much) then your computer need to execute much more calculations. If you would use check_file = set() instead of [] (list) then complexity of it should be: len(set_lst) + len(check_file)

The same with checking if element is in the python’s list (array):

if tup1 in check_file:

because check_file is list -> in case tup1 is not on the list, your cpu need to compare tup1 with all elements so complexity of that is len(check_file)
If you would change check_file to set then complexity of that would be around log2(len(check_file))
Let make difference more visual, let assume len(*check_file*) = 1mln, how much comparisons do you need??

set: log2(1mln) = log2(1000000) ~ 20

list: len(check_file) = 1mln

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to process over 1.3 mil files using deephashes (SSDEEP) http://code.google.com/p/pyssdeep What

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply