Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3489216
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T11:22:08+00:00 2026-05-18T11:22:08+00:00

I have binary A , which is a debug build with accompanying symbols —

  • 0

I have binary A, which is a debug build with accompanying symbols — built many years ago. I also have binary B, a release build without accompanying symbols and is much more recent. I am looking for the most efficient method of matching the symbols from binary A to potential candidates in binary B.

Given that the debug build is considerably bigger (does more input validation, prints more things to stderr, etc.) and that functions invariably change over time, I figure that trying to fingerprint the individual functions will be wasted time.

Therefore, I’ve decided — quite out of thin air, so I could be barking up the wrong tree — that the best way to fingerprint the functions is to create call graphs of both binaries and try to match the vertices (i.e. the functions).

I have already done some preprocessing, so I have the following data structures:

# binary A
[[60, 60, 8734], # function 0 is called by functions 60 (twice) and 8734
 [193, 441, 505], # function 1 is called by functions 193, 441 and 505
 [193, 742],
 [23],  
 [21],  
 [21],  
 [26],  
 [26, 1508, 1509, 1573],  
 [24],  
 [25],
 ...] # (~10k functions)

# binary B
[[8999], # function 0 is called by function 8999
 [9016], # function 1 is called by function 9016
 [1126], 
 [7904, 7904, 7913], 
 [182, 336, 396, 396], 
 [9010], 
 [407], 
 [182, 632], 
 [20], 
 [24],
 ...] # (~10k functions)

An important note to take away is that there is no correspondence between function “0” in binary A and function “0” in binary B. These are arbitrary IDs I have assigned to each function in each binary.

The next step is that which confounds me. My algorithm-fu is very weak and I cannot think of a clever way to proceed. My (very limited) understanding is that to solve this, I would want to employ some form of inexact graph matching. In other words, which set of mappings Ai -> Bi would maximise the similarity of the two call-graphs?

Given that there are additional debugging functions in binary A and the obvious fact that programs evolve over time, there is likely no exact match. Ideally, I would want output of the form:

[[(37, 0.998), (8432, 0.912), (442, 0.75)], # matching-ness of function "0" in binary A with function "37" in binary B is 0.998, second most likely candidate is function "8432" in binary B with score 0.912, etc.
 [(42, 0.973), (7751, 0.788)], # matching-ness of function "1" in binary A with function "42" in binary B is 0.973, second most likely candidate is function "7751" in binary B with score 0.788, etc.
 [(4579, 0.996), (123, 0.934)], 
 ...] # around ~10k mappings

In reality, I would be happy with if I had to make do with only one candidate and no ranking was provided, but one can dream.

Any SO-goers have an idea of where I should start?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T11:22:09+00:00Added an answer on May 18, 2026 at 11:22 am

    Certainly an interesting problem, though I suspect it will be hard to solve. It appears to be an instance of approximate graph isomorphism on directed graphs. I didn’t find much googling for this, but here’s some software for solving for undirected general graphs, a more general case which is NP hard.

    Aligning Executable Code

    I think the most practical thing you may be able to do is to forget about the runtime information and simply take the executable code sections of each version and use a global alignment algorithm (e.g. Needleman-Wunsch, although there do exist much faster but less accurate algorithms) on them that:

    1. Treats entire instructions as characters (this will require building a rudimentary disassembler)
    2. Disregards address components of instructions entirely
    3. Downweights deletions (assuming the “first” file is the debug version, which we expect to be larger)
    4. Upweights matches of CALL instructions, and possibly other “reliable” sequences of instructions.

    Assuming the order that the functions appear in the executables hasn’t changed too much (which it won’t have, unless the optimised version has used some optimisation that gets the linker to place functions that call each other near each other), this should get you a nice first approximation.

    The Assignment Problem (Bipartite Maximum Matching)

    Alternatively, if you can find a way (and my intuition suggests it would need to be an iterative approach, along the lines of how PageRank decides the value of a webpage) to “score” the likelihood that a function f in the debug version corresponds to a function g in the optimised version, then yes you could use a graph matching technique. In this case the vertices of the graph would be all functions in both versions, and there would be a weighted edge between every function from the debug version and every function from the optimised version, with the weight determined by your scoring system.

    This graph would be bipartite because there will never be an edge between 2 functions in the same version. That means it is an instance of the Assignment Problem, for which quite good (and not too complicated) algorithms exist.

    However, the missing piece here is the means of deciding the weights for each pairing. One approximate way of doing this would be to build a vector counting the number of immediate children, grandchildren, great-grandchildren and so on for each function. You could then compare these vectors using any distance measure you like. But I expect doing this here will not work well, because we already expect the debug version to contain far more calls to functions than the optimised version.

    Using the Full Call Tree

    If you have access to the entire call tree for both, this will give you more information: the sequence of calls made within a function, as well as knowledge of the exact hierarchy of calls. You might then be able to build a “signature” for each function by doing the using just the latter:

    1. Extract the list of distinct functions called by a given function
    2. Label the 1st-called function 1, the 2nd-called distinct function 2, etc.
    3. The signature is just this sequence of function labels in the order that they are called in the function.

    Now, Levenshtein distance can be used to compare 2 signatures. For more accuracy at the expense of more computation, you could use a variation in which up to k distinct functions in the debug version are allowed to be deleted, for some small k (e.g. k = 3), and the best Levenshtein distance is taken over all such “slimmed down” versions, with a small penalty attached that is proportional to the number of functions deleted.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

No related questions found

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.