I am looking for a Python implementation of an algorithm which performs the following task:
Given two directed graphs, that may contain cycles, and their roots,
produce a score to the two graphs’ similarity.
(The way that Python’s difflib can perform for two sequences)
Hopefully, such an implementation exists. Otherwise, I’ll try and implement an algorithm myself. In which case, what’s the preferable algorithm to implement (with regard to simplicity).
The way the algorithm works is of no importance to me, though its’ complexity is.
Also, an algorithm which works with a different data-structure is also acceptable, as long as a graph, such as I described, can be represented with this DS.
I’ll emphasize, an implemetation would be much nicer.
Edit:
It seems an isomorphism algortihm is not relevant. It was suggested that graph edit distance is more to the point, which narrows down my search to a solution that either executes graph edit distance or reduces a graph to a tree and then performs tree edit distance.
The nodes themseleves consist of a few lines of assembly code each.
What we ended up doing is implementing an algorithm described in: “Heuristics for Chemical Compound Matching”.
We’re using NetworkX for representing the grap and for finding the max clique.
Edit:
Basically, you create a new graph, each node (v) representing a possible pairing of a node from graph A (a) to a node from graph B (b).
If in your application the two nodes (a,b) are either similar or not, you remove nodes (v) from the new graph which correspond to dissimilar pairings (a,b).
You connect two nodes with an edge if they don’t contradict each other.
For example, the pairings (a,b) and (a,c) contradict each other (see article for a formal definition).
You then find a clique in the new graph which has the maximum amount of nodes.
If in your application the two nodes’ similarity is not binary, you give the new nodes weights within a range (say (0,1)).
You can remove, heuristically, remove new nodes with similarity grades lower than a predefined threshold.
You then find a clique in the new graph which has the maximum weight (the sum of the nodes’ assigned weights).
Either way, you finish up by generating the similarity grade: the size / total weight of the clique divided by a function of the attributes of the original graphs (maximum/minimum/average of the sizes / weights of A and B ).
A nice feature is that the you can deduce the “source” of the similarity from the clique you found – the “stronger” pairings.
Further clarification:
The constraints are application dependent. We used the approach to compare pairs of function control-flow graphs. Generally, the approach finds a matching of some nodes in the first graph to some nodes in the second graph (subgraph to subgraph). Each node in the association graph symbolizes a possible matching of a single node from the first graph to a single node in the second graph. Since eventually a clique (a subset of the nodes) is selected, an edge means two matchings don’t contradict each other. To apply for a different application, you should ask what are the criteria for possible pairings (or what nodes do I create), and how does selecting one pairing influence the selection of another pairing (or how do I connect the nodes with edges).