I have been reading this paper titled Clone Detection using Abstract Syntax Trees by Ira D. Baxter et al. There is a paragraph from the paper that I reproduced below:
In principle, finding sub-tree clones
is easy: compare every subtree to
every other sub-tree for equality. In
practice, several problems arise:
near-miss clone detection, sub-clones
and scale.
…When locating near-miss
clones, hashing on complete subtrees
fails precisely because good hashing
functions include all elements of the
tree, and thus sorts tress with minor
differences into different buckets. We
solved this problem by choosing an
artificially bad hash function. This function must be characterized in
such a way that the main properties
one wants to find on near-miss clones
are preserved. Near miss clones are
usually created by copy and paste
procedures followed by small
modifications. These modifications
usually generate small changes to the
shape of the tree associated with the
copied piece of code. Therefore, we
argue that this kind of near-miss
clone often have only some different
small sub-trees. Based on this
observation, a hash function that
ignores small sub-trees is a
goodchoice. In the experiment
presented here, we used a hash
function that ignores only the
identifier names (leaves in the tree).
Thus our hashing function puts trees
which are similar modulo identifiers
into the same hash bins for
comparison.
I am trying to implement the techniques discussed in this paper but am stuck in trying to understand this one paragraph (that is unfortunately at the beginning of the paper). I understand what the paragraph is saying but the authors do not mention what hash function to choose or how to actually hash the ASTs. Can someone please explain this with a simple example from an implementation standpoint?
Shades that the author himself should answer. Isn’t StackOverflow great 😕
The point of hash functions is that which one you choose doesn’t matter, as long as it distributes input values evenly across a large number of buckets. You need a hash function that can be applied to the entire tree; the usual technique for such is to serialize the tree in any way possible (say, by an in-order tree visit) and then apply the hash function to the stream of values (tree nodes) this produces. (This idea is from the compiler literature on detecting common subexpressions, which was the inspiration for the original CloneDR). If this isn’t clear, you need to spend more energy understanding how hash functions are applied to complex data structures. Wikipedia on hashing is a good place to start; if that’s not enough, you need to find a book on algorithms and study up.
What you feed to the hash function is up to you. The point I made in the paper is that you can compute hash functions that ignore the identifier leaves of an AST, which will cause trees having the same identical structure but different identifiers to hash to the same bucket. Thus, trees which are similar modulo identifiers are easily matched, beause they occur in the same hash bucket.
Of course, there’s a lot more to the whole clone detection algorithm that just matching trees modulo identifiers. You need to worry about matching parameterized sequences (which is sort of the big point in the paper), reporting the results, and of course you need a high-quality language parser for whatever language you care to apply this to.
You can see results of the CloneDR for a number of different languages.