Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4273930
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 21, 20262026-05-21T07:46:44+00:00 2026-05-21T07:46:44+00:00

I have been reading this paper titled Clone Detection using Abstract Syntax Trees by

  • 0

I have been reading this paper titled Clone Detection using Abstract Syntax Trees by Ira D. Baxter et al. There is a paragraph from the paper that I reproduced below:

In principle, finding sub-tree clones
is easy: compare every subtree to
every other sub-tree for equality. In
practice, several problems arise:
near-miss clone detection, sub-clones
and scale.
…

When locating near-miss
clones, hashing on complete subtrees
fails precisely because good hashing
functions include all elements of the
tree, and thus sorts tress with minor
differences into different buckets. We
solved this problem by choosing an
artificially bad hash function. This function must be characterized in
such a way that the main properties
one wants to find on near-miss clones
are preserved. Near miss clones are
usually created by copy and paste
procedures followed by small
modifications. These modifications
usually generate small changes to the
shape of the tree associated with the
copied piece of code. Therefore, we
argue that this kind of near-miss
clone often have only some different
small sub-trees. Based on this
observation, a hash function that
ignores small sub-trees is a
goodchoice. In the experiment
presented here, we used a hash
function that ignores only the
identifier names (leaves in the tree).
Thus our hashing function puts trees
which are similar modulo identifiers
into the same hash bins for
comparison.

I am trying to implement the techniques discussed in this paper but am stuck in trying to understand this one paragraph (that is unfortunately at the beginning of the paper). I understand what the paragraph is saying but the authors do not mention what hash function to choose or how to actually hash the ASTs. Can someone please explain this with a simple example from an implementation standpoint?

  • 1 1 Answer
  • 2 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-21T07:46:44+00:00Added an answer on May 21, 2026 at 7:46 am

    Shades that the author himself should answer. Isn’t StackOverflow great 😕

    The point of hash functions is that which one you choose doesn’t matter, as long as it distributes input values evenly across a large number of buckets. You need a hash function that can be applied to the entire tree; the usual technique for such is to serialize the tree in any way possible (say, by an in-order tree visit) and then apply the hash function to the stream of values (tree nodes) this produces. (This idea is from the compiler literature on detecting common subexpressions, which was the inspiration for the original CloneDR). If this isn’t clear, you need to spend more energy understanding how hash functions are applied to complex data structures. Wikipedia on hashing is a good place to start; if that’s not enough, you need to find a book on algorithms and study up.

    What you feed to the hash function is up to you. The point I made in the paper is that you can compute hash functions that ignore the identifier leaves of an AST, which will cause trees having the same identical structure but different identifiers to hash to the same bucket. Thus, trees which are similar modulo identifiers are easily matched, beause they occur in the same hash bucket.

    Of course, there’s a lot more to the whole clone detection algorithm that just matching trees modulo identifiers. You need to worry about matching parameterized sequences (which is sort of the big point in the paper), reporting the results, and of course you need a high-quality language parser for whatever language you care to apply this to.

    You can see results of the CloneDR for a number of different languages.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have been reading up on this, and it seems that if you use
i have been reading this interesting article which is increasing my every growing confusion
I have been reading other questions and answers around this but I am not
I've been reading this CodeProject article on C++0x and have given it a quick
I have been reading through the C++ FAQ and was curious about the friend
I have been reading the MSDN documentation on subclassing and I have been successful
I have been reading about the differences between Table Variables and Temp Tables and
I have been reading the proper article in MSDN, Strong-Named Assemblies and a related
I have been reading through the CodePlex supported open source licenses, i couldn't quite
On Stackers' recommendation, I have been reading Crockford's excellent Javascript: The Good Parts .

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.