I am looking for a way, using static analysis of two JavaScript functions, to tell if they are the same. Let me define multiple definitions of “the same”.
Level 1: The functions are the same except for possible different whitespace, e.g. TABS, CR, LF and SPACES.
Level 2 The functions may have different whitespace like Level 1, but also may have different variable names.
Level 3 ???
For level one, I think I could just remove all (non-literal, which may be tough) whitespace from each string containing the two JS function definitions, and then compare the strings.
For level two, I think I would need to use something like SpiderMonkey’s parser to generate a two parse trees, and then write a comparer which walks the trees and allows variables to have different names.
[Edit] Williham, makes a good point below. I do mean identical. Now, I’m looking for some practical strategies, particularly with regards to using parse trees.
Reedit:
To expound on my suggestion for determining identical functions, the following flow can be suggested:
Level 1: Remove any whitespace that is not part of a string literal; insert newlines after each
{,;and}and compare. If equal; the functions are identical, if not:Level 2: Move all variable declarations and assignments that don’t depend on the state of other variables defined in the same scope to the start of the scope they are declared in (or if not wanting to actually parse the JS; the start of the braces); and order them by line length; treating all variable names as being 4 characters long, and falling back to alphabetical ordering ignoring variable names in case of tied lengths. Reorder all collections in alphabetical order, and rename all variables
vSNN, where v is literal, S is the number of nested braces and NN is the order in which the variable was encountered.Compare; if equal, the functions are identical, if not:
Level 3: Replace all string literals with
"sNN", where"andsare literal, andNNis the order in which the string was encountered. Compare; if equal, the functions are identical, if not:Level 4: Normalize the names of any functions known to be the same by using the name of the function with the highest priority according to alphabetical order (in the example below, any calls to
p_strlen()would be replaced withc_strlen(). Repeat re-orderings as per level 1 if necessary. Compare; if equal, the functions are identical, if not; the functions are almost certainly not identical.Original answer:
I think you’ll find that you mean “identical”, not “the same”.
The difference, as you’ll find, is critical:
Two functions are identical if, following some manner of normalization, (removing non-literal whitespace, renaming and reordering variables to a normalized order, replacing string literals with placeholders, …) they compare to literally equal.
Two functions are the same if, when called for any given input value they give the same return value. Consider, in the general case, a programming language which has counted, zero-terminated strings (hybrid Pascal/C strings, if you will). A function
p_strlen(str)might look at the character count of the string and return that. A functionc_strlen(str)might count the number of characters in the string and return that.While these functions certainly won’t be identical, they will be the same: For any given (valid) input value they will give the same value.
My point is:
Determining wether two functions are identical (what you seem to want to achieve) is a (moderately) trivial problem, done as you describe.
Determining wether two functions are truly the same (what you might actually want to achieve) is non-trivial; in fact, it’s downright Hard, probably related to the Halting Problem, and not something that can be done with static analysis.
Edit: Of course, functions that are identical are also the same; but in a highly specific and rarely useful way for complete analysis.