Currently I am on a source code plagiarims detection project, and I actually use the different aspects of attributes of the input files (souce code files) to detect plagiarism among student assignments. For example, I now uses (number of identifiers/variables, number of methods used, number of lines of code ) and some other attributes to represent each source code file.
However, when I try to count the number of variables used, one problem is how to find out whether a variable has been used or not. Because the students could intentionally put some identifiers in to cover the plagiarism. However, as I tried to solve this, I found this one really tough. One approach to do this is to use Regular expression in java to handle finding identifiers, but after finding them, I stuck on how to check for usage or not. (What’s more, after this, I still need to find whether a java method is called or not. ) So writing my own version of regular expression could be very complicated.
I know in some IDE like netbeans the editor could instantly find out whether a variable is used or not and underline it. So I wonder if there is any good way for checking variables used or not.
Any suggestions on how to do checking variables would be good!
For doing this kind of code analysis, you absolutely have to look into parser / compiler tools. You cannot determine whether a variable is used by searching for its mere name; you have to search for correct context as well.
I suggest to have a look at ANTLR, which is a Java-based language parsing tool. It has a definition for parsing Java syntax available here. Don’t expect to find an easy solution for your problem that can be implemented in a couple of hours.
Another Java-based tool is JavaCC. If you’re looking for example code showing how these tools can be used, take a look at PMD, which uses a parser built with JavaCC to analyze Java code.
Another possibility is to write a plugin for an IDE that supports code analysis – you’d probably have a much simpler interface there to access the code structure, and as you said, lots of functionality is already available and can simply be called by your plugin.
Yes, you can probably also hack your way with some regexes. Whether you want to do this depends on how exact you want your tool to be. Without parsing the source code, deciding whether an occurrence of a variable name is actually a usage of that variable is merely a heuristic guess.