I am trying to instrument java byte code.
I want to recognize the entry and exit of a java loop, but I have found the identification of loops to be quite challenging.
I have spent a good few hours looking at ASM and open source de-compilers (whom I thought must solve this problem all the time), however, I came up short.
The tool I am augmenting / extending, is using ASM, so ideally I would like to know how to instrument the entry and exit of the different loop constructs in java via ASM. However, I would also welcome a recommendation for a good open source de-compiler, as clearly they would have solved the same problem.
EDIT 4: A bit of background/preamble.
“The only way to jump backward in the code is via a loop.” in Peter’s answer isn’t strictly true. You could jump back and forth without it meaning it’s a loop. A simplified case would be something like this:
Of course, this particular example is very artificial and a bit silly. However, making assumptions as to how the source-to-bytecode compiler is going to behave could lead to surprises. As Peter and I have shown in our respective answers, two popular compilers can produce a rather different output (even without obfuscation). It rarely matters, because all of this tends to be optimised rather well by the JIT compiler when you execute the code.
This being said, in the vast majority of cases, jumping backwards will be a reasonable indication as to where a loop starts. Compared with the rest, finding out the entry point of a loop is the “easy” part.
Before considering any loop start/exit instrumentation, you should look into the definitions of what entry, exit and successors are. Although a loop will only have one entry point, it may have multiple exit points and/or multiple successors, typically caused by
breakstatements (sometimes with labels),returnstatements and/or exceptions (explicitly caught or not). While you haven’t given details regarding the kind of instrumentations you’re investigating, it’s certainly worth considering where you want to insert code (if that’s what you want to do). Typically, some instrumentation may have to be done before each exit statement or instead of each successor statement (in which case you’ll have to move the original statement).Soot is a good framework to do this. It has a number of intermediate representations that make bytecode analysis more convenient (e.g. Jimple).
You can build a BlockGraph based on your method body, for example an ExceptionalBlockGraph. Once you’ve decomposed the control flow graph into such a block graph, from the nodes, you should be able to identity the dominators (i.e. blocks that have an arrow coming back to them). This will give you the start of the loop.
You may find something similar done in sections 4.3 to 4.7 of this dissertation.
EDIT:
Following the discussion with @Peter in comments to his answer. Talking the same example:
This time, compiled with the Eclipse compiler (no specific option: simply autocompilation from within the IDE).
This code hasn’t been obfuscated (apart from being bad code, but that’s a different matter).
Here is the result (from
javap -c):There is a loop between 3 and 12 (jumped in starting a 10) and another loop, due to the exception occurring from the division by zero at 8 to 22.
Unlike the
javaccompiler result, where one could make as guess that there was an outer loop between 0 and 22 and an inner loop between 0 and 12, the nesting is less obvious here.EDIT 2:
To illustrate the kind of problems you may get with a less awkward example. Here is a relatively simple loop:
After (normal) compilation within Eclipse,
javap -cgives this:Before doing anything within the loop, you jump straight from 2 to 15. Block 15 to 17 is the header of the loop (the “entry point”). Sometimes, the header block could contain far more instructions, especially if the exit condition involves more evaluation, or if it’s a
do {} while()loop.The concept of “entry” and “exit” of a loop may not always reflect what you’d write sensibly as Java source code (including the fact that you can rewrite
forloops aswhileloops, for example). Usingbreakcan also lead to multiple exit points.By the way, by “block”, I mean a sequence of bytecode into which you can’t jump and out of which you can’t jump in the middle: they’re only entered from the first line (not necessarily from the previous line, possibly from a jump from somewhere else) and exited from the last (not necessarily to the following line, it can jump somewhere else too).
EDIT 3:
It seems that new classes/methods to analyse loops have been added since last time I had looked at Soot, which make it a bit more convenient.
Here is a complete example.
The class/method to analyse (
TestLoop.foo())When compiled by the Eclipse compiler, this produces this bytecode (
javap -c):Here is a program that loads the class (assuming it’s on the classpath here) using Soot and displays its blocks and loops:
Check the Soot documentation for more details on how to load classes. The
Bodyis a model for the body of the loop, i.e. all the statements made from the bytecode. This uses the intermediate Jimple representation, which is equivalent to the bytecode, but easier to analyse and process.Here is the output of this program:
Body:
Blocks:
Loops:
LoopNestTreeusesLoopFinder, which uses anExceptionalBlockGraphto build the list of blocks.The
Loopclass will give you the entry statement and the exit statements. You should then be able to add extra statements if you wish. Jimple is quite convenient for this (it’s close enough to the bytecode, but has a slightly higher level so as not to deal with everything manually). You can then output your modified.classfile if needed. (See the Soot documentation for this.)