I am doing some performance tests of a HTML stripper (written in java), that is to say, I pass a string (actually html content) to a method of the HTML stripper
and the latter returns plain text (without HTML tags and meta information).
Here is an example of the concrete implementation
public void performanceTest() throws IOException {
long totalTime;
File file = new File("/directory/to/ten/different/htmlFiles");
for (int i = 0; i < 200; ++i) {
for (File fileEntry : file.listFiles()) {
HtmlStripper stripper = new HtmlStripper();
URL url = fileEntry.toURI().toURL();
InputStream inputStream = url.openStream();
String html = IOUtils.toString(inputStream, "UTF-8");
long start = System.currentTimeMillis();
String text = stripper.getText(html);
long end = System.currentTimeMillis();
totalTime = totalTime + (end - start);
//The duration for the stripping of each file is computed here
// (200 times for each time). That duration value decreases and then becomes constant
//IMHO if the duration for the same file should always remain the same.
//Or is a cache technique used by the JVM?
System.out.println("time needed for stripping current file: "+ (end -start));
}
}
System.out.println("Average time for one document: "
+ (totalTime / 2000));
}
But the duration for the stripping of each file is computed 200 times for each time and has a different decreasing value. IMHO if the duration for one and the same file X should always remain the same!? Or is a cache technique used by the JVM?
Any help would be appreciated.
Thanks in advance
Horace
N.B:
– I am doing the tests local (NO remote, NO http) on my machine.
– I am using java 6 on Ubuntu 10.04
This is totally normal. The JIT compiles methods to native code and optimizes them more heavily as they’re more and more heavily used. (The “constant” your benchmark eventually converges to is the peak of the JIT’s optimization capabilities.)
You cannot get good benchmarks in Java without running the method many times before you start timing at all.