For one of my applications, the following function has to be called very often. This function takes up a lot of CPU and hence I am wondering if you know how to improve the performance.
The code counts the occurrences of a combination of four characters. During testing, I found that the number of entries in the map are around 100. The length of text is in the range of 100 to 800. The initial size of 200 is a guess and the code seems to run faster than without specifying an initial size. It is probably not the optimal value, though.
private Map<String, Integer> getTetagramCount(final String text) {
final Map<String, Integer> cipherTetagrams = new HashMap<String, Integer>(200);
for (int i = 0; i < text.length() - 4; i++) {
final String tet = text.substring(i, i + 4);
final Integer count = cipherTetagrams.get(tet);
if (count != null) {
cipherTetagrams.put(tet, count + 1);
} else {
cipherTetagrams.put(tet, 1);
}
}
return cipherTetagrams;
}
I do a lot of work in NLP and machine learning, so I have to do this sort of thing all the time, and there are a ton of opportunities for optimization.
A few points to consider:
First of all, you’re getting killed by the standard JDK HashMap class. It’s a nice container for general-purpose computing, but it’s terrible for high-performance computing. For each entry into your collection (a four-character string (8 bytes) and an integer (4 bytes), the standard java HashMap will consume:
So your tiny little 12 bytes of data becomes 64 bytes. And that’s before the HashMap has allocated an array of hash values to use during its lookup operations. Keep in mind that all those tiny little objects mean more work for the GC, but more importantly, it means that your objects span a larger swath of main memory and are less likely to fit within the CPU cache. When you have lots of cache misses, you lose performance.
NOTE: A commenter reminded me that all the substrings will share the same underlying character array, which is a good point that I’d forgotten about. But still, that means each map entry goes from being 64 bytes down to 44 bytes. Which is still a shame, when there should only be 12 bytes.
Boxing and unboxing all those integer values causes your code to run more slowly and to consume more memory. In most cases, we don’t really care about that, and the vanilla HashMap implementation is fine, even with its mandatory boxing and greedy memory consumption. But in your case, if this code is run within a tight inner loop, we’d rather have a specialized class that knows its values will always be integers and eliminates the need for boxing.
If you dig down into the JDK source code, you’ll see that your code will end up invoking string’s
hashCode()andequals()methods twice. Once for themap.get()and once for themap.put(). But there’s a different kind of collection called a HashBag that can perform lookup, insertion, and count incrementation with only one lookup. A "bag" collection is kind of like a "set", except that it can contain duplicates, and it keeps track of how many duplicates there are. For each of your tetragrams, you’d just callbag.put(tetragram)without having to first retrieve and update a count. Unfortunately, there are no bag implementations in the JDK, so you’ll need to find one elsewhere, or write one yourself.Luckily, your tetragrams can be losslessly encoded as
longvalues (since each java character is 2 bytes wide, and alonggives you eight bytes to work with). You can therefore iterate through the character array and convert each tetragram to along, and avoid all the overhead of constructing so many tiny strings. Then, you can keep your results in aLongIntHashMap(from the Trove library). This will be much faster than your current implementation, because you can avoid creating all those tiny little string objects.Although the Trove
LongIntHashMapis quite excellent, it isn’t quite as good as aLongHashBagwould be. There’s noequalsinvocation (since longs can be compared with the == operator), but you’ll still pay the price of twohashCodeinvocations. If you want to get really aggressive with optimization, you could look at the source code of theLongIntHashMapand figure out how to modify it into aLongHashBag. It’s not that difficult, and ultimately, that’s exactly what I’ve done in my own code.Update 1:
Okay, here’s a little bit of code:
Update 2:
I just did some testing of various solutions, and I was about to get about a 25% performance improvement using the
LongHashBagapproach instead of the standardHashMapapproach.However, I was about to get a 300% improvement by recycling the resultant objects. Basically, instead of doing this:
…I’m now doing this…
The calling code is responsible for creating the LongHashBag object and ensuring that we’re done with it by the time we call the count method again.
But it would also work to do this…
… which will add a little bit of overhead for maintaining the pool. And the calling code would have to remember to put the bag back into the pool when it finishes using it. But the performance benefits could definitely be worth it.
Incidentally, these are exactly the kinds of tricks I use every day. Object pooling has become one of my most reliable tricks for improving performance.
But like I said, recycling those objects yields a 300% performance improvement.