I have a large amount of strings that need to be stored in a very compact fashion. Currently I am storing the strings (32 characters a-f/0-9) in HashSet<byte[]>. I am simply calling .getBytes() to get this.
My question is, is there a better way to store this data in a hashset?
A
HashSet<byte[]>is broken anyway, asbyte[]doesn’t overrideequals()orhashCode(). CallinggetBytes()without specifying a character encoding is generally a bad idea – it’s probably okay if you’ve only got hex digits, but I would still avoid it where possible.If your strings are always 32 hex digits, that’s basically 16 bytes – have you considered either writing a custom collection for this, or possibly just encapsulating them in an object? Given that for any “normal” collection you’ve got to have an object of some description to represent the element, the object overhead is hard to get around – although with a custom collection you could just have two arrays of longs which you kept in sync. That would be about as compact a representation as you could probably find, but just an object with two
longfields or fourintfields would be my starting point. Then you can overridehashCodeandequalsand actually getHashSetto work with value equality, instead of just reference identity… and you’ll still be using less data per element than a byte array of 32 bytes.