I’m getting some behavior from the Text constructors that don’t really make any sense. Basically, if I construct a Text object from a String, it is not equal to another Text object that I constructed from bytes, even though getBytes() returns the same value for both objects.
So we get weird stuff like this :
//This succeeds
assertEquals(new Text("ACTACGACCA_0"), new Text("ACTACGACCA_0"));
//This succeeds
assertEquals((new Text("ACTACGACCA_0")).getBytes(), (new Text("ACTACGACCA_0")).getBytes());
//This fails. Why?
assertEquals(new Text((new Text("ACTACGACCA_0")).getBytes()), new Text("ACTACGACCA_0"));
This manifests when I’m trying to access a hashmap. Here, I’m trying to do a lookup based on a value returned by org.apache.hadoop.hbase.KeyValue.getRow() :
//This succeeds
assertEquals((new Text("ACTACGACCA_0")).getBytes(), keyValue.getRow());
//This returns a value
hashMap.get(new Text("ACTACGACCA_0"));
//This returns null. Why?
hashMap.get(new Text(keyValue.getRow()));
So what’s going on here, and how do I deal with it? Does this have something to do with encoding?
UPDATE : PROBLEM SOLVED
Thanks to Chris for pointing me in the right direction with this. So, a little background : the keyValue object is captured (using a Mockito ArgumentCaptor) from a call to htable.put(). Basically, I had this chunk of code :
byte[] keyBytes = matchRow.getKey().getBytes();
RowLock rowLock = hTable.lockRow(keyBytes);
Get get = new Get(keyBytes, rowLock);
SetWritable<Text> toWrite = new SetWritable<Text>(Text.class);
toWrite.getValues().addAll(matchRow.getMatches(hTable, get));
Put put = new Put(keyBytes, rowLock);
put.add(Bytes.toBytes(MatchesByHaplotype.MATCHING_COLUMN_FAMILY), Bytes.toBytes(MatchesByHaplotype.UID_QUALIFIER),
SERIALIZATION_HELPER.serialize(toWrite));
hTable.put(put);
where matchRow.getKey() returns a text object. You see the problem here? I was adding all the bytes, including the invalid ones. So I created a nice helper function that does this :
public byte[] getValidBytes(Text text) {
return Arrays.copyOf(text.getBytes(), text.getLength());
}
And changed the first line of that block to this :
byte[] keyBytes = SERIALIZATION_HELPER.getValidBytes(matchRow.getKey());
Problem solved! In retrospect : wow, what a nasty bug! I think what it comes down to is that the behavior of Text.getBytes() is very n00b-unfriendly. Not only does it return something that you may not expect (non-valid bytes), the Text object doesn’t have a function to return only the valid bytes! You would think this would be a common use-case. Maybe they’ll add this in the future?
For the same reason that the following fails:
getBytes()returns the backing byte array, but according to the API, the bytes are only valid uptoText.getLength();Yes, this does have to do with encoding – the CharsetEncoder.encode method uses a ByteBuffer whose size is initially allocated to 12 * 1.1 bytes (13) in length, but the actual valid number of bytes is still only 12 (as you are using solely ASCII characters).