So here’s the situation : I’ve created a SetWritable class, basically a wrapper for java.util.Set that implements the Writable interface. I have an HBase table with one column family and one column, and the values for that column are serialized SetWritable objects. Right now, if I want to add an element to the set, I need to pull the row from HBase, deserialize it into a SetWritable, add my element, serialize the SetWritable, and then push it back to HBase. So this means LOTS AND LOTS of communication between my mapper and HBase. Seeing as that I’m working with large sets of data, this could potentially kill my performance.
What I’d like to do is just send the new element over to HBase, and have some code on the HBase server that deserializes the SetWritable, adds the element, serializes the SetWritable, then commits it. Is this possible? Can coprocessors help?
Another idea : instead of serializing my set into one column, I could have a column for each known element of the set. One downside : I could wind up with hundreds of thousands (or millions) of columns. Is this a problem?
Serialization, locally or remotely, is not the right way to go. Use the column qualifier to store your values and you get exactly the behavior that you want.
If you use the column qualifier as your set element, then hbase can store your sets sparsely. I.e. you could have a million elements in one set; another set with a disjoint million elements. HBase would only store two million items.
To add or delete set elements would be easy: adding is a put(key, column, column qualifier), and removing is delete (key, column, column qualifier). To retrieve the whole set you can just iterate over the values in the row.
It is not even that difficult to modify this approach to use counts instead of binary membership — you just use the atomic increment instruction: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#increment%28org.apache.hadoop.hbase.client.Increment%29