I have been playing around with using graphs to analyze big data. Its been working great and really fun but I’m wondering what to do as the data gets bigger and bigger?
Let me know if there’s any other solution but I thought of trying Hbase because it scales horizontally and I can get hadoop to run analytics on the graph(most of my code is already written in java), but I’m unsure how to structure a graph on a nosql database? I know each node can be an entry in the database but I’m not sure how to model edges and add properties to them(like name of nodes, attributes, pagerank, weights on edges,etc..).
Seeing how hbase/hadoop is modeled after big tables and map reduce I suspect there is a way to do this but not sure how. Any suggestions?
Also, does this make sense what I’m trying to do? or is it there better solutions for big data graphs?
You can store an adjacency list in HBase/Accumulo in a column oriented fashion. I’m more familiar with Accumulo (HBase terminology might be slightly different) so you might use a schema similar to:
Where CF=ColumnFamily and CFQ=ColumnFamilyQualifier
You might also store node/vertex properties as separate rows using something like:
The PropertyValue could be either in the CFQ or the Value
From a graph processing perspective as mentioned by @Arnon Rotem-Gal-Oz you could look at Apache Giraph which is an implementation of Google Pregel. Pregel is the method Google use for large graph processing.
Using HBase/Accumulo as input to giraph has been submitted recently (7 Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input and Output formats (GIRAPH-153)