I’m still relatively new to hadoop, and I’ve been learning a bit about it by doing some sample exercises, but I had a question about how it is used in practice. A lot of the applications seem to be geared toward batch processing (such as logfile data), but I’m not sure how hbase fits in here?
Is it common to store the logfile data in hbase and then process and output it to some other storage format? Is it more common to just pass the raw logfiles into hadoop and then store the output in hbase? I guess my real question here is typically hbase used as an input or output of hadoop, or both?
HBase is for use wherever you need random, low latency access to the data, whereas most of the rest of the Hadoop ecosystem is batch oriented as you mention.
To use your log parsing example, you can process log files stored in HDFS via MapReduce, but what then? Presumably you want to see traffic patterns over time (minutes, hours, days, whatever). If you store the results in HBase with the timestamp as the row key, then you can efficiently query a particular date range (for example, “Show me all of the data from last week.”) HBase will return that much more quickly than classic MapReduce, because it doesnt need to scan through all of the data from last month, last year, etc., whereas MapReduce would.