I’m thinking about writing a query language for HBase. With this query language, the user will be able to apply filters, map functions across rows, and aggregate/reduce the data. (And more, it’s a domain specific query language.) Imagine the data set is very large, as is often the case if one is using HBase.
My question is: How should I handle the intermediate data, between different filterings and mappings and aggregations. Should I save the data on the filesystem? That seems a bit wasteful. Should I try to compose the functions and do everything in one go?
I realize that it depends a bit on what I want to achieve and what my query language will look like. But how is this general problem usually dealt with? Do you have any tips or insights to share? Are there any good articles/resources out there that deal with this problem?
Pig and Hive both do pretty much this (and will work on HBase). The way they work is two-fold. First, they try to fit as much as they can into each MR phase. However, this is sometimes simply not possible. For example, a group, then a transform, then another group would not be possible in one go. For intermediate data, they just write out to HDFS. It’s the simplest way to do it, and you’ll have to be writing to disk anyway for any reasonable amount of data. They just delete the intermediate data after they’re done.
Also, are you sure you want to re-invent the wheel? You’ve pretty much just described Pig. It might even be worthwhile to have your language “compile” to Pig Latin.