I’m sorry if this question is basic(I’m new to nosql). Basically I have a large mathimatical process that I’m splitting up and having different servers process and send the result to an hbase database. Each server computing the data, is an hbase regional server, and has thrift on it.
I was thinking of each server processing the data and then updating hbase locally(via thrift). I’m not sure if this is the best approach because I don’t fully understand how the master(named) node will handle the upload/splitting.
I’m wondering what the best practice is when uploading large amounts of data(in total I suspect it’ll be several million rows)? Is it okay to send it to regional servers or should everything go through the master?
From this blog post,
I am assuming you directly use the thrift interface. In that case, even if you call any mutation from a particular regionserver, that regionserver only acts as a client. It will contact Zookeeper quorum, then contact Master to get the regions where to write the data and proceed in the same way as if it was written from another regionserver.
Both are same. There is no such thing as writing directly to regionserver. Master will have to be contacted to determine which region to write the output to.
If you are using a hadoop map-reduce job, and using the Java API for the mapreduce job, then you can use the
TableOutputFormatto write directly to HFiles without going through the HBase API. It is about ~10x faster than using the API.