besides using Hive, is it a good idea in order to execute ad hoc query on large scale log data on HDFS for SQL programmers?
Is there any similar open-source implementation?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Technically it should not be that complicated to implement. Some conceptual problem I see with it that performance-wise behavior of the NoSQL engines is fundamentally different from what MySQL engine expect from storage engines. Specifically – they have good random access and not that efficient in the full or range scans. The question is it will be possible to translate all these costs to the optimizer. It is something applicable to any RDBMS engine. Actually many of them has a concept of pluggable storage engines and have different level of flexibility / documentation.
I think, to have such integration efficient we need to be able to push down predicates to the NoSQL engines for the full / range scans. I am not 100% sure that MySQL supports it on the level of storage engine interface.
Another serious problem I see with this approach – the fact that MySQL does not have parallel query, and thereof can not be too good for processing big data.