Is there a way to run real time analytics with BigQuery? I used the CSV upload option that kicks off a job and loads the data in an offline mode which can be analyzed once the load completes. But in the announcement regarding BigQuery there are mentions of using BigQuery for realtime analytics. How can this be achieved? Can we append (no updates) data from Google Cloud database to BigQuery in a trickle mode for realtime analytics?
As a sidenote, I noticed that BigQuery CSV data loads are about an order of magnitude slower than LucidDB and InfiniDB running on my local PC using a 10GB data file. It took 34 minutes for the BigQuery job to complete vs. 5 minutes on InfiniDB and LucidDB. The query execution times (for simple aggregates) are twice as slower on BigQuery compared to InfiniDB (6 seconds vs. 3 seconds against the 10GB file loaded with approx. 30+ Million records) but better than LucidDB.
There is currently no way to directly stream data one record at a time with BigQuery. The only supported ingestion methods are to import csv files from either Google Cloud storage, or directly via multi-part mime POST request. Quota for ingestion are 2 imports/min, 1,000 import requests per day, 100GB total data per import request (see: https://developers.google.com/bigquery/docs/quota-policy ).
There’s a distinction to be made between “real time analytics” (ability to run very quick ad-hoc queries over massive data) and the constant collection of “real time data” (which key-value datastores are better suited to support). BigQuery’s current ingestion mechanism does not support constant single record updates, but it does allows developers to run queries very quickly over massive datasets.
While Stack Overflow is not the most appropriate place for benchmark discussion, I would add that it’s difficult to benchmark ingestion and querying speeds without understanding the schema of the data you used. While 10Gb is large amount of data, it would be interesting to see a comparison of ingestion and query speed over much much larger datasets – say, 10 terabytes or more.