I am developing an application which will be integrated with thousands of sensors sending information at every 15 minute interval. Let’s assume that the format of the data for all sensors is same. What is the best strategy of storing this data so that every thing is archived (is accessible) and does not have a negative impact due to large size of growing data.
Th question is related to general database design I suppose, but I would like to mention that I am using Hibernate (with Spring Roo) so perhaps there is some thing already out there addressing it.
Edit: sensors are dumb, and off the shelf. It is not possible to extend them. In the case of a network outage all information is lost. Since the sensors work on GPRS this scenario will be some what unlikely (as the GPRS provider is a rather good one here in Sweden, but yes it can go down and one can do nothing about it).
A queuing mechanism was foremost in consideration and spring roo provides easy to work with prototype code based on ACTIVEMQ.
Let’s assume you have 10,000 sensor sending information every 15 minutes. To have better performance on database side you may have to partition your database possibly by date/time, sensor type or category or some other factor. This also depend on how you will be query your data.
http://en.wikipedia.org/wiki/Partition_(database)
Other bottle neck would be your Java/Java EE application itself. This depends on your business like, are all 150,000 sensors gonna send information at same time? and what architecture your java application gonna follow. You will have to read articles on high scalablity and performance.
Here is my recommendation for Java/Java EE solution.
Instead of single, have a cluster of applications receiving the data.
Have a controller application that controls link between which sensor sends data to which instance of application in the cluster. Application instance may pull data from sensor or sensor can push data to an application instance but controller is the one who will control which application instance is linked to which set of sensors. This controller must be dynamic such that sensors can be added or removed or updated as well application instances can join or leave cluster at any time. Make sure that you have some fail over capability into your controller.
So if you have 10,000 sensors and 10 instances of application in cluster, you have 1000 sensors linked to an application at any given time. If you still want better performance, you can have say 20 instances of application in cluster and you will have 500 sensors linked to an application instance.
Application instances can be hosted on same or multiple machines so that vertical as well as horizontal scalability is achieved. Each application instance will be multi threaded and have a local persistence. This will avoid bottle neck on to main database server and decrease your transaction response time. This local persistence can be a SAN file(s) or local RDBMS (like Java DB) or even MQ. If you persist locally in database, then you can use Hibernate for same.
Asynchronously move data from local persistence to main database. This depends on how have you persisted data locally.
If you use file based persistence, you need a separate thread that reads data from file and inserts in main database repository.
If you use a local database then this thread can use Hibernate to read data locally and insert it on main database repository.
If you use MQ, you can have thread or separate application to move data from queue to main database repository.
Drawback to this solution is that there will be some lag between sensor having reported some data and that data appearing in main database.
Advantage in this solution is that it will give you high performance, scalability, and fail-over.