I have a task to store large amount of gps data and some extra info in database and to access it for reporting and some other non frequent tasks.
When I recieve a message from gps device it can have variable number of fields. For example
Message 1: DeviceId Lat Lon Speed Course DIO1 ADC1
Message 2: DeviceId Lat Course DIO2 IsAlarmOn
Message 3: DeviceId Lat Lon Height Course DIO2 IsAlarmOn etc. up to 20-30 fields
There is no way to unify number of fields – diffirent device vendors, diffirent protocols etc.
And another headache is size of database and necessity to support as much db vendors as possible(NHibernate is used).
So i came to idea to store messages that way:
Table1 – Tracks
PK – TrackId
TrackStartTime
TrackEndTime
FirstMessageIndex(stores MessageId)
LastMessageIndex(stores MessageId)
DeviceId(not an FK)
Table2 – Messages
PK – MessageId
TimeStamp
FirstDataIndex(stores DataId)
LastDataIndex(stores DataId)
Table3 – MessageData
PK – DataId
double Data
short DataType
All indexes are assignet with hilo. Tuned my queryes so Nhibernate can handle incerting 3000+k messages veeeeeery quickly(baching also used).
Im happy with perfomance atm. But i dunno how it will work at 50+gb or 100+ gb size.
Will be very grateful for any tips and hints about my issue and storage design overall=)
Thanks, Alexey
PS.Sorry for my english=)
In a nutshell, your application, specifically the heterogeneous structure of the messages received from the GPS devices, pushes your design towards a EAV datastore structure (whereby the Entity is the Message , the Attribute is the “MessageData.DataType” and the Value is systematically a double.)
The Three tables design you outline in the question, however seem to depart a bit from a traditional EAV implementation, in a sense that there is an implicit sequence to the way MessageData is stored whereby all the data points for a given message are sequentially numbered (DataId), and the link from a message to its datapoints will be driven by DataId within a range.
That is a bad idea!
Many problems with that, a notable one being that this introduces a unnecessary bottleneck for the insertion of messages, Can’t start inserting a second message until all datapoints for the previous message.
Another issue is that it makes the relation between message and datapoint difficult to index (underlying DBMS will not be efficient at it).
==> Suggestion: Make the MessageId a foreign key in MessageData table. (and possibly drop the DataId PK in MessageData table altogether, just to save the space, at the expense of having to use a composite key to refer to a particular record in this table, for example for maintenance purposes)
Another suggestion is to store the most common attributes (datapoints) at the level of the Message table. For example, Lat and Long, but maybe also Course or Some alarms etc. The reason for having this info right with the message is to optimize queries to the data (limiting the number of self joins necessary with MessageData table.
Since both the Messages and the MessageData tables may not contain part of the message, you may also want to rename the latter MessageDetail table, or some such name.
Finally, it may be a good idea to allow for data values other than these of the double type. I anticipate some of the alerts are merely boolean, etc. Aside from allowing you accept different kinds of datapoints (say short error message strings…) this may also give you the opportunity to split the datapoints over multiple “detail” tables: one for doubles, one for booleans, one for strings etc. This way of doing complicates the schema in a sense that you then need to build some of these details into the way the queries are produced, but it can provide some potential for performance / scaling gains.