I would like to store data in a queryable format without knowing ahead of time what fields a given packet of data will contain.
The simple/dumb approach seems to be something like a big key-value pair table with a key back to a table of ‘parent’ objects which the data describes.
The data will have the following properties:
- Many pieces of ‘metadata’ will be associated to a single parent object
- The data will always be in key-value pair form
- The data will not be heirachical (one level of key value pairs only)
- There will be lots of it. Never purged. Moved to duplicate archive stores if required
For example
A log file is parsed and it’s messages pulled into some defined format based on some rules as follows:
- Log/System Name
- Location
- Date
- Time
- Level
- Message
There may be many logs parsed for many different systems. Each system may have different fields.
The Date/Time/Level/Message fields are only known when the rules for parsing the file are created, not when the data store is being built.
How would you go about this? What kind of database/design would you use?
Option 1: Use one of the NoSQL databases like MongoDB – I’m not familiar with these as I live in a mostly SQL Server world. These allow you to have records that are documents, not static number of columns like relational DB’s
Option 2: Relational DB
Table: Log {Id (PK), Date, Time, Level, Message}
Table: ExtraFields {Id (PK), FieldName}
Table: AdditionalFields {FieldId (PK), LogId (PK), Value}
Here each record would get a Log record, and then a number of additional fields in AdditionalFields, that link back to the LogId. You could then Load these into a Log object. The ExtraFields table would have all the types of fields. If it doesn’t exist when you load a record, then you add another one. So this might have URL, IP, User-Agent etc if it was web logs.
Alternatively, you could avoid the ExtraFields table and just put the field name directly in the AdditionalFields table.