We are building a solution for document storage and for each document we need to store a lot of extra metadata with it to comply with local regulations, ranging from basic data like title or description to dates of relevant events or disposition and classification rules.
I’ve seen different types of solutions, but none convinces me:
- Tables that grow in columns when a new metadata slot is added (so they have as many columns as metadata associated with the documents)
- Tables with a lot of spare generic columns. Very similar to 1. but the tables don’t grow (less permissions)
- A table of document ids, metadata keys and metadata values.
- A table with metadata definitions and metadata keys in 3. are substituted by metadata ids. We used this solution in the past. The tables have millions of rows at the end.
- A text field in the document table or associated table that stores a XML or other structured information with all the metadata in key-value pairs.
I’m biased towards number 5, providing a parallel full-text index (Lucene.Net? Other?) to search by relevant metadata (not everything has to be “searchable”).
Any suggestion? Similar experiences?
Table 1: Document information (PK is document ID)
Table 2: Metadata definitions (PK is metadata definition ID)
Table 3: Document ID, Metadata defintion ID, metadata value
The biggest drawback to this is that you’d either have to have a single type (varchar, presumably), or you’d have to have n columns (where n is the number of data types you’re willing to store), and use a column in the metadata definitions table to identify which column in table 3 to pull the value from.
My opinions on the 5 solutions listed:
That’s my thoughts – I’ve never designed a system like this, but I have dealt with commercial systems that have used several of these schemes.