I need to store data related to “items” where there will be various different item types, all with common attributes, and then each type with its own additional attributes. I expect this is a common requirement; what’s the best-practices solution? We’re using SQL Server.
Let’s use a made-up example:
Vehicle has
- Price
- Make
- Model
- Owner
(In our real data, there will be 10-15 common columns.)
Car is a Vehicle plus:
- Style (sedan, sports, etc.)
- Color
- EngineSize
Boat is a Vehicle plus:
- Displacement
- PortOfOrigin
…etc. for several types of things. In our real data, each specialized type will typically add 2-5 columns; there will be 5 types to start with. We’ll be adding types over time, but probably only 3 or 4 more in total (if that). Adding types requires development, so it’s not like “tags” which can be added willy-nilly by end users. We assume adding a type will require changes to the DB and client tiers, and probably the mid-tier as well. That’s totally fine.
We will do lots of queries across all of the items (vehicles, in the example above); we only worry about the details of a specific item type (Car, Boat) rarely.
I see four ways to store this data:
- Separate tables for Cars, Boats, etc., with duplicated columns.
- One table with the
Vehicledata, a table for the additionalCardata, and a table for the additionalBoatdata. - One table of items, a separate table of item attributes with a row per additional attribute. E.g., a soft schema for the details.
- One table with generic columns given meaning only by non-DB code.
Looking at each:
-
Separate tables for Cars, Boats, etc., with duplicated columns. E.g., roughly:
CREATE TABLE [Cars] ( [Id] IDENTITY PRIMARY KEY, [Price] DECIMAL (19, 4), [Make] NVARCHAR(200), [Model] NVARCHAR(200), [Owner] INT, [Id] INT PRIMARY KEY, [Style] NVARCHAR(200), [Color] NVARCHAR(200), [EngineSize] DECIMAL(19, 2) ) CREATE TABLE [Boats] ( [Id] IDENTITY PRIMARY KEY, [Price] DECIMAL (19, 4), [Make] NVARCHAR(200), [Model] NVARCHAR(200), [Owner] INT, [Id] INT PRIMARY KEY, [Displacement] DECIMAL(19, 4), [PortOfOrigin] NVARCHAR(200) )Simple enough, Cars go in
Carsand Boats go inBoats. If we add more vehicle types, we add a table. If we add another common column, we have to go back and add it to all the vehicle tables. Reporting against vehicles in general can be done against a union view of all of the tables (being careful about theIdcolumn). -
One table with the
Vehicledata, a table for the additionalCardata, and a table for the additionalBoatdata. E.g., roughly:CREATE TABLE [Vehicles] ( [Id] IDENTITY PRIMARY KEY, [Price] DECIMAL (19, 4), [Make] NVARCHAR(200), [Model] NVARCHAR(200), [Owner] INT, [Type] INT -- A type ID, e.g. "Car" vs. "Boat" ) CREATE TABLE [Cars] ( [Id] INT PRIMARY KEY, [Style] NVARCHAR(200), [Color] NVARCHAR(200), [EngineSize] DECIMAL(19, 2) ) CREATE TABLE [Boats] ( [Id] INT PRIMARY KEY, [Displacement] DECIMAL(19, 4), [PortOfOrigin] NVARCHAR(200) )So every Car would have one row in
Vehiclesand one linked row inCars. Every Boat would have one row inVehiclesand one linked row inBoats. If we add more vehicle types, we add a table. Reporting against vehicles in general can be done against just theVehicletable. When retrieve details of a specificCarorBoat, we use a join. -
One table of items, a separate table of item attributes with a row per additional attribute. E.g., a soft schema for the details. E.g., roughly:
CREATE TABLE [Vehicles] ( [Id] IDENTITY PRIMARY KEY, [Price] DECIMAL (19, 4), [Make] NVARCHAR(200), [Model] NVARCHAR(200), [Owner] INT, [Type] INT ) CREATE TABLE [VehicleDetails] ( [VehicleId] INT, [Name] NVARCHAR(200), [Value] NVARCHAR(MAX) )So every Car gets one row in
Vehiclesand three rows inVehicleDetails(one each for “Style”, “Color”, and “EngineSize”). Reporting is largely done against theVehicletable. Reporting on details starts getting messy fast. Soft schemas have their place, mostly around user-defined data, but I’m assuming this wouldn’t be a good choice here. -
One table with generic columns given meaning only by non-DB code:
CREATE TABLE [Vehicles] ( [Id] IDENTITY PRIMARY KEY, [Price] DECIMAL (19, 4), [Make] NVARCHAR(200), [Model] NVARCHAR(200), [Owner] INT, [Type] INT, [Detail01] NVARCHAR(MAX), [Detail02] NVARCHAR(MAX), [Detail03] NVARCHAR(MAX), [Detail04] NVARCHAR(MAX), [Detail05] NVARCHAR(MAX), [Detail06] NVARCHAR(MAX), [Detail07] NVARCHAR(MAX), [Detail08] NVARCHAR(MAX), [Detail09] NVARCHAR(MAX), [Detail10] NVARCHAR(MAX) )So Car data would assign Style to
Detail01, Color toDetail02, and EngineSize toDetail03; for Boats, we’d put Displacement inDetail01and PortOfOrigin inDetail02. Similarly, there may be a place for this with end-user defined schemas, but I’m guessing this wouldn’t be a good answer when you can control the DB structure.
It depends.
Approach 1 is best for situations where most attributes will be common to most types.
Approach 2 is best for situations where few attributes will be common to most types.
Approach 3 is essentially approach 1, with an Entity-Attribute-Value approach to handling type-specific attributes. This approach is best for situations where most attributes will be common to most types, and it is difficult to anticipate what additional attributes will be required – it is quite common in situations that require user-created fields.
Approach 4 is not a good idea in any situation – it removes semantic content from the metadata layer into the code layer, while retaining the inflexibility of Approach 1.
There is also another possible approach – a pure Entity-Attribute-Value approach (essentially a blend of approaches 3 and 4). This is generally regarded as an anti-pattern, due to the complexity and poor performance produced when implemented on a RDBMS. However, there are some situations where it is the only approach possible – primarily, where the entity relationships are not known in advance.