I’m trying to turn a huge spreadsheet of data into a database to make data analysis easier, but I’m running into problems with too many columns. I’ve tried my best to learn about normalization, but I’m having a hard time applying it to this use case.
Scenario
We are performing N independent measurements on rectangular blocks. Measurements include:
- Length (or Measurement 0)
- Width (or Measurement 1)
- Height (or Measurement 2)
- Mass (or Measurement 3)
- Color (or Measurement 4)
- …
- Measurement N
There are over 7000 measurements (complicated blocks)! The measurements have limits. If a block fails one or more measurements, all measurements are repeated to verify. If it fails again, the block is deemed a failure.
The blocks are serialized and there are thousands of them.
Data Source
A huge spreadsheet (table). The fields are: Block Number, Length, Width, Height, Mass, Color, …, Measurement N. Each row represents one test run or execution of all measurements. Since we have a retest policy, there may be multiple rows with results from the same block.
Help!
This source table seems like an intuitive format, but doesn’t seem like the best format for a database. At first I tried to put it in an SQLite database and ran into the 2000 column limit. Yes I could recompile SQLite with more columns or use another database engine, but this sounds like more of a database design issue. Do you have a better design idea?
P.S. Sorry so long, but thanks for reading!
Sounds like you need a
MeasurementTypetable to hold the names of all measurements and any other information you may want to store about measurements.Then you would have a
Measurementtable referencing both theMeasurementTypeand the “original” table that your spreadsheet is (i.e. the table left with theBlock Numbercolumn):