I have an excel spreadsheet i am going to be turning into a DB to mine data and build an interactive app. There are about 20 columns and 80,000 records. Practically all records have about half of their column data as null, but which column has data is random for each record.
The options would be to:
-
Create a more normalized DB with a table for each column and use 20 joins to view all data. I would think the benefits would be a DB with really no NULL values so the size would be smaller. One of the major cons would be more code to update each table from the application side.
-
Create a flat file with one table that has all columns. I figure this will be easier for the application side to do updates, but will result in a table that has a butt load of empty dataspace.
I don’t get why you think updating a normalized db is harder than a flat table. It’s very much the other way around.
Think about inserting a relation between a customer and a product (basically an order). You’d have to:
What about the first time? What do you do with the initial nulls? Do you modify your selects to ignore them? What if you want the nulls?
What if you delete the last product? Do you change it into an update and set nulls for just a few columns?
Joins aside, working with a normalized table is trivial by design. You pay for its triviality with performance, that’s the actual trade-off.