Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I’m unsure what database design would be best for this. I’m using MySQL for the job but I’m are open for suggestions. At this moment I’m planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 – 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
I’m working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you’ll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You’ve said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
Third, try to split data according to the extracts you’ll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single “master” table, and create another one for the rest of the fields. In fact, you can have several “another ones”, logically grouping data in a separate tables.
In my practice we’ve had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we’ve managed to gain a pleasant performance boost (sorry, don’t have numbers any longer).
To summarize: you know the data you’ll be working with and you know the queries that will be used to access your database, analyze and design accordingly.