I want to sample data based on a timestamp field. I am reading huge data files, each having close to million record a day. I have several such files for every month.
Now I want to read this data, but store, say, only 5% or 10% into a mysql database. I do not have prior knowledge of number of records in each of the data files.
Is there any way with which I can sample only 5% of total read data for a file? Are there any standard statistical approaches to this kind of problem?
EDIT based on comments below:
Before this sampling idea, I had created a key based partition and index on two fields: id and date. The id field is more like a clientId. Even with partitioning, a group by 2 fields on 15 million rows would take criminally long time, in the range of 30-60 mins. I had also created and additional index on one of the group by field.
My explanation would show this:
SIMPLE visits ref 3ColumnerIndex,2ColumnIndex 2ColumnIndex 302 const 7493642 Using where; Using filesort
Got this performance after giving innodb a buffer size of 4 GB!
You need an estimate of the number of records for this to work, but if you don’t have strict requirements of how many samples you need this shouldn’t be a problem:
ksamples fromnrecords.k/n, output the current record. Putk := k-1andn := n-1.n := n-1.Each record appear in the output with probability k/n. E.g. the probability of the second record appearing would be: