I want to sample data based on a timestamp field. I am reading huge

Question

0

Asked: May 31, 20262026-05-31T16:28:19+00:00 2026-05-31T16:28:19+00:00

I want to sample data based on a timestamp field. I am reading huge

0

I want to sample data based on a timestamp field. I am reading huge data files, each having close to million record a day. I have several such files for every month.

Now I want to read this data, but store, say, only 5% or 10% into a mysql database. I do not have prior knowledge of number of records in each of the data files.

Is there any way with which I can sample only 5% of total read data for a file? Are there any standard statistical approaches to this kind of problem?

EDIT based on comments below:

Before this sampling idea, I had created a key based partition and index on two fields: id and date. The id field is more like a clientId. Even with partitioning, a group by 2 fields on 15 million rows would take criminally long time, in the range of 30-60 mins. I had also created and additional index on one of the group by field.

My explanation would show this:

SIMPLE visits ref 3ColumnerIndex,2ColumnIndex 2ColumnIndex 302 const 7493642 Using where; Using filesort

Got this performance after giving innodb a buffer size of 4 GB!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T16:28:21+00:00

You need an estimate of the number of records for this to work, but if you don’t have strict requirements of how many samples you need this shouldn’t be a problem:

Suppose you are choosing k samples from n records.
For each record, or until you have enough records:
1. Produce a random number between 0 and 1.
2. If it is less than k/n, output the current record. Put k := k-1 and n := n-1.
3. Else, discard the record and put n := n-1.

Each record appear in the output with probability k/n. E.g. the probability of the second record appearing would be:

(k/n)*(k-1)/(n-1) + ((n-k)/n)*k/(n-1) = (k-1+n-k)*k/(n*(n-1)) = k/n

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to sample data based on a timestamp field. I am reading huge

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply