I need to count the quantiles for a large set of data.
Let’s assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile:
List<double> allData = new List<double>();
// This is only an example; the portions of data are not really rows of some matrix
foreach(var row in matrix)
{
allData.AddRange(row);
}
allData.Sort();
double p = 0.75 * allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];
I would like to find a way of obtaining the quantile without storing the data in an intermediate variable. The best solution would be to count some parameters of mid-results for first row and then adjust it step by step for next rows.
Note:
- These datasets are really big (ca 5000 elements in each row)
- The Q3 can be estimated, it doesn’t have to be an exact value.
- I call the portions of data “rows”, but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies!
This question is similar to “On-line” (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis, but I need to count quantiles.
ALso there are few articles in this topic, i.e.:
- An Efficient Algorithm for the Approximate Median Selection Problem
- Incremental quantile estimation for massive tracking
Before trying to implement these approaches, I wondered if there are maybe any other, quicker ways of counting the 0.25/0.75 quantiles?
Inspired by this answer I created a method that estimates the quantiles quite good. It is approximation close enough for my purposes.
The idea is following: the 0.75 quantile is in fact a median of all values that lies above the global median. And respectively, 0.25 quantile is a median of all values below the global median.
So if we can approximate the median, we can in similar way approximate the quantiles.
Remarks:
etain order to fit to the strange data. But the accuracy will be worse.etaparameter in this way: at the beggining set theetato be almost equal some large value (i.e. 0.2). As the loop passes, lower the value ofetaso when you reach almost the end of the collection, theetawill be almost equal 0 (for example, in loop compute it like that:eta = 0.2 - 0.2*(i/N);