I am looking for a way to perform basic outlier filtration on a column of data in SQL server.
Background
I have a log table that contains various actions and the times at which those actions occurred. I am looking to retrieve some data surrounding the average time between two distinct log event types. I’m using a simple query (using DATEDIFF between timestamps) to capture the time duration between those events. Currently, I use an AVG function to get the average time for all paired instances of those two events occurring.
Problem
I would like to perform outlier filtration on the dataset prior to averaging using the following method:
Y is an outlier if Y < (Q1 - 1.5 * IQR)
OR
Y is an outlier if Y > (Q3 + 1.5 * IQR)
Where Q1 is the first quartile boundary value,
Q3 is the third quartile boundary value,
and IQR is Q3 - Q1.
My question is first – what is the best way to determine quartile values in SQL and second – is there a way that I can store this as it’s own aggregate function to filter and then average?
Let me presume that you are using SQL Server 2005 or later, since what you want to do requires window functions.
Some notes on how this works. The qs subquery is calculating the quartiles explicitly — by sequencing the rows (by y) and taking the value that is at the rows 25% and 75% of the way through the data. Note that the comparison compares the sequence number to the total of rows times that fraction cast back to an integer.
The group by just puts these values into one row, for each of calculation. The where clause is the logic that you want to apply for exclusion.