I have two tables:
DATA
DATA_ID | SAMPLE_ID | ASSAY_ID | SIGNAL
101 | 201 | 301 | 2.87964
102 | 201 | 302 | 7.64623
103 | 202 | 301 | 1.98473
...
And SAMPLES:
SAMPLE_ID | SAMPLE_NAME | CATEGORY
201 | SAMP0001 | CAT A
202 | SAMP0002 | CAT B
203 | SAMP0003 | CAT A
...
There are about 20,000 rows in SAMPLES. For each sample, there are about 40,000 rows in DATA. Each ASSAY_ID occurs exactly once per sample in DATA. I need to take a subset of the samples in SAMPLE and calculate a standard/z-score value for each signal value in DATA, grouping by ASSAY_ID. I am trying to create a stored procedure that will be called repeatedly, which will accept a single ASSAY_ID value and return SAMPLE_ID and ZSCORE pairs for all of the samples in the predefined sample subset.
Given a set of sample signal values (X = [3.21, 4.56, 1.12, ..]) for a given assay, the standard/z-score in this case is calculated as
(X[i] - median(X))/(K * MAD)
Where K is a scale factor equal to 1.4826 and MAD is the median adjusted deviation, equal to:
median(|X[i]-median(X)|)
Got that? Good 🙂 Now, what is the most efficient way to perform this calculation using a SQL query? Execution time is key, given that there are close to a billion rows in DATA and a z-score needs to be calculated for almost every SIGNAL value.
Here is the best query I have been able to come up with so far:
WITH BASE AS (
SELECT
S.SAMPLE_ID,
D.SIGNAL
FROM
DATA D
JOIN SAMPLES S
ON D.SAMPLE_ID = S.SAMPLE_ID
WHERE
S.CATEGORY IN ('CAT A', 'CAT B')
AND D.ASSAY_ID = 12345
AND S.SAMPLE_NAME NOT IN ('SAMP0003', 'SAMP0005', 'SAMP0008')
)
SELECT
A.SAMPLE_ID,
(A.SIGNAL-B.MED)/(1.4826*C.MAD) AS ZSCORE
FROM
BASE A,
(
SELECT MEDIAN(X.SIGNAL) AS MED
FROM BASE X
) B,
(
SELECT MEDIAN(ABS(Y.SIGNAL-YY.MED)) AS MAD
FROM BASE Y,
(SELECT MEDIAN(SIGNAL) AS MED FROM BASE) YY
) C
Is there a more efficient way to perform this query?
Bonus Question: Can I write a single SQL query that would perform this calculation for EVERY ASSAY_ID in a single execution?
Can you have a look at:
Is it correct? Is it faster? If it is, just remove the
AND D.ASSAY_ID = 301clause for the bonus question 🙂On the physical side, I would look into the data type for signal (
BINARY_FLOATorBINARY_DOUBLEare supposedly faster thanNUMBER). And, if this is an option, I’d try to physically collocate the assays with partitions.