I have two tables: DATA DATA_ID | SAMPLE_ID | ASSAY_ID | SIGNAL 101 |

Question

0

Asked: June 17, 20262026-06-17T04:02:34+00:00 2026-06-17T04:02:34+00:00

I have two tables: DATA DATA_ID | SAMPLE_ID | ASSAY_ID | SIGNAL 101 |

0

I have two tables:
DATA

DATA_ID  |  SAMPLE_ID  |  ASSAY_ID  |  SIGNAL
101      |  201        |  301       |  2.87964
102      |  201        |  302       |  7.64623
103      |  202        |  301       |  1.98473
...

And SAMPLES:

SAMPLE_ID  |  SAMPLE_NAME  |  CATEGORY
201        |  SAMP0001     |  CAT A  
202        |  SAMP0002     |  CAT B
203        |  SAMP0003     |  CAT A
...

There are about 20,000 rows in SAMPLES. For each sample, there are about 40,000 rows in DATA. Each ASSAY_ID occurs exactly once per sample in DATA. I need to take a subset of the samples in SAMPLE and calculate a standard/z-score value for each signal value in DATA, grouping by ASSAY_ID. I am trying to create a stored procedure that will be called repeatedly, which will accept a single ASSAY_ID value and return SAMPLE_ID and ZSCORE pairs for all of the samples in the predefined sample subset.

Given a set of sample signal values (X = [3.21, 4.56, 1.12, ..]) for a given assay, the standard/z-score in this case is calculated as

(X[i] - median(X))/(K * MAD)

Where K is a scale factor equal to 1.4826 and MAD is the median adjusted deviation, equal to:

median(|X[i]-median(X)|)

Got that? Good 🙂 Now, what is the most efficient way to perform this calculation using a SQL query? Execution time is key, given that there are close to a billion rows in DATA and a z-score needs to be calculated for almost every SIGNAL value.

Here is the best query I have been able to come up with so far:

WITH BASE AS (
    SELECT 
        S.SAMPLE_ID,
        D.SIGNAL
    FROM
        DATA D
        JOIN SAMPLES S
            ON D.SAMPLE_ID = S.SAMPLE_ID
    WHERE 
        S.CATEGORY IN ('CAT A', 'CAT B')
        AND D.ASSAY_ID = 12345
        AND S.SAMPLE_NAME NOT IN ('SAMP0003', 'SAMP0005', 'SAMP0008')          
)
SELECT  
    A.SAMPLE_ID,
    (A.SIGNAL-B.MED)/(1.4826*C.MAD) AS ZSCORE
FROM 
    BASE A,
    (
        SELECT MEDIAN(X.SIGNAL) AS MED 
        FROM BASE X
    ) B,
    (
        SELECT MEDIAN(ABS(Y.SIGNAL-YY.MED)) AS MAD 
        FROM BASE Y, 
        (SELECT MEDIAN(SIGNAL) AS MED FROM BASE) YY
    ) C

Is there a more efficient way to perform this query?

Bonus Question: Can I write a single SQL query that would perform this calculation for EVERY ASSAY_ID in a single execution?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T04:02:35+00:00

Can you have a look at:

SELECT ASSAY_ID, SAMPLE_ID, 
       (SIGNAL - MED)/(1.4826F * MAD) AS ZSCORE
  FROM (
        SELECT ASSAY_ID, SAMPLE_ID, SIGNAL, MED,
               MEDIAN(ABS(SIGNAL - MED)) OVER (PARTITION BY ASSAY_ID) AS MAD
          FROM (
                SELECT ASSAY_ID, SAMPLE_ID, SIGNAL,
                       MEDIAN(SIGNAL) OVER (PARTITION BY ASSAY_ID) AS MED
                  FROM DATA    D
                  JOIN SAMPLES S USING (SAMPLE_ID)
                 WHERE S.CATEGORY IN ('CAT A', 'CAT B')
                   AND S.SAMPLE_NAME NOT IN ('SAMP0003', 'SAMP0005', 'SAMP0008')  
                   AND D.ASSAY_ID = 301
               )
       );

Is it correct? Is it faster? If it is, just remove the AND D.ASSAY_ID = 301 clause for the bonus question 🙂

On the physical side, I would look into the data type for signal (BINARY_FLOAT or BINARY_DOUBLE are supposedly faster than NUMBER). And, if this is an option, I’d try to physically collocate the assays with partitions.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have two tables: DATA DATA_ID | SAMPLE_ID | ASSAY_ID | SIGNAL 101 |

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply