I have a table which holds the data of many seq_id. Each seq_id has

Question

0

Asked: May 25, 20262026-05-25T00:23:47+00:00 2026-05-25T00:23:47+00:00

I have a table which holds the data of many seq_id. Each seq_id has

0

I have a table which holds the data of many seq_id. Each seq_id has many hits (hit_name_id) on different rows. What I want to do is group the seqs into groups if their hits are similar (ie share around 70-80% of hits) Eg in the table below sequences 1,2 and 4 are actually very similar so that more than likely they are the same thing. I want to be able to assign all the similar hits with a group id so that I can later extract just the unique seqs.

I created this query to demonstrate that each seq_id can have many hits that may or may not be shared:

mysql> SELECT seq_id,GROUP_CONCAT(hit_name_id ORDER BY hit_name_id), count(hit_name_id) FROM polished_data
    -> GROUP BY seq_id;
+--------+------------------------------------------------+--------------------+
| seq_id | GROUP_CONCAT(hit_name_id ORDER BY hit_name_id) | count(hit_name_id) |
+--------+------------------------------------------------+--------------------+
|      1 | 4,5,6,9,10,14,19,20,21                         |                  9 |
|      2 | 4,6,9,10,14,18,19,20,21                        |                  9 |
|      3 | 6,12,13,14,18,20                               |                  6 |
|      4 | 4,7,8,11,14,18,19,20,21                        |                  9 |
|      5 | 1,2,3,15,16,17,32                              |                  7 |
+--------+------------------------------------------------+--------------------+

I am not sure whether I can accomplish this in MySQL or whether I will need to program this step in my linked program.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T00:23:47+00:00

This will count the number of hits that are same.

SELECT seq_id, COUNT(*) AS same
FROM polished_data
WHERE 
    hit_name_id IN (SELECT hit_name_id FROM polished WHERE seq_id = ###) 
    AND and seq_id != ### 
GROUP BY seq_id

You can then extend this and calculate how many are different (it appears in either but not both), then join them together.

SELECT *, (same/(same+diff)) AS similarity   
FROM
(
    SELECT 
        s.seq_id, 
        s.same,
        ((t.total-s.same)+(ct.total-s.same)) AS diff 

    FROM 

        (SELECT seq_id, COUNT(*) as total FROM polished_data
         GROUP BY seq_id) AS t  

    LEFT JOIN

        (SELECT seq_id, COUNT(*) AS same
         FROM polished_data
         WHERE 
             hit_name_id IN 
                 (SELECT hit_name_id FROM polished_data 
                  WHERE seq_id = ###) 
         GROUP BY seq_id) AS s

    ON t.seq_id = s.seq_id

    JOIN

        (SELECT COUNT(*) as total FROM polished_data
         WHERE seq_id = ###) AS ct  

) as result

Using random data you get something like this (tested using ### replaced by 1).

+--------+------+------+------------+
| seq_id | same | diff | similarity |
+--------+------+------+------------+
|      1 |   22 |    0 |     1.0000 |
|      2 |    4 |   45 |     0.0816 |
|      3 |    5 |   57 |     0.0806 |
|      4 |    8 |   34 |     0.1905 |
|      5 |    9 |   47 |     0.1607 |
|      6 |    3 |   36 |     0.0769 |
|      7 |    7 |   45 |     0.1346 |
|      8 |    3 |   48 |     0.0588 |
|      9 |    9 |   46 |     0.1636 |
|     10 |    4 |   48 |     0.0769 |
+--------+------+------+------------+

Change the ### in the above SQL to be the seq_id you want to compare against.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a table which holds the data of many seq_id. Each seq_id has

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply