I have a table which holds the data of many seq_id. Each seq_id has many hits (hit_name_id) on different rows. What I want to do is group the seqs into groups if their hits are similar (ie share around 70-80% of hits) Eg in the table below sequences 1,2 and 4 are actually very similar so that more than likely they are the same thing. I want to be able to assign all the similar hits with a group id so that I can later extract just the unique seqs.
I created this query to demonstrate that each seq_id can have many hits that may or may not be shared:
mysql> SELECT seq_id,GROUP_CONCAT(hit_name_id ORDER BY hit_name_id), count(hit_name_id) FROM polished_data
-> GROUP BY seq_id;
+--------+------------------------------------------------+--------------------+
| seq_id | GROUP_CONCAT(hit_name_id ORDER BY hit_name_id) | count(hit_name_id) |
+--------+------------------------------------------------+--------------------+
| 1 | 4,5,6,9,10,14,19,20,21 | 9 |
| 2 | 4,6,9,10,14,18,19,20,21 | 9 |
| 3 | 6,12,13,14,18,20 | 6 |
| 4 | 4,7,8,11,14,18,19,20,21 | 9 |
| 5 | 1,2,3,15,16,17,32 | 7 |
+--------+------------------------------------------------+--------------------+
I am not sure whether I can accomplish this in MySQL or whether I will need to program this step in my linked program.
This will count the number of hits that are same.
You can then extend this and calculate how many are different (it appears in either but not both), then join them together.
Using random data you get something like this (tested using ### replaced by 1).
Change the
###in the above SQL to be the seq_id you want to compare against.