This should do it: $gs_relations = new GSRelations(); $part =…

Question

0

Asked: May 10, 20262026-05-10T20:51:25+00:00 2026-05-10T20:51:25+00:00

I have a table of the form CREATE TABLE data { pk INT PRIMARY

0

I have a table of the form

CREATE TABLE data {    pk INT PRIMARY KEY AUTO_INCREMENT,    dt BLOB };

It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.

Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.

The naive approach took about an hour on 30-40k rows:

SELECT a.pk, MIN(b.pk)      FROM data AS a      JOIN data AS b   ON a.dt=b.dt   WHERE b.pk < a.pk   GROUP BY a.pk;

I happen to have, for other reasons, a table that has the sizes of the blobs:

CREATE TABLE sizes (    fk INT,  // note: non-unique    sz INT    // other cols );

By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:

SELECT da.pk,MIN(db.pk)    FROM data AS da   JOIN data AS db   JOIN sizes AS sa   JOIN sizes AS sb   ON         sa.size=sb.size     AND da.pk=sa.fk     AND db.pk=sb.fk   WHERE         sb.fk<sa.fk     AND da.dt=db.dt    GROUP BY da.pk;

However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I’d think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.

OK so for the question: Am I going to get much better than the second select? If so, how?

A bit of a corollary is: if I have a table where the key column’s get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?

Xgc on #mysql@irc.freenode.net points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T20:51:25+00:00

2026-05-10T20:51:25+00:00Added an answer on May 10, 2026 at 8:51 pm

You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.

The question is if you can save the hashes in your database?

0

Reply
Share
Share

- Report

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions