Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 77607
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 10, 20262026-05-10T20:51:25+00:00 2026-05-10T20:51:25+00:00

I have a table of the form CREATE TABLE data { pk INT PRIMARY

  • 0

I have a table of the form

CREATE TABLE data {    pk INT PRIMARY KEY AUTO_INCREMENT,    dt BLOB }; 

It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.

Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.

The naive approach took about an hour on 30-40k rows:

SELECT a.pk, MIN(b.pk)      FROM data AS a      JOIN data AS b   ON a.dt=b.dt   WHERE b.pk < a.pk   GROUP BY a.pk; 

I happen to have, for other reasons, a table that has the sizes of the blobs:

CREATE TABLE sizes (    fk INT,  // note: non-unique    sz INT    // other cols ); 

By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:

SELECT da.pk,MIN(db.pk)    FROM data AS da   JOIN data AS db   JOIN sizes AS sa   JOIN sizes AS sb   ON         sa.size=sb.size     AND da.pk=sa.fk     AND db.pk=sb.fk   WHERE         sb.fk<sa.fk     AND da.dt=db.dt    GROUP BY da.pk; 

However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I’d think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.

OK so for the question: Am I going to get much better than the second select? If so, how?

A bit of a corollary is: if I have a table where the key column’s get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?


Xgc on #mysql@irc.freenode.net points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-10T20:51:25+00:00Added an answer on May 10, 2026 at 8:51 pm

    You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.

    The question is if you can save the hashes in your database?

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 99k
  • Answers 99k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer This should do it: $gs_relations = new GSRelations(); $part =… May 11, 2026 at 7:41 pm
  • Editorial Team
    Editorial Team added an answer This is a .bat file to require there is a… May 11, 2026 at 7:41 pm
  • Editorial Team
    Editorial Team added an answer In the end, we reduced the size of the image.… May 11, 2026 at 7:41 pm

Related Questions

I am somewhat new to transactional databases and have come across an issue I
I've created a table in Microsoft Sql CE that I'm using to hold some
This should be pretty straight forward but I can't seem to get my newbie
I have a table that's generated by a normal PHP loop. What I want

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.