Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 77607
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 10, 20262026-05-10T20:51:25+00:00 2026-05-10T20:51:25+00:00

I have a table of the form CREATE TABLE data { pk INT PRIMARY

  • 0

I have a table of the form

CREATE TABLE data {    pk INT PRIMARY KEY AUTO_INCREMENT,    dt BLOB }; 

It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.

Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.

The naive approach took about an hour on 30-40k rows:

SELECT a.pk, MIN(b.pk)      FROM data AS a      JOIN data AS b   ON a.dt=b.dt   WHERE b.pk < a.pk   GROUP BY a.pk; 

I happen to have, for other reasons, a table that has the sizes of the blobs:

CREATE TABLE sizes (    fk INT,  // note: non-unique    sz INT    // other cols ); 

By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:

SELECT da.pk,MIN(db.pk)    FROM data AS da   JOIN data AS db   JOIN sizes AS sa   JOIN sizes AS sb   ON         sa.size=sb.size     AND da.pk=sa.fk     AND db.pk=sb.fk   WHERE         sb.fk<sa.fk     AND da.dt=db.dt    GROUP BY da.pk; 

However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I’d think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.

OK so for the question: Am I going to get much better than the second select? If so, how?

A bit of a corollary is: if I have a table where the key column’s get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?


Xgc on #mysql@irc.freenode.net points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-10T20:51:25+00:00Added an answer on May 10, 2026 at 8:51 pm

    You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.

    The question is if you can save the hashes in your database?

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 64k
  • Answers 64k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • added an answer We tested both and found Enyim to perform the best… May 11, 2026 at 10:54 am
  • added an answer how about you bind _ to a function roughly like… May 11, 2026 at 10:54 am
  • added an answer Use ProductMajorPart/ProductMinorPart instead of FileMajorPart/FileMinorPart : public static string Version… May 11, 2026 at 10:54 am

Related Questions

I have a table of the form CREATE TABLE data { pk INT PRIMARY
I have a table of Users that includes a bitmask of roles that the
I have a table where one of the left column shrinks when I set
I have a database table and one of the fields (not the primary key)
I have a table, users, in an Oracle 9.2.0.6 database. Two of the fields
Suppose I have a database table that has a timedate column of the last
I have a MySQL table with approximately 3000 rows per user. One of the
I have a temp table I am creating a query off of in the
In my database, in one of the table I have a GUID column with
I have a table in SQL server that has the normal tree structure of

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.