Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8414173
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T01:05:52+00:00 2026-06-10T01:05:52+00:00

I know there have been a lot of questions about sql query performance improvement,

  • 0

I know there have been a lot of questions about sql query performance improvement, but I was not able to use the answers of those questions to improve my queries performance (enough).

Since I wanted something more flexible than rsync & fslint, I’ve written a little java tool that walks file trees and stores paths & checksums in a mysql database.

You’ll find my table structure here:
http://code.google.com/p/directory-scanner/source/browse/trunk/sql/create_table.sql –
at first I only had one table, but then I thought I could save a lot of space if I move the redundant quite long strings of the directory paths into a seperate place and make it an 1:n relationship

I’ve defined those two indexes:

CREATE INDEX files_sha1 ON files (sha1);
CREATE INDEX files_size ON files (size);

Now the queries that bug me are those:
http://code.google.com/p/directory-scanner/source/browse/trunk/sql/reporingQueries.sql

The worst of them is the last one, that should with a very high probability always return an empty set (sha1 collisions & mistakenly multiple inserted files):

SELECT 
    d.path, 
    d.id, 
    f.filename, 
    f.id, 
    f.size, 
    f.scandate, 
    f.sha1, 
    f.lastmodified 
FROM files f 
INNER JOIN directories d 
    ON d.id = f.dir_id 
WHERE EXISTS ( /* same sha1 but different size */ 
    SELECT ff.id 
    FROM files ff 
    WHERE ff.sha1 = f.sha1 
    AND ff.size <> f.size 
) 
OR EXISTS ( /* files with same name and path but different id */ 
    SELECT ff2.id 
    FROM files ff2 
    INNER JOIN directories dd2 
        ON dd2.id = ff2.dir_id 
    WHERE ff2.id <> f.id 
    AND ff2.filename = f.filename 
    AND dd2.path = d.path 
) 
ORDER BY f.sha1

It ran well enough within less than a second as long as I had only 20k rows (after creating my indexes), but now that I have 750k rows, it literary runs for hours, and mysql totaly uses up one of my cpu cores for the whole time.

EXPLAIN for this query gives this result:

id ; select_type ; table ; type ; possible_keys ; key ; key_len ; ref ; rows ; filtered ; Extra
1 ; PRIMARY ; d ; ALL ; PRIMARY ; NULL ; NULL ; NULL ; 56855 ; 100.0 ; Using temporary; Using filesort
1 ; PRIMARY ; f ; ref ; dir_id ; dir_id ; 4 ; files.d.id ; 13 ; 100.0 ; Using where
3 ; DEPENDENT SUBQUERY ; dd2 ; ALL ; PRIMARY ; NULL ; NULL ; NULL ; 56855 ; 100.0 ; Using where
3 ; DEPENDENT SUBQUERY ; ff2 ; ref ; dir_id ; dir_id ; 4 ; files.dd2.id ; 13 ; 100.0 ; Using where
2 ; DEPENDENT SUBQUERY ; ff ; ref ; files_sha1 ; files_sha1 ; 23 ; files.f.sha1 ; 1 ; 100.0 ; Using where

My other queries are also not quick with 750k rows, but finish at least within 15 minutes or something the like (however, I would like them to also work with millions of rows..)

UPDATE: Thanks radashk for the comment, but the indexes you suggested seem to be created automatically by mysql –>

"Table","Non_unique","Key_name","Seq_in_index","Column_name","Collation","Cardinality","Sub_part","Packed","Null","Index_type","Comment","Index_comment"
"files","0","PRIMARY","1","id","A","698397","NULL","NULL",,"BTREE",,
"files","1","dir_id","1","dir_id","A","53722","NULL","NULL",,"BTREE",,
"files","1","scanDir_id","1","scanDir_id","A","16","NULL","NULL","YES","BTREE",,
"files","1","files_sha1","1","sha1","A","698397","NULL","NULL","YES","BTREE",,
"files","1","files_size","1","size","A","174599","NULL","NULL",,"BTREE",,

UPDATE2: Thanks Eugen Rieck! I consider your answer a good replacement for this query, since it most likly will return an empty set anyway I will just select the data to display the user to describe the problem later in another query.
To make me really happy it would be great if someone could take a look at my other queries as well 😀

UPDATE3: The answer from Justin Swanhart inspired me to the following solution: instead of having queries to check for directories and files that have been inserted multiple times unintentionally, just create unique constraints like this:

ALTER TABLE directories ADD CONSTRAINT uc_dir_path UNIQUE (path);
ALTER TABLE files ADD CONSTRAINT uc_files UNIQUE(dir_id, filename);

However, I wonder how much this would negatively effect the performance of insert statements, could somebody comment on this please?

UPDATE4:

ALTER TABLE directories ADD CONSTRAINT uc_dir_path UNIQUE (path);

doesn’t work, since its to long..

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

UPDATE5:

Okey, this is the solution I’m gonna use for replacing the query I quoted above in my initial question:

For the first part, finding sha1 collisions, I will use this:

SELECT sha1
FROM files
GROUP BY sha1
HAVING COUNT(*)>1
AND MIN(size)<>MAX(size)

And if it returns anything, I will select the details with another query WHERE sha1 = ?

I guess this query will run best, with this index defined:

CREATE INDEX sha1_size ON files (sha1, size);

For verifying that no duplicated directories exist, I will use this, since he doesn’t allow a constraint (see UPDATE4 above):

SELECT path
FROM directories
GROUP BY path
HAVING COUNT(*)>1

And for the duplicated files I will try to create this constraint:

CREATE UNIQUE INDEX filename_dir ON files (filename, dir_id);

This runs quite fast (15 to 20 sec) and I don’t need to create other indexes before it to make it faster. Also the error message contains the details I need to display the problem to the user (which is unlikely anyway since I check for those things before inserting)

Now there are only 5 more queries to make perform in less time 😉 thanks for the great help so far Eugen & Justin!

UPDATE6: Okey, so since it’s been a few days since the last response from anybody, I’m just gonna accept Justin’s answer, since that was the one that helped me the most. I incorporated what I learned from both of you into my app and released version 0.0.4 here: http://code.google.com/p/directory-scanner/downloads/detail?name=directory-scanner-0.0.4-jar-with-dependencies.jar

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T01:05:52+00:00Added an answer on June 10, 2026 at 1:05 am

    While I can’t verify without building your tables an dpopulating, I’d try something like

    -- This checks the SHA1 collisions
    SELECT
      MIN(id) AS id,
    FROM files
    GROUP BY sha1
    HAVING COUNT(*)>1
    AND MIN(size)<>MAX(size)
    
    -- This checks for directory duplicates
    SELECT
      MIN(path) AS path
    FROM directories
    GROUP BY path
    HAVING COUNT(*)>1
    
    -- This checks for file duplicates
    SELECT
      MIN(f.id) AS id
    FROM files AS f
    INNER JOIN files AS ff 
       ON f.dir_id=ff.dir_id
       AND f.filename=ff.filename
    GROUP BY f.id
    HAVING COUNT(*)>1
    

    Run one after the other.

    Edit

    3rd query was bogous – sorry for that

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I know there have been questions in the past about SQL 2005 versus Lucene.NET
Now i know there have been a lot of question about this, but i
I know there have been a lot of questions about Entity Framework doing cross
I know there have been some similar questions to this, but they haven't helped
I know there have been a million questions asking this, but mine is different.
There have been a lot of questions about C++ Multidimension arrays asked already, although
Why I ask this question: I know there have been a lot of questions
I know there have been many questions on grid and pack in the past
I know there are other questions that have similar issues, but I have read
I know there are tons of threads regarding this issue but I have not

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.