I know there have been a lot of questions about sql query performance improvement,

Question

0

Asked: June 10, 20262026-06-10T01:05:52+00:00 2026-06-10T01:05:52+00:00

I know there have been a lot of questions about sql query performance improvement,

0

I know there have been a lot of questions about sql query performance improvement, but I was not able to use the answers of those questions to improve my queries performance (enough).

Since I wanted something more flexible than rsync & fslint, I’ve written a little java tool that walks file trees and stores paths & checksums in a mysql database.

You’ll find my table structure here:
http://code.google.com/p/directory-scanner/source/browse/trunk/sql/create_table.sql –
at first I only had one table, but then I thought I could save a lot of space if I move the redundant quite long strings of the directory paths into a seperate place and make it an 1:n relationship

I’ve defined those two indexes:

CREATE INDEX files_sha1 ON files (sha1);
CREATE INDEX files_size ON files (size);

Now the queries that bug me are those:
http://code.google.com/p/directory-scanner/source/browse/trunk/sql/reporingQueries.sql

The worst of them is the last one, that should with a very high probability always return an empty set (sha1 collisions & mistakenly multiple inserted files):

SELECT 
    d.path, 
    d.id, 
    f.filename, 
    f.id, 
    f.size, 
    f.scandate, 
    f.sha1, 
    f.lastmodified 
FROM files f 
INNER JOIN directories d 
    ON d.id = f.dir_id 
WHERE EXISTS ( /* same sha1 but different size */ 
    SELECT ff.id 
    FROM files ff 
    WHERE ff.sha1 = f.sha1 
    AND ff.size <> f.size 
) 
OR EXISTS ( /* files with same name and path but different id */ 
    SELECT ff2.id 
    FROM files ff2 
    INNER JOIN directories dd2 
        ON dd2.id = ff2.dir_id 
    WHERE ff2.id <> f.id 
    AND ff2.filename = f.filename 
    AND dd2.path = d.path 
) 
ORDER BY f.sha1

It ran well enough within less than a second as long as I had only 20k rows (after creating my indexes), but now that I have 750k rows, it literary runs for hours, and mysql totaly uses up one of my cpu cores for the whole time.

EXPLAIN for this query gives this result:

id ; select_type ; table ; type ; possible_keys ; key ; key_len ; ref ; rows ; filtered ; Extra
1 ; PRIMARY ; d ; ALL ; PRIMARY ; NULL ; NULL ; NULL ; 56855 ; 100.0 ; Using temporary; Using filesort
1 ; PRIMARY ; f ; ref ; dir_id ; dir_id ; 4 ; files.d.id ; 13 ; 100.0 ; Using where
3 ; DEPENDENT SUBQUERY ; dd2 ; ALL ; PRIMARY ; NULL ; NULL ; NULL ; 56855 ; 100.0 ; Using where
3 ; DEPENDENT SUBQUERY ; ff2 ; ref ; dir_id ; dir_id ; 4 ; files.dd2.id ; 13 ; 100.0 ; Using where
2 ; DEPENDENT SUBQUERY ; ff ; ref ; files_sha1 ; files_sha1 ; 23 ; files.f.sha1 ; 1 ; 100.0 ; Using where

My other queries are also not quick with 750k rows, but finish at least within 15 minutes or something the like (however, I would like them to also work with millions of rows..)

UPDATE: Thanks radashk for the comment, but the indexes you suggested seem to be created automatically by mysql –>

"Table","Non_unique","Key_name","Seq_in_index","Column_name","Collation","Cardinality","Sub_part","Packed","Null","Index_type","Comment","Index_comment"
"files","0","PRIMARY","1","id","A","698397","NULL","NULL",,"BTREE",,
"files","1","dir_id","1","dir_id","A","53722","NULL","NULL",,"BTREE",,
"files","1","scanDir_id","1","scanDir_id","A","16","NULL","NULL","YES","BTREE",,
"files","1","files_sha1","1","sha1","A","698397","NULL","NULL","YES","BTREE",,
"files","1","files_size","1","size","A","174599","NULL","NULL",,"BTREE",,

UPDATE2: Thanks Eugen Rieck! I consider your answer a good replacement for this query, since it most likly will return an empty set anyway I will just select the data to display the user to describe the problem later in another query.
To make me really happy it would be great if someone could take a look at my other queries as well 😀

UPDATE3: The answer from Justin Swanhart inspired me to the following solution: instead of having queries to check for directories and files that have been inserted multiple times unintentionally, just create unique constraints like this:

ALTER TABLE directories ADD CONSTRAINT uc_dir_path UNIQUE (path);
ALTER TABLE files ADD CONSTRAINT uc_files UNIQUE(dir_id, filename);

However, I wonder how much this would negatively effect the performance of insert statements, could somebody comment on this please?

UPDATE4:

ALTER TABLE directories ADD CONSTRAINT uc_dir_path UNIQUE (path);

doesn’t work, since its to long..

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

UPDATE5:

Okey, this is the solution I’m gonna use for replacing the query I quoted above in my initial question:

For the first part, finding sha1 collisions, I will use this:

SELECT sha1
FROM files
GROUP BY sha1
HAVING COUNT(*)>1
AND MIN(size)<>MAX(size)

And if it returns anything, I will select the details with another query WHERE sha1 = ?

I guess this query will run best, with this index defined:

CREATE INDEX sha1_size ON files (sha1, size);

For verifying that no duplicated directories exist, I will use this, since he doesn’t allow a constraint (see UPDATE4 above):

SELECT path
FROM directories
GROUP BY path
HAVING COUNT(*)>1

And for the duplicated files I will try to create this constraint:

CREATE UNIQUE INDEX filename_dir ON files (filename, dir_id);

This runs quite fast (15 to 20 sec) and I don’t need to create other indexes before it to make it faster. Also the error message contains the details I need to display the problem to the user (which is unlikely anyway since I check for those things before inserting)

Now there are only 5 more queries to make perform in less time 😉 thanks for the great help so far Eugen & Justin!

UPDATE6: Okey, so since it’s been a few days since the last response from anybody, I’m just gonna accept Justin’s answer, since that was the one that helped me the most. I incorporated what I learned from both of you into my app and released version 0.0.4 here: http://code.google.com/p/directory-scanner/downloads/detail?name=directory-scanner-0.0.4-jar-with-dependencies.jar

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T01:05:52+00:00

While I can’t verify without building your tables an dpopulating, I’d try something like

-- This checks the SHA1 collisions
SELECT
  MIN(id) AS id,
FROM files
GROUP BY sha1
HAVING COUNT(*)>1
AND MIN(size)<>MAX(size)

-- This checks for directory duplicates
SELECT
  MIN(path) AS path
FROM directories
GROUP BY path
HAVING COUNT(*)>1

-- This checks for file duplicates
SELECT
  MIN(f.id) AS id
FROM files AS f
INNER JOIN files AS ff 
   ON f.dir_id=ff.dir_id
   AND f.filename=ff.filename
GROUP BY f.id
HAVING COUNT(*)>1

Run one after the other.

Edit

3rd query was bogous – sorry for that

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I know there have been a lot of questions about sql query performance improvement,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply