I’m trying to dedup a table, where I know there are ‘close’ (but not

Question

0

Asked: May 28, 20262026-05-28T22:06:01+00:00 2026-05-28T22:06:01+00:00

I’m trying to dedup a table, where I know there are ‘close’ (but not

0

I’m trying to dedup a table, where I know there are ‘close’ (but not exact) rows that need to be removed.

I have a single table, with 22 fields, and uniqueness can be established through comparing 5 of those fields. Of the remaining 17 fields, (including the unique key), there are 3 fields that cause each row to be unique, meaning the dedup proper method will not work.

I was looking at the multi table delete method outlined here: http://blog.krisgielen.be/archives/111 but I can’t make sense of the final line of code (AND M1.cd*100+M1.track > M2.cd*100+M2.track) as I am unsure what the cd*100 part achieves…

Can anyone assist me with this? I suspect I could do better exporting the whole thing to python, doing something with it, then re-importing it, but then (1)I’m stuck with knowing how to dedup the string anyway! and (2) I had to break the record into chunks to be able to import it into mysql as it was timing out after 300 seconds so it turned into a whole debarkle to get into mysql in the first place…. (I am very novice at both mysql and python)

The table is a dump of some 40 log files from some testing. The test set for each log is some 20,000 files. The repeating values are either the test conditions, the file name/parameters or the results of the tests.

    CREATE SHOW TABLE:

    CREATE TABLE `t1` (
     `DROID_V` int(1) DEFAULT NULL,
     `Sig_V` varchar(7) DEFAULT NULL,
     `SPEED` varchar(4) DEFAULT NULL,
     `ID` varchar(7) DEFAULT NULL,
     `PARENT_ID` varchar(10) DEFAULT NULL,
     `URI` varchar(10) DEFAULT NULL,
     `FILE_PATH` varchar(68) DEFAULT NULL,
     `NAME` varchar(17) DEFAULT NULL,
     `METHOD` varchar(10) DEFAULT NULL,
     `STATUS` varchar(14) DEFAULT NULL,
     `SIZE` int(10) DEFAULT NULL,
     `TYPE` varchar(10) DEFAULT NULL,
     `EXT` varchar(4) DEFAULT NULL,
     `LAST_MODIFIED` varchar(10) DEFAULT NULL,
     `EXTENSION_MISMATCH` varchar(32) DEFAULT NULL,
     `MD5_HASH` varchar(10) DEFAULT NULL,
     `FORMAT_COUNT` varchar(10) DEFAULT NULL,
     `PUID` varchar(15) DEFAULT NULL,
     `MIME_TYPE` varchar(24) DEFAULT NULL,
     `FORMAT_NAME` varchar(10) DEFAULT NULL,
     `FORMAT_VERSION` varchar(10) DEFAULT NULL,
     `INDEX` int(11) NOT NULL AUTO_INCREMENT,
     PRIMARY KEY (`INDEX`)
    ) ENGINE=MyISAM AUTO_INCREMENT=960831 DEFAULT CHARSET=utf8

The only unique field is the PriKey, ‘index’.

Unique records can be established by looking at DROID_V,Sig_V,SPEED.NAME and PUID

Of the ¬900,000 rows, I have about 10,000 dups that are either a single duplicate of a record, or have upto 6 repetitions of the record.

Row examples: As Is

    5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
    5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
    5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
    5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"
    5;"v37";"slow";"12766";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"193977"
    5;"v37";"slow";"12768";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"193978"
    5;"v37";"slow";"12769";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"193979"
    5;"v37";"slow";"12770";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"193980"

Row Example: As It should be

    5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
    5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
    5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
    5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"

Please note, you can see from the index column at the end that I have cut out some other rows – I have only idenitified a very small set of repeating rows. Please let me know if you need any more ‘noise’ from the rest of the DB

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T22:06:01+00:00

I figured out a fix – using the count function, I was using a COUNT(*) that just returned everything in the table, by using a COUNT (distinct NAME) function I am able to weed out the dup rows that fit the dup critera (as set out by the field selection in a WHERE clause)

Example:

SELECT `PUID`,`DROID_V`,`SIG_V`,`SPEED`, COUNT(distinct NAME) as Hit FROM sourcelist, main_small WHERE sourcelist.SourcePUID = 'MyVariableHere' AND main_small.NAME =  sourcelist.SourceFileName 
GROUP BY `PUID`,`DROID_V`,`SIG_V`,`SPEED` ORDER BY `DROID_V` ASC, `SIG_V` ASC, `SPEED`;

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to dedup a table, where I know there are ‘close’ (but not

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply