i have a table eng-jap which is essentially just a translation so having an english and a japanese column. a script i made somehow cause every insert to have a clone and thus 1000s of duplicate entries in this table, for example:
duplicate example A
eng jap
"mother washes every day" "母は毎日洗濯する"
"mother washes every day" "母は毎日洗濯する"
if it were just one column i could use the query:
SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng HAVING c > 1
but since the table can legitimately have a duplicates in eng or jap, as long as its not in both. for example:
duplicate example B
eng jap
"mother washes every day" "母は毎日洗濯する"
"every day mother washes" "母は毎日洗濯する"
this is to allow one sentence to have more than one translation. so i need to alter the query to find duplicates as a combination of both columns i guess you could say.
once again to be clear. example B is fine, i want to select all duplicates like example A so i can make a scrip to remove one of all of the duplicates. please and Thank you!
I think you just need to group by eng and jap:
And if you want to remove all duplicates, if your rows have an
id, this query shows all the ids that you have to keep:(it’s a trick that uses
GROUP_CONCATto find the first id of every combination ofeng/jap). And this query shows the ids of the rows you have to delete:I rewrote this query using just join, it has to be a little faster, but if your table is too big it is probably still slow.
If it is still too slow and it still doesn’t work, i would suggest you to add two more columns to your table:
eng-hash, where you can saveMD5(eng)jap-hash, where you can saveMD5(jap)then update all of your records like this:
then you can add a unique index on the table on both columns, ignore all errors, and let MySql do the work to eliminate duplicates for you:
(if you get an error while creating index, see this question: MySQL: ALTER IGNORE TABLE gives "Integrity constraint violation")