I need to find duplicates in a table. In MySQL I simply write:
SELECT *,count(id) count FROM `MY_TABLE`
GROUP BY SOME_COLUMN ORDER BY count DESC
This query nicely:
- Finds duplicates based on SOME_COLUMN, giving its repetition count.
- Sorts in desc order of repetition, which is useful to quickly scan major dups.
- Chooses a random value for all remaining columns, giving me an idea of values in those columns.
Similar query in Postgres greets me with an error:
column “MY_TABLE.SOME_COLUMN” must appear in the GROUP BY clause or be
used in an aggregate function
What is the Postgres equivalent of this query?
PS: I know that MySQL behaviour deviates from SQL standards.
Back-ticks are a non-standard MySQL thing. Use the canonical double quotes to quote identifiers (possible in MySQL, too). That is, if your table in fact is named
"MY_TABLE"(all upper case). If you (more wisely) named itmy_table(all lower case), then you can remove the double quotes or use lower case.Also, I use
ctinstead ofcountas alias, because it is bad practice to use function names as identifiers.Simple case
This would work with PostgreSQL 9.1:
It requires primary key column(s) in the
GROUP BYclause. The results are identical to a MySQL query, butctwould always be 1 (or 0 ifid IS NULL) – useless to find duplicates.Group by other than primary key columns
If you want to group by other column(s), things get more complicated. This query mimics the behavior of your MySQL query – and you can use
*.This works because
DISTINCT ON(PostgreSQL specific), likeDISTINCT(SQL-Standard), are applied after the window functioncount(*) OVER (...). Window functions (with theOVERclause) require PostgreSQL 8.4 or later and are not available in MySQL.Works with any table, regardless of primary or unique constraints.
The
1inDISTINCT ONandORDER BYis just shorthand to refer to the ordinal number of the item in theSELECTlist.SQL Fiddle to demonstrate both side by side.
More details in this closely related answer:
count(*)vs.count(id)If you are looking for duplicates, you are better off with
count(*)than withcount(id). There is a subtle difference ifidcan beNULL, becauseNULLvalues are not counted – whilecount(*)counts all rows. Ifidis definedNOT NULL, results are the same, butcount(*)is generally more appropriate (and slightly faster, too).