Let’s assume three models, standard joins:
class Mailbox < ActiveRecord::Base
has_many :addresses
has_many :domains, :through => :addresses
end
class Address < ActiveRecord::Base
belongs_to :mailbox
belongs_to :domain
end
class Domain < ActiveRecord::Base
has_many :addresses
has_many :mailboxes, :through => :addresses
end
Now obviously, if for any given mailbox you want to know in which domains it has addresses, you have two possible ways to go:
m = Mailbox.first
# either: SELECT DISTINCT domains.id, domains.name FROM "domains" INNER JOIN
# "addresses" ON "domains".id = "addresses".domain_id WHERE
# (("addresses".mailbox_id = 1))
m.domains.all(:select => 'DISTINCT domains.id, domains.name')
# or: SELECT domains.id, domains.name FROM "domains" INNER JOIN "addresses" ON
# "domains".id = "addresses".domain_id WHERE (("addresses".mailbox_id = 1))
# GROUP BY domains.id, domains.name
m.domains.all(:select => 'domains.id, domains.name',
:group => 'domains.id, domains.name')
The problem for me is that I don’t know which solution is better. When I don’t specify any other conditions, the PostgreSQL query planner favours solution number two (working as expected), but if I add conditions to the queries, it comes down to “Unique” vs. “Group”:
With “DISTINCT”:
Unique (cost=16.56..16.57 rows=1 width=150)
-> Sort (cost=16.56..16.56 rows=1 width=150)
Sort Key: domains.name, domains.id
-> Nested Loop (cost=0.00..16.55 rows=1 width=150)
-> Index Scan using index_addresses_on_mailbox_id on addresses (cost=0.00..8.27 rows=1 width=4)
Index Cond: (mailbox_id = 1)
-> Index Scan using domains_pkey on domains (cost=0.00..8.27 rows=1 width=150)
Index Cond: (domains.id = addresses.domain_id)
Filter: (domains.active AND domains.selfmgmt)
(9 rows)
With “GROUP BY”:
Group (cost=16.56..16.57 rows=1 width=150)
-> Sort (cost=16.56..16.56 rows=1 width=150)
Sort Key: domains.name, domains.id
-> Nested Loop (cost=0.00..16.55 rows=1 width=150)
-> Index Scan using index_addresses_on_mailbox_id on addresses (cost=0.00..8.27 rows=1 width=4)
Index Cond: (mailbox_id = 1)
-> Index Scan using domains_pkey on domains (cost=0.00..8.27 rows=1 width=150)
Index Cond: (domains.id = addresses.domain_id)
Filter: (domains.active AND domains.selfmgmt)
(9 rows)
I’m really unsure how to determine the better way to retrieve these data. My instincts tell me to go with “GROUP BY”, but I was unable to find any documentation specific enough to solve this problem.
Should I use “:group” or “:select => ‘DISTINCT'”? Is that choice the same with other modern RDBMS like e.g. Oracle, DB2 or MySQL (I don’t have access to those, so I can’t perform tests)?
If you’re using Postgresql < 8.4 (which I guess you are, given the plans) – it’s usually better to use
GROUP BYinstead ofDISTINCTas its plan is simply more efficient.In 8.4 there is no difference as DISTINCT was “taught” to be able to use group operators as well.