Is there a benefit of using multiple columns on distribution when creating a table? For instance:
CREATE TABLE data_facts (
data_id int primary key,
channel_id smallint,
chart_id smallint,
demo_id smallint,
value numeric)
DISTRIBUTED BY (
channel_id,
chart_id,
demo_id)
as there will be chance I need join data_facts with three different tables channel, chart and demo using channel_id, chart_id and demo_id respectively.
Specifically,
- Should I always add
distributionand include allid(s)that I’m using for joining in terms of efficiency? - If so, does the order of these
id(s)matter? - How does this work on an architecture level? (optional)
Thanks!
It depends on how much you want to shard the database, and on how less records you want to distribute in each partition, I mean if you add more than one column in the distribution you will fragment a lot more the data into more partitions.
It also depends if you shard by modulo or hash …
However, in my opinion, if you have a multiple columns primary key and you want to shard by this primary key could have a sense distributing by multiple columns(with all the columns in the primary key) otherwise you should shard by a single column that in most cases is enough .