I’m about to create a data warehouse with facts and dimensions in a star-schema.
The business questions I want to answer are typically these:
- How much money did we sell for in Q1?
- How much money did we sell for in Q1 to females?
- How much money did we sell for in Q1 to females between age 30-35?
- How much money did we sell for in Q1 to females between age 30-35 living in new york?
-
How much money did we sell for in Q1 to females between age 30-35 living in new york?
-
How much money did we sell for in category clothes last year?
- How much money did we sell for of the product blue jeans last year?
- How much money did we sell for of the product blue jeans to males between 40-42 living in Australia last year?
I am thinking of a date dimension with the granularity of an hour (specifying year, month, day, hour, quarter, name of day, name of month etc.)
I am also thinking of a product dimension and a user dimension.
I wonder if these questions could be answered using a single fact table or if its proper to create multiple fact tables? I am thinking of a table such as:
FactSales
DimDate – fk to a table containting information about the date (such as quarter, day of week, year, month, day)
DimProduct – fk to a table containing information about the product such as (product name)
DimUser – fk to a table containing information about the user such as (age, gender)
TotalSales – a SUM of all sales for those particular date,product and user.
Also, if I would like to measure booth the total sales (money) and the total number of sales? Would it be proper to create a new fact table with the same dimensions but using TotalNumberOfSales as the fact instead?
Thankful for all input I can get about this.
I think you are on the right track. All questions above should be possible to answer using only one fact table covering up the sales.
I think one should start out unaggregated, and rather aggregate later if needed. Considering that one sale can contain multiple products and multiple items, I’d organize it as follows … one fact row for each product in the sale (typically lines on the invoice, so I’d call it “order lines” or “sale lines”), and maybe three counter attributes:
NumItems– number of items, i.e. 3 if the customer bought three of the same product.NumLines– number of “order lines” – should always be 1. May be useful when aggregating data later (big win to already havesum(NumLines)rather thancount(*)in the SQL), or when adding correction items (NumLines = -1).NumSales– a fractional number so it can be summed up to yield the number of sales (i.e. 0.333 if the sale involves three different products and hence contains three order lines).Now, one will get a problem to get the right count i.e. for “number of sales involving black clothes”. We had this problem at my previous workplace – I’m sure there must exist some “best practice” for this, we ended up more or less by introducing a
SaleIDin the fact table (orTransactionID) and docount(distinct SaleID). That lacks elegance, but works.In our setup we had several money attributes – most important, one for the revenue (what’s left of the income after paying the direct costs attributed with the items sold) and one for the turnover (the price paid by the customer for the item). Sales tax or VAT may add more complications. One can make it with only one money attribute and then split the sales up into multiple lines in the fact table, but I think I would rather recommend multiple money columns in the sales line fact table. Everything in the fact table was counted in “base currency” (Euros, in our case), and then we had an exchange rate dimension to track the exact amounts.
I don’t think it makes sense to have a date dimension containing the hour of the day. At my former work I kept my warehouse in postgres, and I actually managed quite well without a date dimension at all – although a date dimension is considered “best business practice” I found that performance-wise for all our purposes we got much better performance by using standard postgres date functions instead of dragging in a date dimension. I was playing quite a lot with it, and I think in the end I found the most optimal was to split up date and time into two different attributes. (Timezones and daylight saving gave me quite some extra headaches…)