This is my first attempts at creating a data-mart/warehouse and I am a little confused on how to best design the schema. A background on the project: I originally created a relational database that captures information about our clients. A simplified schema is as such:
ClientTbl:
ID:PK;
FName:String;
LName:String;
ClientEDU(one to many)
ID:PK;
ClientID:FK;
SchoolName:String;
Degree:String;
GPA:String;
ClientJobs(One to many)
ID:PK;
ClientID:FK;
OrganizationName:string;
Industry:String;
StartDate:Date;
EndDate:Date;
Salary:double;
CityLocation: String;
This is a simplified example. In reality I have several more tables holding thousands of records. When ever I want to run queries on these tables, it can be very time consuming. I seems like creating a data-mart would help. This way, we could run an update to the data mart, which would be time consuming, nightly. Then have the queried data in our DM that would be fast to query. I’m just having a hard time on how to best design the schema. Example question I would like to have answered in the data mart, based on the example tables above, is this:
% of clients that attend each school in our db
% that have each degree in our db
Avg salary of client
Avg length of stay at a job
% of clients that worked in each city, that is found in the db
From my reading, I know that the fact table would contain all the calculated values (avg salary, length, etc) and each dimension would contain data (jobs or education) but i dont understand how they are tied together. Would my fact table have a row for each client? Just one row?
Any help would be great
thanks
This is a difficult problem because it involves demographic summaries of clients.
You have a Job which appears to be fact-like. It has a duration and a salary which are measures. We know they’re measures because they have proper units.
Given the Job fact, what are the dimensions of this fact?
Client
Time started
Perhaps you know other things about the Job (geography, industry, for example).
Time is a point-in-time. This is usually a table with dates and all of the various reporting categories that dates fall into: quarters, weeks, fiscal periods, etc., etc.
Client is an individual; and people don’t make a particularly good dimension. They have a lot of dimensions of their own.
Choice 1. A “snowflake” schema. Treating Client as a kind of fact with a lot of dimensions, including their own geography and degree and and school and what-not.
Choice 2. A “demographic” dimension. This is summaries of degree program, GPA range, school name, and the like. This is — effectively — an association between the proper Job Facts and the Clients. A Job belongs to a demographic category. A number of clients also belong to that category.
A fact table has one row for each measurable instance of a fact associated with the various dimensions of that fact.
A Job fact has two measures: salary, duration and at least to foreign key references to dimensions: start date and demographics. If you have other dimensional attributes of the Job (like geography or industry), these are foreign keys of the job.
A Client Demographic will be associated with one or more jobs.
The same would be true for Geography or Industry.
Because client is a special case, one or more clients will also have FK references to the appropriate client demographic dimension row.