I have a data warehouse containing typical star schemas, and a whole bunch of code which does stuff like this (obviously a lot bigger, but this is illustrative):
SELECT cdim.x ,SUM(fact.y) AS y ,dim.z FROM fact INNER JOIN conformed_dim AS cdim ON cdim.cdim_dim_id = fact.cdim_dim_id INNER JOIN nonconformed_dim AS dim ON dim.ncdim_dim_id = fact.ncdim_dim_id INNER JOIN date_dim AS ddim ON ddim.date_id = fact.date_id WHERE fact.date_id = @date_id GROUP BY cdim.x ,dim.z
I’m thinking of replacing it with a view (MODEL_SYSTEM_1, say), so that it becomes:
SELECT m.x ,SUM(m.y) AS y ,m.z FROM MODEL_SYSTEM_1 AS m WHERE m.date_id = @date_id GROUP BY m.x ,m.z
But the view MODEL_SYSTEM_1 would have to contain unique column names, and I’m also concerned about performance with the optimizer if I go ahead and do this, because I’m concerned that all the items in the WHERE clause across different facts and dimensions get optimized, since the view would be across a whole star, and views cannot be parametrized (boy, wouldn’t that be cool!)
So my questions are –
-
Is this approach OK, or is it just going to be an abstraction which hurts performance and doesn’t give my anything but a lot nicer syntax?
-
What’s the best way to code-gen these views, eliminating duplicate column names (even if the view later needs to be tweaked by hand), given that all the appropriate PK and FKs are in place? Should I just write some SQL to pull it out of the
INFORMATION_SCHEMAor is there a good example already available.
Edit: I have tested it, and the performance seems the same, even on the bigger processes – even joining multiple stars which each use these views.
The automation is mainly because there are a number of these stars in the data warehouse, and the FK/PK has been done properly by the designers, but I don’t want to have to pick through all the tables or the documentation. I wrote a script to generate the view (it also generates abbreviations for the tables), and it works well to generate the skeleton automagically from INFORMATION_SCHEMA, and then it can be tweaked before committing the creation of the view.
If anyone wants the code, I could probably publish it here.
I’ve used this technique on several data warehouses I look after. I have not noticed any performance degradation when running reports based off of the views versus a table direct approach but have never performed a detailed analysis.
I created the views using the designer in SQL Server management studio and did not use any automated approach. I can’t imagine the schema changing often enough that automating it would be worthwhile anyhow. You might spend as long tweaking the results as it would have taken to drag all the tables onto the view in the first place!
To remove ambiguity a good approach is to preface the column names with the name of the dimension it belongs to. This is helpful to the report writers and to anyone running ad hoc queries.