Agile Data Warehouse Modeling: How to Build a Virtual Type 2 Slowly Changing Dimension
One of the ongoing complaints about many data warehouse projects is that they take too long to delivery. This is one of the main reasons that many of us have tried to adopt methods and techniques (like SCRUM) from the agile software world to improve our ability to deliver data warehouse components more quickly.
So, what activity takes the bulk of development time in a data warehouse project?
Writing (and testing) the ETL code to move and transform the data can take up to 80% of the project resources and time.
So if we can eliminate, or at least curtail, some of the ETL work, we can deliver useful data to the end user faster.
One way to do that would be to virtualize the data marts.
For several years Dan Linstedt and I have discussed the idea of building virtual data marts on top of a Data Vault modeled EDW.
In the last few years I have floated the idea among the Oracle community. Fellow Oracle ACE Stewart Bryson and I even created a presentation this year (for #RMOUG and #KScope13) on how to do this using the Business Model (meta-layer) in OBIEE (It worked great!).
While doing this with a BI tool is one approach, I like to be able to prototype the solution first using Oracle views (that I build in SQL Developer Data Modeler of course).
The approach to modeling a Type 1 SCD this way is very straight forward.
How to do this easily for a Type 2 SCD has evaded me for years, until now.
Building a Virtual Type 2 SCD (VSCD2)
So how to create a virtual type 2 dimension (that is “Kimball compliant” ) on a Data Vault when you have multiple Satellites on one Hub?
(NOTE: the next part assumes you understand Data Vault Data Modeling. if you don’t, start by reading my free white paper, but better still go buy the Data Vault book on LearnDataVault.com)
Here is how:
Build an insert only PIT (Point-in-Time) table that keeps history. This is sometimes referred to as a historicized PIT tables. (see the Super Charge book for an explanation of the types of PIT tables)
Add a surrogate Primary Key (PK) to the table. The PK of the PIT table will then serve as the PK for the virtual dimension. This meets the standard for classical star schema design to have a surrogate key on Type 2 SCDs.
To build the VSCD2 you now simply create a view that uses the PIT table to join the Hub and all the Satellites together. Here is an example:Create view Dim2_Customer (Customer_key, Customer_Number, Customer_Name, Customer_Address, Load_DTS) as Select sat_pit.pit_seq, hub.customer_num, sat_1.name, sat_2.address, sat_pit.load_dts from HUB_CUST hub, SAT_CUST_PIT sat_pit, SAT_CUST_NAME sat_1, SAT_CUST_ADDR sat_2 where hub.CSID = sat_pit.CSID and hub.CSID = sat_1.CSID and hub.CSID = sat_2.CSID and sat_pit.NAME_LOAD_DTS = sat_1.LOAD_DTS and sat_pit.ADDRESS_LOAD_DTS = sat_2.LOAD_DTS
Benefits of a VSCD2
- We can now rapidly demonstrate the contents of a type 2 dim prior to ETL programming
- With using PIT tables we don’t need the Load End DTS on the Sats so the Sats become insert only as well (simpler loads, no update pass required)
- Another by product is the Sat is now also Hadoop compliant (again insert only)
- Since the nullable Load End DTS is not needed, you can now more easily partition the Sat table by Hub Id and Load DTS.
The main objection to this approach is that the virtual dimension will perform very poorly. While this may be true for very high volumes, or on poorly tuned or resourced databases, I maintain that with today’s evolving hardware appliances (e.g., Exadata, Exalogic) and the advent of in memory databases, these concerns will soon be a thing of the past.
UPDATE 26-May-2018 – Now 5 years later I have successfully done the above on Oracle. But now we also have Snowflake elastic cloud data warehouse where all the prior constraints are indeed eliminated. With Snowflake you can now easily chose to instantly add compute power if the view is too slow or do the work and processing to materialize the view. (end update)
Worst case, after you have validated the data with your users, you can always turn it into a materialized view or a physical table if you must.
So what do you think? Have you ever tried something like this? Let me know in the comments.
Get virtual, get agile!
The Data Warrior
P.S. I am giving a talk on Agile Data Warehouse Modeling at the East Coast Oracle Conference this week. If you are there, look me up and we can discuss this post in person!
Interesting. I am aware of other work that does not attempt to give each “dimension row version” it’s own skey. How do you create a junk dimension skey or any other skey where several tables need to be joined together and no one table has one row for every projection of the joins?
I have honestly never implemented a traditional junk dimension (never had the need).
In the case I am discussing here using Data Vault, since the Satellite tables are all related via a common Hub (based on a business key) and the joins are projected in the historical PIT table, I don’t have that issue. Without some table to project those joins (like the PIT table) I do not know how you could create the junk table with an skey via view. You would have to build it using traditional ETL techniques similar to how I build the PIT table during the data vault load process.
Without adding the skey to the PIT table, I could not build a virtual dimension with an skey either. (The only other option I ever came up with was to generate a virtual PK by applying an MD5 hash function to the Hub skey plus the PIT load date to get a unique key – but then that key would not be a standard integer surrogate).
Pingback: Reading List: Enterprise Data Architecture – Pipeline LLC