Data Vault vs. The World (2 of 3)
In the first post of this series, I discussed advantages of using a Data Vault over a replicated ODS. In this post I am going to discuss Data Vault and the more classical approach of a Kimball-style dimensional (star schema) design.
To be clear upfront, I will be comparing using Data Vault to using dimensional for the enterprise data warehouse repository. Using star schemas for reporting data marts is not at issue (we use those in the data vault framework regularly).
I also want to recognize that there are some very expert architects out there who have had great success building Kimball-style EDWs and mitigated all the risks I am going to mention. Unfortunately, there are not enough to them to go around. So for the rest of us, we might need an alternative approach…
When considering a Kimball-style approach, organizations often consider designing facts and dimensions to hold the operational data from one of their source systems. One downside to this approach is that to design optimal structures we would have to have a very solid understanding of all the types of questions the business needs to answer. Otherwise we risk not having the right fact tables designed to support the ever-changing needs of the business. So this tends to make it difficult to implement in an agile, iterative manner and can lead to the need for re-engineering. This is especially a risk with conformed dimensions. While adding new fact tables in an interative fashion is not a problem, having to redeign or rebuild a large conformed dimension could be big effort.
Additionally there may be requirements to extract data for various other analysis. Dimensional models do not lend themselves well to all types of extracts. Not every query can be supported by cube-type designs – especially if the data is spread across multiple fact tables with varying levels of granularity.
With a Data Vault solution we can store all the data at an atomic level in a form that can easily be projected into dimensional views, 3NF ODS-style structures or just flat denormalized spreadsheet style extracts. This will allow us to be more agile in addressing the changing report and extract requirements than if we have to design new facts and dimensions, then write and test the ETL code to build them.
Plus the instantiated dimension model (which is highly denormalized) will simply take more space than a data vault model which is more normalized.
Using the dimensional approach, as the basis for the foundation layer of an enterprise data warehouse, there is also the risk pf having to redesign, drop and reload both facts and dimensions as the OLTP model evolves. That effort can be very expensive and take a lot of time.
With the Data Vault it is much easier to evolve the warehouse model by simply adding new structures (Hubs, Links, Sats) to the model to absorb the changes.
Integrating other source systems can also be a challenge with dimensional models, especially in populating conformed dimensions as you have to account for all semantic and structural differences between the various systems before you load the data. this could mean a lot for cleansing and transforming the source data (which could jeopardize your ability to clearly trace data to the source.)
Data Vault avoids this issue by integrating around business keys (via Hubs) and allowing for apparently disparate data sets to be associated through the use of Same-As-Link tables to support dynamic equivalence mappings of the data sets from different sources. In agile terms, these structures can be built in a future sprint after the data is loaded and profiled. (As a side note, Same-As-Links are also very helpful for using your vault to do master data type work.)
Well, those are the primary issues I have run across. I am sure there are more, but this should be enough for now.
Can’t wait to hear your comments! Be sure to tweet this to your friends.
Later,
Kent
P.S. Stay tuned for #3 where I will discuss traditional Inmon 3NF-style data warehouse.