Data Vault vs. The World (2 of 3)
In the first post of this series, I discussed advantages of using a Data Vault over a replicated ODS. In this post I am going to discuss Data Vault and the more classical approach of a Kimball-style dimensional (star schema) design.
To be clear upfront, I will be comparing using Data Vault to using dimensional for the enterprise data warehouse repository. Using star schemas for reporting data marts is not at issue (we use those in the data vault framework regularly).
I also want to recognize that there are some very expert architects out there who have had great success building Kimball-style EDWs and mitigated all the risks I am going to mention. Unfortunately, there are not enough to them to go around. So for the rest of us, we might need an alternative approach…
When considering a Kimball-style approach, organizations often consider designing facts and dimensions to hold the operational data from one of their source systems. One downside to this approach is that to design optimal structures we would have to have a very solid understanding of all the types of questions the business needs to answer. Otherwise we risk not having the right fact tables designed to support the ever-changing needs of the business. So this tends to make it difficult to implement in an agile, iterative manner and can lead to the need for re-engineering. This is especially a risk with conformed dimensions. While adding new fact tables in an interative fashion is not a problem, having to redeign or rebuild a large conformed dimension could be big effort.
Additionally there may be requirements to extract data for various other analysis. Dimensional models do not lend themselves well to all types of extracts. Not every query can be supported by cube-type designs – especially if the data is spread across multiple fact tables with varying levels of granularity.
With a Data Vault solution we can store all the data at an atomic level in a form that can easily be projected into dimensional views, 3NF ODS-style structures or just flat denormalized spreadsheet style extracts. This will allow us to be more agile in addressing the changing report and extract requirements than if we have to design new facts and dimensions, then write and test the ETL code to build them.
Plus the instantiated dimension model (which is highly denormalized) will simply take more space than a data vault model which is more normalized.
Using the dimensional approach, as the basis for the foundation layer of an enterprise data warehouse, there is also the risk pf having to redesign, drop and reload both facts and dimensions as the OLTP model evolves. That effort can be very expensive and take a lot of time.
With the Data Vault it is much easier to evolve the warehouse model by simply adding new structures (Hubs, Links, Sats) to the model to absorb the changes.
Integrating other source systems can also be a challenge with dimensional models, especially in populating conformed dimensions as you have to account for all semantic and structural differences between the various systems before you load the data. this could mean a lot for cleansing and transforming the source data (which could jeopardize your ability to clearly trace data to the source.)
Data Vault avoids this issue by integrating around business keys (via Hubs) and allowing for apparently disparate data sets to be associated through the use of Same-As-Link tables to support dynamic equivalence mappings of the data sets from different sources. In agile terms, these structures can be built in a future sprint after the data is loaded and profiled. (As a side note, Same-As-Links are also very helpful for using your vault to do master data type work.)
Well, those are the primary issues I have run across. I am sure there are more, but this should be enough for now.
Can’t wait to hear your comments! Be sure to tweet this to your friends.
Later,
Kent
P.S. Stay tuned for #3 where I will discuss traditional Inmon 3NF-style data warehouse.
Thanks Ken, great summary. Looking forward to part 3!
Great post, Kent! I enjoyed it very much! I believe the most powerfull argument I happen to be looking for is in the paragraph “Integrating other source systems can also be a challenge with dimensional models,(…)”. In fact, DV cuts a LOT of ETL design efforts. As DV makes everything else easier albeit the extra layer, cutting down on ETL designing efforts may be a very tangible advantage, tangible enough to sell to the disbelieving management and the challenging dimensional modeler co-workers. I will try to figure how to measure it somehow and share it on DV LinkedIn group.
Glad you liked it. Please do keep us all posted on you findings.
Very nice summary. Of course, I already thought it very problematic to use a dimensional base repository in a DW. Part 3 will be the more challenging (and important) post. I look forward to it.
Thanks for pointing that integrating new system with conformed dimensions is the biggest drawbacks of the Kimball way. One should note that with a DV it’s easier but you end up with no direct results. Your customer report will not magically be updated. All the work is still to be done with business rules.
To be clear if adding a new system create a new hub for those new clients with a different business key, I can understand that the same will happen with orders and most table. Sure you can map things with relationship but it seems of increasing complexity when you want to merge all those to create a reporting star schema.
Keeping all the information and specificity doesn’t reduce complexity at the DV level. It is more creating a common clean framework of all source data with a build-in versioning feature. I see that more as an intermediate step before a data warehouse which meaning is to adding value by consolidating and simplifying the information.
Great series of posts Ken …
My personal opinion (and experience) is the semantic separation of the archival and integration in the DV from the presentation layers is one of the keys is one of the biggest values by concentrating efforts on 2 completely different requirements.
The true value of the DV is only possible when business rules are pushed downstream close to the business users (in the marts).
Leaving the core Data Warehouse intact even after business rules change is extremely powerful.
The ability to leave the structures and data integration processes intact even after source system modifications and/or addition of new sources is extremely powerful.
The flexibility you get in mart builds including treating the marts as a completely front-end construct that can be destroyed and reconstructed at will is extremely powerful (The auto-generation tools that the Kimballites are now raving about are much more useful with a DV than they’ll ever be with a BUS architecture – but there is inherent refusal or outright denial on their part).
“I also want to recognize that there are some very expert architects out there who have had great success building Kimball-style EDWs and mitigated all the risks I am going to mention.”
This is a myth. It has never been done because it’s technically impossible. Even the DV has certain inherent risks that can’t be avoided. It just so happens the risk of re-engineering is extremely low in a DV when compared to a BUS.
So, to all those who have claimed successes, they either haven’t encountered maintenance or are liars. It’s also possible, they’re smaller systems that didn’t really require much change.
“Plus the instantiated dimension model (which is highly denormalized) will simply take more space than a data vault model which is more normalized.”
Storage is cheap these days. It’s memory and query space that has become more important. The flexibility you get with a DV is hard to get with other formats. It has comparatively more redundancy than a 3NF DW and databases have evolved much more in terms of storage efficiency of redundant structures.
There’s so much more I can talk about especially with the Kimball comparison. The DV does pick good things from it and makes the use of star-schemas much more efficient.
The comparison to in-memory databases and tools which I believe are also better served from a foundational DV and with the evolving landscape of technology wherein databases are changing especially in the source space which includes columnar, key-value, graph databases and more is also an interesting discussion. Source from Object based databases was already an issue from a structural perspective of the Data Warehouse. These new ones throw another wrench in it.
I’m looking forward to your next post. I think you could have easily done many more parts.
Pingback: Data Vault vs.The World (3 of 3) « Oracle Data Warrior