Data Vault vs. The World (1 of 3)
Okay, maybe not “the world” but is does sometimes seem like it.
Even though the Data Vault has been around for well over 10 years now, has multiple books, video, and tons of success stories, I am constantly asked to compare and contrast Data Vault to approaches generally accepted in the industry.
What’s up with that?
When was the last time you got asked to justify using a star schema for your data warehouse project?
Or when was that expensive consulting firm even asked “so what data modeling technique do you recommend for our data warehouse?”
Such is the life of the “new guy.” (If you are new to Data Vault, read this first.)
So, over the next few posts, I am going to lay out some of the explanations and justifications I use when comparing Data Vault to other approaches to data warehousing.
The first contestant: Poor man’s ODS vs. Data Vault
This approach entails simply replicating the operational (OLTP) tables to another server for read only reporting. This could be used as a partial data warehouse solution using something like Oracle’s GoldenGate to support near real time operational reporting that would minimize impact on the operational system.
This solution, however, does not adequately support needs for dimensional analysis nor would it allow for tracking of changes to the data historical (beyond any temporal tracking inherent in the OLTP data model).
A big risk of this approach is that as the OLTP structures continue to morph and change over time, reports and other extracts that access the changed structures would of course break as soon as the change was replicated to the ODS.
How does Data Vault handle this?
Data Vault avoids these problems by using structures that are not tightly coupled to any one source system. So as the source systems change we simply add Satellite and Link structures as needed. In the Data Vault methodology we do not drop any existing structures so reports will continue to work until we can properly rewrite them to take advantage of the new structure. If there is totally new data added to a source, we would probably end up adding new Hubs as well.
An additional advantage is that because Data Vault uses this loosely coupled approach we can load data from multiple sources. If we replicate specific OLTP structures, we would not be able to easily integrate other source system feeds – we would have to build another repository to do the integration (which would likely entail duplicating quite a bit of the data).
Don’t get me wrong, there is nothing wrong with using replication tools to build real time operational data stores.
In fact it is an excellent solution to getting your operational reporting offloaded from the main production server.
It is a tried and true solution – for a specific problem.
It is however, not the right solution if you are building an enterprise data warehouse and need to integrate multiple sources or need to report on changes to your data over time.
So let’s use the right tool for the right job.
Data Vault is newer, better tool.
In the next two posts I will compare Data Vault to the Kimball-style dimensional approach (part 2 of 3) and then to Inmon-style 3NF (part 3 of 3).
P.S. Be sure to sign up to follow my blog so you don’t miss the next round of Data Vault vs. The World.
Nice post Mr. Gaziano. Do you imply enterprises should have a DV instead of ODS or should not use ODS anyway? (I don’t like ODS, btw, and hence usually don’t build them.) ODS, as you have said, solves a very specific problem. How would this need (to avoid performance impact on OLTP databases due to real time reporting) be simplified/made cheaper by having a DV instead of a ODS? I can see you’d save effort on the way to the EDW, but it seems it would be more complicated and expensive to have real time reporting from the (real time?) DV than from the ODS. I can see that, after all the work on setting up a DV to suport real time reporting, the savings from isolating the DV pays up when reports does not break after some OLTP structure changes.
Please do not take my questions as a challenge for the DV or to you. I have been asked very similar questions and I was not able to find iron-clad answers. I’m sharing the questioning with you expecting to hear from a more experienced DV practitioner.
If the **only** problem you have to solve is reducing the reporting load on an operational systems, then using a data vault will not be simpler as the data vault is an additional layer you may not need. To do a real-time ODS using Data Vault, you would build the data vault model, implement real-time messaging to feed the vault, then create a virtualized ODS structure on top of the vault for reporting.
If you have experienced data architects familiar with data vault (or who learn fast) this approach should not cost you much more then the replicated ODS to build, but probably will not be any cheaper in the short term (really depends on your cost of labor and their experience) .
The real savings is in the long term – when you don’t have to re-engineer later to support something more than operational reporting.
So should you have a Data Vault instead of an ODS? I think most organizations would be well served to build a data vault then virtualize the ODS on top. Then they are positioned for the future.
Thank you very much for such insightfull answer, Kent.
One should note that reporting should go through view so that a change in the underlying operational schema doesn’t need to update reports but just the view common to many report.
For history, obviously the DV is better than the ODS.
One should not as well that an ODS is ideally in clean data and not just a copy of the operational data. I saw that happening but there is a tradeoff between the data improvement and the delivery time.
Concerning the whole DV story, my technical concerns are :
– big/big joins : I don’t want to have those. The only execution plan I want to see is sequential read for the fact table and hash join for dimensions tables with those in memory, O(n) basically. This is the performance of the star schema, factorize as much as you want but never let a set don’t fit the memory. Say otherwise, never read a data from the disk twice and read it sequentially (less important with SSD it seems). Some part will need a sort (windows function and or merge join) during load step, I can live with that if its really needed.
– complexity : I have a poor memory, and a screen that can’t contains a list of hundred tables on the side. I want to be effective in my daily work.
DV stories are usually as raw data warehousing. Nobody argue against that. It stores rows well, it’s flexible. There is rarely a discussion about what comes next. It is said that users are using a star schema anyway (materialized or not), so DV is just an extra step, like the Imnon EDW. Obviously adding interesting feature but also complexifying the whole stack. I spend most of my time in the business rules, not much of fixing the ETL pipeline when the operational system change. If at some point the business divide the client table by prospects and real clients the DV will not really help. Evolving the old client hub will be difficult (it contains now some prospects and some real clients) and all the pipeline should be changed anyway because most client related and business rules and KPI will be impacted.
We should evaluate the whole stack and not just a part from it in insulation.
Thanks for taking the time for such a detailed comment!
I agree about suing views to base reports on. I am definitely in favor of virtualizing as much as possible to retain agility. With the power of databases appliances today like exadata, timesten, columnar, etc, I am much less worried about all the joins in the data vault but agree that, depending on your hardware and data volumes, there are still many times when instantiating a star schema on top of the data vault is the right solution.
Regardless, the reporting layer should never see the raw data vault schema.
Pingback: Data Vault vs. The World (2 of 3) « Oracle Data Warrior
Pingback: Data Vault vs.The World (3 of 3) « Oracle Data Warrior
Interesting series. I’m not sure I buy your argument about consulting companies being asked about the best way to model a data warehouse. Companies hire consultants because they have the knowledge to solve their problem, usually knowledge they don’t have themselves. I’m not sure the company even cares what the ultimate schema is if their problem is solved. However, they do care that the solution delivered is maintainable over time.
As the “new guy”, if you are pitching a solution that isn’t mainstream, of course you’ll have to justify why you are suggesting something that differs. A better question might be, if this approach has been around for 10 years, why isn’t it more mainstream? I think part of that answer lies in a comment you made in part 2: “I also want to recognize that there are some very expert architects out there who have had great success building Kimball-style EDWs and mitigated all the risks I am going to mention. Unfortunately, there are not enough to them to go around.” I think the same can be said of good Data Vault modelers. The problem is compounded by the fact that there are 100 times the number of consultants out there that can understand a 3NF or Star Schema warehouse (or a standard hybrid of the two) than there are DV modelers. If you are in Colorado or Vermont, finding a Data Vault modeler might not be that hard, but there are plenty of areas where that isn’t true. Further, due to scarcity, the standard rates for mainstream data warehouse consultants are likely to be much lower. You can certainly argue “you get what you pay for”, but I’m not sure the architecture decision can be made in a vacuum without considering all of the costs, especially if the company has few to no Data Vault resources and will be left with something that no one understands when the high priced consultants leave.
Last, metaphor aside, Data Vault is not a tool, it is an approach, and one that most people have to implement manually. For 20 years companies have been trying to get out of the software development business, and that trend is reaching closer and closer to the data warehouse. Certainly Amazon Redshift wants to host your data warehouse in the cloud (based on their ParAccel acquisition). Kalido offers a cloud based version (which we use for our training courses, among other things) and can also automate the creation of warehouses (using a similarly flexible DW 2.0 compliant approach) directly on customer on-premise SMP or MPP databases. And there are obviously other solutions. Ultimately customers are going to be weighing multiple options (“buy vs. build”, “cloud vs on-premise”, etc.) and the option they choose will likely inform the final architecture.