The Data Warrior

Changing the world, one data model at a time. How can I help you?

Data Vault vs.The World (3 of 3)

So far in this series I have compared and contrasted the Data Vault approach to that of  using a replicated ODS or a dimensional Kimball-style structure for an enterprise data warehouse repository. Let me thank everyone who has replied or commented for taking these posts as they were intended, keeping the conversation professional and not starting any flame wars. 🙂

So to close out the thread I must of course discuss using classic Inmon-style CIF (Corporate Information Factory) approach. Many of you may also think of this as a 3NF EDW approach.

To start it should be noted that I started my data warehouse career building 3NF data warehouses and applying Bill Inmon’s principles. Very shortly after I began learning about data warehousing I had the incredible privilege of getting to not only meet Mr. Inmon and learn from him but I got to co-author a book (my first) with him. That book was The Data Model Resource Book. I got to work with Bill on the data warehouse chapters and because of that experience became well versed in his theories and principles. For many years I even gave talks at user group events about how to convert an enterprise data model to an enterprise data warehouse model.

So how does this approach work? Basically you do some moderate denormalization of the source system model (where it is not already denormalized) and add a snapshot date to all the primary keys (to track changes over time). This of course is an oversimplification – there are a number of denormaliztion techniques that could be used in build a data warehouse using Bill’s original thesis.

Additionally this approach (CIF or even DW 2.0) calls for then building dimensional data marts on top of that EDW (along with other reporting structures as needed). Risks here are similar to those mentioned in the previous posts.

The primary one being that the resulting EDW data structure is usually pretty tightly coupled to the OLTP model so the risk of reworking and reloading data is very high as the OLTP structure changes over time. This of course would have impacts downstream to the dimensional models, reports, and dependent extracts.

The addition of snapshot dates to all the PKs in this style data warehouse model also adds quite a bit of complexity to the load and query logic as the dates cascade down through parent-child-child-type relationships. Getting data out ends up needing lots of nested Max(Date) sorts of sub-queries. Miss a sub-query or get it wrong and you get the wrong data. Overall a fairly fragile architecture in the long run.

Also like the dimensional approach, I have encountered few teams that have been successful trying to implement this style of data warehouse in an incremental or agile fashion. My bad luck? Maybe…

The loosely coupled Data Vault data model mitigates these risks and also allows for agile deployment.

As discussed in the previous posts, the data model for a Data Vault based data warehouse is based on business keys and processes rather than the model of any one source system. The approach was specifically developed to mitigate the risks and struggles that were evident in the traditional approaches to data warehousing, including what we all considered the Inmon approach.

As I mentioned earlier I got to interact with Bill Inmon while we worked on a book. The interaction did not stop there. I have had many discussions over the years with Bill on many topics related to data warehousing, which of course includes talking about Data Vault. Both Dan and I talked with Bill about the ideas in the Data Vault approach. I spent a number of lunches telling him about my real-world experience with the approach and how it compared to his original approach (since I had done both). There were both overlaps and differences. Initially, Bill simply agreed it sounded like a reasonable approach (which was a relief to me!).

Over a period of time, many conversations with many people, study, and research, we actually won Bill Inmon over and got his endorsement for Data Vault. In June of 2007,  Bill Inmon stated for the record: 

The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.

So if Bill Inmon agrees that Data Vault is a better approach for modeling an enterprise data warehouse, why would anyone keep using his old methods and not at least consider learning more about Data Vault?

Something to think about, eh?

I hope you enjoyed this little series about Data Vault and will keep it in mind as you get into your new data warehouse projects for 2013.

Kent

P.S. – if you are ready to learn more about Data Vault, check out the introductory paper on my White Papers page, or just go for broke and buy the Super Charge book.

Single Post Navigation

4 thoughts on “Data Vault vs.The World (3 of 3)

  1. I’ve never attempted an Inmon DW modeling for sheer fear of geting all tangled up in the details (quite harder to find than Dimensional Modeling.) The first time I set my eyes upon DV I tough “this is what Inmon is all about!” DV souns a natural match to all Inmon theories. I met Bill Inmon at São Paulo on 2001, while he was on his SAS sponsored lecturing trip to Brazil.

  2. We can argue about the merits of one approach over the other; however, meeting the requirements of Inmon’s DW 2.0 essentially requires a hybrid schema, and there are multiple ways to get there. Successfully implementing such a DW 2.0 compliant hybrid is a continuum from ‘fully automated’ like how Kalido does it to manually using a modeling approach like Data Vault. It is worth noting that Kalido was the first solution to be certified as DW 2.0 compliant:

    http://www.kalido.com/c9a5ca25-5586-4ce9-a46e-2a7256d9f167/press-releases-2007-detail.htm

  3. In case you might not know: A point of interest: Stephen works for Kalido. No problems there, just thought I would point that out.

    Stephen is comparing apples to oranges here: a Tool based approach compared to a modeling based approach, not something I think should be done, or could be done appropriately.

    Kalido certainly has merits for use (as an enablement tool in certain situations). What I would like to see (as I think perhaps the readers would as well), is Kalido come back to the table to actually discuss the data modeling constructs that Kalido employs under the covers. That would be really interesting to see and compare with Data Vault Modeling.

    The Data Vault Model on the other hand is just that – an approach to architecture and design. It is tool agnostic – therefore it can be used with or without tools. Makes no difference.

    Furthermore, I wish to state that I have yet to see where in DW2.0 there is an actual specification for “hybrid schema”. Stephen, do you have any references to this? I would very much like to look at the same documentation you have access to, I am interested in researching the requirement “requires a hybrid schema”.

    I’m not saying it doesn’t exist, I’m just curious as to the reference point where this statement has been published, perhaps I missed something in my DW2.0 certification.

    As for the record: I am authorized in writing by Bill Inmon to teach, speak, write, and discuss DW2.0 structures. I am also DW2.0 Certified (in case that matters to anyone out there).

    As for a Hybrid approach – the Data Vault Model is that as well, a Hybrid between normalized data warehouse models and denormalized (star schema) data warehouse models. This is in fact one of the reasons why Bill Inmon has publicly endorsed Data Vault Modeling for use in DW2.0.

    There’s also another option out there called Anchor Modeling – which is increasingly interesting to check out.

  4. Having worked for Kalido for over 15 years and being fairly active in the space and online, I think most folks here pretty much know where I work by now. Certainly Kent does. 🙂 If not, LinkedIn is just a click or two away.

    I defer to you on DW2.0 matters, but as far as I’m aware, only solutions that are hybrid have been DW2.0 approved. Perhaps you can update us here on some that aren’t–I certainly could have missed them. I think non-hybrid approaches are going to run into problems supporting the level of flexibility that Bill prescribes. I also think approaches like Kalido that are driven by metadata and not just passively recording it are going to be closer aligned to Bill’s requirement of tightly coupling metadata to the data warehouse environment.

    Regarding Kalido’s architecture, as we’ve discussed before, the underlying mechanism for Kalido is similar though not exact to your Data Vault approach (e.g. Hub, Link, and Satellite -> Kalido Adaptive Services Core, Transaction Datasets, and Mapping Tables). We used to spend a lot of time talking about the way we worked, and we still do that a fair bit with prospects and customers when asked, but more and more we find it just isn’t that relevant anymore. Why? Because companies buy a solution like Kalido because they don’t really want to deal with the nitty-gritty details. They want to know it is based on sound principals, that it is fast and it adapts to change well, perhaps take it for a quick spin on their Teradata or Oracle Exadata box, but at that point, they’re usually good.

    You said that “the Data Vault Model [is] tool agnostic – therefore it can be used with or without tools.” From a tool perspective, I would argue that that is only true if the tools operate on a certain granular level. Creating models with a data modeling tool. Populating those models with ETL tools. Manually configuring your BI tools to those schemas. That’s fine—that’s what most people do. I don’t want to turn this into a Kalido commercial, but I’d say we are on the opposite side of that: we strive to eliminate data modeling and ETL tools and replace them with software automation using best practices driven by a high level business model, enabled by a flexible hybrid model, and pushing as much processing as we can directly into the database (ELT). Probably others in the data warehouse automation space would say something along similar lines, at least for portions of the process.

    Bottom line, Kent just wrote a three part series on why he thinks Data Vault is the best way to create enterprise data warehouses. I have no issue comparing architectures for manual custom builds. But I also think there is a buy vs. build decision and that depending on what you buy, the architectural decision may be less important or altogether moot.

Leave a comment