Is the Data Vault too complex?
This was a very interesting topic that came up on LinkedIn the other day, so I wanted to address it here to.
There seems to be quite a few people who think that Data Vault models are harder to build and manage than what we do today in most shops. So let me explain how I came to learn Data Vault Data Modeling.
Before learning Data Vault, I had successfully built several 3NF OLTP, 3NF DW, and Kimball-style Dimensional data warehouses (and wrote about it with Bill Inmon and Len Silverston in the original Data Model Resource Book).
In other words, I had a reasonable amount of experience in the subject area (data modeling and data warehousing).
I personally found Data Vault extremely easy to learn as a modeling technique (once I took the time to study it a bit). At the time that meant reading the old white papers, attending some lunch & learns with Dan Linstedt and then building a few sample models.
I was definitely skeptical at first (and asked lots of questions at the public lunch & learns). I did not care about MPP, scalability, or many of the other benefits Dan mentioned. I just knew from experience there were a few issues I had seen with the other approaches when it came to building a historical enterprise data store and was hoping Data Vault might be a solution.
In comparison to trying to learn how to design and load a Type 2 slowly changing dimension, Data Vault was a piece of cake (for me anyway).
Once I was convinced, I then introduced the technique to my team in Denver – who had virtually no data warehouse experience.
It was universal – everyone from the modelers to the dbas to the ETL programmers found the technique very easy to learn.
Our investment: One week of training from Dan for 7 people and 3 or 4 days of follow-on consulting where Dan came in once a month (for a day) to do a QA review on our models and load routines and mentor us on any issues we were having.
Dan did not make much $$ off of us. 😦
Since then, I have found that experienced 3NF modelers pick up the technique in no-time flat.
Why is that?
Because Data Vault relies on solid relational principles, experienced 3NF modelers seem to grasp it pretty fast.
Modelers who only have experience with star schemas, on the other hand, seem to have a bit of a hard time with the approach. For some of them it is a paradigm shift in modeling technique (i.e., feels very unfamiliar – “too many tables and joins”), for others it is almost a dogmatic objection as they were (sadly) taught that dimensional/star was the only “right” way to do data warehousing.
They are just not open to a new approach for any reason (sad but true). 😦
The biggest issue I have seen with clients is a reluctance to try the approach for fear of failure because they don’t personally know anyone (other than me) who has used the approach and because they think it is easier (and cheaper?) to find dimensional modelers.
This happens, even if they agree in concept that Data Vault sounds like a very valid and flexible modeling approach.
As we all know, it takes $$ to train people on star schema design too, so my advice is that if you have a team of people who know 3NF but don’t know dimensional, train them on Data Vault to build your EDW, then hire one or two dimensional modelers to build your end user reporting layer (i.e., data marts) off the Data Vault.
So that’s my 25 cent testimonial. (You get if for free!)
If you want to learn more about Data Vault, check out my presentations on SlideShare or click on the Super Charge book cover (below my picture in the sidebar) to buy the Data Vault modeling book..
Check it out and let me know what you think in the comments. How do we get people over the fear of trying Data Vault?
Talk to you later.
I had substantial background in both 3NF and Star solutions when I was exposed to DV. I found it intuitive and natural from the start. But, most importantly, I found it to address many of the nastiest, common problems of the other approaches. Yes, it’s different. But I believe that, were one to learn DW from the up in each of the 3 paradigms…having no previous bias…one would conclude that DV was the most appropriate for the front-end of the warehouse. As Kent points out, you’ll still want star-based DMs on the backend. I see DV as a part of the overall architecture…not a full replacement.
Good to hear from you! Thanks for adding your perspective.
Production data models leave a deep impression in operator’s memories. Statistical, analytical, reporting, and mining models are sufficiently different and may seem alien to the uninformed.
The reasons for the differences in model and methodology are easy to understand: each demands a distinct and diverse end-point. Production data serve well-knwn business processes. Analytical data (my shorthand for all the non-production uses) fuel the derivation engines of the Magic Data 8 Ball. (Answer uncertain. Try again later.)
Like a kata, the production demands, forms, states, stages, and uses are drilled into the early students of production data. It may be harder for some to extend (or even break) the kata before they can explore what may seem far less obvious.
As you point out in your blog entry, the concepts embraced in this “different” data philosophy are natural and understood in other fields: mostly scientific, or meta-mathematical — but rarely in engineering.
As it is the schools of engineering that have influenced, badly in many cases, the foundational training of our next generation technologists, the rigidity of engineering can be seen, sometimes, in the difficulty when breaking the norms of the production data kata. A broader understanding of the natural and theoretical worlds would make this transition easier. But it is as it is.
A natural law in chemistry is that the most common structure of a molecule can be found when the internal energy of the molecule is lowest (of all possible combinations of its atoms). I hold the same theory regarding data. While not the lowest energy state possible of all data elements (in a manner of speaking), the data vault provides an obvious reduction in internal energy for a schema.
We may not like to acknowledge the fact, but the structure of business process data demands a higher state of internal energy, reducing the energy consumption necessary in business processes or function. The internal energy level in a proposed analytical data state is theoretically much lower. However, by removing the energy demands to support business processes, the data appear to lose a lot of structure (when, in fact, the data take on the greatest possible structure).
Perhaps this is the greatest impediment to those who expect to see familiar business process structures in their data. They observe a “natural state” of data. Unfortunately for many data professionals, they have never seen data in a natural state.
As you mention, for those who are comfortable with 3nf data the transition to a more natural state seems easier. I have concluded that the 3nf is a good break-away point to migrate from production to analytical states. Those who expect cognitive compatibility between production and analytical data may experience greater transitional dissonance.
In short, complexity is different to different people. After becoming comfortable with a high-complexity field of production data, the natural data state may seem so unfamiliar as to mimic complexity, when in fact it is the reduction of such.
Jeff – Welcome and thanks for posting the comments.
I love your analogies (especially the katas and chemistry). Very well put.
Not hard to learn, but a bit hard to implement. I am in the process of building a SCD Type 2 Dimension based upon four hubs with their satellites and three link tables and the join structures are killing my brain 🙂 I try to build point in time tables using autogeneration of sql (we are on Oracle ) and I am searching for relevant and suitable documents on the web telling me exactly how to do that. Any ideas or pointers in the right direction will be much appreciated !
I am not sure you will find those kind of details on the web (at least not for free). The new book has a lot of details on point in time (PIT) tables. Have you looked at that yet? Also Dan will soon be releasing some online implementation classes. You can get on the notification list for that here: http://learndatavault.com/dv-implementation-early-bird-notification/,
People who think the DV is hard without even trying, probably haven’t really done maintenance on a complex SCD-II. Take for example something with multiple sources like Accounts (in any large company that has disparate systems due to acquisitions etc) and try and build a dimension to serve different business groups.
You’ll end up creating something fairly wide and hard to manage where you have to even take into account precedence (which may be different for different groups).
I’ve had to fix a broken load on this in production and it’s ugly. This can happen for any reason such as a new bug in the code or even hardware issues like the DI server having RAM issues. You’ll start with the bandages on the data which in a few years will look like a mummy.
Contrast this same problem with a DV and tell me with a straight face that it’s not easier to build a DV. Sure, you still have to build the last mile to the customer, but the DV helps much with that too as well as the separation of the engineering components of the build (which are friendly to automation and easier modeling) from the business components ensures a much easier delivery to the actual business where they’re not stepping on each others toes.
And, your virtualizing articles pretty much gives everyone an easy way to build out at least the test dimensional models, where your front-end developers don’t even need to do much.
The DV is the enemy of the large consulting teams where they want a long-term cash cow in a BI project because it’s so easy to build, automate and deliver. Sure, you need to model and it still starts with the most important business key alignment across the enterprise, but nobody can claim that any other technique is simpler with very few exceptions like very small single source type solutions.
Any sufficiently large enterprise solution that doesn’t use the DV as the foundational warehouse model will eventually end up with something similar (As we’re already seeing with PSAs and other hacks in dimensional models) but with not enough positives.
As always, for most things it’s usually a skill/knowledge issue which can always be helped with training. And, as an example a client who was paying their hired consultants $100-$150/hour found it relatively inexpensive to simply foot the bill to train them on DV which gave them a ton of productivity afterwards reducing time to implementation (3-4 days of the consultants billing gets them trained. Saving 3-4 days on a project is a piece of cake with the DV).
The only issue then is not when, but who to get training from. Personally, I’d recommend you Kent or someone like you with a ton of real life experience with various DV solutions or go directly to Dan.
And, there’s always the new book on DV 2.0 which has already gotten rave reviews.
Thanks Sanjay for your feedback and the vote of confidence!