The Data Warrior

Changing the world, one data model at a time. How can I help you?

Archive for the tag “Data Warehouse”

Data Vault Informatica Class is Live!

Just a quick note to let you all know that Dan has finally released the class on how to easily implement a Data Vault using Informatica.

I wrote about the class here.

I have gone through a few of the lessons already and can tell you the instruction is very clear and easy to follow (even for me!) and the audio and video is excellent. The audio seems to come on a bit load so just be sure you have your volume turned down a bit when you start the videos.

And there is a money back guarantee if for some reason you decide the class is not for you.

If you did not get on Dan’s early notice list you can still sign up by going directly here: http://learndatavault.com/kentdvi

And, since you are a reader of my blog, if you sign up in the next few weeks and enter the coupon code DATAWARRIOR13, you can get $100 off !

So if you use Informatica and plan to do a Data Vault, you owe it to yourself to take a look at this course.

Take care.

Kent

Data Vault and the Oracle Reference Architecture

Thanks to Mark Rittman and Twitter, I found out just before RMOUG that Oracle had published a new reference architecture.  It used to be called the Data Warehouse Reference Architecture, now it is called the Information Management Reference Architecture.

Oracle Information Management Ref Architecture

Oracle updated the architecture to allow for unstructured and big data to fit into the picture.

In my talks about Data Vault over the last few years I have been referring to the Foundation Layer of the architecture as the place where Data Vault fits. The new version of the architecture actual fits the definition of the Data Vault even better.

Now the Foundation Layer is defined as “Immutable Enterprise Data with Full History”.

If that is not the definition of Data Vault, I don’t know what is!

Immutable – does not change. Data Vault is insert only, no update – ever.

Enterprise Data – well duh! That pretty well fits any real data warehouse architecture. The model covers an enterprise view of the data not just a departmental views (like a data mart).

Full History – tracks data changes over time. That is one of the keys to the data Vault approach. We track all the data change history in Satellites so we can always refer to a view of the data at any point in time  That allows us to build (or re-build) dependent data marts whenever we need or whenever the business changes the rules.

So it is possible to do a Data Vault approach and be compliant with Oracle’s reference architecture.

Guess Dan was just a bit ahead of the game…

Later

Kent

How to Use Informatica to Build a Data Vault

Yes, its true – you will soon be able to get online training on how to build a Data Vault data warehouse using Informatica.

Dan Linstedt has been working hard for several months now to put together some top notch training for all you who use Informatica.

Dan will teach you all his best practices for getting the job done quickly using Informatica for your ETL tool.

If you want in, it’s not too late to get on the VIP early notification (which entitle you some discounts). Get in on the list here.

Here are a few common questions that Dan recently answered:

Q. What are the pre-requisties?
A. You must know Informatica PowerCenter and Data Vault modeling basics. The training works with Informatica PowerCenter v8.x and v9.x, but the mappings will only import to version 9.x or higher.

Q. Do I need access to an Informatica installation?
A. You will, if you want to do any of the hands on portions. We can’t help you with this. They used to provide a limited developer edition with a devnet membership, but that seems to have been discontinued.

Q. Will I learn Informatica PowerCenter?
A. No! This course assumes, you have at least 3 months experience in Informatica and know the difference between mapping, session and workflow objects. If you’ve never worked with Informatica tools, then we recommend that you DO NOT invest in it.

Q. Do I need to know Data Vault Modeling?
A. Yes, and the knowledge in the book “Super Charge your Data Warehouse” is sufficient for the course. It’s better if you have  more hands on experience though.

Q. Would it benefit me if I’ve gone through  the Data Vault Implementation and Best Practices course?
A. Yes.

Want to know more? Check out this video that has more details about the class and what it covers.

That’s it for now.

Later.

Kent

Data Vault vs.The World (3 of 3)

So far in this series I have compared and contrasted the Data Vault approach to that of  using a replicated ODS or a dimensional Kimball-style structure for an enterprise data warehouse repository. Let me thank everyone who has replied or commented for taking these posts as they were intended, keeping the conversation professional and not starting any flame wars. 🙂

So to close out the thread I must of course discuss using classic Inmon-style CIF (Corporate Information Factory) approach. Many of you may also think of this as a 3NF EDW approach.

To start it should be noted that I started my data warehouse career building 3NF data warehouses and applying Bill Inmon’s principles. Very shortly after I began learning about data warehousing I had the incredible privilege of getting to not only meet Mr. Inmon and learn from him but I got to co-author a book (my first) with him. That book was The Data Model Resource Book. I got to work with Bill on the data warehouse chapters and because of that experience became well versed in his theories and principles. For many years I even gave talks at user group events about how to convert an enterprise data model to an enterprise data warehouse model.

So how does this approach work? Basically you do some moderate denormalization of the source system model (where it is not already denormalized) and add a snapshot date to all the primary keys (to track changes over time). This of course is an oversimplification – there are a number of denormaliztion techniques that could be used in build a data warehouse using Bill’s original thesis.

Additionally this approach (CIF or even DW 2.0) calls for then building dimensional data marts on top of that EDW (along with other reporting structures as needed). Risks here are similar to those mentioned in the previous posts.

The primary one being that the resulting EDW data structure is usually pretty tightly coupled to the OLTP model so the risk of reworking and reloading data is very high as the OLTP structure changes over time. This of course would have impacts downstream to the dimensional models, reports, and dependent extracts.

The addition of snapshot dates to all the PKs in this style data warehouse model also adds quite a bit of complexity to the load and query logic as the dates cascade down through parent-child-child-type relationships. Getting data out ends up needing lots of nested Max(Date) sorts of sub-queries. Miss a sub-query or get it wrong and you get the wrong data. Overall a fairly fragile architecture in the long run.

Also like the dimensional approach, I have encountered few teams that have been successful trying to implement this style of data warehouse in an incremental or agile fashion. My bad luck? Maybe…

The loosely coupled Data Vault data model mitigates these risks and also allows for agile deployment.

As discussed in the previous posts, the data model for a Data Vault based data warehouse is based on business keys and processes rather than the model of any one source system. The approach was specifically developed to mitigate the risks and struggles that were evident in the traditional approaches to data warehousing, including what we all considered the Inmon approach.

As I mentioned earlier I got to interact with Bill Inmon while we worked on a book. The interaction did not stop there. I have had many discussions over the years with Bill on many topics related to data warehousing, which of course includes talking about Data Vault. Both Dan and I talked with Bill about the ideas in the Data Vault approach. I spent a number of lunches telling him about my real-world experience with the approach and how it compared to his original approach (since I had done both). There were both overlaps and differences. Initially, Bill simply agreed it sounded like a reasonable approach (which was a relief to me!).

Over a period of time, many conversations with many people, study, and research, we actually won Bill Inmon over and got his endorsement for Data Vault. In June of 2007,  Bill Inmon stated for the record: 

The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.

So if Bill Inmon agrees that Data Vault is a better approach for modeling an enterprise data warehouse, why would anyone keep using his old methods and not at least consider learning more about Data Vault?

Something to think about, eh?

I hope you enjoyed this little series about Data Vault and will keep it in mind as you get into your new data warehouse projects for 2013.

Kent

P.S. – if you are ready to learn more about Data Vault, check out the introductory paper on my White Papers page, or just go for broke and buy the Super Charge book.

Data Vault vs. The World (2 of 3)

In the first post of this series, I discussed advantages of using a Data Vault over a replicated ODS. In this post I am going to discuss Data Vault and the more classical approach of a Kimball-style dimensional (star schema) design.

To be clear upfront, I will be comparing using Data Vault to using dimensional for the enterprise data warehouse repository. Using star schemas for reporting data marts is not at issue (we use those in the data vault framework regularly).

I also want to recognize that there are some very expert architects out there who have had great success building Kimball-style EDWs and mitigated all the risks I am going to mention. Unfortunately, there are not enough to them to go around.  So for the rest of us, we might need an alternative approach…

When considering a Kimball-style approach, organizations often consider designing facts and dimensions to hold the operational data from one of their source systems. One downside to this approach is that to design optimal structures we would have to have a very solid understanding of all the types of questions the business needs to answer. Otherwise we risk not having the right fact tables designed to support the ever-changing needs of the business. So this tends to make it difficult to implement in an agile, iterative manner and can lead to the need for re-engineering. This is especially a risk with conformed dimensions. While adding new fact tables in an interative fashion is not a problem, having to redeign or rebuild a large conformed dimension could be big effort.

Additionally there may be requirements to extract data for various other analysis. Dimensional models do not lend themselves well to all types of extracts. Not every query can be supported by cube-type designs – especially if the data is spread across multiple fact tables with varying levels of granularity.

With a Data Vault solution we can store all the data at an atomic level in a form that can easily be projected into dimensional views, 3NF ODS-style structures or just flat denormalized spreadsheet style extracts. This will allow us to be more agile in addressing the changing report and extract requirements than if we have to design new facts and dimensions, then write and test the ETL code to build them.

Plus the instantiated dimension model (which is highly denormalized) will simply take more space than a data vault model which is more normalized.

Using the dimensional approach, as the basis for the foundation layer of an enterprise data warehouse, there is also the risk pf having to redesign, drop and reload both facts and dimensions as the OLTP model evolves. That effort can be very expensive and take a lot of time.

With the Data Vault  it is much easier to evolve the warehouse model by simply adding new structures (Hubs, Links, Sats) to the model to absorb the changes.

Integrating other source systems can also be a challenge with dimensional models, especially in populating conformed dimensions as you have to account for all semantic and structural differences between the various systems before you load the data. this could mean a lot for cleansing and transforming the source data (which could jeopardize your ability to clearly trace data to the source.)

Data Vault avoids this issue by integrating around business keys (via Hubs) and allowing for apparently disparate data sets to be associated through the use of Same-As-Link tables to support dynamic equivalence mappings of the data sets from different sources. In agile terms, these structures can be built in a future sprint after the data is loaded and profiled. (As a side note, Same-As-Links are also very helpful for using your vault to do master data type work.)

Well, those are the primary issues I have run across. I am sure there are more, but this should be enough for now.

Can’t wait to hear your comments! Be sure to tweet this to your friends.

Later,

Kent

P.S. Stay tuned for #3 where I will discuss traditional Inmon 3NF-style data warehouse.

Post Navigation