The Data Warrior

Changing the world, one data model at a time. How can I help you?

Archive for the tag “architecture”

Do you want a Data Vault 2.0 Bootcamp in The Woodlands, Texas?

Survey time peeps!

Simple Yes/No question:

Would you like me to hold a DV 2.0 Bootcamp (and private certification) west of the Mississippi?

If I set one up a in “America’s Hometown” (really it is even trademarked), The Woodlands, Texas (20 minutes north of Houston Bush Intercontinental Airport), would you come?

The Woodlands Waterway and our concert venue The Pavillion

The beautiful Woodlands Waterway and our concert venue, The Pavilion. You can take a water taxi to a concert!

Since I am an authorized DV 2.0 Trainer, I figure it is time I actually teach a class. And why not in my hometown in Texas?

So what is in a DV 2.0 Bootcamp class?

Three days of intense training on all things DV (followed by a chance to become a Certified DV 2.0 Practitioner).  NOTE: If you are new to Data Vault, you must read the Super Charge book before attending the class.

This class covers what you need to know as a practitioner in the world of Data Warehousing and Business Intelligence.  This is our foundational course.  This is a 3 day (in person) course that covers end-to-end best practices.  Major topics for this class are:

  • Architecture – Including NoSQL, Big Data, Hybrid Systems and Relational stores
  • Methodology – Including CMMI, Six Sigma, Optimization, Automation, and Generation
  • Implementation – Including Performance and Tuning, Set Logic, ELT vs ETL, Parallelism
  • Modeling – Including replacing of surrogates with Hash Keys, data layout, data co-location

This class takes you through the why/what/how of Data Vault 2.0.  It includes the coverage of the business justifications, then follows with the technical descriptions of the architecture, implementation, methodology, and modeling.  Included in the topics are reaching agility, practicing Six Sigma, measuring and optimizing at CMMI level 5, the KPA’s and KPI’s of Data Warehousing, and more.

We also discuss the use of Hadoop, and NoSQL platforms along side the relational world.  The objective is to enrich your understanding of how and when to apply Big Data Solutions.  The course finishes with descriptions on ETL and ELT design time paradigms, including templates, best practices and working SQL.  This class is a prerequisite for anyone wishing to achieve DV2.0 Certified Practitioner status.

When:

Late summer – early fall (depending on interest) of 2015

Cost:

TBD based on how much the space costs me here!

But will definitely be less than $3,000.

Where:

Someplace nice and central in The Woodlands within walking distance to great food and drink.

Apartments over looking the waterway that flows to Lake Woodlands. A great natural setting. Not the usual suburban wasteland.

Apartments over looking the waterway that flows to Lake Woodlands. A great natural setting. Not the usual suburban wasteland.

So who is in?

Please respond in the comments so I can tell if I should start setting something up.

Thanks.

Kent

The Data Warrior

P.S. Since you know I am into fitness, we have great options here to exercise here too. MIles of running trails and even kayaking on the waterway.

The Riva Boathouse were you can rent single and double kayaks by the hour (look close and you find a picture of my son and & on the sign from when they first opened).

The Riva Boathouse were you can rent single and double kayaks by the hour (look close, when you are here, and you find a picture of my son and & I on the sign from when they first opened).

Data Vault vs.The World (3 of 3)

So far in this series I have compared and contrasted the Data Vault approach to that of  using a replicated ODS or a dimensional Kimball-style structure for an enterprise data warehouse repository. Let me thank everyone who has replied or commented for taking these posts as they were intended, keeping the conversation professional and not starting any flame wars. 🙂

So to close out the thread I must of course discuss using classic Inmon-style CIF (Corporate Information Factory) approach. Many of you may also think of this as a 3NF EDW approach.

To start it should be noted that I started my data warehouse career building 3NF data warehouses and applying Bill Inmon’s principles. Very shortly after I began learning about data warehousing I had the incredible privilege of getting to not only meet Mr. Inmon and learn from him but I got to co-author a book (my first) with him. That book was The Data Model Resource Book. I got to work with Bill on the data warehouse chapters and because of that experience became well versed in his theories and principles. For many years I even gave talks at user group events about how to convert an enterprise data model to an enterprise data warehouse model.

So how does this approach work? Basically you do some moderate denormalization of the source system model (where it is not already denormalized) and add a snapshot date to all the primary keys (to track changes over time). This of course is an oversimplification – there are a number of denormaliztion techniques that could be used in build a data warehouse using Bill’s original thesis.

Additionally this approach (CIF or even DW 2.0) calls for then building dimensional data marts on top of that EDW (along with other reporting structures as needed). Risks here are similar to those mentioned in the previous posts.

The primary one being that the resulting EDW data structure is usually pretty tightly coupled to the OLTP model so the risk of reworking and reloading data is very high as the OLTP structure changes over time. This of course would have impacts downstream to the dimensional models, reports, and dependent extracts.

The addition of snapshot dates to all the PKs in this style data warehouse model also adds quite a bit of complexity to the load and query logic as the dates cascade down through parent-child-child-type relationships. Getting data out ends up needing lots of nested Max(Date) sorts of sub-queries. Miss a sub-query or get it wrong and you get the wrong data. Overall a fairly fragile architecture in the long run.

Also like the dimensional approach, I have encountered few teams that have been successful trying to implement this style of data warehouse in an incremental or agile fashion. My bad luck? Maybe…

The loosely coupled Data Vault data model mitigates these risks and also allows for agile deployment.

As discussed in the previous posts, the data model for a Data Vault based data warehouse is based on business keys and processes rather than the model of any one source system. The approach was specifically developed to mitigate the risks and struggles that were evident in the traditional approaches to data warehousing, including what we all considered the Inmon approach.

As I mentioned earlier I got to interact with Bill Inmon while we worked on a book. The interaction did not stop there. I have had many discussions over the years with Bill on many topics related to data warehousing, which of course includes talking about Data Vault. Both Dan and I talked with Bill about the ideas in the Data Vault approach. I spent a number of lunches telling him about my real-world experience with the approach and how it compared to his original approach (since I had done both). There were both overlaps and differences. Initially, Bill simply agreed it sounded like a reasonable approach (which was a relief to me!).

Over a period of time, many conversations with many people, study, and research, we actually won Bill Inmon over and got his endorsement for Data Vault. In June of 2007,  Bill Inmon stated for the record: 

The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.

So if Bill Inmon agrees that Data Vault is a better approach for modeling an enterprise data warehouse, why would anyone keep using his old methods and not at least consider learning more about Data Vault?

Something to think about, eh?

I hope you enjoyed this little series about Data Vault and will keep it in mind as you get into your new data warehouse projects for 2013.

Kent

P.S. – if you are ready to learn more about Data Vault, check out the introductory paper on my White Papers page, or just go for broke and buy the Super Charge book.

Data Vault vs. The World (2 of 3)

In the first post of this series, I discussed advantages of using a Data Vault over a replicated ODS. In this post I am going to discuss Data Vault and the more classical approach of a Kimball-style dimensional (star schema) design.

To be clear upfront, I will be comparing using Data Vault to using dimensional for the enterprise data warehouse repository. Using star schemas for reporting data marts is not at issue (we use those in the data vault framework regularly).

I also want to recognize that there are some very expert architects out there who have had great success building Kimball-style EDWs and mitigated all the risks I am going to mention. Unfortunately, there are not enough to them to go around.  So for the rest of us, we might need an alternative approach…

When considering a Kimball-style approach, organizations often consider designing facts and dimensions to hold the operational data from one of their source systems. One downside to this approach is that to design optimal structures we would have to have a very solid understanding of all the types of questions the business needs to answer. Otherwise we risk not having the right fact tables designed to support the ever-changing needs of the business. So this tends to make it difficult to implement in an agile, iterative manner and can lead to the need for re-engineering. This is especially a risk with conformed dimensions. While adding new fact tables in an interative fashion is not a problem, having to redeign or rebuild a large conformed dimension could be big effort.

Additionally there may be requirements to extract data for various other analysis. Dimensional models do not lend themselves well to all types of extracts. Not every query can be supported by cube-type designs – especially if the data is spread across multiple fact tables with varying levels of granularity.

With a Data Vault solution we can store all the data at an atomic level in a form that can easily be projected into dimensional views, 3NF ODS-style structures or just flat denormalized spreadsheet style extracts. This will allow us to be more agile in addressing the changing report and extract requirements than if we have to design new facts and dimensions, then write and test the ETL code to build them.

Plus the instantiated dimension model (which is highly denormalized) will simply take more space than a data vault model which is more normalized.

Using the dimensional approach, as the basis for the foundation layer of an enterprise data warehouse, there is also the risk pf having to redesign, drop and reload both facts and dimensions as the OLTP model evolves. That effort can be very expensive and take a lot of time.

With the Data Vault  it is much easier to evolve the warehouse model by simply adding new structures (Hubs, Links, Sats) to the model to absorb the changes.

Integrating other source systems can also be a challenge with dimensional models, especially in populating conformed dimensions as you have to account for all semantic and structural differences between the various systems before you load the data. this could mean a lot for cleansing and transforming the source data (which could jeopardize your ability to clearly trace data to the source.)

Data Vault avoids this issue by integrating around business keys (via Hubs) and allowing for apparently disparate data sets to be associated through the use of Same-As-Link tables to support dynamic equivalence mappings of the data sets from different sources. In agile terms, these structures can be built in a future sprint after the data is loaded and profiled. (As a side note, Same-As-Links are also very helpful for using your vault to do master data type work.)

Well, those are the primary issues I have run across. I am sure there are more, but this should be enough for now.

Can’t wait to hear your comments! Be sure to tweet this to your friends.

Later,

Kent

P.S. Stay tuned for #3 where I will discuss traditional Inmon 3NF-style data warehouse.

Data Vault vs. The World (1 of 3)

Okay, maybe not “the world” but is does sometimes seem like it.

Even though the Data Vault has been around for well over 10 years now, has multiple books, video, and tons of success stories,  I am constantly asked to compare and contrast Data Vault to approaches generally accepted in the industry.

What’s up with that?

When was the last time you got asked to justify using a star schema for your data warehouse project?

Or when was that expensive consulting firm even asked “so what data modeling technique do you recommend for our data warehouse?”

Oh…like never.

Such is the life of the “new guy.” (If you are new to Data Vault, read this first.)

So, over the next few posts, I am going to lay out some of the explanations and justifications I use when comparing Data Vault to other approaches to data warehousing.

The first contestant: Poor man’s ODS vs. Data Vault

This approach entails simply replicating the operational (OLTP) tables to another server for read only reporting. This could be used as a partial data warehouse solution using something like Oracle’s GoldenGate to support near real time operational reporting that would minimize impact on the operational system.

This solution, however, does not adequately support needs for dimensional analysis nor would it allow for tracking of changes to the data historical (beyond any temporal tracking inherent in the OLTP data model).

A big risk of this approach is that as the OLTP structures continue to morph and change over time, reports and other extracts that access the changed structures would of course break as soon as the change was replicated to the ODS.

How does Data Vault handle this?

Data Vault avoids these problems by using structures that are not tightly coupled to any one source system. So as the source systems change we simply add Satellite and Link structures as needed.  In the Data Vault methodology we do not drop any existing structures so reports will continue to work until we can properly rewrite them to take advantage of the new structure.  If there is totally new data added to a source, we would probably end up adding new Hubs as well.

An additional advantage is that because Data Vault uses this loosely coupled approach we can load data from multiple sources. If we replicate specific OLTP structures, we would not be able to easily integrate other source system feeds – we would have to build another repository to do the integration (which would likely entail duplicating quite a bit of the data).

Don’t get me wrong, there is nothing wrong with using replication tools to build real time operational data stores.

In fact it is an excellent solution to getting your operational reporting offloaded from the main production server.

It is a tried and true solution – for a specific problem.

It is however, not the right solution if you are building an enterprise data warehouse and need to integrate multiple sources or need to report on changes to your data over time.

So let’s use the right tool for the right job.

Data Vault is newer, better tool.

In the next two posts I will compare Data Vault to the Kimball-style dimensional approach (part 2 of 3) and then to Inmon-style 3NF (part 3 of 3).

Stay tuned.

Kent

P.S. Be sure to sign up to follow my blog so you don’t miss the next round of Data Vault vs. The World.

 

ODTUG KScope12: Day 4 – Another Day in Paradise

Well folks, it is really late/early so, for now I am just putting up some pictures without a lot of detail.

This was the sunrise that greeted me on the way to Morning Chi Gung. Going to be another hot one!

image

This is the view of the bluff and waterfall that we see every day when we practice Chi Gung on the lawn. Very soothing and relaxing. It definitely enhances the experience and feeling of connectedness to the earth.

First session of the day was Maria Colgan (the optimizer queen) talking about Tuning SQL in a Data Warehouse. A huge amount of information to digest. Mostly over my head but very useful for a data warehouse dba.

She did however forget her glasses this morning and could not really see the people in the back row too well. 🙂

Next up was Mr. Kevin McGinley (BI Track Lead) giving us his thoughts about Exalytics and what is meant by “Speed of Thought”.

This is a picture of his four kids before the session. They did Kevin’s introduction today. Quite cute.

Not sure where they got it from (just say’n). 😉

After Kevin’s entertaining talk, I went to see John Jeffries talk about Oracle Golden Gate. John is one of the world experts on Golden Gate having published the Oracle Golden Gate 11g Implementer’s Guide.

John had a nice diagram (below)  of what you can use Golden Gate for. Very useful.

After lunch I went to see Dr. Holger Friedrich who gave us a comparison between ODI 11g (Oracle Data Integrator) and OWB (Oracle Warehouse Builder). OWB is going away in the not too distant future so it is important for OWB shops to get a handle on it and start to lean about ODI. This presentation was a great start.

Holger is a very interesting and intelligent guy. He is from Switzerland and holds a PhD in Robotics and Machine Learning.

Not sure how he ended up doing Oracle data warehousing.

Tonight was our BIG EVENT: Dinner and Rodeo and Dancing at the Knibbe Ranch. It was hoot!

Here I am with my armadillo. I actually “won” an armadillo race.

We had a great BBQ dinner and a great country band to listen to. Hard find a bad band in this neck of the woods.

After dinner was the main event: rodeo. This is the big show: The board of directors for ODTUG got to ride into the arena on horseback for the opening ceremonies of the rodeo. It did appear that Edward Roske (conference chair) really knew how to ride a horse.

That’s it for now. I have to sleep a little before Morning Chi Gung. Then I have my last presentation tomorrow morning.

Good thing I tested everything and got ready before going to the special event.

Check back in a day or so as I will fill in some details on the technical presentations.

Adios for now.

Kent

Post Navigation

%d bloggers like this: