Data Vault and the Oracle Reference Architecture
Thanks to Mark Rittman and Twitter, I found out just before RMOUG that Oracle had published a new reference architecture. It used to be called the Data Warehouse Reference Architecture, now it is called the Information Management Reference Architecture.
Oracle updated the architecture to allow for unstructured and big data to fit into the picture.
In my talks about Data Vault over the last few years I have been referring to the Foundation Layer of the architecture as the place where Data Vault fits. The new version of the architecture actual fits the definition of the Data Vault even better.
Now the Foundation Layer is defined as “Immutable Enterprise Data with Full History”.
If that is not the definition of Data Vault, I don’t know what is!
Immutable – does not change. Data Vault is insert only, no update – ever.
Enterprise Data – well duh! That pretty well fits any real data warehouse architecture. The model covers an enterprise view of the data not just a departmental views (like a data mart).
Full History – tracks data changes over time. That is one of the keys to the data Vault approach. We track all the data change history in Satellites so we can always refer to a view of the data at any point in time That allows us to build (or re-build) dependent data marts whenever we need or whenever the business changes the rules.
So it is possible to do a Data Vault approach and be compliant with Oracle’s reference architecture.
Guess Dan was just a bit ahead of the game…
“Immutable – does not change. Data Vault is insert only, no update – ever.”
Does no update also include technical data = there is no valid_to/valid_from (otherwise valid_to must be updated if a new insert is done)?
Good point – immutable applies to the data from the source system. We never edit, change, scrub, transform source system data. It is allowable to update technical meta data attributes in the vault. Generally the only one I update is the load end date attribute (what you call the valid to).
So that means, if I’m not wrong, update on all the Satellites at level of last changed items. Hm, that means one should update not much as columns, but much on rows level.
Exactly – the could values from the source are never updated for a row. If a value changes that creates a new row for the Sat then the end dts for the previous row is updated.
Remember that using the END DTS is an option in the Data Vault method. You could forgo using END DTS in which case you would never update any rows. If you choose that route, then you probably want to use a Point In Time (PIT) table to keep track of which row is the “active” row in your Sats. The PIT table, which is really a query helper table for extracting data from the vault, is completely a technical meta data table, and itself would have lots of updates.
The third option would be use temporally aware queries for extract that would need nested sub-queries to figure out the max(load dts) for each set of rows in the Sat in order to get the most recent picture of the data. This option of course may have some performance implications depending on your RDBMS platform, configuration, and data volumes.
This certainly adds to the point ‘we’ try to make for several years now. I wonder if Oracle (either formal or informal) advises any kind of modeling strategy. Do you know?
How are their reference models (e.g. Retail) modeled? And do they model them on a conceptual, logical or technical level?
Good to hear from you!
There is at least one product manager I know who thinks data vault is a perfect fit and would work really well on Exadata, but he was never able to get band width or funding to really test it. As for the reference models, the only one I have seen is the Healthcare model and it was a very traditional 3NF style warehouse model with a ton of metadata columns including lots of different timestamps to track load dates and “effectivity” dates, etc. What we were given was DDL (so a technical model) which we had to reverse engineer to even get a diagram. I never saw an ERD (logical) so do not know if they had one or if it was just the physical model.
On the Oracle BI Apps (OBIA) side of the house, those packaged solutions include star schemas as well. The Healthcare model did not – it had pre-defined staging tables then the 3NF warehouse tables. You then had to roll your own for the data mart (reporting) models. I have not seen any of the other industry models.
At the recent RMOUG conference in Denver I did a joint presentation with an OBIEE expert from Rittman-Mead (Stewart) who showed how easy it was to implement virtual stars on a data vault using the OBIEE meta data layer. We included this architecture diagram and discussion in that talk. It was very well received. We will be repeating the presentation at the KScope13 conference in New Orleans this June.
I’m just a bit wary about the contradiction between “immutable” w.r.t. the source system data (which I agree with), versus the position of the “data quality” bubble in the front, which seems to imply cleansing/golden record determination up front. In a DV-focused architecture I’d expect it to be “just another datamart” feeding results from the DQ-checks back into the source.
I agree. That one piece is problematic for a complete data vault compliant solution.
I think you might be reading into that diagram if you feel that that diagram either endorses or excludes a particular real life implementation. IMO, those diagrams are intended to be conceptual architecture diagrams that describe logical containers that provide a purpose and a set of capabilities. Each container aligns to an area where Oracle has specific tools/products for.
If someone used the prior diagram to suggest DV wasn’t valid, than shame on them. I don’t believe the intent of the diagram was to bless one approach relative to another. It’s just conceptual.
In regards to modeling approaches Oracle discusses, the DW guide has this chapter:
They talk about 3NF, star, and snowflake.
Maybe you can talk to the documentation others to get a short intro to the DV approach included. 🙂
In regards to some of the models Oracle provides, you can see the communications, retail, and airlines industry models here:
Actually, no one from Oracle has said anything one way or the other about Data Vault (officially). If I implied otherwise, that was unintentional.
I (and some others) felt even the prior architecture diagram, conceptually, allowed for a Data Vault approach as well. I just think the wording in the new version fits even better.
Overall the point Stewart and I wanted to get across to folks is that, if you are an Oracle shop, using Oracle tools, you can use a Data Vault approach and not violate any of the concepts in the Oracle Reference Architecture. While it is likely true that many (if not most) wouldn’t let that stop them in any case, we all know there are architects, CIO’s, etc, out there that would be afraid to try something (like Data Vault) if it did not appear to fit neatly into someone’s idea of a valid reference architecture.
BTW – I like the idea of getting a mention of Data Vault into the documentation you mentioned. I will look into that. 😉