Better Data Modeling: My Top 3 Reasons why you should put Foreign Keys in your Data Warehouse
This question came up at the recent World Wide Data Vault Consortium. Seems there are still many folks who build a data warehouse (or data mart) that do not include FKs in the database.
The usual reason is that it “slows down” load performance.
No surprise there. Been hearing that for years.
And I say one of two things:
1. So what! I need my data to be correct and to come out fast too!
2. Show me! How slow is it really?
Keep in mind that while getting the data in quickly is important, so is getting the data out.
Who would you rather have complain – the ETL programmer or the business user trying to run a report?
Yes, it has to be a balance, but you should not immediately dismiss including FKs in your warehouse without considering the options and benefits of those options.
So here are my three main reasons why you should include FK constraints in your Oracle data warehouse database:
- The Oracle optimizer uses the constraints to make better decisions on join paths.
- Your Data Modeling and BI tools can read the FKs from the data dictionary to create correct joins in the meta data of the tool (SDDM, Erwin, OBIEE, Cognos, Bus Objects can all do this).
- It is a good QA check on your ETL. (Yeah, I know… the ETL code is perfect and checks all that stuff, bla, bla, bla)
Now of course there are compromise options. The three main ones are I know:
- Drop the constraints at the start of the load then add them back in after the load completes. If any fail to build, that tells you immediately where you may have some data quality problems or your model is wrong (or something else changed).
- Build all the constraints as DISABLE NOVALIDATE. This puts them in the database for the BI tools and data modeling tools to see and capture but, since they are not enforced, they put minimal overhead on the load process. And, so I am told by those that know, even a disabled constraint helps the optimizer make a smarter choice on the join path.
- (really 2a) Best of both – disable the constraints, load your data, then re-enable the constraints. You get optimization and quality checks.
So NOW what is your reason for not using FKs in your data warehouse?
I agree completely. Business requirements trump developer convenience every day.
There is no doubt that enforcing RI requires additional compute resources, but that is an impact that is measurable and manageable.
Enforcing RI adds IO to the PK of the parent table. So distribute that IO to a different IO path and spindle set. In other words introduce parallel resources to the additional load.
The storage architecture needs to meet the business requirements. If the business requirements of data integrity and throughput (performance) demand a more sophisticated solution then that message needs to get to the sponsor and a business decision made choosing between performance, $$ and data integrity.
Exactly! Thanks for the feedback. Glad to know it is not just me.
You can always just disable the constraints (and triggers) before a data load, just don’t forget to turn them on afterwards.
Yup, that is number three.
I totally agree. However (you can wait for it) In SQL Server er option 3 is very expensive, so YMMV as always. I do love oracles way of novalidation, alas SQL Server is a different beast. IMO FK’s should always LOGICALLY be there, and only PHISICALLY dropped when there are serious issues.
BTW you either enable/disable them or create automated functions that test/correct FK’s with SQL after load actions, period. Ignoring them is not an option, certainly not with DV and auditability in mind.
@Martijn, I take a more purist approach. There are costs and risks with deferral. I believe it is of far greater business value to know that the db forbids the creation of dis-integrated data, rather than trusting bespoke code to identify all of the instances of dis-integrated data that the db gate keeper (who was put to sleep during the load) just let in and fix them.
The db is a far better gate keeper than bespoke code is a cleaner.
We pay big money for mature DBMS’s then disable many features in the name of performance rendering them expensive file stores. yuck. The problem is not that the features are performance hogs, rather that db designers are not correctly specifying and correctly allocating hardware resources. My advice to sponsors is if you won’t pay for the hardware resources and the necessary skills to correctly allocate them, then lower your expectations, don’t bother with the expensive DBMS’s, get an open source nosql and store it in hadoop.
It’s akin to gettng a ferrari and removing 8 of the spark plugs cause the fuel is expensive, but still expecting amazing performance. If you want a ferrari, buy it and use it the way the designers intended. Hi octane fuel, on a good circuit, pedal-to-the-metal. It won’t be cheap, but it will be good. (You can have any 2 of: good, fast and cheap)
In the long run, preserving data integrity is cheaper, more effective, lower risk with greater residual business value than restored integrity. The latter leaves room for doubt in the minds of the owners.
Well said Russell. I could not agree more.
Yes I agree, Data Integrity is more important than compromising little bit on Data-load performance. Also, performance while users run reports is much more important than that in batch data-load. Besides FKs, we should also keep unique keys/indexes where applicable. This will not only give data-integrity but most likely it will help in running reports faster e.g. filtering on that column.
In my experience, the DW projects fail (or do not provide its full potential) more due to lack of data integrity and not due to slower data-loads.
I recall discussing the issue of using FK during my datavault class back in 2008. In my opinion storing data should take into account that a certain amount of data does NOT concur with RI. The ETL proces can take care of the checking. This allows for the creation of the so-called error-marts as discussed in DV theory.
Now, pushing the data to the reporting layer (data-marts, cubes whatever) SHOULD be all about enforcing RI. But again , the ETL process should manage the load in such a fashion that the datamart can be populated with correct data.
Pingback: OBIEE and the Big Data | Addidici - 14 Years OBIEE and OBI Applications
RI is important. However I personally minimize the use of FK by “duplicating” the referenced data in my data mart. That way the ETL processing load remains high but the, space usage is also high, and the reporting and querying is fast. I use “very denomalised” marts.
How do you optimize your query without the foreign keys? Are you pre-joining all the data for reporting?
Yes, I pre-join. On one hand I get the performance I need on queries, and also not affected by “changing” referenced data. On the down side, this model is wasteful on storage.
FK, always. Here’s what I’ve written on the three different types of referential integrity:
Nothing is a faster than a PK/FK combination for RI performance.
Pingback: OBIEE and the Big Data | Addidici – 20 Years of Oracle Business Intelligence