Better Data Modeling: Are you making these 3 beginner mistakes in your data models?
There are lots of people in the database industry that end up building data models.
Some of them are very educated and well trained and are professional data modelers and data architects. (If that is you, you can probably skip this article)
Others have learned on-the-job with little or no training or education on modeling concepts or techniques. They may be database administrators or even programmers that got asked to produce a model diagram by their boss or project manager, after the project was delivered (but there was no data modeler on the project ever).
This article is for this last group of folks who may want to improve their knowledge or skill in data modeling.
So here are three of the most common mistakes I have seen over the years:
1. Only defining surrogate keys
Do all your tables have a primary key defined and that primary key is a single column integer generated by the system? That is a surrogate key.
Instead of that, you should be defining a natural, or business, key for every table in your system. A natural key is a an attribute or set of attributes (that occur naturally in the data set) required to uniquely identify a row in that table. In addition you should define a Unique Key Constraint on those attributes in the database. Then you can be sure you will not get any duplicate data into the tables.
CLARIFICATION: This point has caused a lot of questions and comments. To be clear, the mistake here is to have ONLY defined a surrogate key. i believe that even if using surrogate keys is the best solution for your design, you should ALSO define an alternate unique natural key.
2. Using Visio or PowerPoint to draw your data model diagrams
This is all too common when no data modeler has been hired for a project or there is a really tight (as in NO) budget. The result is a pretty picture that gets out of date very quickly and can’t help you generate DDL code to build (or rebuild) the database.
Instead of that, you should either invest in a real data modeling tool (like ERWin), or better still, get Oracle’s totally free SQL Developer Data Modeler (my favorite!). The point being with a real data modeling tool, you can forward and reverse engineer your database tables, make changes, review them, generate DDL, etc.
3. Not reviewing the model with real business people
Build it and they will come does not work! How do you even know you are using the right terminology or have defined the relationships correctly if you have not talked to a business person about their data and data needs?
Instead of that, you need to involve business users from the very beginning of the project. You need to either set up 1:1 interviews with key stake holders or better yet have group JAD sessions to discuss the project and have them help you define the data model. That is the best way to get buy in to the end result.
Just avoiding these three rookie mistakes can greatly improve your chance of success.
Bonus Tip: The best way to avoid these and other common rookie mistakes is to use a pre-defined data model review checklist.
You can get a jump-start on your own check list by downloading my Kindle book A Checklist for Doing Data Model Design Reviews on Amazon.com.
Get it here: http://www.amazon.com/Check-Doing-Design-Reviews-ebook/dp/B008RG9L5E/, or just search Amazon for author Kent Graziano.
Here’s to building better data models! Are you with me?
The Data Warrior
How about having surrogate keys with alternate key? What potential issues there? Business keys change. Merger and acquisition of companies and data will make it more difficult to maintain.
Depends – are you talking about a custom OLTP system or a Data Warehouse?
What I called Natural Keys is what I assume you refer to as Alternate Keys right?
For the OLTP, I would argue that if the Business Keys change that much, the business must have changed a lot too, so you will be having to redevelop your OLTP anyway. Yes mergers and acquisitions are hard to deal with from a data perspective. There is no simple answer to that.
If you only use surrogate keys with no alternate keys you will always be at a very high risk of entering duplicate records, which will have many negative downstream data quality impacts. Surrogate key si a simple answer, but rarely is it the right answer longterm.
For data warehousing, I use the Data Vault modeling approach for my core, foundation layer. In that case a change in Business Keys would indicate the need for a brand new Hub definition. Existing Hubs with their historical data remain but we add a new Hub and associated Satellites to store the data from systems that have a new/different business key.
These are great tips. In my opinion there is one more that deserves a place with those three. And that is ‘Naming Standards”. So much depends on naming the tables, columns and other objects in the model consistently, that if it is ignored at the modeling stage, total confusion my occur later on. I’ve seen this become very important when designing models for reporting. There are BI tools (like Microstratgy for instance) that read a table and then decompose it into attributes and metrics and make parent-child associations based upon existing attributes. If naming standards are not followed, it becomes a nightmare for the report developers to build good reports and dashboards.
My two cents.
I agree 100%. Consistent naming standards are also vital to success!
Great points Kent. I’m not familiar with “JAD sessions” though – could you give a bit more detail please?
JAD = Joint Application Development. It is a pretty old (and somewhat outdated) term. In many cases superseded by concepts like RAD and Agile/XP/SCRUM.
Basically it is meeting with the users, usually in a group, to define their requirements. As I am using it, I really mean a data modeling session with the business folks.
Thanks for the clarification, I guessed you meant something along those lines from the context, but it’s good to check!
I’ve always been a fan of surrogate keys. Very, very simple to administer/maintain. Very flexible with regards to business changes. Some problem domains don’t lend themselves very well to natural keys (wouldn’t the natural key for an address table be a concatenated key of ALL the values in the address?). Surrogate keys can also be much less intensive to join (single column join rather than multiple column join), etc.
As long as you follow the rule of NEVER exposing the surrogate keys to the users, I like them quite a bit.
However, I’m very willing to be convinced otherwise. Care to convince me?
The mistake I listed was ***only*** defining surrogate keys right? My issue is with systems where every single table has a surrogate key with no alternate, business key identified. I agree that surrogates are very useful in many situations as the Primary Key (fast joins being one of them).
And yes the natural key for some tables/entities/concepts may be a long list of attributes, but that does not mean you should skip the step of defining that key – at least logically. Nicely most relational databases allow you to define both a PK and one or more alternate UKs. So you can have your surrogate id for the PK of the table and have the natural business key defined as well. (And with Oracle you can define the UK as disabled or novalidate so at least you have the meta data available even if you don’t want the enforcement overhead).
Unfortunately I have seen too many systems (that I usually had to reverse engineer) where all there is are surrogate keys to join the tables. (In many of these cases, the tables are simply buckets for writing out data from a front end application.) And the system owners are under the mistaken impression that they have a 3NF design. Without the alternate keys, there is usually no clue for me, or anyone, what the intent of those tables are, and the tables are full of poor quality and duplicate data (which of course messes with reporting, etc).
So I firmly believe that the database designer or data modeler should put in the effort to understand the data, normalize the model, and articulate the business key in the data model (and of course validate it with the business users). Otherwise we all spend even more time de-duping and cleaning up the data later when someone wants a data warehouse and a single source of truth.
So, is there a setting in SQL Dev where you can tell it to add a surrogate (primary) key when generating the relational model from the logical model?
Yes there is! Under Tools -> Preferences ->Data Modeler -> Model – Logical. There are two check boxes for creating and using surrogate keys.
I didn’t grok the way you were using the only at first. (And nice that you updated the article! Many bloggers don’t bother doing that.) You had me a bit worried when I was reading it that you should only use natural keys.
I really like the idea of defining the natural key and then possibly not enabling it. That’s a good tool to add to the belt.
Hi Kent, your clarification on ” August 26, 2013 at 9:28 pm” completed the picture for me, and I agree with everything you said.
I also agree with the naming standards comment made earlier. Even if all players do not agree on what the naming standards are, there is no question in consistency, and that goes a long ways in ensuring a solid enterprise wide model.
The one thing that I will add is making sure that comments (i.e. metadata) describe the intent of the entities and their attributes, so that there is no question in their meaning. Storing these comments in the database goes a long way in communicating this information to developers, data analysts, business users, etc. If a definition is incorrect, or confusing, then you can correct it.
p.s. remember the land records system at BLM in Denver?
I agree 100 % about having comments in the database. It is one of my standards.
Yes I remember land records.
Kent, I recognize these errors from my own experience! Good list.
I would point out the difference between data modeling and database modeling, however. This is a significant difference that is often muddled — so I would suggest that, before engaging in a modeling effort, you need to be clear about what you are modeling — data or database.
Data modeling is about the data that runs the business — significantly, it does not presume that the data is in any kind of database. Much (some say most) of the business data exists outside of databases, even outside of any computer storage at all. There is still paper out there; people talk over drinks in a bar or on a golf course; decisions are made via text message and phone calls and email. These all involve data, and modeling this data — as well as the data that is recorded on a computer, whether in a formal database or not — is an important, and often overlooked, part of data modeling.
Database modeling, of course, is about putting data into a database (typically relational, but also other forms old and new). The data in a database is frequently a subset of the data that runs the business. It is intended for usage by specific processes that have been defined and automated.
Surrogate keys really have no meaning in data modeling, but they may be appropriate in a database model.
Naming standards are often non-existent in data modeling — the data model is a reflection of how people communicate in real-life, and real-life people use different terms for the same thing and the same term for different things — but naming standards are important in database modeling, for the reasons mentioned by others.
Most modelers try to do data modeling within the scope of a database modeling effort (or vice-versa). I’ve found that this leads to very tangled conversations, especially when non-technical business people are in the conversation along with the technicians. Often the discussion skips from data to database and back again, with no clear demarcations to indicate the change of focus.
Thanks for the excellent clarification Bryan! That leads me to a corollary mistake – presuming your database model actually represents useful data without having first at least looked at a valid business data model.