The Data Warrior

Changing the world, one data model at a time. How can I help you?

Archive for the tag “ETL”

Building a Data Lake in the Cloud 

Hey fellow data warriors! Here is a new joint blog post I just did with fellow data warrior Dale Anderson from Talend! Check it out. I hope you find the concept compelling!

So you want to build a Data Lake?  Ok, sure let’s talk about that.  Perhaps you think a Data Lake will eliminate the need for a Data Warehouse and all your business users will merely lure business analytics from it easily.  Maybe you think putting everything into Big Data technologies like Hadoop will resolve all your data challenges and deliver fast data processing with Spark delivering cool Machine Learning insights that magically give you a competitive edge.  And really, with NoSQL, nobody needs a data model anymore, right?

Avoid the data swamp! Use modern cloud based DWaaS (Snowflake) and the leading-edge Data Integration tool (Talend) to build a Governed Data Lake.

Read the rest: How to Build a Governed Data Lake in the Cloud with Snowflake and Talend

Enjoy!

Kent

The Data Warrior

Advertisements

#Kscope16 Blog Hop: #BigData and #AdvancedAnalytics Sessions Not to Miss

You are attending #KScope16 right?

Me too.

But there are so many sessions to choose from (mine included), which do you pick? How do you pick?

Well, I (and my fellow bloggers) are here to help you out with a Blog Hop. We are going to give you our top picks for for each track. In this post, I will give you my picks for the Big Data and Advanced Analytics track.

Big Data and Advanced Analytics Sessions

Why did I pick that this track? Really because it is a necessary adjunct to BI and Data Warehousing. In fact I find it hard to imagine that these two really won’t merge over the next few years (at my company, Snowflake, it really has already). Every company that is investing in BI/DW is also finding that they need to deal with Big Data too. And Advanced Analytics is, to me, the logical extension to BI.

So after looking at the agenda, really most of the sessions are of interest to me (sigh). But in reality I am sure I will not be able to attend them all, so here are my top 5 picks to see at KScope16:

  1. How to Build an Internet of Things Data Pipeline presented by Rex Eng
  2. Oracle Big Data Discovery: Extending into Machine Learning and Advanced Visualizations presented by Mark Rittman
  3. Introduction to Apache Kafka and Real-Time ETL presented by Gwen Shapira
  4. Getting Started with a Data Discovery Lab: You Don’t Have to Go Big to Gain Big presented by Kathryn Watson
  5. Getting Started with Oracle R and OBIEE presented by Kevin McGinley
 Why those? Simply because they hit on all the top issues and topics that see being discussed (or written about) in the field, and I need to get a better grip on these things:
  • IoT – it is here already
  • Machine Learning – I am pretty clueless about this one so far
  • Kafka – ETL/ELT in the cloud
  • Data Discovery – the next step beyond BI
  • R – the language of choice for data scientists

And I actually know all of but one of the presenters, so am sure they will be very informative and lively talks.

The rest of the blog hop:

Thanks for attending this ODTUG blog hop!

Looking for some other juicy cross-track sessions to make your Kscope16 experience more educational? Check out the following session recommendations from fellow experts!

I hope this gives you some great ideas on what to see at KScope16!

See you in Chicago.

Kent

The Data Warrior

P.S. Don’t forget to make time to attend my Morning Chi Gung sessions down by the river to get each day started right with a clear mind and strong heart. Look for signs at the hotel.

 

Better Data Modeling: My Top 3 Reasons why you should put Foreign Keys in your Data Warehouse

This question came up at the recent World Wide Data Vault Consortium. Seems there are still many folks who build a data warehouse (or data mart) that do not include FKs in the database.

The usual reason is that it “slows down” load performance.

No surprise there. Been hearing that for years.

And I say one of two things:

1. So what! I need my data to be correct and to come out fast too!

or

2. Show me! How slow is it really?

Keep in mind that while getting the data in quickly is important, so is getting the data out.

Who would you rather have complain – the ETL programmer or the business user trying to run a report?

Yes, it has to be a balance, but you should not immediately dismiss including FKs in your warehouse without considering the options and benefits of those options.

So here are my three main reasons why you should include FK constraints in your Oracle data warehouse database:

  1. The Oracle optimizer uses the constraints to make better decisions on join paths.
  2. Your Data Modeling and BI tools can read the FKs from the data dictionary to create correct joins in the meta data of the tool (SDDM, Erwin, OBIEE, Cognos, Bus Objects can all do this).
  3. It is a good QA check on your ETL. (Yeah, I know… the ETL code is perfect and checks all that stuff, bla, bla, bla)

Now of course there are compromise options. The three main ones are I know:

  1. Drop the constraints at the start of the load then add them back in after the load completes. If any fail to build, that tells you immediately where you may have some data quality problems or your model is wrong (or something else changed).
  2. Build all the constraints as DISABLE NOVALIDATE. This puts them in the database for the BI tools and data modeling tools to see and capture but, since they are not enforced, they put minimal overhead on the load process. And, so I am told by those that know, even a disabled constraint helps the optimizer make a smarter choice on the join path.
  3. (really 2a) Best of both – disable the constraints, load your data, then re-enable the constraints. You get optimization and quality checks.

So NOW what is your reason for not using FKs in your data warehouse?

Happy Modeling!

Kent

Agile Data Warehouse Modeling: How to Build a Virtual Type 2 Slowly Changing Dimension

One of the ongoing complaints about many data warehouse projects is that they take too long to delivery. This is one of the main reasons that many of us have tried to adopt methods and techniques (like SCRUM) from the agile software world to improve our ability to deliver data warehouse components more quickly.

So, what activity takes the bulk of development time in a data warehouse project?

Writing (and testing) the ETL code to move and transform the data can take up to 80% of the project resources and time.

So if we can eliminate, or at least curtail, some of the ETL work, we can deliver useful data to the end user faster.

One way to do that would be to virtualize the data marts.

For several years Dan Linstedt and I have discussed the idea of building virtual data marts on top of a Data Vault modeled EDW.

In the last few years I have floated the idea among the Oracle community. Fellow Oracle ACE Stewart Bryson and I even created a presentation this year (for #RMOUG and #KScope13) on how to do this using the Business Model (meta-layer) in OBIEE (It worked great!).

While doing this with a BI tool is one approach, I like to be able to prototype the solution first using Oracle views (that I build in SQL Developer Data Modeler of course).

The approach to modeling a Type 1 SCD this way is very straight forward.

How to do this easily for a Type 2 SCD has evaded me for years, until now.

Building a Virtual Type 2 SCD (VSCD2)

So how to create a virtual type 2 dimension (that is “Kimball compliant” ) on a Data Vault when you have multiple Satellites on one Hub?

(NOTE: the next part assumes you understand Data Vault Data Modeling. if you don’t, start by reading my free white paper, but better still go buy the Data Vault book on LearnDataVault.com)

Here is how:

Build an insert only PIT (Point-in-Time) table that keeps history. This is sometimes referred to as a historicized PIT tables.  (see the Super Charge book for an explanation of the types of PIT tables)

Add a surrogate Primary Key (PK) to the table. The PK of the PIT table will then serve as the PK for the virtual dimension. This meets the standard for classical star schema design to have a surrogate key on Type 2 SCDs.

To build the VSCD2 you now simply create a view that uses the PIT table to join the Hub and all the Satellites together. Here is an example:

Create view Dim2_Customer (Customer_key, Customer_Number, Customer_Name, Customer_Address, Load_DTS)
as
Select sat_pit.pit_seq, hub.customer_num, sat_1.name, sat_2.address, sat_pit.load_dts
from HUB_CUST hub,        
          SAT_CUST_PIT sat_pit,        
          SAT_CUST_NAME sat_1,        
          SAT_CUST_ADDR sat_2
where  hub.CSID = sat_pit.CSID           
    and hub.CSID = sat_1.CSID           
    and hub.CSID = sat_2.CSID           
    and sat_pit.NAME_LOAD_DTS = sat_1.LOAD_DTS           
    and sat_pit.ADDRESS_LOAD_DTS = sat_2.LOAD_DTS 
 

Benefits of a VSCD2

  1. We can now rapidly demonstrate the contents of a type 2 dim prior to ETL programming
  2. With using PIT tables we don’t need the Load End DTS on the Sats so the Sats become insert only as well (simpler loads, no update pass required)
  3. Another by product is the Sat is now also Hadoop compliant (again insert only)
  4. Since the nullable Load End DTS is not needed, you can now more easily partition the Sat table by Hub Id and Load DTS.

Objections

The main objection to this approach is that the virtual dimension will perform very poorly. While this may be true for very high volumes, or on poorly tuned or resourced databases, I maintain that with today’s evolving hardware appliacnes  (e.g., Exadata, Exalogic) and the advent of in memory databases, these concerns will soon be a thing of the past.

Worst case, after you have validated the data with your users, you can always turn it into a materialized view or a physical table if you must.

So what do you think? Have you ever tried something like this? Let me know in the comments.

Get virtual, get agile!

Kent

The Oracle Data Warrior

P.S. I am giving a talk on Agile Data Warehouse Modeling at the East Coast Oracle Conference this week. If you are there, look me up and we can discuss this post in person!

Data Vault Informatica Class is Live!

Just a quick note to let you all know that Dan has finally released the class on how to easily implement a Data Vault using Informatica.

I wrote about the class here.

I have gone through a few of the lessons already and can tell you the instruction is very clear and easy to follow (even for me!) and the audio and video is excellent. The audio seems to come on a bit load so just be sure you have your volume turned down a bit when you start the videos.

And there is a money back guarantee if for some reason you decide the class is not for you.

If you did not get on Dan’s early notice list you can still sign up by going directly here: http://learndatavault.com/kentdvi

And, since you are a reader of my blog, if you sign up in the next few weeks and enter the coupon code DATAWARRIOR13, you can get $100 off !

So if you use Informatica and plan to do a Data Vault, you owe it to yourself to take a look at this course.

Take care.

Kent

Post Navigation

%d bloggers like this: