The Data Warrior

Changing the world, one data model at a time. How can I help you?

Archive for the tag “Hadoop”

The Elephant in the Data Lake and Snowflake

So is Hadoop finally dead? For many use cases, I think it really is. The cloud and the continued evolution of technology has created newer, better ways of working with data at scale. Check out what Jeff has to say about it!

Jeffrey Jacobs, Consulting Data Architect, Snowflake SnowPro Core Certified

Let’s talk about the elephant in the data lake, Hadoop, and the constant evolution of technology.

Hadoop, (symbolized by an elephant), was created to handle massive amounts of raw data that were beyond the capabilities of existing database technologies. At its core, Hadoop is simply a distributed file system. There are no restrictions on the types of data files that can be stored, but the primary file contents are structured and semi-structured text. “Data lake” and Hadoop have been largely synonymous, but, as we’ll discuss, it’s time to break that connection with Snowflake’s cloud data warehouse technology.

Hadoop’s infrastructure requires a great deal of system administration, even in cloud managed systems.   Administration tasks include: replication, adding nodes, creating directories and partitions, performance, workload management, data (re-)distribution, etc.  Core security tools are minimal, often requiring add-ons. Disaster recovery is another major headache.  Although Hadoop is considered a “shared nothing” architecture, all…

View original post 809 more words

One more time: Do we still need Data Modeling?

More specifically do we still need to worry about data modeling in the NoSQL, Hadoop, Big Data, Data Lake, world?

This keeps coming up. Today it was via email after a presentation I gave last week. This time the query was about the place of data modeling tools in this new world order.

Bottom line: YES, YES, YES! We still need to do data modeling and therefore need good data modeling tools and skills.

Snowflake with RI

A picture can say so much!


In order to get any business value out of the data, regardless of where or how it is stored, you have to understand the data, right?

That means you have to understand the model of the data. Even if the model (or schema) is not needed upfront to store the data (schema-on-write), you must discern the model in order to use it (schema-on-read).

It is (mostly) impossible to get repeatable, auditable metrics, KPIs, dashboard, or reports that bring value to the business without understanding the semantics of the data – which means you at least need a conceptual or logical model.

And if you want/need to join data from multiple source then you really have to understand each source or there is no way to properly join it all together to get meaningful results.

There are a few data cleansing, discovery,and “virtualization” tools out there that will help you figure out those relationships but they are expensive and mostly rely on standard data profiling techniques to find similar data objects across the sets and propose “relationships”. Some allow for the definition of fairly sophisticated matching rules including customizations. But a human still needs to figures those out, test, and validate the results.

In the end you still have to know your data.

One of the best ways to do that, in my opinion, is to model that data. Otherwise your data lake will likely become a data swamp!

So keep your data modeling tool and keep building your data dictionary with your business folks.

Final Stage Table

A good modeling tool can act as a visual data dictionary too!

If you agree with me, please share on social media!



The Data Warrior

P.S. If you need a good modeling tool, check out Oracle SQL Developer Data Modeler. And check out my books and training offering for SDDM on the blog sidebar.

Oracle OpenWorld 2012: Day 4

It was another beautiful, sunny day in San Francisco. I started the day, again, with some morning Chi Gung, then I enjoyed the morning keynote watching the big screen in Yerba Buena Gardens. Quite a pleasant way to listen to these talks.

It was a light day session-wise for me, but it did set off a few light bulbs.

The best session of the day (and for me the whole conference), was Gwen Shapira’s  (from Pythian) talk on building an integrated data warehouse with Hadoop. Gwen did a superb job of explaining what Big Data is and what it isn’t.

Her simple, and straightforward, definition:

Big Data Defined

The “cheaply” part seems to be the key. Oracle, and other databases, can handle really HUGE amounts of data. Petabytes in fact. But putting all that data into an RDBMS can cost a lot more money than having it stored in a less sophisticated file system on commodity drives (like HDFS).

So just having lots of data in your warehouse does not mean you have Big Data, you just have a Very Large Data Warehouse (VLDW).

She went on to expand the definition:

Big Data Defined 2

This part shed even more light on Big Data for me. This really helped clarify even more when you might be dealing with Big Data.

The talk was filled with lots of technical details, limitations,  and tools ( Sqoop, Flume,  Fuse-DFS) you can look at for integrating Hadoop into Oracle. Of course there are Oracle’s offerings as well, like Oracle Loader for Hadoop and Oracle Direct Connector for HDFS.

Gwen also gave use several use case examples that illustrated when to use Hadoop. Bottom line – learn to use Hadoop appropriately, not just because it is cool. With tech we can:

Make the impossible, possible. That might no make the possible easy.

If you went to OOW, find and download Gwen’s slides. And follow her on twitter (@gwenshap).

It was Big Data day for me. My other session was  Ian Abramson’s session on Agile & Data. Two of my favorite topics.

Ian discussed the Agile Manifesto and Big Data and how he has been able to use agile techniques to make his projects successful.

To start, here is his simple definition on Big Data:

What is Big Data?

Ian had a nice picture of the overall architecture as well:

Another Big Data Picture

To be successful in applying agile to data projects, Ian has determined the projects must be driven by data value – that is the sprint priorities are set based on the data that can best help the customer achieve their goals. To stay on track and keep velocity, it is important to have daily touch-points with the team members as well. Ian does a daily stand-up for 15 minutes.

Ian shared lots of details and answered a lot of my annoying questions too. He came up with a great tree graphic to illustrate important factors in having a high performance project:

Agile Tree

Again, find and download the slides once Oracle uploads them. In the meantime, follow Ian on twitter (@iabramson). A data-centric agilest is hard to find. For more on agile and data warehousing check out my classic white paper on the subject.

After Ian’s session I got to go to my first Oracle blogger meet up. It was nice to put more faces to names. Thanks to Pythian and OTN for sponsoring it.

Blogger Meetup

Then back to the hotel to pack and then stand inline (for an hour!) to get to the appreciation event and see Pearl Jam live. It was a good concert. Hard to beat live music outdoors!

Huge crowd for Pearl Jam

Pearl Jam Live!

Well that’s it for me on OOW2012. I am back home in Houston now and heading into the office tomorrow. Then I need to write another abstract or two for KScope13 and RMOUG TD2013. Then it will be time to plan for OOW2013 and The America’s Cup finals…

Nap time.


Post Navigation

%d bloggers like this: