Snowflake SQL: Making Schema-on-Read a Reality (Part 2)
This is the 2nd of my articles on the Snowflake blog.
In the first article of this series, I discussed the Snowflake data type VARIANT, showed a simple example of how to load a VARIANT column in a table with a JSON document, and then how easy it is to query data directly from that data type. In this post I will show you how to access an array of data within the JSON document and how we handle nested arrays. Then finally I will give you an example of doing an aggregation using data in the JSON structure and how simple it is to filter your query results by referring to values within an array.
Check out the rest of the post here:
Snowflake SQL: Making Schema-on-Read a Reality (Part 2) – Snowflake
Enjoy!
Kent
The Data Warrior
Oracle 12c has some similar JSON capabilities. My problem with storing JSON in the database (or XML, for that matter) is that even if you can query it as in Snowflake or 12c there is no standardization. Did they store the last name as lastName or nameLast? Or sometimes one and sometimes the other? When it is in the LAST_NAME column, you know what to query.
Agreed but when you need to get data loaded quickly for exploration this feature is very helpful. You can always put a view on top to standardize the column names.
Keep in mind that systems which output these documents tend to output a standard format so it should not be changing randomly.
On the contrary, what I am seeing is that JSON is being stored in the database because “we don’t want to lock ourselves in with a database design” or “we haven’t designed a database up front – we’ll let it evolve with the application”. And because “a database is just a place to persist your objects”. So it DOES change all the time depending on the last developers to touch the application. Do names change? Well, probably, you are right about the fundamental data fields like last name – they were included from the first iteration of the application and they are still there with the same name. But other fields come and go with a whim.
Unfortunately in the application development world there are people doing exactly as you say. And I would call that a bad practice in most cases (other than maybe for prototyping). They think they are being agile, but in the end they may shoot themselves in the foot. In those cases I am not sure why they would bother with Oracle anyway and not just use MongoDB?
Since Snowflake is specifically a data warehouse service, we are not seeing this type of issue. Rather we have clients streaming/loading in JSON data from known sources so they can do exploration and analytics on the data quickly. While the source sending the JSON could of course change the schema, and occasionally does, that happens less often. And even when it does, as I described, the existing reports keep working regardless. If the customer wants to take advantage of the new elements in the documents, they can adjust their views and reports in the next sprint. Once they are done with discovery, then can then “harden” their analysis and reports by extracting the data into actual tables and/or views for ongoing reports. At the BIWA summit last week, several folks referred to that as using “curated” data designs.
Is there a command that returns the structure of a given variant field?
Yes there are few ways. The simplest is just select the column and view contents. I will look up the other and post it Monday.
If I do this: “select f.* from json_demo, table(flatten(v)) f;” it dumps the contents of the variant into multiple rows – one row for each element in the document with a label and a value. For an array element, it gives the name of the array and the value is the contents of the array. It only goes one level down so it does not break out arrays within the array into separate rows. There is a RECURSIVE option that I am told will help. Still researching how to use that properly.
Keith – Okay got it – with recursive you can see the entire structure of a JSON document (i.e., the content of the variant column). Keep in mind that each row could (in theory) have a different structure.
select path
from json_demo, lateral flatten(v, recursive=>TRUE) a
The output from my example looks like this:
age
children
children[0]
children[0].age
children[0].gender
children[0].name
children[1]
children[1].age
children[1].gender
children[1].name
children[2]
…
fullName
gender
phoneNumber
phoneNumber.areaCode
phoneNumber.subscriberNumber
So not only does it dump the structure and show the hierarchy, but it also counts out the number of values in an array (probably not what you want if there are hundreds of values). If I add “value” to the select, you get the data in each of those levels as well:
age 42
children [ { “age”: “10”, “gender”: “Male”, “name”: “Jayden” }, { “age”: “8”, “gender”: “Female”, “name”: “Emma” }, { “age”: “6”, “gender”: “Female”, “name”: “M…
children[0] { “age”: “10”, “gender”: “Male”, “name”: “Jayden” }
children[0].age “10”
children[0].gender “Male”
children[0].name “Jayden”
children[1] { “age”: “8”, “gender”: “Female”, “name”: “Emma” }
children[1].age “8”
children[1].gender “Female”
children[1].name “Emma”
I assume you are thinking there may be a way to write a dynamic SQL to perhaps automatically generate a create table statement that would represent the document schema? I think you can using “with” and the flatten recursive option. We are ANSI SQL compliant. 🙂
You read me like a book! This makes the Snowflake VARIANT data type truly schema-on-read.
Awesome!!
The VARIANT data type is reminiscent of the old VB Variant data type. With that in mind, is there a means of dynamically determining the data type of a particular row? (i.e. http://www.java2s.com/Code/VBA-Excel-Access-Word/Data-Type/DeterminingtheTypeofaVariant.htm, etc.)
I could see this being a useful tool for the Snowflake developer.
Not sure I follow this Keith as each object/sub-column within a variant string can have a separate data types. So in my example v:name is varchar while v:age is integer. So each row in the table will not usually have a single data type. It looks like with the VB construct each row could only have one data type (i.e., varchar, numeric, date, etc).
Now each row could potentially have a different “schema” if you will. That is the structure of the JSON document loaded into a given row may be different than the “schema” loaded into the prior row (hence the concept of schema-less loading).
Is that more of what you are looking for?
I can best illustrate with an example. for a given snowflake table (i.e. stuff…):
– Rows 1, 11, 21 have Person data in the Variant field (i.e. stuff_detail)
– Rows 2, 12, 22 have Place data in the Variant field (i.e. stuff_detail)
– Rows 3, 13, 23 have Thing data in the Variant field (i.e. stuff_detail)
Is there a means to do something like:
select *
from stuff
where lateral flatten(stuff_detail, recursive=>TRUE) = person;
Make sense?
So the answer is even easier – you do not need the flatten at all. Let’s say your person record has a person_id field that none of the other record types have. You could just run
“select stuff_detail from stuff where stuff_detail:person_id is not null”
Then you can even build a view for each of those record types to make it look like three tables.
Got it! Let me check on that. Good questions.
Outstanding!!! Thanks for the quick replay. 🙂