Kent Graziano

Right now, what you're touching on is exactly some of the reasons that data vault evolved was you by definition, say, you know, Bill in men in his original definitions of data warehouse, we talked about it being time variant, and that meant we would be able to see changes over time. And like you said, people were many cases were trying to do that and their operational systems and having performance issues because the operational system wasn't designed to handle that data. And when you tried to run a report, if you wanted the history, we're having to join together multiple tables. So we started evolving in the data vault world to store all of that and what happened is often people took the third normal form model and put it into a data warehouse which really that it's Early days men, they put it on a different server in a different database and started collecting the history. Well, how are you going to collect the history without overriding things constantly? Well, they discovered and this was Bill Simmons approach was put a snapshot date on it. So now you had maybe a customer table that would have the same customer in it multiple times, but with different snapshot dates. And so now you can start doing that trend analysis will over time, you know, number of things happen with that, from a performance perspective is depending on the platform you're on, you started kind of hitting the max capability to aggregate and read all of that data, and got into a lot of y'all say, fancy performance tuning and optimizations that the various database that vendors added in to handle this because most of the databases, well, they all started out as OLTP. And eventually they started adopting and adapting for data warehousing did you do four kinds of queries. So as the data got bigger and bigger and bigger queries got Slower, slower and slower. So there was that aspect to it, then you started getting more specialized technologies and appliances that perform better were designed for this type of query. Now, what has complicated it is then when you start having to make changes, and this is kind of where data vault comes in is with the third normal form approach with snapshot dates, the queries get very complicated. So if you had a master detail, detail type of association, and you've got snapshot dates at all levels, and data is changing at different rates in the different tables, trying to put together the query to get a single point in time view, you know, what did that look like last week, you know, some of the data is changed, some of the data hasn't changed a lot of sub queries having to go on to try to line that all up in betweens, and that gets problematic and it's very hard to maintain. The other issue you get is if you try to integrate multiple source systems with that kind of model. It's very hard to get a common model and get it aligned and then the you know, the worst case scenario Is your storage system changes. And you've got this, you know, five levels deep with snapshot dates, and in the source changes, whether it's new attributes or new tables and refactoring that model can be very expensive and very time consuming. So this was one of the problems that data vault was really looking to solve. Then on the other side you had in order to try to get performance better. We have the dimensional modeling approach. And one of the things that I've seen over the, you know, bunch of years that I've been doing this is the dimensional model or star schemas, as they're, they're often called a really a, to me a articulation of a pivot table in a database format. And many business analysts, they think very well in pivot tables, and it's a very great paradigm for doing business analytics will try to do that on a third normal form model just didn't work in the older technologies. So people started using dimensional models and doing the D normalization Some pre aggregation. So we run a process to put all this data together in a forum that's easier for a business user to consume takes up a little more space, because now we're introducing redundancies in the data set, but would often perform better because the queries are less complicated, you know, a few fewer joins you in the process of building the dimensional model, you deal with all the snapshot dates and things like that. So downside of that was that you've got a fixed model, you've got a fact, let's say, five dimensions. Now the same problem happens here, you get a new source system and say your customer dimension has 20 attributes in it already. It's just one source system. Now you get a second source system that has maybe 40 attributes, some of which are similar or equivalent to the 20 you already have and the other brand new Well, two things now you have to if you need all that information for your analysis, you Need to re engineer or refactor that customer dimension? And that means adding a bunch of attributes. Well, that's great. So the net new attributes coming from source system to add them in, modify your retail process to populate those. Well, what about the first 20, they came from sources to one that have equivalents and source system to now you have to deal with an order of precedence. And I did this in one of the very large data warehouses I worked on for HP years ago, we had to decide and it wasn't just two, we had five source systems. So we had to say for column one customer name, we can get it from five different source systems, which name do we want in our reporting system, and so we had to write code that said, If source system one has the name, then we use it. If for some reason it's Nolan source system two, then we will look at them. We look at number two and then we look at number three, and we look at number four, we look at number five. And of course, that's a very simple example. It was never Actually that simple, but so there's a lot of coding involved, and having to not only refactor the design, but also refactor the load process. And that's, again, very time consuming. And as you're getting into bigger and bigger data sets, you start talking about, you know, terabytes of data and billions of rows of data. It takes a lot of time and a lot of cost to re engineer that the approach with data vault, as you said, getting down to a more granular level was designed to eliminate that refactoring to be able to integrate new sources much quicker, but also have a model that would be resilient, that it's going to be business oriented business focus, so the business keys or natural keys and integrating across those because you hope that you know, across your, your five source systems, that there's a common set of attributes that you can align on now from a performance perspective. One of the things that that means what early days of data vault people looked at it went Wow, that's a lot more tasty. labels, especially if they were used as dimensional models. dimensional models are very clean, you know, fact table five dimensions great. You break that out into a data vault model it can be I did did one of these one times where I had a dimension that had three layers of hierarchy and the dimension. And by the time we finished the data vault model that was behind that dimension, we had 25 tables. And so depending on the database engineer working on, that may look like, well, I don't know how we're ever going to get a query to perform to get a report out and part of the data vault and I'll say methodology now rather than modeling technique, but the data vault methodology was that the data vault is the core of your enterprise data warehouse. It really is the history of all the data time variant that you're trying to store. But for reporting, we still attend to project those into dimensional models, because most of the BI tools tend to think in terms of facts and dimensions in the early days again, because performance was potentially an issue on that many joints are simple joints. They're not like the complex joints and trying to do the snapshot at third normal form. But there's a lot of them. And depending on the platform, you might be able to make those work really well and just do views. This is particularly true in snowflake, we have a lot of customers that are doing that, they're able to just put a view on a bunch of data vault tables, and it looks like a dimensional model. That's what gets exposed to the reporting side. Previously, we had to we would just, you know, write a little ELT process that would take a turn that into the dimensional model, and project that data from the data vault into a dimensional model, again, in order to get the performance. And one of the conclusions that I've come to over the years when people asked like, Well, you know, why did we have dimensional models, and it was twofold. One is a business perspective, because it puts the data in a format that is easy to consume by most business analysts. And the other was a performance one as you kind of indicated earlier, is we need needed to be able to have the queries be fast on large sets of data. And it was simply because the underlying technology couldn't handle the complex joints with the lower level of granularity that you get with third normal form, and then even lower with data vault, it allows you more flexibility in the questions that you ask of the data model and your data warehouse, you don't have to know all the questions to design the schema, you have to know the business and the business semantics and the context of the data to start building a data vault where with a dimensional model, you kind of need to know what questions you're going to be asked. So if somebody says, I need reports, I want to know sales by region by customer. And by product. Well, that's a particular dimensional model of fact, in a couple of dimensions. So if you go to building all of that out, and then they come back and say, Oh, I also need to know the sales rep. Well, that's a different dimension. And that's potentially a different level of granularity in the fact table and then means building or rebuilding that fact table and adding dimension to it. So it's again, it's a refactoring or reengineering, that can be very time consuming, especially if this is the only place you've stored the data in the data vault world, you simply go back to your data vault and say, okay, where's our information on sales rep. And then we can add that dynamically into the model and into the design of the reporting layer, again, giving us that much more flexibility to be responsive to changing business requirements. And if we don't have to have the sales rep information, because of the granularity at which we model in data vault, we go get that information. And we model the hubs and links appropriate to connect all that data together. And so even if we had requirements that we didn't know about, we can evolve the model organically without having to re engineer what we already did. And that's actually one of the huge benefits that I saw in data vault, especially as we moved in Trying to do more agile approaches to data warehousing. And we've got, you know, pressure to deliver faster. How do we do that? And the data vault modeling technique is the only one that I'd seen that that really could accommodate that, that could allow you to model in a very incremental manner, the business aligned and deliver and but not boxing yourself into a corner. That's going to require a lot of work if you didn't quite get it right the first time.