How far along are we down the road of distributed databases and data management systems?

Usually, when you say “database person,” you precede it by “grumpy, old.” We’re famous for looking at a lot of systems work and saying, “Yeah, we’ve been doing that for 20 years, 30 years.” Even, frankly, if you look at MapReduce — inside of any parallel database system like Teradata, like IBM’s parallel edition, like Oracle’s RAC, inside there is a MapReduce engine. Those techniques have been known for many years. So grumpy, old database people really did figure out a lot of things in the past.

That being said, I do think as a database person that things have really fundamentally changed. Probably, things have changed the most since the adoption of the relational model back in ‘80s. The big data ecosystem is really fundamentally different in many ways than traditional data management. In particular, people like to talk about scalability, because the big in big data means you have lots of data.

But again, scale-out techniques have been known for quite some time. They’re a little different now because of some different systems assumptions and so on, and maybe whereas before somebody might think that a thousand-node system was a big system, now it’s easy to talk about 10,000 nodes.

“Compared to a lot of grumpy, old database people … I believe that things have fundamentally changed and they’re not going to change back. I think we’re really at the beginning of a new era.”

To me, what’s really fundamentally different about this new-generation data management isn’t really isn’t just scalability, but it’s really flexibility. If you look at the ability to store data first and then impose structure on it later — sometimes this is called schema on read or schema on need — that’s a complete game changer.

Because the way things used to work if you wanted to do a data management project is you’d say, “OK, step number 1: Figure out every piece of data that you might want to ever store in your system, what it looks like, how it’s organized, and then how it’s related to all the other pieces of data that you might ever want to store in your database system. Then step 2 is get your hands on some real data,. And then step 3 was try to make the real data conform to this model that you would create in step 1.”

Many projects never made it that far, and back when people were first starting to do things like data warehousing, the literature was just full of horror stories where people would throw millions and billions of dollars into these systems and they never get them to work.

In this new regime, where you store the data first and then figure out what to do with it, it’s completely changed it. Now you can collect all the technical data you can think about collecting. Yes, you have to do some extra work when you go to use it; and, yes, you might take a little bit of a performance hit because you don’t have the storage completely optimized; and, yes, there may be some consistency problems that you need to understand. But by and large, the friction of getting your data-management system put together now has just decreased dramatically.

“The real breakthrough [of the relational model] was the separation of the logical view you have of the data and how you want to work with it, from the physical reality of how the data is actually stored.”

If you look at elastic computing, through cloud computing, and some of the mechanisms that are in Hadoop MapReduce and then things like Spark, just the ability to add more resources and have the system gracefully absorb those resources is something that didn’t exist before. And it’s not just the ability to grow your system, but it’s the ability to expand your system as you need it and then shrink it back down when you don’t need it anymore.

Again, this completely reduces the friction. It used to be that you would have to build your datacenter or your system for the biggest problem you would ever imagine that you’d have to solve, and now you don’t have to do that anymore. Now you can build your system for what you think you’re going to need, and then you can surge with cloud resources when you need to do that, or you can just do the whole thing in the cloud in the first place.

That has changed things pretty fundamentally.

Then this ability to move smoothly between languages like SQL for querying, languages like R for doing statistical processing, graph processing — the things that you can do easily in Spark. That’s completely different, so you no longer have to commit to a single paradigm for working with your data. You can store the data in the system and then you can do the things that make sense with your graph system using that, the things that make sense with relational query processing using that, the things that make sense for statistical processing using that. And you can mix and match them.

So compared to a lot of grumpy, old database people you might talk to, I believe that things have fundamentally changed and they’re not going to change back. I think we’re really at the beginning of a new era. Sure, just like the beginning of the relational revolution, there’s a lot of work to do to make systems more robust, there’s a lot of work you have to do to make systems more performant, there’s a lot of work you have to do to make systems easier to use. But we’re just at the start of that journey.

“Even as Hadoop was getting more popular … many of my colleagues and I were just waiting until people realized that writing MapReduce programs directly is a real pain and that there were languages, in particular SQL, that had been designed to solve many of these problems.”

You mentioned SQL. Did you think, as you watched Hadoop and Spark get popular, that SQL would be the focus of so much attention on those systems?

I think I can say without fibbing too much that, yes, even as Hadoop was getting more popular and people were getting more excited about it, many of my colleagues and I were just waiting until people realized that writing MapReduce programs directly is a real pain and that there were languages, in particular SQL, that had been designed to solve many of these problems. I was pretty sure SQL was going to play a big role in these systems.

I guess maybe you could see it coming as far back as Hive.

You don’t even have to go to Hive. This is exactly why database systems caught on in many ways. It’s because it’s just too hard to write that stuff directly. Furthermore, you don’t want to, because the thing that a lot of people don’t realize about the relational model and systems like SQL is that the real breakthrough there wasn’t the language. The language is just sort of an artifact.

The real breakthrough was the separation of the logical view you have of the data and how you want to work with it, from the physical reality of how the data is actually stored. And built into the relational model is that the vision, it’s called data independence. What that lets you do is it lets you change the layout of your data and the organization of your data and the systems that you’re using and the machines that you’re using without having to rewrite your applications every time you change something.

Likewise, it lets you write the application in a way that you’re not really too concerned about how the data is organized at any particular minute. That flexibility is absolutely vital for data-oriented systems because once you collect data, you tend to keep it. Applications that you write tend not to go away. You need that ability to evolve the physical layout of the data, and you need that ability to protect developers — even though they may not want to be protected — from those sorts of changes.

Anybody that worked with database systems for any amount could see this happening because Hadoop was basically breaking all those rules, and that was a lesson that had been learned decades earlier.