NoSQL, sql



Observing the NoSQL hype through the eyes of an SQL performance consultant is an interesting experience. It is, however, very hard to write about NoSQL because there are so many forms of it. After all, NoSQL is nothing more than a marketing term. A marketing term that works pretty well because it goes to the heart of many developers that struggle with SQL every day.

My unrepresentative observation is that NoSQL is often taken for performance reasons. Probably because SQL performance problems are an everyday experience. NoSQL, on the other hand, is known to “scale well”. However, performance is often a bad reason to choose NoSQL—especially if the side effects, like eventual consistency, are poorly understood.

Most SQL performance problems result out of improper indexing. Again, my unrepresentative observation. But I believe it so strongly that I am writing a book about SQL indexing. But indexing is not only a SQL topic, it applies to NoSQL as well. MongoDB, for example, claims to support “ Index[es] on any attribute, just like you’re used to “. Seems like there is no way around proper indexing—no matter if you use SQL or NoSQL. The latest release of my book, “Response Time, Throughput and Horizontal Scalability“, describes that in more detail.

Performance is—almost always—the wrong reason for NoSQL. Still there are cases where NoSQL is a better fit than SQL. As an example, I’ll describe a NoSQL system that I use almost every day. It is the distributed revision control system Git. Wait! Git is not NoSQL? Well, let’s have a closer look.

Git doesn’t have an SQL front end Git has specialized interfaces to interact with the repository. Either on the command line or integrated into an IDE. There isn’t anything that remotely compares to SQL or a relational model. I never missed it. Git doesn’t use an SQL back-end Honestly, if I would have to develop a revision control system, I wouldn’t take an SQL database as back-end. There is no benefit in putting BLOBs into a relational model and handling BLOBs all the time is just too awkward. Git is distributed That’s my favourite Git feature. Working offline is exactly what is meant by ‘partition tolerance’ in Brewer’s CAP Theorem. I can use all Git features without Internet connection. Others can, of course, still use the server if they can connect to it. Full functionality on either end. It is partition tolerant. Conflicts happen anyway If there is one thing we learned in the 25 years since Larry Wall introduced patch, it is that conflicts happen. No matter what. Software development has a very long “transaction time” and we are mostly using optimistic locking—conflicts are inevitable. But here comes the famous CAP Theorem again. If we cannot have consistency anyway, let’s focus on the other two CAP properties: availability and partition tolerance. Acknowledging inconsistencies means to take care of methods and tools to find and resolve them. That involves the software (e.g., Git) as well as the user. But here comes one last unrepresentative observation from my side: most NoSQL users just ignore that. They assume that the system magically resolves contradicting writes automatically. It’s like using a CVS work flow with Git—it works for a while, but you’ll end up in trouble soon.

I’m not aware of a minimum feature set for NoSQL datastores—it’s therefore hard to tell if Git fulfils them or not. However, Git feels to me like using NoSQL for the right reason.

It’s about choosing the right tool for the job. But I can’t get rid of the feeling that NoSQL is too often taken for the wrong reasons—query response time, in particular. No doubt, NoSQL is a better fit for some applications. However, an index review would often solve the performance problems within a few days. SQL is no better than NoSQL, nor vice-versa. Because the question is not what’s better. The question is what is a better fit for a particular problem.





