The questions we often hear are “Why create a new query language?” and “What’s wrong with SQL?”. This post contains answers to both.

Before we begin, let’s overview some of the history of how the relational model came to be, and how SQL was created.

Today, SQL is by far the most widely used database language. But that does not necessarily mean that it represents the best of what we can do. In fact, SQL’s beginnings as a “simple, ad-hoc” language coupled with “design by implementation” from competing vendors had left the language with a baggage of severe issues.

IBM had an overwhelmingly large influence over the tech market at the time, so SQL became a de facto standard for relational databases, and then a proper standard with the publication of the first ANSI standard in 1989 that essentially circumscribed the most prominent existing implementations of SQL. Subsequent versions of the standard continued to be primarily influenced by the commercial vendors.

In 1974 Donald Chamberlin and Raymond Boyce published a paper [ 2 ], which introduced “a set of simple operations on tabular structures, […] of equivalent power to the first order predicate calculus”. Chamberlin and Boyce felt that the formal relational query languages proposed at the time were too hard to understand for “users without formal training in mathematics or computer programming” and thought that the “predominant use of the language would be for ad-hoc queries” [ 3 ]. Initially, the authors did not consider SQL to be a “serious” language. Nonetheless, the great interest in the commercial application of the relational model had pushed IBM to quickly adopt and productize SQL, which was also picked up by their quickly-rising competitor–Oracle.

Codd’s ideas were revolutionary. For the first time, a database, and a universal way to query it, was described in a succinct, consistent mathematical model. This, naturally, created lots of interest in further research and, importantly, into practical implementation of the relational model.

The relational model was introduced by Edgar F. Codd in a seminal 1970 paper “A Relational Model of Data for Large Shared Data Banks” [ 4 ]. There, Codd postulated that all data in a database can be represented in terms of sets of tuples, called relations. Codd also invented a form of first-order predicate logic to describe the database queries: tuple relational calculus.

Critique of SQL

SQL, especially its earlier versions, was heavily criticized by the experts, including Codd himself [6], as well as C. J. Date, who published a multitude of papers and books on the subject ([7], [9], [10]). While many of the early shortcomings were fixed in the later versions of the standard, some of the serious issues had been only further ingrained.

Some of the complaints here apply to SQL as a whole, while others are specific to a certain implementation. We are primarily using PostgreSQL for the examples.

SQL’s shortcomings can be grouped into these categories:

lack of proper orthogonality — SQL is hard to compose;

lack of compactness — SQL is a large language;

lack of consistency — SQL is inconsistent in syntax and semantics;

poor system cohesion — SQL does not integrate well enough with application languages and protocols.

Lack of Orthogonality Orthogonality in a programming language means that a relatively small set of primitive constructs can be combined in a relatively small number of ways. A language with good orthogonality is smaller, more consistent, and is easier to learn due to there being few exceptions from the overall set of rules. Conversely, bad orthogonality leads to a large language with many exceptions and caveats. A good example of orthogonality in a programming language is the ability to substitute an arbitrary part of an expression with a variable, or a function call, without any effect on the final result. In SQL, such generic substitution is not possible, since there are two mutually incompatible kinds of expressions: A table expression is a SQL expression that yields a table: SELECT * FROM table .

A scalar expression is a SQL expression that yields a single scalar value: SELECT count(*) FROM table . Table expressions can only be used in a FROM clause, in a function, or with an operator that specifically expects a table expression as input. What’s worse the scalar and table expressions may have exactly the same syntax, which can be a source of further confusion. For example, let’s imagine we needed to list the names of all department heads in a company. This query would do the job: SELECT name FROM emp WHERE role = 'dept head' Now let’s say we needed to add this bit to a larger query that extracts information about a department. An intuitive way is to simply add the above as a subquery to the target list of our larger query: SELECT dept . name , . . . ( SELECT name FROM emp WHERE role = 'dept head' AND deptno = dept . no ) FROM dept . . . This is legal, but only if the subquery returns not more than one row. Otherwise, an error would be raised at run time. To account for multiple department heads, we would need to rewrite the query using a join: SELECT dept . name , . . . head . name FROM dept INNER JOIN emp AS head ON ( head . deptno = dept . no AND head . role = 'dept head' ) . . . The difference in structure is large enough to make any sort of source-level query reuse impractical.

Lack of Compactness Few claim that SQL is an elegant language characterized by orthogonality. Some call it an elephant on clay feet. With each addition, its body grows, and it becomes less stable. SQL standardization is largely the domain of database vendors, not academic researchers without commercial interests or users with user interests. [1] Paolo Atzeni et al. SQL is not a small language. At the time of writing the PostgreSQL implementation contains 469 keywords. Just part 2 (out of 14) of the SQL:2016 standard has 1732 pages. The main reason is that SQL, in line with its original goals, strives to be an English-like language, catered to “non-professionals”. However, with the growth of the language, this verbosity has contributed negatively to the ability to write and comprehend SQL queries. We learnt this lesson with COBOL, and the world has long since moved on to newer, more succinct programming languages. In addition to keyword proliferation, the orthogonality issues discussed above make queries more verbose and harder to read.

Lack of Consistency SQL is arbitrarily inconsistent both in its syntax and semantics. What makes things even worse is that different databases have their own version of SQL, often incompatible with other SQL variants. Here are a few examples of entirely different calling conventions in SQL: SELECT substring ( 'PostgreSQL' from 8 for 3 ) ; SELECT substring ( 'PostgreSQL' , 8 , 3 ) ; SELECT trim ( both from 'yxSQLxx' , 'xyz' ) ; SELECT extract ( day from timestamp '2001-02-16 20:38:40' ) ; There are two syntaxes that specify the ordering of the input set in an aggregate function: SELECT array_agg ( foo ORDER BY bar ) and SELECT rank ( foo ) WITHIN GROUP ( ORDER BY bar ) The list of inconsistencies like this does not end here and can be continued, but that’s beyond the scope of this blog post.