HAVING A Blunderful Time

or

Wish You Were WHERE

by

Hugh Darwen

This article was prompted by the appearance of "A Call to Arms" [3]

by Jim Gray and Mark Compton. It is not a response to that article. Rather, it is something that I have long been wanting to put in writing. The authors' apparent claim to the effect that SQL DBMSs are "sagging at the knees" (even if they didn't really mean that), combined with the fact that one of the authors was a member of the famed System R team that gave us the original SQL, has provided a stimulus that is irresistible.

I question the competence in computer language design of the System R team. I do so by examining and criticising the design of SQL's notation for writing database queries. In spite of my strong words, I mean no personal offence to the members of that team, with some of whom I collaborated closely and was on very friendly terms during my time at IBM (1967-2004). Many of them were and are very fine software engineers, and Jim Gray in particular was one whose work (on locking) was of great help to me and my colleagues working on Business System 12 (1978-82). But an engineer does not a language designer make. According to my understanding, the System R team were engaged only in a proof-of-concept exercise and did not set out to devise a language that would take over the database industry in the way that SQL has. So they can perhaps be excused to some extent, though personally I question that the proof of concept was actually achieved by their SQL implementation, because they shirked the responsibility of a relational DBMS never to allow the same tuple to appear more than once in any relation, stored or derived. Happily, Business System 12, inspired by ideas passed on by the ISBL team also referred to in my article, was to show, to the few that saw this system in action during its brief life, that the feared burden of "duplicate row elimination" was not too much of a burden after all.

In the 1970s two research teams in IBM were notable among those working on implementation of E.F. Codd's relational model. One was a comparatively small team working on a project called PRTV [4] and a language called ISBL at IBM United Kingdom's Scientific Centre at Peterlee in the north-east of England. The other was a larger team working in the United States on a project called System R and a language that was eventually called SQL. One of these teams was rather good at computer language design; the other was not. Both teams had some very skilled software engineers.

Perhaps the most important issue left open by Codd's early papers, and therefore to be faced up to by teams such as these, was that of how to design an industry-strength relational database computer language—in particular, a notation for writing relational expressions, as prescribed by the model for defining queries and constraints. Codd's relational algebra and calculus did of course provide an agreeable basis to work on, but certain important issues were left open, such as:

· how to enforce "union compatibility", where required;

· how to handle attributes of the same name when joining relations; and

· how to incorporate "calculations" such as SALARY+BONUS and AVG(SALARY) into queries.

These issues were neatly addressed by the ISBL team, who had sensibly decided that the first thing we needed was a direct implementation of Codd's relational algebra. After all, at that time we already had a very good understanding of how to write expressions by invoking operators (an algebra being, among other things, a collection of operators). Because an argument (denoting a operand) to an invocation can itself be an invocation, we can build expressions of arbitrary complexity and thus be sure that the language satisfies some agreeable criterion of completeness. Codd had given us such a criterion. He called it relational completeness and we understood and accepted it.

Given the existence of a good implementation of relational algebra, it was imagined that "higher level" languages, if desired, could naturally follow.

How to enforce union compatibility?

In retrospect, it seems to me a bit strange that this was an open issue at all. Codd's model required that every attribute value (in the tuples of a relation) be explicitly accessible. It also required that no significance be attached to any ordinal positions of attributes within a relation. In other words, every attribute must be referenceable, but not by its ordinal position. What options remain, if ordinal position is excluded? The ISBL team probably did not think their idea of using names to be very remarkable. The System R team decided instead to violate the model and use ordinal position anyway—a decision that, when I learned about it, I could put down only to careless study of the model or careless consideration of its implications.

The consequences of the System R team's decision were far-reaching and disastrous. The only way of maintaining UNION's commutativity was to make each column of the result anonymous (except perhaps in the special case where each pair of corresponding columns happen to correspond by name as well as by ordinal position). And yet, columns were not referenceable by ordinal position (except in the ORDER BY clauses of cursor declarations), so depriving a column of a name made it completely unreferenceable.

Now, the 1992 edition of the international standard for SQL sought to correct the System R team's error by adding UNION CORRESPONDING to the language, this being a true relational union operator. One has to wonder: if the original UNION had been based on column correspondence by name, would anybody ever have considered including the other form too? A similar remark can be made in connection with NATURAL JOIN, as I now explain.

How to handle attributes of the same name when joining relations?

If relations r1 and r2 are to be joined and each have an attribute named a1, how do we avoid having two attributes, both named a1, in the result?

The ISBL team based their solution directly on predicate logic, noting that JOIN is the relational counterpart of AND. If predicates p1 and p2 both contain the free variable a1, then that free variable must appear more than once in the predicate "p1 and p2". But multiple appearances of the same variable in the same predicate necessarily stand for the same thing! (In case you doubt this, consider the predicate, "a loves b even though b is stupid".)

The ISBL team realized, therefore, that what became known as "natural join" was really the fundamental join, and they defined their join operator (*) to have the semantics of natural join, whereby the join condition is implicitly "equality on common attributes" and only one of each pair of common attributes appears in the result. It had been thought by some that this approach would lack completeness, because you might for example, want the joining condition to involve a "less than" comparison on common attributes, or you might want a pair of common attributes not to be involved at all in the joining condition. The ISBL team recognized that problem and solved it neatly by providing an attribute renaming operator (of which the RENAME in Tutorial D [1] is a direct descendant, of course.)

The ISBL solution had the following two great advantages: (a) there is no need to spell out the joining condition for the joins most commonly needed in practice; and (b) there is no need to write out a long list of attribute names in order to remove unwanted duplicates. SQL users are painfully aware of those two inconveniences (though some products do now at last support the standard's NATURAL and USING varieties of JOIN).

The System R team saw no harm in having two or more columns of the same name in the result of a SELECT-FROM-WHERE expression, because they didn't at that time recognize the requirement for such expressions to be permitted as operands of the FROM clause. Instead of equating column names to predicate variables, they opted for qualification of column names in the style proposed by Codd in his relational calculus, using range variables. At first sight this obviates the need for a RENAME operator and SQL still doesn't have such an operator to this day (though the "AS <column name>" feature that was added to the SELECT clause in the 1992 edition of the international standard went a little way towards addressing the requirement).

As a consequence of the System R team's lack of foresight back in the 1970s, we now have a ridiculous state of affairs in SQL whereby the expression SELECT C1 AS X, C2 AS X FROM T is legal (!) and furthermore, if that expression is written as an operand in a FROM clause, neither of its columns can be referenced in any of the clauses of the outer expression containing that FROM clause. (We also have some really baroque scoping rules for range variables, far too complicated for me to try to explain here.) Even worse (I would say), a query whose result contains two columns of the same name (e.g., SELECT * FROM T1, T2 WHERE T1.C1 = T2.C1) is a legal expression that cannot legally be used as the "AS" operand of a CREATE TABLE ... AS ... statement. (There's a lesson in language design here. We can perhaps forgive the System R team for not thinking of CREATE TABLE ... AS ..., but if they had paid proper attention to language design principles in the first place, CREATE TABLE ... AS ... would have been supported for all queries automatically.)

How to incorporate calculations?

In Tutorial D [1] we have the operators EXTEND and SUMMARIZE for this purpose. EXTEND has a direct counterpart in ISBL, as does the "BY" variety of SUMMARIZE. (The "PER" variety was invented by Chris Date and myself to add support for certain aggregations over the empty set that you don't get with the "BY" variety.)

Here the ISBL team was working in the dark, for Codd had not addressed this issue (at least not in depth). Rather mysteriously, his stance was that calculations should be done in the host languages. This didn't make sense to me. If he meant that the application program has to "do calculations" on the result of a query, then we are restricted to "doing calculations" as the last step of a query and we cannot use the results of calculations in relational joins, restrictions, etc. If instead he meant that host language calculations would somehow be "imbedded" in relational expressions, how were we to do that? And in particular, how could the results of calculations be used to give rise to a relation of higher degree than the input relation (à la EXTEND, as we now know)?

What was even more puzzling to many of us was that Codd's relational completeness criterion did not take EXTEND- and SUMMARIZE-style calculations into account. For us database practitioners of the time, a language that didn't support calculations was in no way complete!

The System R team also grappled with the problem of incorporating calculations. The result of their deliberations is seen in SQL's support for expressions such as SALARY+BONUS and AVG(SALARY) in the SELECT clause, and the GROUP BY and HAVING clauses.

Ah, the HAVING clause! That brings me to the title of this paper and its real motivation. At least, it starts to. But we need to look at SELECT-FROM-WHERE-GROUP BY before we can look properly at HAVING. Bear with me.

SELECT-FROM-WHERE

Do you take SELECT-FROM-WHERE for granted, or do you, like me, find it rather curious that the System R team should have spurned the normal way of writing expressions of arbitrary complexity in favour of something utterly idiosyncratic and, one might say, rather dictatorial? Actually, after much reflection, I have arrived at an educated guess as to why they might have taken this approach. I suppose I could ask some of them if my guess is anywhere near the truth, but I'm not sure if I would get a definitive answer and even if I did get one in the negative I'm not sure it would entirely dispel my suspicion.

The fact is that in the 1960s various scripting languages (as we tend to call such things these days) had come about for the purposes of report generation, especially ad hoc report generation. We had one such language in the prerelational DBMS called Terminal Business System (TBS) that I worked on for IBM from 1969-77. Our language required the user to specify the required report in a series of steps that had to be given in the prescribed order, something like this:

1. FILE <filename> (specifying the input file for the report—the structure of the file was predefined in a language analogous to SQL's CREATE TABLE and was required to be uniform, like that of a table)

2. SET <new field name> = <calculation> (any number of SET statements could be given here, and the <calculation> could reference results of previous SET statements as well as fields in the file).

3. IF <condition> (just like SQL's WHERE)

4. PRINT <fields> (like SQL's SELECT, with field names only)

5. SORT <sort specification> (like SQL's ORDER BY)

6. <control break specifications> (sorry, but I forget the details)

That last step somehow allowed the user to specify, for example, totals and averages that might be required at various "levels" determined by the <sort specification>.

Note carefully that the SET statements came before anything else, thus allowing the introduced variables to be referenced subsequently. Isn't it amazing that the System R team decided that calculations, if their results are to be included in the result of the query, should be required to appear at the end? (the logical end, I mean—the SELECT clause, though this is the physical beginning as it happens!) And isn't it equally amazing that they didn't think of allowing results of calculations to be given names for subsequent reference and to avoid the need for repetition? As a consequence of this apparent thoughtlessness, paying customers had to write expressions such as

SELECT E#, NAME, SALARY + BONUS

FROM EMP

WHERE SALARY + BONUS >= 5000

Why on earth didn't they complain, or (better) just refuse to pay? In my opinion the user community, as much as the System R team, must take a share of the blame for the mess that SQL is. And also in my opinion, the database industry could have been placed on a much sounder footing if Codd himself had been much more critical of SQL at the outset.

Curiously, invocation of TBS's report generator required at least three lines of input: the FILE line, IF line, and PRINT line were all mandatory. TBS's counterpart of SELECT * FROM T was FILE T; IF ANY; PRINT ALL. In effect we had a compulsory WHERE clause as well as compulsory SELECT and FROM clauses!

Now, although I was working on TBS in a mainly British team working in the United Kingdom, we had taken over the development of the product from an IBM team in the USA, and it was that American team that had devised the report generator. A somewhat similar but much more sophisticated report generator was later developed by IBM in the US, as part of a product called (prosaically, as was IBM's style in those days) Generalized Information System (GIS). I cannot remember if GIS could be said to have qualified as a DBMS, as TBS certainly did, but I do remember that it was pretty baroque and much derided in our neck of the woods.

Nor can I remember any details about GIS at all. But I can remember this: when I first looked at SQL, my immediate reaction was "Oh no! Son of GIS? Please not that!" I might have been quite wrong about this. The similarity I perceived might have been illusory and even if it was not, I have no firm evidence that anybody in the System R team was familiar with GIS. The fact remains that the general style of a fixed order of actions was the order of the day at the time. I postulate that SQL's SELECT-FROM-WHERE arose out of this fashion.

At this point it is useful to ponder the origin of the restriction, in System R's SQL and just about all implementations until well into the 1990s, to the effect that each element of the FROM clause must be specifically a table name (as opposed to a table expression of arbitrary complexity). I believe that it arose out of a belief that support for "nested queries" in the FROM clause was not needed for completeness, and it seems possible that Codd himself helped to foster this belief, for I know that he strongly asserted that such "nested queries" were not needed in a language based closely on his relational calculus (the style that he personally favoured, by the way, over the algebra).

The flaw in the completeness claim for SELECT-FROM-WHERE (plus UNION) without "derived tables in the FROM clause" becomes apparent only when you consider that other issue that Codd left pending: the incorporation of calculations. But when you do consider it, it becomes screamingly obvious (though it clearly eluded those System R stalwarts and many others, apparently including Codd himself, for he never referred to the matter in his "Fatal Flaws of SQL" presentations). Consider the question, "How many people do we employ and how many departments do we have?" To express that as a query you need to be able to join SELECT COUNT(*) FROM EMP with SELECT COUNT(*) FROM DEPT. Or the question, "In how many cities do we have either a supplier or a part?" For that you need to be able to aggregate over an invocation of UNION (and, now that you can at last do that, you get the wrong answer if you add the word ALL after UNION, as some consultants advise you always to do!).

Anyway, the SELECT-FROM-WHERE structure catered for the relational algebra's joins, restrictions, projections and extensions, and the added UNION catered for its unions. (EXCEPT catered for difference, but that was also catered for by the use of NOT EXISTS subqueries in the WHERE clause.) Only summarizations really remained to be catered for. (Although RENAME is required for completeness in a proper relational language, SQL's violations of the model in connection with column names make RENAME dispensable in SQL, though it would still be useful.)

I don't know if the System R team was aware of the elegant "glump" operator used for aggregation in ISBL, when they pondered how to add support for summarization to the SELECT-FROM-WHERE structure. Whether they were or not, the counterpart they came up with really is one of the ugliest constructs I have ever come across. To have adopted it at all is pretty shameful in my view; if they adopted it in the light of glump's beauty, then their shame is compounded. But to adopt glump would mean abandoning SELECT-FROM-WHERE altogether and reverting to a conventional language; I think that by this time they must have been irrevocably wedded to their chosen "structure". They were digging themselves into a hole but they just carried on digging.

GROUP BY

The decision to add a GROUP BY clause to the existing structure, in order to support aggregations, had some disastrous effects. The first was the spoiling of the SELECT clause, whose permitted content in the presence of a GROUP BY clause (including the presence of an implicit empty GROUP BY clause!) became quite different from its permitted content in the absence of GROUP BY. I have been trying to teach this stuff to students of Britain's Open University for the past 16 years. It is always a struggle to explain, even though I understand the reasons perfectly well, that every appearance of a column reference in the SELECT clause of a grouped query must either be a reference to a grouping column or be inside an invocation of an aggregate operator such as SUM.

And when I've got over that hurdle, then I have to teach them HAVING. Ah, HAVING!

HAVING

At this stage in my SQL lesson we are looking at the following query, whose result gives the highest mark obtained in each subject in the end-of-semester exams:

SELECT Subject, MAX ( Mark ) AS TopScore

FROM Exam_Marks

GROUP BY Subject

We now wish to consider refining the query so as to restrict the result to just those subjects in which nobody scored more than, say, 70. I ask the students to suggest how this might be done. They are quick to suggest this:

SELECT Subject, MAX ( Mark ) AS TopScore

FROM Exam_Marks

GROUP BY Subject

WHERE TopScore <= 70

and I have to disabuse them of that idea, on more than one count. First, the name TopScore isn't "in scope" (try explaining that if you haven't done so before—I know it isn't really difficult, but remember that you are dealing with beginners who are struggling to get their heads around the whole structure). Secondly, WHERE is ambiguous because it can also be used to apply to the result of the FROM clause. When I point this out, the students immediately protest that the GROUP BY clause is in the way and should therefore disambiguate the situation, so now I have to explain that the GROUP BY clause can be omitted, and what is meant by a SELECT clause including aggregations when the GROUP BY clause is omitted.

It's an awful lot to have to take in, all this, compared with just learning a few conventional operators.

Anyway, so now I show them

SELECT Subject, MAX ( Mark ) AS TopScore

FROM Exam_Marks

GROUP BY Subject

HAVING MAX ( Mark ) <= 70

and explain that HAVING means exactly the same as WHERE, but has different rules concerning what can be written in the condition that follows it and in the SELECT clause that physically precedes it.

And then I explain that if only the System R team had realized that their language was relationally incomplete without support for derived tables in the FROM clause, then they (and after them, Larry Ellison with his Oracle that brought SQL to the unsuspecting public at large) might have included such support and thus allowed their users (a) not to have to learn HAVING, and (b) to write

SELECT *

FROM ( SELECT Subject, MAX ( Mark ) AS TopScore

FROM Exam_Marks

GROUP BY Subject ) AS something

WHERE TopScore <= 70

And then somebody asks what the "AS something" is about, and I have to explain that the catalogue of language design blunders perpetrated by the System R team started a tradition that was continued by their successors. Note that the "AS TopScore", which ought to be mandatory because every column needs a name (and no two should have the same name, of course!), is optional (though if it is omitted here, the WHERE clause cannot be written), and the "AS something", which ought to be optional (because not every table needs a name), is mandatory. How, one might wonder, did this strange topsy-turvy state of affairs over the naming of things come about? Well, I have a hunch—about the column naming part of it at least.

Why "AS <column name>" is optional (perhaps)

I have already mentioned that the System R team mistakenly believed that their language was relationally complete without any need to support derived tables in the FROM clause. Now, if they had been correct in that belief, then it is true that they could just about have got away without requiring every column of a table to have a (unique) name. Mind you, to accept that we would have to accept the cavalier approach they took to UNION, but my point here is that if you can't have derived tables in the FROM clause, then there is never any possibility of referencing columns in such tables, so what does it matter if a column in a derived table has no name, or if two or more columns in the same derived table have the same name?

That said, I'm still surprised that the simple measure of allowing "AS <column name>" to follow an expression in the SELECT clause wasn't adopted at the outset, if only to make life easier for developers of general-purpose front ends. As it is, when "AS <column name>" did eventually get added to the international standard for SQL, we were unable to include a rule to the effect that the specified name must be unique, because the original SQL already permitted derived tables to have two or more columns of the same name. The Shackle of Compatibility means you can never recover from mistakes in language design.

And when at the same time as adding "AS <column name>" to the standard we were able to add support for derived tables in the FROM clause, it was too late to look upon the monster that had now been created, take stock and change tack. In case you think I'm exaggerating, just compare

SELECT *

FROM ( SELECT Subject, MAX ( Mark ) AS TopScore

FROM Exam_Marks

GROUP BY Subject ) AS something

WHERE TopScore <= 70

with

( SUMMARIZE Exam_Marks BY { Subject } ADD ( MAX ( Mark ) AS TopScore ) )

WHERE TopScore <= 70

(and I don't mean compare by counting the number of characters in each expression!)

I have got to the point of this article at last, it seems. At least, I've got to the point of its title. But now I wish to move on to the Real Point.

The Real Point

I have tried to expose the ineptitude of the System R team as language designers. Notice that for once I have made no mention of the very worst ways in which our cavaliers rode roughshod over the relational model, with their three-valued logic and their duplicate rows—those misbegotten concepts have a huge effect on the usability of the language, it is true, but they don't actually have much effect on the language design per se (though I remark in passing that having to explain to my students the need for "IS NULL" rather than "= NULL" can be rather tricky). An SQL without nulls and without duplicate rows would be almost indistinguishable from the real SQL. No DISTINCT after SELECT, no ALL after UNION, INTERSECT and EXCEPT, no IS (NOT) NULL, but the rest stays, and the rest is what I have scratched the surface of in this article, and the rest is, well, just awful.

And now somebody who was a member of that very team puts his name to published statements (in [3]) such as:

· "traditional relational database constructs—always cumbersome at best—are now clearly at risk of collapsing altogether"

· "classic relational database architectures are slowly sagging to their knees"

· "traditional relational databases [were] never designed to allow for the commingling of data and algorithms"

It seems that the authors of [3] equate "relational database" to implementations of SQL. If SQL implementations really are suffering as indicated—a view that I strongly support, by the way—then perhaps now would be a good time to start to turn away from SQL and to turn instead, very belatedly, to the relational model! It seems that Jim Gray and Mark Compton didn't even consider this possibility.

References

1. C.J. Date and Hugh Darwen. Foundation for Future Database Systems: The Third Manifesto (2nd edition). Reading, Mass.: Addison-Wesley (2000). 3rd edition to appear under the title Databases, Types, and The Relational Model, 2005. See also http://www.thethirdmanifesto.com.

2. C.J. Date, Hugh Darwen, and Nikos A. Lorentzos. Temporal Data and The Relational Model. San Francisco, Ca.: Morgan Kaufmann (2003).

3. Jim Gray and Mark Compton. "A Call to Arms", ACM Queue vol. 3, no. 3, April 2005, available at http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=293

4. S.J.P. Todd. "The Peterlee Relational Test Vehicle—A System Overview", IBM System Journal 15, No. 4 (1976).