Practice A Co-Relational Model of Data For Large Shared Data Banks

Credit: iStockPhoto.com

Fueled by their promise to solve the problem of distilling valuable information and business insight from big data in a scalable and programmer-friendly way, noSQL databases have been one of the hottest topics in our field recently. With a plethora of open source and commercial offerings (Membase, CouchDB, Cassandra, MongoDB, Riak, Redis, Dynamo, BigTable, Hadoop, Hive, Pig, among others) and a surrounding cacophony of technical terms (Paxos, CAP, Merkle trees, gossip, vector clocks, sloppy quorums, MapReduce, and so on), however, it is hard for businesses and practitioners to see the forest for the trees. The current noSQL market satisfies the three characteristics of a monopolistically competitive market: the barriers to entry and exit are low; there are many small suppliers; and these suppliers produce technically heterogeneous, highly differentiated products.12 Monopolistically competitive markets are inconsistent with the conditions for perfect competition. Hence in the long run monopolistically competitive firms will make zero economic profit.

In the early 1970s, the database world was in a similar sorry state.14 An overabundance of database products exposed many low-level implementation details and, as database people like to say, forced programmers to work at the physical level instead of the logical level. The landscape changed radically when Ted Codd proposed a new data model and a structured query language (SQL) based on the mathematical concept of relations and foreign-/primary-key relationships.4 In the relational model, data is stored in conceptually simple containers (tables of rows), and queries over this data are expressed declaratively without knowledge of the underlying physical storage organization.

Codd's relational model and SQL allowed implementations from different vendors to be (near) perfect substitutes, and hence provided the conditions for perfect competition. Standardizing on the relational model and SQL created a secondary network effect around complementary producers such as educators, tool vendors, consultants, etc., all targeting the same underlying mathematical principles. Differences between actual relational database implementations and SQL dialects became to a large extent irrelevant.7

Today, the relational database market is a classic example of an oligopoly. The market has a few large players (Oracle, IBM, Microsoft, MySQL), the barriers to entry are high, and all existing SQL-based relational database products are largely indistinguishable. Oligopolies can retain high profits in the long run; today the database industry is worth an estimated $32 billion and still growing in the double digits.

In this article we present a mathematical data model for the most common noSQL databasesnamely, key/value relationshipsand demonstrate that this data model is the mathematical dual of SQL's relational data model of foreign-/primary-key relationships. Following established mathematical nomenclature, we refer to the dual of SQL as coSQL. We also show how a single generalization of the relational algebra over setsnamely, monads and monad comprehensionsforms the basis of a common query language for both SQL and noSQL. Despite common wisdom, SQL and coSQL are not diabolically opposed, but instead deeply connected via beautiful mathematical theory.

Just as Codd's discovery of relational algebra as a formal basis for SQL shifted the database industry from a monopolistically competitive market to an oligopoly and thus propelled a billion-dollar industry around SQL and foreign-/primary-key stores, we believe that our categorical data-model formalization and monadic query language will allow the same economic growth to occur for coSQL key-value stores.

Back to Top

Objects Versus Tables

To set the scene let's look at a simple example of products with authors and recommendations as found in the Amazon SimpleDB samples, and implement it using both object graphs and relational tables.

While we don't often think of it this way, the RAM for storing object graphs is actually a key-value store where keys are addresses (l-values) and values are the data stored at some address in memory (r-values). Languages such as C# and Java make no distinction between r-values and l-values, unlike C or C++, where the distinction is explicit. In C, the pointer dereference operator *p retrieves the value stored at address p in the implicit global store. In the rest of this article we conveniently confuse the words object (graph) and key-value store.

In C# (or any other modern object-oriented language) we can model products using the following class declaration, which for each product has scalar properties for title, author, publication date, and number of pages, and which contains two nested collectionsone for keywords and another for ratings:

Given this class declaration, we can use object initializers to create a product and insert it into a collection using collection initializers:

The program produces in memory the object graph shown in Figure 1.

Note that the two nested collections for the Keywords and Ratings properties are both represented by actual objects with their own identity.

Using the LINQ (Language Integrated Query) comprehension syntax introduced in C# 3.0,11 we can find the titles and keywords of all products that have four-star ratings using the following query:

The LINQ comprehension syntax is just syntactic sugar for a set of standard query operators that can be defined in any modern programming language with closures, lambda expressions (written here as rating==>rating == "****" ), or inner-classes such as Objective-C, Ruby, Python, JavaScript, Java, or C++. The C# compiler translates the previous query to the following de-sugared target expression:

The various values in the query result, in particular the Keywords collection, are fully shared with the original object graph, and the result is a perfectly valid object graph, shown in Figure 2.

Now let's redo this example using the relational model. First of all, we must normalize our nested Product class into three flat tables, for Products, Keywords , and Ratings respectively, as shown in Figure 3. Each value in the relational world requires a new primary key property (here all called ID). Furthermore, the Keywords and Ratings tables require an additional foreign-key property ProductID that encodes the one-to-many association between Products and Keywords and Ratings . Later we will decorate these class declarations with additional metadata to reflect the underlying database tables.

In most commercial relational database systems, tables are defined by executing imperative CREATE TABLE DDL (data definition language) statements.

As usual in the relational world, we do not model the individual collections of keywords and ratings for each product as separate entities, but instead we directly associate multiple keywords and ratings to a particular product. This shortcut works only for one-to-many relationships. The standard practice for many-to-many relationships (multivalued functions) requires intersection tables containing nothing but pairs of foreign keys to link the related rows.

Perhaps surprisingly for a "declarative" language, SQL does not have expressions that denote tables or rows directly. Instead we must fill the three tables in an imperative style using loosely typed DML statements, which we express in C# as shown in Figure 4.

These DML statements create three tables filled with the rows as shown in Figure 5.

An important consequence of normalizing a single type into separate tables is that the database must maintain referential integrity to ensure that: the foreign-/primary-key relationships across related tables remain synchronized across mutations to the tables and rows; the primary key of every row remains unique within its table; and foreign keys always point to a valid primary key. For example, we cannot delete a row in the Products table without also deleting the related rows in the Keywords and Ratings tables.

Referential integrity implies a closed-world assumption where transactions on the database are serialized by (conceptually) suspending the world synchronously, applying the required changes, and resuming the world again when referential integrity has been restored successfully, rolling back any chances otherwise. Assuming a closed world is, as we claim, both a strength and a weakness of the relational model. On the one hand, it simplifies the life of developers via ACID (atomicity, consistency, isolation, durability) transactions (although in practice, for efficiency, one must often deal with much weaker isolation levels) and allows for impressive (statistics-based) query optimization. The closed-world assumption, however, is in direct contradiction with distribution and scale-out. The bigger the world, the harder it is to keep closed.

Returning to our example, we present the naïve query to find the titles and keywords of all products that have four stars, expressed directly in terms of the relational model. It creates the cross-product of all possible combinations of Products, Keywords , and Ratings , and selects only the title and keyword where the keyword and rating are related to the product and the rating has four stars:

The result of this query is the row set shown in Figure 6. Disappointingly, this row set is not itself normalized.

In fact, to return the normalized representation of our object-graph query, we need to perform two queries (within a single transaction) against the database: one to return the title and its primary key, and a second query that selects the related keywords.

What we observe here is SQL's lack of compositionalitythe ability arbitrarily to combine complex values from simpler values without falling outside the system. By looking at the grammar definition for table rows, we can immediately see that SQL lacks compositionality; since there is no recursion, rows can contain only scalar values:

Compare this with the definition for anonymous types, where a row can contain arbitrary values, including other rows (or nested collections):

SQL is rife with noncompositional features. For example, the semantics of NULL is a big mess: why does adding the number 13 to a NULL value, 13+NULL , return NULL, but summing the same two values, SUM(13, NULL) , returns 13?

Also, even though query optimizers in modern SQL implementations are remarkably powerful, the original query will probably run in cubic time when implemented via three nested loops that iterate over every value in the three tables. A seasoned SQL programmer would instead use explicit join syntax to ensure that the query is as efficient as our object-based query:

Depending on the encoding of the nesting of the result of a join using flat result sets, the SQL programmer must choose among various flavors of INNER, OUTER, LEFT , and RIGHT joins.

Back to Top

Impedance Mismatch

In 1984 George Copeland and David Maier recognized the impedance mismatch between the relational and the object-graph model just described,5 and in the quarter century since, we have seen an explosion of O/R (object-relational) mappers that attempt to bridge the gap between the two worlds.

A more skeptical view of O/R mappers is that they are undoing the damage caused by normalizing the original object model into multiple tables. For our running example this means that we have to add back information to the various tables to recover the relationships that existed in the original model. In this particular case we use the LINQ-to-SQL custom metadata annotations; other O/R mappers use similar annotations, which often can be inferred from naming conventions of types and properties.

Note that the resulting object model is necessarily more complex than we started with since we are forced to introduce explicit representations for Rating and Keyword collections that did not exist in our original object model. The existence of the various foreign- and primary-key properties is further evidence that the O/R mapping abstraction is leaky.

Aside from those small differences, the net result of all this work is that we can now write the query to find all products nearly as concisely as we did before normalization:

Since the results must be rendered as object graphs, the O/R mapper will make sure that the proper nested result structures are created. Unfortunately, not every O/R mapper does this efficiently.9

It is not only the programmer who needs to recover the original structure of the code. The database implementer must also jump through hoops to make queries execute efficiently by building indexes that avoid the potential cubic effect that we observed earlier. For one-to-many relationships, indexes are nothing more than nested collections resulting from precomputing joins between tables to quickly find all the rows whose foreign keys point to a row with a particular primary key. Since the relational model is not closed under composition, however, the notion of index has to be defined outside the model.

Two natural indexes correspond to the relationship between Products and Ratings and Products and Keywords , respectively. For each product in the Products table, the ratings index contains the collection of all related ratings:

Similarly, for each product in the Product table, the keywords index contains the collection of all keywords related to that product :

If we visualize the indexes as additional columns on the Products table, the reversal of the original relationships between the tables becomes apparent. Each row in the Products table now has a collection of foreign keys pointing to the Keywords and Ratings tables much as the original object graph, as shown in Figure 7.

One of the advantages touted for normalization over hierarchical data is the ability to perform ad-hoc queriesthat is, to join tables on conditions not envisioned in the original data model. For example, we could try to find all pairs of products where the length of the title of one product is the same as the length of the author's name in the other using the following query:

Without an index, however, this query requires a full table scan and hence takes quadratic time in the length of the Products table.

The ability to create indexes makes a closed-world assumption. For example, if we modify the previous ad-hoc query to find all pairs of Web pages where one page has a URL referencing the other, it should be obvious that building an index for this join is quite a hard task when you do not have the whole Web available inside a single database:

Summarizing what we have learned so far, we see that in order to use a relational database, starting with a natural hierarchical object model, the designer needs to normalize the data model into multiple types that no longer reflect the original intent; the application developer must reencode the original hierarchical structure by decorating the normalized data with extra metadata; and, finally, the database implementer has to speed up queries over the normalized data by building indexes that essentially re-create the original nested structure of the data as well.

Back to Top

noSQL is coSQL

At this point it feels like the conceptual dissonance between the key-value and foreign-/primary-key data models is insurmountable. That would be a pity since clearly each has its strengths and weaknesses. Wouldn't it be great if we could give a more mathematical explanation of where the relational model shines and where the object-graph model works best? As it turns out, we can find the answer to this question by taking a closer look at the (in-memory) structures created for our running example in both models.

Let's start by slightly simplifying the object-graph example. We do so by removing the object identity of the Ratings and Authors collections to reflect more directly how they are modeled in the relational world. We inline the Keywords and Ratings items directly into the parent Product, as if we had value-based collections. Pictorially, we move from the diagram on the left to the one on the right in Figure 8:

For the tabular representation, we show explicit arrows for the relationship from foreign keys to primary keys. Again, pictorially, we move from the diagram on the left to the one on the right in Figure 9:

When we do a side-by-side comparison of the two rightmost diagrams, we notice two interesting facts:

In the object-graph model, the identity of objects is intensional that is, object identity is not part of the values themselves but determined by their keys in the store. In the relational model, object identity is extensional that is, object identity is part of the value itself, in the form of a primary key.

that is, object identity is not part of the values themselves but determined by their keys in the store. In the relational model, object identity is that is, object identity is part of the value itself, in the form of a primary key. Modulo the notion of object identity, the two representations are extremely similar; the only difference is that the arrows are reversed!

At this point, it appears there is a strong correspondence between these two representations: they both consist of a collection of elements (objects or rows) and a collection of arrows between the elements, the only difference being the direction of the arrows. Is there some precise way of describing such a situation? Fortunately, there is an entire area of mathematics designed exactly for this: category theory.1

Obviously, the precise formalization of SQL and noSQL as categories is outside the scope of this article, but it is illustrative to learn a little bit of category theory nonetheless. Category theory arose from studies of mathematical structures and an attempt to relate classes of structures by considering the relations between them. Thus, category theory expresses mathematical concepts in terms of objects, arrows between objects, and the composition of arrows, along with some axioms that the composition of arrows should satisfy. A computational view of category theory is that it is a highly stylized, compact functional language. Small as it is, it's enough to represent all of mathematics. For computer scientists, category theory has proved to be a rich source of techniques that are readily applicable to many real-world problems. For example, Haskell's approach to modeling imperative programming is lifted straight from a categorical model of the problem.

The first powerful concept from category theory that we will use is duality. Examples of duality abound in computer science. Every programmer is familiar with the two dual forms of De Morgan's law:

Other examples of dualities in computer science are between reference counting and tracing garbage collection, between call-by-value and call-by-name, between push- and pull-based collections, and between transactional memory and garbage collection, among many others.

Formally, given a category C of objects and arrows, we obtain the dual category co(C) by reversing all the arrows in C . If a statement T is true in C , then its dual statement co(T) is true in the dual category co(C) . In the context of this article, we can read "opposite statement" for "dual statement".

In the SQL category, child nodes point to parent nodes when the foreign key of a child node equals the primary key of the parent node (Figure 10). In the noSQL category, the arrows are reversed. Parent nodes point to child nodes when the child pointer in the parent equals the address of the child node in the store (see Figure 11).

In other words, the noSQL category is the dual of the SQL categorynoSQL is really coSQL. The implication of this duality is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Interestingly, in Chinese philosophy yin symbolizes open and hence corresponds to the open world of coSQL, and yang symbolizes closed and hence corresponds to the closed world of SQL.

Because SQL and coSQL are mathematically dual, we can reason precisely about the tradeoffs between the two instead of relying on rhetoric and anecdotal evidence. The accompanying table gives a number of statements and their duals as they hold for SQL and coSQL, respectively.

If we really take the duality to heart, we may also choose to (but don't have to) fine-tune our model for key-value stores to reflect the duality between values and computations, and that between synchronous ACID and asynchronous BASE (basically available, soft state, eventually consistent).13

Looking up a value given its address or key in a key-value story can involve a computation, which in a truly open world has potential latency and may fail. For example, in the C# language getters and setters, known as properties, can invoke arbitrary code. Perhaps an even better example of a computation-driven key-value store with long latency and high probability of failure (always able to handle a 404) is the Web, with URI (Uniform Resource Identifier) as keys, "resources" as values, and the HTTP verbs as a primitive query and data-manipulation language. On the other hand, in a C-like key-value memory model, we usually make the simplifying assumption that a key lookup in memory takes constant time and does not fail.

Traversing a relationship in the closed world of the relational model involves comparing two values for equality, which is guaranteed to succeed because of referential integrity; and vice versa, referential consistency dictates that relationships are value-based. Otherwise, we could never be sure that referential consistency actually holds.

Note that comparing keys by value requires that objects in the SQL category are strongly typed, at least enough to identify primary and foreign keys; and dually, since we do not need to know anything about the value of a coSQL object to find it using its key, objects in the coSQL world can be loosely typed.

Back to Top

Relationship to the Real World

Our abstract model of the SQL category did not impose any restrictions on the structure of rows; we assumed only that we could determine a primary or foreign key to relate two rows.

In the typical relational model we would further impose the constraint that rows consist of flat sequences of scalar values (the so-called First Normal Form, or 1-NF). If we dualize relations in 1-NF, then we get a key-value model where values consist of either scalars or keys or collections of scalars or keys. Surprisingly, this is precisely the Amazon SimpleDB data model (see Figure 12).

If we assume that rows in the foreign-/primary-key model are simply blobs and keys are strings, then the dual key-value model is exactly the HTML5 key-value storage model:

Back to Top

A Little More Category Theory

So far we have discussed the basic data models for SQL and coSQL, but we have not yet touched upon queries. By applying a little more category theory we can show how a single abstraction, monads and monad comprehensions, can be used as a unified query language for both SQL and coSQL.

To talk about queries, we need to be more precise about what we mean by collections of values. Pure relational algebra is based on sets of rows, but actual relational databases use multisets (bags) or ordered multisets (permutations). To model collections abstractly, we look at sets/bags/permutations of rows and apply the category theory dictum: "What is the interface that these various collections of rows implement?" and "How do we generalize queries based on such an interface?"

First, let us stick with simple set collections. When we write a SQL query such as

the SQL compiler translates that pretty syntax into a relational-algebra expression in terms of selection, projection, joins, and Cartesian product. As is the custom in the relational world, the various operations are denoted using Greek symbols:

There is no reason to restrict the relational algebra operators to work over just sets (or bags, or permutations) of rows. Instead, we can define similar operators on any collection M<T> of values of arbitrary type T .

The interface for such abstract collections M<T> is a straightforward generalization of that of sets. It allows us to create an empty collection using the constant Ø ; create a singleton collection of type M<T> given some value of type T using the function {_} M<T> (the notation T M<T> denotes a function/closure/lambda expression that maps an argument value of type T to a result collection of type M<T> ); and combine two collections into a larger collection using the binary operator (depending on the commutativity and idempotence of , we obtain the various sorts of collections such as lists, permutations, bags, and sets):

Using these constructors, we can generalize the traditional relational algebra operators (selection P , projection F , and Cartesian product ×) to operate over generalized collections using the following signatures:

In fact, if we assume a single operator for correlated subqueries, which we call SelectMany , but is often called CROSS APPLY in SQL

then we can define the other operators in terms of this single one, as follows:

Rather incredibly, an interface of this shape is well known in category theory. It is called a monad, where the type constructor M< _ > is a functor of the monad; the constructor {_} is the unit of the monad; SelectMany is the bind of the monad; and Ø and are the neutral element and addition, respectively. For the rest of us, they are just the signatures for methods defined on an abstract interface for collections.

This is no theoretical curiosity. We can play the same syntactic tricks that SQL does with relational algebra, but using monads instead. Such monad comprehensions have been recognized as a versatile query language by both functional programmers and database researchers.8

LINQ queries are just a more familiar SQL-like syntax for monad comprehensions. Instead of Greek symbols, LINQ uses human-readable identifiers such as xs.Where(P) for P (xs) and xs.Select(F) for F (xs) . To accommodate a wide range of queries, the actual LINQ standard query operators contain additional operators for aggregation and grouping such as

Any data source that implements the standard LINQ query operators can be queried using comprehension syntax, including both SQL and coSQL data sources, as we show in the next section.

The .NET framework defines a pair of standard interfaces IEnumerable<T> and IQueryable<T> that are often implemented by data sources to support querying, but it is by no means necessary to use these particular interfaces. Other standard interfaces that support LINQ query operators include the IObservable<T> and IQbservable<T> interfaces that make it possible to use LINQ for complex event processing.10

Back to Top

Scalability and Distribution

In contrast to most treatments of noSQL, we did not mention scalability as a defining characteristic of coSQL. The openness of the coSQL model eases scalable implementations across large numbers of physically distributed machines.

When using LINQ to query SQL databases, typically similar rows are stored in tables of some concrete type that implements the IQueryable<T> interface. Relationships between rows are lifted to bulk operations over collections that perform joins across these tables; hence, queries take any number of related tables and from those produce a new one. Pictorially for three tables, this is shown in Figure 13.

Because relationships in SQL cross different tables, it is nontrivial to partition the single closed world into independent worlds of "strongly connected" components that can be treated independently. In a closed world, however, query optimizers can leverage all the information and statistics that are available about the various tables. This allows users to write declarative queries focusing on the "what" and letting the system take care of the "how."

In the coSQL case, a typical scenario is to have a single collection of type IQueryable<S> of (pointers to) self-contained denormalized "documents" of type S. In that case, queries have type IQueryable<S> R (see Figure 14).

When there are no cross-table relationships, collections {x 0 , x 1 ,...,x n1 } M<S> that are the source of coSQL queries can be naturally horizontally partitioned, or sharded, into individual subcollections {x 0 } {x} ... {x n-1 }, and each such subcollection {x i } can be distributed across various machines on a cluster.

For a large subset of coSQL queries, the shape of the query closely follows the shape of the data. Such homomorphic queries map a collection xs={x 0 } {x 1 } ... {x n-1 } to the value f(x 0 ) f(x 1 ) ... f(x n-1 )that is, they are of the form xs.Select(f).Aggregate( ) for some function f S R and binary operator RxRR . In fact, Richard Bird's first homomorphism lemma3 says that any function h M<S>R is a homomorphism with respect to if and only if it can be factored into a map followed by a reduce: h(xs) = xs.Select(f).Aggregate( ) . Mathematics dictates that coSQL queries are performed using MapReduce.6

Practical implementations of MapReduce usually slightly generalize Bird's lemma to use SelectMany instead of Select so that the map phase can return a collection instead of a single value, and insert an intermediate GroupBy as a way to "write" equivalence classes of values from the map phase into the key-value store for subsequent processing in the reduce phase, and then aggregate over each subcollection:

For example, DryadLINQ15 uses the type PartitionedTable<S>:IQuer yable<S> to represent the partitioned input collection for LINQ queries and then implements MapReduce over the partitioned collection using the function illustrated in Figure 15.

In an open world where collections are distributed across the network, it is much harder for a query optimizer to perform a global optimization taking into account latency, errors, etc. Hence, most coSQL databases rely on explicit programmatic queries of a certain pattern such as MapReduce that can be executed reliably on the target physical machine configuration or cluster.

Back to Top

Conclusion

The nascent noSQL market is extremely fragmented, with many competing vendors and technologies. Programming, deploying, and managing noSQL solutions requires specialized and low-level knowledge that does not easily carry over from one vendor's product to another.

A necessary condition for the network effect to take off in the noSQL database market is the availability of a common abstract mathematical data model and an associated query language for noSQL that removes product differentiation at the logical level and instead shifts competition to the physical and operational level. The availability of such a common mathematical underpinning of all major noSQL databases can provide enough critical mass to convince businesses, developers, educational institutions, etc. to invest in noSQL.

In this article we developed a mathematical data model for the most common form of noSQLnamely, key-value stores as the mathematical dual of SQL's foreign-/primary-key stores. Because of this deep and beautiful connection, we propose changing the name of noSQL to coSQL. Moreover, we show that monads and monad comprehensions (i.e., LINQ) provide a common query mechanism for both SQL and coSQL and that many of the strengths and weaknesses of SQL and coSQL naturally follow from the mathematics.

In contrast to common belief, the question of big versus small data is orthogonal to the question of SQL versus coSQL. While the coSQL model naturally supports extreme sharding, the fact that it does not require strong typing and normalization makes it attractive for "small" data as well. On the other hand, it is possible to scale SQL databases by careful partitioning.2

What this all means is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Because of the common query language based on monads, both can be implemented using the same principles.

Back to Top

Acknowledgments

Many thanks to Brian Beckman, Jimmy "the aggregator" Nilsson, Bedarra-Dave Thomas, Ralf Lämmel, Torsten Grust, Maarten Fokkinga, Rene Bouw, Alexander Stojanovic, and the anonymous referee for their comments that drastically improved the presentation of this paper, and of course to Dave Campbell for supporting work on all cool things LINQ.

Related articles

on queue.acm.org

A Conversation with Erik Meijer and Jose Blakeley

http://queue.acm.org/detail.cfm?id=1394137

BASE: An Acid Alternative

Dan Pritchett

http://queue.acm.org/detail.cfm?id=1394128

Bridging the Object-Relational Divide

Craig Russell

http://queue.acm.org/detail.cfm?id=1394139

Back to Top

References 1. Awodey, S. Category Theory (2nd edition). Oxford University Press, 2010. 2. Baker, J., Bond, C. et al. Megastore: providing scalable, highly available storage for interactive services. Conference on Innovative Data Systems Research. (2011). 3. Bird, R. An introduction to the theory of lists. In Logic Programming and Calculi of Discrete Design. M. Broy, ed. Springer-Verlag (1987), 342. 4. Codd, T. A relational model of data for large shared data banks. Commun. ACM 13 (June 1970). 5. Copeland, G. and Maier, D. Making Smalltalk a database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data.1984. 6. Fokkinga, M. MapReducea two-page explanation for laymen; http://www.cs.utwente.nl/~fokkinga/mmf2008j.pdf. 7. Ghosh, R.A. An economic basis for open standards (2005); flosspols.org. 8. Grust, T. 2003. Monad comprehensions: a versatile representation for queries. In The Functional Approach to Data Management. P. Gray, L. Kerschberg, P. King, and A. Poulovassilis, Eds. Springer Verlag, 2003, 288311. 9. Grust, T., Rittinger, J., Schreiber, T. Avalanche-safe LINQ compilation. Proceedings of the VLDB Endowment 3 (12), 2010. 10. Meijer, E. Subject/Observer is dual to iterator. Presented at FIT: Fun Ideas and Thoughts at the Conference on Programming Language Design and Implementation (2010); http://www.cs.stanford.edu/pldi10/fit.html. 11. Meijer, E., Beckman, B., Bierman, G. LINQ: reconciling objects, relations, and XML in the .NET framework. Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 2006. 12. Pirayoff, R. Economics Micro & Macro. Cliffs Notes, 2004. 13. Pritchett, D. BASE: an Acid alternative. ACM Queue (July 2008). 14. Stonebraker, M., Hellerstein, J.M. What goes around comes around. In Readings in Database Systems (Fourth Edition). M. Stonebraker, and J.M. Hellerstein, eds. MIT Press, Cambridge, MA, 2005, 241. 15. Yuan Yu, M.I. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. Operating Systems Design and Implementation. 2008.

Back to Top

Authors Erik Meijer (emeijer@microsoft.com) has been working on "Democratizing the Cloud" for the past 15 years. He is perhaps best known for his work on the Haskell language and his contributions to LINQ and the Reactive Framework (Rx). Gavin Bierman (gmb@microsoft.com) is a senior researcher at Microsoft Research Cambridge focusing on database query languages, type systems, semantics, programming language design and implementation, data model integration, separation logic, and dynamic software updating.

Back to Top

Back to Top

Figures Figure 1. Object graph for Products collection. Figure 2. Object graph result for books with four star rating. Figure 3. Data declaration for Product database. Figure 4. Inserting values into Product database. Figure 5. Relational tables for Product database. Figure 6. Tabular result for books with four-star ratings. Figure 7. Keyword and Ratings index on Products table. Figure 8. Object graph for Products collection with keywords and ratings inlined. Figure 9. Relational tables for Products database with explicit relationships. Figure 10. SQL arrow from child to parent. Figure 11. coSQL arrow from parent to child. Figure 12. Amazon simpleDB representation of Products collection. Figure 13. Foreign key relationships between three relational tables. Figure 14. Collection of coSQL documents. Figure 15. Signature for MapReduce in DryadLINQ.

Back to Top

Tables Table. Consequences of the duality between SQL and coSQL.

Back to top

©2011 ACM 0001-0782/11/0400 $10.00 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.

Comments

Anonymous

I like the idea but would one need to manually merge the result sets like LINQ'ing over SQL and XML or use the same query for both?

It would be handy to have coSQL take an IQueryable and return an IObservable or preferably Task of IObservable. As most GUIs only need the first x number of reference items for the initial display in the sequence. ie. drop down, lists before scrolling, etc. Having the rest come when ready or having the control cancel the request if not needed would be handy. "JIT data access".

- N2Cheval

Anil Shafeeque

Interesting subject. But I always have a feel that query on object models cannot perform like relational models. Or is this with assumption that performance can be achieved by distributed computing frameworks like Map reduce etc..

+Anil

CACM Administrator

The following letter was published in the Letters to the Editor in the July 2011 CACM (http://cacm.acm.org/magazines/2011/7/109897).

--CACM Administrator

Erik Meijer's and Gavin Bierman's article "A Co-Relational Model of Data for Large Shared Data Banks" (Apr. 2011) overreached by claiming equivalence between the Relational Model and NoSQL "key-value pairs" without regard to the definition of a data model by E.F. Codd more than 30 years ago. Finding similarity in NoSQL systems to some parts of the Relational Model, Meijer and Bierman mistakenly concluded the two are equivalent.

Codd, in his paper "Data Models in Database Management" in Proceedings of the 1980 Workshop on Data Abstraction, Databases and Conceptual Modeling (http://portal.acm.org/citation.cfm?id=806891) defined a data model as comprising three components: data structures to represent well-formed expressions in first-order logic; operators closed over these structures, permitting inferencing; and integrity constraints to enforce internal consistency.

NoSQL systems have no data model so defined. All else is commentary.

Meijer and Bierman ignored logic and inferencing and did not explain how key-value systems recognize, let alone enforce, integrity constraints. They cited referential integrity a form of integrity constraint as an exogenous cost relational databases bear to correct for a deficiency. The truth is actually the opposite; consistency is a central obligation of any database-management system. The lack of constraint-checking in key-value systems imposes the constraint-checking burden on the application, a situation the Relational Model was invented specifically to correct.

Codd encountered a similar lack of understanding in his day. In the same proceedings paper, he wrote, "In comparing data models people often ignore the operators and integrity rules altogether. When this occurs, the resulting comparisons run the risk of being meaningless."

Codd's landmark article "A Relational Model of Data for Large Shared Data Banks" (Communications, June 1970) addressed other points raised by Meijer and Bierman, including path independence. An interested reader would learn much in an evening spent with that one article alone.

Object Relational Mapping libraries and NoSQL systems attempt to solve (through technical means) a nontechnical problem: reluctance of talented people to master the Relational Model, and thus benefit from its data consistency and logical inferencing capabilities. Rather than exploit it and demand more relational functionality from DBMS vendors, they seek to avoid and replace it, unwittingly advocating a return to the fragile, unreliable, illogical systems of the 1960s, minus the green-bar fanfold paper.

James K. Lowden

New York

----------------------------------------------------

AUTHORS' RESPONSE:

Lowden's comment contains a number of errors. Our article was, in fact, explicitly critical of the lack of an agreed data model for NoSQL. We didn't ignore "inferencing," proposing instead a query language based on monad comprehensions interestingly, the same query language we prefer for the relational model. We did not assert that the relational and key-value models are equivalent, but rather dual. The issue of weakening consistency checking goes to the heart of the interest in NoSQL systems and is beyond the scope of our article.

Erik Meijer

Redmond, WA

Gavin Bierman

Cambridge, U.K.

CACM Administrator

The following letter was published in the Letters to the Editor in the June 2011 CACM (http://cacm.acm.org/magazines/2011/6/108669).

--CACM Administrator

I would like to clarify Erik Meijer's and Gavin Bierman's statement concerning the origin of SQL in their article "A Co-Relational Model of Data for Large Shared Data Banks" (Apr. 2011) that said, "The landscape changed radically when Ted Codd proposed a new data model and a structured query language (SQL) based on the mathematical concepts of relations and foreign/primary key relationships." There were actually additional steps and additional people involved getting from Codd's influential 1970 Communications article "A Relational Model of Data for Large Shared Data Banks" (cited by Meijer and Bierman) to the SQL language. In it and in subsequent articles and papers, Codd discussed query languages based on both relational calculus and relational algebra. But it was Raymond Boyce and Donald Chamberlin who, after studying Codd's work, defined the SEQUEL language, later renamed SQL, as first documented by Chamberlin and Boyce.(1) For more on how Codd's ideas influenced Boyce and Chamberlin and the rest of the System R team, see Chamberlin.(2)

Paul McJones

Mountain View, CA

REFERENCES

(1) Chamberlin, D. D. and Boyce, R. F. SEQUEL: A structured English query language. In Proceedings of the 1974 ACM SIGFIDET (Now Sigmod) Workshop on Data Description, Access and Control (Ann Arbor, MI, May 13). ACM Press, New York, 1974, 249264.

(2) Chamberlin, D. Oral History, Paul McJones, interviewer. Computer History Museum, Mountain View, CA, July 21, 2009; http://www.computerhistory.org/collections/accession/102702111

Displaying all 4 comments