08 September 2016

Heterogeneous Collections

A common problem in database design is heterogeneous collections or references. This is solved in many programming languages with sum types, but these are not typically supported in databases, especially relational ones.

The example I will use throughout this article is a cutlery store with a shopping cart. The cart could contain many types items: knifes, forks and spoons. Each item has its own set of attributes which we may want to sort or compare on. These items should be able to be inserted in to our database in a way that does not permit erroneous data, and it should also be possible to display the contents of the cart with a single query returning all attributes for each item. How do we model this?

Contents:

Terminology

Referential integrity is what a foreign key constraint provides. It ensures that an identifier references a valid row in another table at all times.

Entity integrity is what a primary key constraint provides. It ensures that an attribute or set of attributes in a table reference a unique row at all times.

Note

SQL snippets will be PostgreSQL specific. The result of the cart query will always look something like this:

item_type | item_id | price | attributes -----------+---------+-------+-------------------- Fork | 3 | 6.0 | {"prongs" : 2} Knife | 1 | 5.0 | {"sharpness" : 10} Spoon | 4 | 3.0 | {"spooniness" : 8} Fork | 2 | 6.0 | {"prongs" : 1}

Tagged Key

This uses a tag to indicate the type of the ID. It is used in Django’s GenericForeignKey .

One way to display the cart in a single query is like this:

-- The set of all items. (item_type, item_id) is unique in this view. CREATE OR REPLACE VIEW items_view AS SELECT 'Knife' AS item_type , knife_id AS item_id , price , json_build_object ( 'sharpness' , sharpness ):: text AS attributes FROM Knives UNION SELECT 'Fork' AS item_type , fork_id AS item_id , price , json_build_object ( 'prongs' , prongs ):: text AS attributes FROM Forks UNION SELECT 'Spoon' AS item_type , spoon_id AS item_id , price , json_build_object ( 'spooniness' , spooniness ):: text AS attributes FROM Spoons ; -- Join against the item set using (item_type, item_id) CREATE OR REPLACE VIEW cart_view AS SELECT items_view . item_type , items_view . item_id , items_view . attributes FROM CartEntries INNER JOIN items_view ON CartEntries . item_type = items_view . item_type AND CartEntries . item_id = items_view . item_id ;

Advantages

It can refer to anything.

Disadvantages

It can refer to anything.

No enforced referential integrity.

An item cannot be referenced by a single ID. To reference an item both the ID and type are needed.

Multiple Tables with Unique IDs

If it can ensured that the ID of an item is unique amongst all items, it is possible to discard the tag. This can be achieved by by using a suitable UUID or sharing a sequence between all item types.

The cart query is similar to the tagged key query, but simplified as the type does not need to be checked:

-- The set of all items. item_id is unique in this view. CREATE OR REPLACE VIEW items_view AS SELECT 'Knife' AS item_type , item_id , price , json_build_object ( 'sharpness' , sharpness ):: text AS attributes FROM Knives UNION SELECT 'Fork' AS item_type , item_id , price , json_build_object ( 'prongs' , prongs ):: text AS attributes FROM Forks UNION SELECT 'Spoon' AS item_type , item_id , price , json_build_object ( 'spooniness' , spooniness ):: text AS attributes FROM Spoons ; -- Join against the item set using item_id CREATE OR REPLACE VIEW cart_view AS SELECT items_view . item_type , items_view . item_id , items_view . attributes FROM CartEntries INNER JOIN items_view USING ( item_id );

Advantages

An item can be referenced by a single ID. There is no need to keep track of an item’s type as in the ‘tagged key’ example.

Disadvantages

No enforced entity integrity. Primary keys cannot span across tables without additional tricks. A knife, spoon and fork could all have the same ID.

No enforced referential integrity.

Multiple Tables with Integrity

Some of the disadvantages of separate tables can be corrected by changing the way the cart is modelled:

Then the cart can be displayed with a union of joins:

CREATE OR REPLACE VIEW cart_view AS SELECT 'Knife' AS item_type , knife_id AS item_id , price , json_build_object ( 'sharpness' , sharpness ):: text AS attributes FROM CartKnifeEntries INNER JOIN Knives USING ( knife_id ) UNION SELECT 'Fork' AS item_type , fork_id AS item_id , price , json_build_object ( 'prongs' , prongs ):: text AS attributes FROM CartForkEntries INNER JOIN Forks USING ( fork_id ) UNION SELECT 'Spoon' AS item_type , spoon_id AS item_id , price , json_build_object ( 'spooniness' , spooniness ):: text AS attributes FROM CartSpoonEntries INNER JOIN Spoons USING ( spoon_id );

Advantages

Entity integrity.

Referential integrity.

Disadvantages

Cumbersome to create a table per type every time there is a need to refer to an item.

An item cannot be referenced by a single ID.

Product Types > Sum Types

This approach adds multiple nullable ID columns and one of them has an ID.

CREATE OR REPLACE VIEW cart_view as SELECT 'Knife' AS item_type , knife_id AS item_id , price , json_build_object ( 'sharpness' , sharpness ):: text AS attributes FROM CartEntries INNER JOIN Knives USING ( knife_id ) UNION SELECT 'Fork' AS item_type , fork_id AS item_id , price , json_build_object ( 'prongs' , prongs ):: text AS attributes FROM CartEntries INNER JOIN Forks USING ( fork_id ) UNION SELECT 'Spoon' AS item_type , spoon_id AS item_id , price , json_build_object ( 'spooniness' , spooniness ):: text AS attributes FROM CartEntries INNER JOIN Spoons USING ( spoon_id );

Disadvantages

Cumbersome to add a field per type.

An item cannot be referenced by a single ID.

To avoid accidental sporks, it requires a check that one and only one ID is used. In PostgreSQL it would look something like:

ALTER TABLE CartEntries ADD CHECK (( knife_id IS NOT NULL AND fork_id IS NULL AND spoon_id IS NULL ) OR ( knife_id IS NULL AND fork_id IS NOT NULL AND spoon_id IS NULL ) OR ( knife_id IS NULL AND fork_id IS NULL AND spoon_id IS NOT NULL ));

Subclassing

This is the traditional representation, and usually the preferred one. The main reason I see developers avoiding it is because it is not well supported by some ORMs which often have problems querying a set of heterogeneous types (the table definition is the easy part). I recommend building an ORM object on top of the view rather than, or in addition to, the tables themselves.

It’s also fine to use this pattern there are no shared attributes and the base table consists only of a primary key. This ensures entity and referential integrity.

-- The set of all items. item_id is unique in this view. As the -- extended attributes are in separate tables, there is no need to pick -- out individual attributes. The JSON structure is just the -- serialisation of the entire table. CREATE OR REPLACE VIEW items_view AS SELECT 'Knife' AS item_type , item_id , price , row_to_json ( Knives ):: text as attributes FROM Knives JOIN Items USING ( item_id ) UNION SELECT 'Fork' AS item_type , item_id , price , row_to_json ( Forks ):: text as attributes FROM Forks JOIN Items USING ( item_id ) UNION SELECT 'Spoon' AS item_type , item_id , price , row_to_json ( Spoons ):: text as attributes FROM Spoons JOIN Items USING ( item_id ); -- Join by item_id CREATE OR REPLACE VIEW cart_view AS SELECT items_view . item_type , items_view . item_id , items_view . attributes FROM CartEntries INNER JOIN items_view USING ( item_id );

Incidentally, if only the common properties are needed, this is a simple join. For example, for the total price of the items in the cart:

SELECT SUM ( price ) FROM CartEntries INNER JOIN Items USING ( item_id )

Advantages

Trivial to enforce entity and referential integrity.

Attributes that are common to all items can be placed in the base table to avoid repetition.

An item can be referenced by a single ID.

Disadvantages

With basic SQL there is no simple way to enforce how many subclasses are used. In this example the items are complete and disjoint, i.e. an item can only be a knife, fork or spoon; however, without further constraints it is possible to have a knife, fork and spoon with the same item_id or even have an item with no subclass at all. There are a few tricks to enforce this constraint, such as using triggers.

or even have an item with no subclass at all. There are a few tricks to enforce this constraint, such as using triggers. Sometimes has poor ORM support.

The Big Bag of Attributes

CREATE OR REPLACE VIEW cart_view AS SELECT Items . * FROM CartEntries INNER JOIN Items USING ( item_id );

Advantages

No joins needed, so it is one of the best performing choices.

An item can be referenced by a single ID.

Well suited to simplistic ORMs as it can map directly to a single object.

Disadvantages

Table bloat.

Accidental sporks. Awkward to enforce which set of attributes are used for each type and make sure your spoon doesn’t have prongs.

Probably a maintenance nightmare. Forgoes database design principles.

Schemaless

CREATE OR REPLACE VIEW cart_view AS SELECT Items . * FROM CartEntries INNER JOIN Items USING ( item_id );

Advantages

No joins needed, so it is one of the best performing choices.

A disadvantage of all the schema-based approaches are that adding or removing new types of items means manipulating the schema via DDL statements. The schemaless approach does not constrain data and so attributes do not have to be defined in advance. This is useful for user defined attributes.

An item can be referenced by a single ID.

Well suited to simplistic ORMs as it can map directly to a single object.

Simplicity. Initially, at least. It moves the costs to the application layer.

Disadvantages

The reasons why I dislike using schemaless data could be a blog post in itself, but I will try to briefly summarise: