The Ten Rules of Schema Growth

25 December 2016

Data outlives code, and a valuable database supports many applications over time. These ten rules will help grow your database schema without breaking your applications.



1. Prod is not like dev.

2. Grow your schema, and never break it.

growth is providing more schema

is providing more schema breakage is removing schema, or changing the meaning of existing schema.

3. The database is the source of truth.

4. Growing is adding.

adding new attributes to an existing 'type'

adding new types

adding relationships between types





5. Never remove a name.

6. Never reuse a name.

7. Use aliases.

In iteration 1, users of your system are identified by their email with an attribute named :user/id .

. In iteration 2, you discover that users sometimes have non-email identifiers for users and that you want to store a user's email even when not using the email as an identifier. In short, you wish that :user/id was named :user/primary-email .

:user/primary-email

:user/id

:user/id

:user/primary-email

8. Namespace all names.

id

:inventory/id

:order/id

:user/id

:user-v2/id

user

v2

9. Annotate your schema.

you could annotate names that are not recommended for new programs with a :schema/deprecated flag, or you could get fancier still with :schema/deprecated-at or :schema/deprecated-because . Note that such deprecated names are still never removed (Rule Five).

flag, or you could get fancier still with or . Note that such deprecated names are still removed (Rule Five). you could provide :schema/see-also or :schema/see-instead pointers to more current conventions.

add N new columns

(optional) add a :schema/split-into attribute on the original column whose value is the new columns, and possibly even the recipe for the split





10. Plan for accretion.

:user/id

:user/id

Are these rules specific to a particular database?

In Context

Production is not development. In production, one or more codebases depend on your data, and these ten rules below should be followed exactingly.A dev environment can be much more relaxed. Alone on your development machine experimenting with a new feature, you have no users to break. You can soften the rules, so long as you harden them when transitioning to production.The lack of common vocabulary makes it all too easy to automate the wrong practices. I will use the termsandas defined in Rich Hickey's Spec-ulation talk. In schema terms:In contrast to these terms, many people use " migrations ", " refactoring ", or " evolution ". These usages tend to focus on repeatability, convenience, and the needs of new programs, ignoring the distinction between growth and breakage. The problem here is obvious: Breakage is bad, so we don't want it to be more convenient!Using precise language underscores the costs of of breakage. Most migrations are easily categorized as growth or breakage by considering the rules below.are suitable for production, andare, at best, a dev-only convenience. Keep them widely separate.Schema growth needs to be reproducible from one environment to another. Reproducibility supports the development and testing of new schema before putting it into production and also the reuse of schema in different databases. Schema growth also needs to be evident in the database itself, so that you can determine what the database has, what it needs, and when growth occurred.For both of these reasons, the database is the proper source of truth for schema growth. When the database is the source of truth, reproducability and auditability happenvia the ordinaryquery and transaction capabilities of the database. (If your database is not up to the tasks of queries and transactions you have bigger problems beyond the scope of this article).Storing schema in a database is strictly more powerful than storing schema as text files in source control. The database is the actual home for schema, plus it provides validation, structure, query, transactions, and history. A source control system provides only history and is separate from the data itself.Note that this doesmean "never put schema information in source control". Source control may be convenient for other reasons, e.g. it may be more readily accessible. You may redundantly store schema in source control, but remember that the database is definitive.As you acquire more information about your domain, grow your schema to match. You can grow a schema by adding new things, andby adding new things, for example:Removing a named schema component at any level is a breaking change for programs that depend on that name. Never remove a name.The meaning of a name is established when the name is first introduced. Reusing that name to mean something substantially different breaks programs that depend on that meaning. This can be evenworse than removing the name, as the breakage may not be as immediately obvious.If you are familiar with database refactoring patterns, the advice in Rules Five and Six may seem stark. After all, one purpose of refactoring is to adopt better names as we discover them. How can wedo that if names can never be removed or changed in meaning?The simple solution is to use more than one alias to refer to the same schema entity. Consider the following example:No problem! Just createas an alias for. Older programs can continue to use, and newer programs can use the now-preferredNamespaces greatly reduce the cost of getting a name wrong, as the same local name can safely have different meanings in different namespaces. Continuing the previous example, imagine that the localnameis used to refer to a UUID in several namespaces, e.g., and so on. The fact thatisa UUID is inconsistent, and newer programs should not have to put up with this.Namespaces let you improve the situation without breaking existing programs. You can introduce, and new programs can ignore names in thenamespace. If you don't like, you can also pick a more semantic name for the new namespace.Databases are good at storing data about your schema. Adding annotations to your schema can help both human readers and make sense of how the schema grew over time. For example:In fact, all the database refactoring patterns that are typically implemented as breaking changes could be implemented non-destructively, with the refactoring details recorded as an annotation. For example, the breaking "split column" refactoring might instead be implemented as schema growth:If a system is going to grow at all, then programs must not bake in limiting presumptions. For example: If a schema states thatis a string, then programs can rely onbeing a string and not occasionally an integer or a boolean. But a program cannot assume that a user entity will be limited to a the set of attributes previously seen, or that it understands the semantics of attributes that it has not seen before.No. These rules apply to almost any SQL or NoSQL database. The rules even apply to the so-called "schemaless" databases. A better word for schemaless is "schema-implicit", i.e. the schema is implicit in your data and the database has no reified awareness of it. With an implicit schema, all the rules still apply, except that the database is impotent to help you (no Rule 3).Many of the resources on migrations, refactoring, and database evolution emphasize repeatability and the needs of new programs, without making the top-level distinctions of growth vs. breakage and prod vs. dev. As a result, these resources encourage breaking the rules in this article.Happily, these resources can easily be recast in growth-only terms. Yougrow your schema without breaking your app. Youcontinuously deploy without continuously propagating breakage. Here's what it looks like in Datomic