When we started on Dolt, our goal was to apply Git's idea of versioning to data. Whereas Git versions files, Dolt versions tables. As the creator of a database, you have the choice of how to represent any value. We chose to adopt the idea of having a schema for each of our tables, meaning that any one column has a definition of the type of data that it represents, and all cells under that column adhere to that type.

As an example, let's take a look at the value 1 , and let's store it in a column we'll refer to as val .

val 1

If someone were to retrieve the value, what would it represent? Would it represent a count? Would it instead represent a true/false value? Perhaps it represents a character encoded in the ASCII format to denote the start of a message header? It is ambiguous. This could be fixed by using a more descriptive column name.

Number Of Books 1 IV 1101

This now highlights another problem, in that we can add any values that we want, so we inserted the values IV (which is 4 in Roman numerals) and 1101 (which is 13 in binary). Each of these are considered valid values in the column by its name alone. By having a type, the database can verify that the value given conforms to the type, converting if need be, or rejecting if the value is not understood to map to the given type.

Number Of Books int32 1 4 13

This type information is even more handy in a database built for sharing like Dolt. Have you ever tried importing a CSV with N/A in a numerical column to represent NULL? We have. It's more work than it should be. Moreover, the type information of a column communicates the schema creator’s intent. It's like documentation that is built-in.

First there was Noms

We initially built Dolt on top of Noms. Although we have made significant changes along the way, their implementation of types was our starting point. In Noms, each value is stored on disk as an array of bytes, and those bytes are prefixed with an integer.

Kind Kind ↓ Data→ ↓ Data→ ------------------------------------------------- | 1 | | | | | | 3 | | | | | | -------------------------------------------------

This integer represents the kind , which is Noms' way of stating the type that those bytes represent, and also how to interpret those bytes. Noms has a general-purpose type system, but for our purpose of collaborating on tabled data, we started using the three scalar types:

Type Description bool represents either true or false number IEEE-754 double precision number string an array of characters

Each type in Noms has a few properties, such as defining equality, sortability, how to encode and decode from a byte array, and also how to hash the value. When we began working on Dolt, we added our own types into the mix, while also renaming number to float64 .

Type Description bool represents either true or false float64 IEEE-754 double precision number int64 integers from -9223372036854775808 to 9223372036854775807 string an array of characters uint64 integers from 0 to 18446744073709551615 uuid 128-bit value as defined by RFC 4122

And then there was SQL

This was sufficient for a few months, until we began to add SQL support. SQL allows one to use the data in many exciting different ways, and also works as a standard and well-known interface to tabled data. We specifically chose the MySQL variant to target first due to its popularity, and with it came a lot more types that we did not support.

SQL Type SQL Aliases Description BIGINT integers from -9223372036854775808 to 9223372036854775807 BIGINT UNSIGNED integers from 0 to 18446744073709551615 BINARY an array of bytes, up to a maximum of 255 BIT depending on bit depth, holds integers from 0 to 18446744073709551615 BLOB an array of bytes, up to a maximum of 65535 CHAR CHARACTER an array of character, up to a maximum of 255 DATE dates from 1000-01-01 to 9999-12-31 DATETIME dates and times from 1000-01-01 00:00:00 to 9999-12-31 23:59:59 DECIMAL FIXED, DEC, NUMERIC number with a specifiable number of digits, both total and fractional DOUBLE DOUBLE PRECISION, REAL IEEE-754 double precision number ENUM definable collection of allowed strings, with a maximum of 65535 entries FLOAT IEEE-754 single precision number GEOMETRY a supertype that can hold a point, linestring, or polygon GEOMETRYCOLLECTION a collection of any geometry-type object INT INTEGER integers from -2147483648 to 2147483647 INT UNSIGNED INTEGER UNSIGNED integers from 0 to 4294967295 JSON represents the JavaScript Object Notation format for a string LINESTRING a collection of points forming a line LONGBLOB an array of bytes, up to a maximum of 4294967295 LONGTEXT an array of characters, up to a maximum of 4294967295 MEDIUMBLOB an array of bytes, up to a maximum of 16777215 MEDIUMINT integers from -8388608 to 8388607 MEDIUMINT UNSIGNED integers from 0 to 16777215 MEDIUMTEXT LONG, LONG VARCHAR an array of characters, up to a maximum of 4194303 (for utf8 encoding) MULTILINESTRING a collection of linestrings MULTIPOINT a collection of points MULTIPOLYGON a collection of polygons NCHAR NATIONAL CHAR shortcut for appending CHARACTER SET utf8mb3 to CHAR NVARCHAR NATIONAL CHARACTER VARYING shortcut for appending CHARACTER SET utf8mb3 to VARCHAR POINT represents a point in space POLYGON a collection of points forming a shape SET definable collection of allowed strings, with a max of 64 entries, and allows combinations SMALLINT integers from -32768 to 32767 SMALLINT UNSIGNED integers from 0 to 65535 TEXT an array of characters, up to a maximum of 16383 (for utf8 encoding) TIME times from -838:59:59 to 838:59:59 TIMESTAMP dates and times from 1970-01-01 00:00:01 to 2038-01-19 03:14:07 TINYBLOB an array of bytes, up to a maximum of 255 TINYINT BOOLEAN, BOOL integers from -128 to 127 TINYINT UNSIGNED integers from 0 to 255 TINYTEXT an array of characters, up to a maximum of 63 (for utf8 encoding) VARBINARY an array of bytes, up to a maximum of 65535 VARCHAR CHARACTER VARYING an array of characters, up to a maximum of 16383 (for utf8 encoding) YEAR integers from 1901 to 2155, along with 0

Our first solution was to map any types that are more restrictive to their less restrictive variants, such as a TINYINT ( int8 ) to a BIGINT ( int64 ), and store them as our closest Noms-equivalent type. Although the user lost a bit of type information, it would still allow them to export their data from MySQL and import it into Dolt. This solution did not work for other types, such as DATETIME , as there was no clear analog in Dolt, and thus we had to add two more types to our Noms-based backend: timestamp and inlineblob .

Type Description bool represents either true or false float64 IEEE-754 double precision number inlineblob an array of bytes, up to a maximum of 65535 bytes int64 integers from -9223372036854775808 to 9223372036854775807 string an array of characters timestamp dates and times from -200000000-01-01 00:00:00 UTC to 200000000-12-31 23:59:59 UTC uint64 integers from 0 to 18446744073709551615 uuid 128-bit value as defined by RFC 4122

This allowed us to map to many of the common types seen in MySQL databases.

SQL Type Noms Type BIGINT int64 BIGINT UNSIGNED uint64 BINARY BIT BLOB CHAR string DATE timestamp DATETIME timestamp DECIMAL DOUBLE float64 ENUM FLOAT float64 GEOMETRY GEOMETRYCOLLECTION INT int64 INT UNSIGNED uint64 JSON LINESTRING LONGBLOB LONGTEXT string MEDIUMBLOB MEDIUMINT int64 MEDIUMINT UNSIGNED uint64 MEDIUMTEXT string MULTILINESTRING MULTIPOINT MULTIPOLYGON NCHAR NVARCHAR POINT POLYGON SET SMALLINT int64 SMALLINT UNSIGNED int64 TEXT string TIME TIMESTAMP timestamp TINYBLOB TINYINT int64 TINYINT UNSIGNED uint64 TINYTEXT string VARBINARY VARCHAR string YEAR bool inlineblob uuid

Of note, both bool and uuid do not have an equivalent in SQL, and thus are inaccessible as a backing Noms type when creating a table through SQL, since no types map to them. Although inlineblob would satisfy some of the binary / blob types, it would not work for LONGBLOB , so its integration with SQL types was postponed.

Enter go-mysql-server

As work on the SQL engine continued, we began running into the issue that our Noms types worked a bit differently than MySQL's types under some circumstances, and thus we worked to achieve complete parity with their types. Our goal was to be a fully drop-in replacement for MySQL, and that included all of the idiosyncrasies that may not be immediately apparent. This way, one could round-trip data between MySQL and Dolt without any surprises or loss of data. It would also allow anyone to gain the versioning capabilities of Dolt while keeping the workflow that has already been established with their MySQL integration.

We adopted a SQL engine built on top of our fork of the go-mysql-server project. In order to achieve full type compliance, most of the type code had to be rewritten. We referenced MySQL's documentation, however we found a lot of little interactions that were not documented. For example, assigning the string 1 to an integer column ( BIGINT , etc.) would parse the string as though it contained an integer. However, assigning the string 1 to a BIT column would parse the string as though it were an array of utf8-encoded characters, which it would then convert to bytes, giving a value of 49 in this example. Considering how similarly BIT behaves relative to the other integer types, this difference was not assumed to have existed until it was manually found.

We have a test framework for testing correctness, however, the focus of the correctness tests is less about types and more on query execution and features. This meant that a lot of manual testing with a MySQL instance was used to better understand how each type worked, writing tests for the observed behavior to guarantee our implementation's parity.

Putting it all together in Dolt

Once the SQL engine handled types properly, it was on to the Dolt side. In Dolt, a schema’s type was just a reference to the Noms type, and the decision was made to give Dolt its own concept of types, and for those types to handle how SQL and Noms types relate to Dolt. This way, Dolt did not need to map SQL types directly to similar Noms types, and could freely convert between the two representations. This allowed for many SQL types, such as SET , to map to a completely different Noms type, such as uint64 , as the Dolt type could handle the uint64 as a bit-array, and map each bit to a SET entry. This also had another benefit: any Dolt types that do not exist in SQL now have an interface through which SQL can interact with them, retaining compatibility with old Dolt databases. For example, uuid does not have a parallel in SQL, and thus appears as a CHAR(36) when accessed through SQL, however still retains all of the type checking necessary for that type. In the future, if we decide to extend our SQL types to support uuids , then it's as simple as changing the displayed SQL type, without changing anything in Noms. For the majority of SQL types, the Dolt type is simply a conversion from the SQL value to the closest Noms value that can uniquely represent that value, and vice-versa.

SQL Type Dolt Type Noms Type BIGINT number[i64] int64 BIGINT UNSIGNED number[u64] uint64 BINARY varbinary[bin, length] BIT bit[depth] uint64 BLOB varbinary[blob, 65535] CHAR varstring[char] string DATE datetime[date] timestamp DATETIME datetime[dt] timestamp DECIMAL decimal[prec, scale] decimal DOUBLE number[f64] float64 ENUM enum[entries] int64 FLOAT number[f32] float64 GEOMETRY GEOMETRYCOLLECTION INT number[i32] int64 INT UNSIGNED number[u32] uint64 JSON LINESTRING LONGBLOB varbinary[blob, 4294967295] LONGTEXT varstring[text, 4294967295] string MEDIUMBLOB varbinary[blob, 16777215] MEDIUMINT number[i24] int64 MEDIUMINT UNSIGNED number[u24] uint64 MEDIUMTEXT varstring[text, enc_length] string MULTILINESTRING MULTIPOINT MULTIPOLYGON NCHAR varstring[char, length] string NVARCHAR varstring[var, length] string POINT POLYGON SET set[entries] uint64 SMALLINT number[i16] int64 SMALLINT UNSIGNED number[u16] int64 TEXT varstring[text, enc_length] string TIME time int64 TIMESTAMP datetime[ts] timestamp TINYBLOB varbinary[blob, length] TINYINT number[i8] int64 TINYINT UNSIGNED number[u8] uint64 TINYTEXT varstring[text, enc_length] string VARBINARY varbinary[var, length] VARCHAR varstring[var, length] string YEAR year int64

For a SMALLINT ( int16 ), Dolt has a number type that converts it to an int64 for Noms to handle, and back to an int16 when retrieving the value, using the SQL engine to verify the value before storage. This way, a Dolt type does not need to worry about how to hash a value or perform any of the other operations, as the underlying Noms value already does that.

What's still missing?

This has allowed us to add the majority of types from MySQL, however there are still a few missing, such as: JSON , binary types, and GEOMETRY (including all subtypes). We also save character sets and collations on all relevant types, however they are currently ignored. We are working hard to bring you these missing types, so check our release notes on GitHub to stay updated! Better yet, if you need a new type, implement it and send us a pull request.

Conclusion

Types are an integral part of a database meant for sharing data. They communicate intention, while also enforcing homogeneity to the data. It is imperative that we get types right, and our journey through building Dolt has landed us with a type implementation that we feel is stable, extensible, and ready for the challenges that our users will demand.

If you enjoyed reading this blog, why not checkout Dolt, or browse our repositories here on DoltHub?