RDBM

The traditional SQL database may seem an odd choice however, in addition to simply scaling vertically, with sharding and read-replicas it can scale across multiple nodes. In the following points I’m focusing more on these databases as analytical data stores (relatively few large queries) rather than traditional databases (massive numbers of relatively small queries).

Overall: Powerful ACID guarantees Row level updates and inserts Requires structured data however some databases also have support for free form JSON fields Can scale to handle large data sizes Vertically: Modern machines can be quite large so even a single machine can store significant data Horizontally: Sharding is possible although it requires additional manual setup and potentially client logic changes There are tradeoffs as you scale (ie: queries across partitions or complex queries) Computing tied to storage system in terms of scaling Multi-master or automatic failover setups can be tricky to get right so often a single point of failure exists Used by Uber and Facebook to handle massive amounts of data There are better purpose built technologies if you truly need to scale big

MySQL: Open source; PaaS and enterprise versions exist Support for JSON data types Recent support for window functions Commercial support by Oracle (who owns MySQL), PaaS support by AWS

MariaDB: Open source Originally a fork of MySQL Support window functions No JSON data type but native functions for working with JSON Support for a columnar storage engine which significantly speeds up analytical workloads Commercial support by MariaDB, PaaS support by AWS

PostgreSQL Open source; PaaS and enterprise versions exist Support for JSON data types Commercial support by various companies Better parallel single query optimizations than MySQL Third party support for columnar storage engine which significantly speeds up analytical workloads Support for sharding via PL/Proxy

Amazon Aurora: Fully managed MySQL and PostgeSQL compatible databases on AWS Proprietary PaaS Automatically and seamlessly allocates storage Data is replicated across and within availability zones Claims improved performance compared to open source versions due to tight coupling with the SSD storage layer PostgreSQL performance may be lower on Aurora Lags behind open source in version support, Aurora MySQL 5.7 support came out over 2 years after MySQL 5.7 Does not support clustering beyond read replicas

Fully managed MySQL and PostgeSQL compatible databases on AWS

Object/File Storage:

These are distributed scalable ways of storing large amounts of bulk data such as historical logs or images files. These data stores can efficiently read out batches of data for further processing (via Spark, Presto, etc.) and so are capable of acting as a Data Warehouse backend.

Overall: Efficient storage of structured, semi-structured and unstructured data Not designed for individual row level reads and writes Not optimized for storing small files/objects Data processing systems (Spark, MapReduce, etc.) can connect to them Computing can scale independently of storage

Apache HDFS: The Hadoop Distributed File System (HDFS) provides a distributed way of storing hundreds of petabytes of data Open source; PaaS and enterprise versions exist Written in Java First release in 2006 Distributed, fault-tolerant and scalable Data is stored multiple times across the cluster either as full copies or utilizing erasure coding Multiple master nodes allow for seamless failover Stores files in directories but not designed as a mountable file system Various other projects have strong support for HDFS including HBase, Spark, Hive, Hadoop MapReduce, Presto and Flink Even with modern tooling (Hortonworks, Cloudera) there is still a non-trivial amount of operational knowledge needed to run your own cluster PaaS versions exist (Amazon EMR, Azure HDInsight, Google Dataproc) and can lower operational knowledge needed Not optimized for storing large number of smaller files (<64mb) such as images You can bundle them together into, for example, SequenceFiles Based on the Google File System paper If you’re running your own hardware or need the performance then it’s a solid choice but otherwise a cloud store like S3 makes more sense Commercial support provided by Hortonworks and Cloudera

The Hadoop Distributed File System (HDFS) provides a distributed way of storing hundreds of petabytes of data Amazon S3: A fully managed object/file storage platform provided by Amazon Proprietary PaaS Distributed, highly available (99.99%) and fault-tolerant (11 9’s) Fully managed so no configuration or manual scaling is necessary Can emulate a file system including listing objects/files in "directories” (technically just uses the prefixes of keys) Can be considered an alternative to HDFS as many projects are able to query data stored in S3 (including MapReduce, Spark, Flink, Presto, etc.) HBase can use S3 as a storage backend if using Amazon EMR List operations can be slow and are only eventually consistent (ie: may return stale data) Latest release of Hadoop includes experimental metadata caching support to work around this Relatively low cost and lack of operational overhead A solid choice for storing batch data if you’re in the Amazon ecosystem

A fully managed object/file storage platform provided by Amazon Azure Blob Storage: An object/file storage platform akin to S3 provided by Azure Proprietary PaaS Distributed, highly available (99.99%) and fault-tolerant (11 9’s or greater depending on replication configuration) Fully managed so no configuration or manual scaling is necessary Strongly consistent in list operations unlike S3 Can be considered an alternative to HDFS as many projects are able to query data stored in S3 (including MapReduce, Spark, Flink, Presto, etc.) HBase can use Azure Blob Storage as a backend natively Relatively low cost and lack of operational overhead A solid choice for storing batch data if you’re in the Azure ecosystem

An object/file storage platform akin to S3 provided by Azure Google Cloud Storage: An object/file storage platform akin to S3 provided by Google Proprietary PaaS Distributed, highly available (99.9% to 99.95%) and fault-tolerant (11 9’s) Fully managed so no configuration or manual scaling is necessary Strongly consistent in list operations unlike S3 Can be considered an alternative to HDFS as many projects are able to query data stored in S3 (including MapReduce, Spark, Flink, Presto, etc.) HBase is not supported, Google instead prefers you use the HBase interface for BigTable Relatively low cost and lack of operational overhead A solid choice for storing batch data if you’re in the Google ecosystem

An object/file storage platform akin to S3 provided by Google

Columnar NoSQL

Instead of storing data as rows these databases instead store data as columns or groups of columns. This approach allows for much higher performance in cases where only a subset of the columns needs to be read for a given query.

Overall: Efficient storage of structured data Allow for key level write and reads in addition to bulk reads and writes Data processing systems (Spark, MapReduce, etc.) can connect to them and as a result computing can scale independently of storage Can act as a regular NoSQL database

Apache Cassandra: A masterless database that avoids any single point of failure and aims for high availability. Open source but PaaS and enterprise version exist Written in Java Initially developed by Facebook and publicly released in 2008 Availability over consistency (AP) in general Supports tunable per query consistency with rigorous testing to avoid issues Uses a limited SQL syntax for queries (no joins) Requires structured data whose schema is defined ahead of time No external dependencies needed (like ZooKeeper) which makes deployment relatively easy Overall good performance with an original emphasis on batch write performance Supports secondary indexes Used by Uber and Netflix Based on the Dynamo paper from Amazon First class support for Spark and lack of dependencies makes for a great Spark data storage backend Commercial support provided by DataStax

A masterless database that avoids any single point of failure and aims for high availability. Apache HBase: A strongly consistent (CP) database built on top of HDFS and Zookeeper. Open source but enterprise versions exists Written in Java First release in 2007 Consistency over availability (CP) Native support for being a MapReduce data source. Spark support through a third-party and soon built-in support in HBase 3.0.0 (may be a while judging by the 2.0.0 releases) Support for Coprocessors that allow for custom code to easily run on HBase servers Read performance on par or better than Cassandra but slower write performance Actual results depends on who’s doing the test: Hortonworks or DataStax Dependency on HDFS and Zookeeper means that deployment is fairly involved if you haven’t already bought into the Hadoop ecosystem Supports secondary indexes Used by Facebook among others Inspired by BigTable from Google Dependency on HDFS makes it harder to justify as a batch data store versus plain HDFS Commercial support provided by Hortonworks and Cloudera

A strongly consistent (CP) database built on top of HDFS and Zookeeper. Google BigTable: A fully managed database that aims for high consistency Proprietary PaaS Consistency over availability (CP) Fully managed HBase is inspired by BigTable and BigTable provides an open source HBase compatibility layer Third party libraries (Spark, MapReduce, etc.) that have HBase support generally also have BigTable support No support for custom Coprocessors No support for secondary indexes

A fully managed database that aims for high consistency Amazon DynamoDB: A fully managed database that aims for high availability Proprietary PaaS Availability over consistency (AP) in general but has support for tunable per query consistency Fully managed Provides auto-scaling for the read and write capacity You do not need to define a schema ahead of time for each table but merely the index keys Like Cassandra its derived from the Dynamo approach (from Amazon) and so shares various similarities There is also support for MapReduce and Spark operations against DynamoDB tables Supports secondary indexes.

A fully managed database that aims for high availability

Data Warehouse

These are full featured Data Warehouses that tie together the data storage and data processing into a single entity. In the future Data Processing post in this series we’ll go more into the computing performance of these solutions which is often a key reason for using them to store data.

Overall: Low latency and high throughput query performance but not necessarily faster than other modern batch processing solutions Optimized columnar data storage Limits on flexibility (data types, UDFs, data processing approaches, etc.) Lock-in if used as primary data store

Druid: Columnar data store designed to provide low-latency analytical queries Open source Written in Java Open sourced in 2012 Provides sub-second analytical/OLAP queries Supports real-time ingestion of data rather than just batch ingestion Provides a limited subset of SQL queries (only large to small table joins) Seamless scaling of the cluster up/down independently of storage Leverages “deep” storage such as S3 or HDFS to avoid data loss if nodes go down Complicated infrastructure setup involving multiple types of nodes and distributed storage (S3, HDFS, etc.) Number of external dependencies (S3/HDFS, ZooKeeper, RDBM) which increases operational overhead Well suited for time series data Used by Airbnb, eBay, Netflix, Walmart and others A solid choice for a specialized analytical/OLAP system but otherwise other options are more flexible and lower overhead (at the cost of slower queries)

Columnar data store designed to provide low-latency analytical queries ClickHouse: Columnar data store designed to provide low-latency analytical queries and simplicity Open Source Written in C++ Open sourced in 2016 by Yandex Significantly higher performance than Druid for some workloads Less scalable than Druid or other approaches Leverages Zookeeper but can run a single node cluster without it

Columnar data store designed to provide low-latency analytical queries and simplicity Amazon Redshift: A fully-managed data warehouse solution that lets you efficiently store and query data using a SQL syntax. Proprietary PaaS General purpose analytical store that support full SQL syntax Loading/unloading data takes time (hours potentially) No real time ingestion, only batch, although micro-batches can simulate real-time Need to explicitly scale the cluster up/down (with write downtime for the duration) Storage and computing are tied together Lack of complex data types such as arrays, structs, maps or native json

A fully-managed data warehouse solution that lets you efficiently store and query data using a SQL syntax. Google BigQuery: A fully-managed data warehouse solution that let’s you efficiently store and query data using a SQL syntax. Proprietary PaaS General purpose analytical store that support full SQL syntax Real time ingestion support Unlike Redshift it is serverless and you do not need to manage, scale or pay for a cluster yourself Supports complex data types (arrays, structs) but not native json

A fully-managed data warehouse solution that let’s you efficiently store and query data using a SQL syntax. Azure SQL Data Warehouse Proprietary PaaS General purpose analytical store that support full SQL syntax No real time ingestion, only batch, although micro-batches can simulate real-time Computing nodes can be scaled independently of storage Can pause computing resources if not using to save cost Storage in Azure Blob Storage Lack of complex data types such as arrays, structs, maps or native json



Be sure to tune back for the next installments in this series which will cover Batch Processing, Stream Processing, NoSQL and Infrastructure. If you liked this post be sure to reach out on Twitter, comment or share it.