Architectures of Druid and Pinot are almost identical to each other, while ClickHouse stands a little apart from them. I will first compare the architecture of ClickHouse with “generic” Druid/Pinot architecture, and then discuss smaller differences between Druid and Pinot.

Differences between ClickHouse and Druid/Pinot

Data Management: Druid and Pinot

In Druid and Pinot, all data in each “table” (whatever it is called in the terminology of those systems) is partitioned into the specified amount of parts. By the time dimension, data is also usually divided with the specified interval. Then those parts of the data are “sealed” individually into self-contained entities called “segments”. Each segment includes table metadata, compressed columnar data and indexes.

Segments are persisted in “deep storage” (e. g. HDFS) and could be loaded on query processing nodes, but the latter are not responsible for durability of the segments, so query processing nodes could be replaced relatively freely. Segments are not rigidly attached to some nodes, they could be loaded more or less on any nodes. A special dedicated server (called “Coordinator” in Druid and “Controller” in Pinot, but I’m going to generically refer to it as “master” below) is responsible for assigning the segments to the nodes, and moving segments between the nodes, if needed. (It doesn’t contradict what I pointed above in this post that all three subject systems, including Druid and Pinot, have “static” data distribution between the nodes, because segment loads and movements in Druid (and Pinot, I suppose) are expensive operations and not done for each particular query, and usually happen only every several minutes, or hours, or days.)

Metadata about segments is persisted in ZooKeeper, directly in Druid, and via Helix framework in Pinot. In Druid metadata is also persisted in an SQL database, it’s explained in more details in the section “Differences between Druid and Pinot” below in this post.

Data Management: ClickHouse

ClickHouse doesn’t have “segments”, containing data strictly falling into specific time ranges. There is no “deep storage” for data, nodes in ClickHouse cluster are also responsible for both query processing and persistence/durability of the data, stored on them. So no HDFS setup or cloud data storage like Amazon S3 is needed.

ClickHouse has partitioned tables, consisting of specific sets of nodes. There is no “central authority” or metadata server. All nodes, between which some table is partitioned, have full, identical copies of the table metadata, including addresses of all other nodes, on which partitions of this table are stored.

Metadata of partitioned table includes “weights” of nodes for distribution of the newly written data, e. g. 40% of data should go to the node A, 30% to the node B and 30% to the node C. It should normally be just equal distribution among the nodes. “Skew”, as in example above, is only required when a new node is added to the partitioned table, in order to fill the new node faster with some data. Updates of those “weights” should be done manually by ClickHouse cluster administrators, or an automated system should be built on top of ClickHouse.

Data Management: Comparison

The approach to data management is simpler in ClickHouse than in Druid and Pinot: no “deep storage” is required, just one type of nodes, no special dedicated server for data management is required. But the approach of ClickHouse becomes somewhat problematic, when any table of data grows so large that it needs to be partitioned between dozens of nodes or more: query amplification factor becomes as large as the partitioning factor, even for queries, that cover small interval of data:

Data distribution tradeoff in ClickHouse

In the example that is given in the picture above, the table data is distributed between three nodes in Druid or Pinot, but a query for a little interval of data will usually hit just two nodes (unless that interval crosses a segment interval boundary). In ClickHouse, any queries will need to hit three nodes, if the table is partitioned between three nodes. In this example it doesn’t seem like a dramatic difference, but imagine that the number of nodes is 100, while the partitioning factor could still be e. g. 10 in Druid or Pinot.

To mitigate this problem, the largest ClickHouse cluster at Yandex (of hundreds of nodes) is in fact split into many “subclusters” of a few dozens of nodes each. This ClickHouse cluster is used to power website analytics, and each point of data has “website ID” dimension. There is strict assignment of each website ID to a specific subcluster, where all data for that website ID go. There is some business logic layer on top of that ClickHouse cluster to manage such data separation on both data ingestion and querying sides. Thankfully in their use case, little queries need to hit data across multiple website IDs, and such queries are coming not from customers of the service, so they don’t have strict real-time SLA.

Another drawback of the ClickHouse’s approach is that when a cluster grows rapidly, data is not rebalanced automatically without humans manually changing “node weights” in a partitioned table.

Tiering of Query Processing Nodes in Druid

Data management with segments is “simple to reason about”. Segments could be moved between the nodes relatively easily. Those two factors helped Druid to implement “tiering” of query processing nodes: old data is automatically moved to servers with relatively larger disks, but less memory and CPU, that allows to significantly reduce costs of running a large Druid cluster, at the expense of slowing queries to older data.

This feature allows Metamarkets to save hundreds of thousands dollars of Druid infrastructure spend per month, versus it had a “flat” cluster.

Tiering of query processing nodes in Druid

As far as I know, ClickHouse and Pinot don’t have similar features yet, all nodes in their clusters are supposed to be the same.

Since the architecture of Pinot is very similar to Druid’s, I think it would be not very hard to introduce a similar feature in Pinot. It might be harder to do this in ClickHouse, because the concept of segments is really helpful for the implementation of such feature, but still possible though.

Data Replication: Druid and Pinot

The unit of replication in Druid and Pinot is a single segment. Segments are replicated in both “deep storage” layer (e. g. three replicas in HDFS, or it is done transparently inside the cloud blob storage, such as Amazon S3), and in the query processing layer: typically in both Druid and Pinot, each segments is loaded on two different nodes. The “master” server monitors the replication levels of each segment and loads a segment on some server, if the replication factor falls below the specified level, e. g. if some node becomes unresponsive.

Data Replication: ClickHouse

The unit of replication in ClickHouse is a table partition on a server, i. e. all data from some table, stored on a server. Similar to partitioning, replication in ClickHouse is “static and specific” rather than “cloud style”, i. e. several servers know that they are replicas of each other (for some specific table; for a different table, replication configuration may be different). Replication provides both durability and query availability. When a disk on some node is corrupted, the data is not lost, because it is stored on some other node too. When some node is temporarily down, queries could be routed to the replica.

In the largest ClickHouse cluster at Yandex, there are two equal sets of nodes in different data centers, and they are paired. In each pair the nodes are replicas of each other (i. e. the replication factor of two is used) and located in different data centers.

ClickHouse depends on ZooKeeper for replication management, but otherwise ZooKeeper is not needed. It means that ZooKeeper is not needed for a single-node ClickHouse deployment.

Data Ingestion: Druid and Pinot

In Druid and Pinot, query processing nodes are specialized to load segments and serve queries to the data in segments, but not to accumulate new data and produce new segments.

When a table could be updated with delay of an hour or more, segments are created using batch processing engines, such as Hadoop or Spark. Both Druid and Pinot have “first class” out-of-the-box support for Hadoop. There is a third-party plugin for Druid indexing in Spark, but it’s unsupported at the moment. As far as I know, Pinot doesn’t have even such level of support for Spark, i. e. you should contribute it yourself: understand Pinot interfaces and code, write some Java or Scala code. But it shouldn’t be very hard to do. (Update: Ananth PackkilDurai from Slack is contributing support for Spark in Pinot now.)

When a table should be updated in real time, both Druid and Pinot introduce a concept of “real time nodes”, which do three things: accept new data from Kafka (Druid supports other sources too), serve queries to the recent data, and create segments in background, later pushing them to the “deep storage”.

Data Ingestion: ClickHouse

The fact that ClickHouse doesn’t need to prepare “segments” containing strictly all data, falling into specific time intervals, allows for simpler data ingestion architecture. ClickHouse doesn’t need a batch processing engine like Hadoop, nor “realtime” nodes. Regular ClickHouse nodes, the same that store the data and serve queries to it, directly accept batch data writes.

If a table is partitioned, the node that accepts a batch write (e. g. 10k rows) distributes the data according to the “weights” of all nodes in the partitioned table itself (see section “Data management: ClickHouse” above).

Rows written in a single batch form a small “set”. Set is immediately converted into columnar format. There is a background process on each ClickHouse node, that merges row sets into larger ones. Documentation of ClickHouse heavily refers to this principle as “MergeTree” and highlights it’s similarity with log-structured merge trees, although IMO it’s a little confusing because data is not organized in trees, it’s in a flat columnar format.

Data Ingestion: Comparison

Data ingestion in Druid and Pinot is “heavy”: it consists of several different services, and it’s management is a burden.

Data ingestion in ClickHouse is much simpler (at the expense of more complicated historical data management — see above), although there is one caveat: you should be able to “batch” data in front of ClickHouse itself. Automatic ingestion and batching of data from Kafka is available out-of-the-box, but if you have a different source of real-time data, ranging from queueing infrastructure alternative to Kafka and stream processing engines to simple HTTP endpoints, you need to create an intermediate batching service, or contribute code to ClickHouse directly.

Query Execution

Druid and Pinot have dedicated layer of nodes called “brokers”, which accept all queries to the system. They determine to which “historical” query processing nodes subqueries should be issued, based on the mapping from segments to nodes, on which the segments are loaded. Brokers keep this mapping information in memory. Broker nodes send downstream subqueries to query processing nodes, and when the results of those subqueries come back, broker merges them and returns the final combined result to the user.

I can only speculate why the decision to extract another one type of nodes was made when Druid and Pinot were designed. But now it seems essential, because with the total number of segments in clusters going beyond ten millions, segment to node mapping information takes gigabytes of memory. It’s too wasteful to allocate so much memory on all query processing nodes. So, this is another drawback, imposed by Druid’s and Pinot’s “segmented” data management architecture.

In ClickHouse dedicating a separate set of nodes for “query brokering” is usually not needed. There is a special ephemeral “distributed” table type in ClickHouse, that could be set up on any node, and queries to this table do everything for what “broker” nodes are responsible in Druid and Pinot. Usually such ephemeral tables are set up on each node that participates the partitioned table, so, in practice, every node could be the “entry point” for a query to a ClickHouse cluster. This node will issue necessary subqueries to other partitions, process it’s part of the query itself, and merge it with partial results from other partitions.

When a node (either one of processing nodes in ClickHouse, or a “broker” node in Druid and Pinot) issues subqueries to other nodes, and a single or a few subqueries fail for whatever reason, ClickHouse and Pinot handle this situation properly: they merge the results of all succeeded subqueries and still return partial result to the user. Druid notoriously lacks this feature at the moment: if any subquery fails, the whole query fails as well.

ClickHouse vs. Druid or Pinot: Conclusions

“Segmented” approach to data management in Druid and Pinot versus simpler data management in ClickHouse define many other aspects of the systems. However, importantly, this difference has little or no implications on the potential compression efficiency (albeit the compression story in all three systems is sad at the moment), or query processing speed.

ClickHouse resembles traditional RDMBS, e. g. PostgreSQL. In particular, ClickHouse could be deployed just on a single server. If the projected size of the deployment is small, e. g. not bigger than in the order of 100 CPU cores for query processing and 1 TB of data, I would say that ClickHouse has significant advantage over Druid and Pinot, due to it’s simplicity and not requiring additional types of nodes, such as “master”, “real-time ingestion nodes”, “brokers”. On this field, ClickHouse competes rather with InfluxDB, than with Druid or Pinot.

Druid and Pinot resemble Big Data systems such as HBase. Not by their performance characteristics, but by dependency on ZooKeeper, dependency on persistent replicated storage (such as HDFS), focus on resilience to failures of single nodes, and autonomous work and data management not requiring regular human attention.

For a wide range of applications, neither ClickHouse nor Druid or Pinot are obvious winners. First and foremost, I recommend to take into account, the source code of which system your are able to understand, fix bugs, add features, etc. The section “On Performance Comparisons and Choice of the System” discusses this more.

Secondly, you could look at the table below. Each cell in this table describes a property of some application, that makes either ClickHouse or Druid/Pinot probably a better choice. Rows are not ordered by their importance. The relative importance of each row is different for different applications, but if your application is described by many properties from one column in the table and by no or a few properties from another, it’s likely that you should choose the corresponding system from the column header.

Note: neither of the properties above means that you must use the corresponding system(s), or must avoid the other system(s). For example, if you cluster is projected to be big, it doesn’t mean that you should only consider Druid or Pinot, but never ClickHouse. It rather means that Druid or Pinot become more likely better solutions, but other properties could outweigh and ClickHouse could be ultimately a more optimal choice even for large clusters, in some applications.

Differences between Druid and Pinot

As I noted several times above, Druid and Pinot have very similar architectures. There are several pretty big features, that are present in one system and absent in another, and areas, where one system has advanced significantly farther than another. But all such things that I’m going to mention could be replicated in another system with reasonable amount of efforts.

There is just one difference between Druid and Pinot, that is probably too big to be eliminated in foreseeable future — it’s the implementation of segment management in the “master” node. Also developers of the both systems probably wouldn’t want to do that anyway, because the approach of both has it’s pros and cons, it’s not that one is totally better than another.

Segment Management in Druid

The “master” node in Druid (and neither in Pinot) is not responsible for persistence of the metadata of the data segments in the cluster, and the current mapping between segments and query processing nodes, on which the segments are loaded. This information is persisted in ZooKeeper. However Druid additionally persists this information in an SQL database, that should be provided to set up a Druid cluster. I cannot say why this decision was originally made, but currently it provides the following benefits:

Less data is stored in ZooKeeper. Only minimal information about the mapping from the segment id to the list of query processing nodes on which the segment is loaded is kept in ZooKeeper. The remaining extended metadata, such as size of the segment, list of dimensions and metrics in it’s data, etc. is stored only in the SQL database.

When segments of data are evicted from the cluster because they become too old (this is a commonplace feature of timeseries databases, all ClickHouse, Druid and Pinot have it), they are offloaded from the query processing nodes and metadata about them is removed from ZooKeeper, but not from the “deep storage” and the SQL database. As long as they are not removed manually from those places, it allows to “revive” really old data quickly, in case the data is needed for some reporting or investigation.

Unlikely it was an intention originally, but now there are plans in Druid to make dependency on ZooKeeper optional. Currently ZooKeeper is used for three different things: segment management, service discovery, and property store, e. g. for realtime data ingestion management. Service discovery and property store functionality could be provided by Consul. Segment management could be implemented with HTTP announcements and commands, and it’s partially enabled by the fact that the persistence function of ZooKeeper is “backed up” by SQL database.

The downside of having SQL database as a dependency is greater operational burden, especially if some SQL database is not set up in the organization yet. Druid supports MySQL and PostgreSQL, there is a community extension for Microsoft SQL Server. Also, when Druid is deployed in the cloud, convenient managed RDBMS services could be used, such as Amazon RDS.

Segment Management in Pinot

Unlike Druid, which implements all segment management logic itself, and relies only on Curator for communication with ZooKeeper, Pinot delegates a big share of segment and cluster management logic to Helix framework. On the one hand, I can imagine that it gives Pinot developers a leverage to focus on other parts of their system. Helix probably has fewer bugs than the logic implemented inside Druid, because it was tested under different conditions, and because probably more time was put into Helix development.

On the other hand, Helix probably constrains Pinot with it’s “framework bounds”. Helix, and consequently Pinot, are probably going to depend on ZooKeeper forever.