DRBD: a distributed block device

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

The three R's of high availability are Redundancy, Redundancy and Redundancy. However, on a typical setup built with commodity hardware, it is not possible to add redundancy beyond a certain limit to increase the number of 9's after your current uptime percentage (ie 99.999%). Consider a simple example: an iSCSI server with the cluster nodes using a distributed filesystem such as GFS2 or OCFS2. Even with redundant power supplies and data channels on the iSCSI storage server, there still exists a single point of failure: the storage.

The Distributed Replicated Block Device (DRBD) patch, developed by Linbit, introduces duplicated block storage over the network with synchronous data replication. If one of the storage nodes in the replicated environment fails, the system has another block device to rely on, and can safely failover. In short, it can be considered as an implementation of RAID1 mirroring using a combination of a local disk and one on a remote node, but with better integration with cluster software such as heartbeat and efficient resynchronization with the ability to exchange dirty bitmaps and data generation identifiers. DRBD currently works only on 2-node clusters, though you could use a hybrid version to expand this limit. When both nodes of the cluster are up, writes are replicated and sent to both the local disk and the other node. For efficiency reasons, reads are fetched from the local disk.

The level of data coupling used depends on the protocol chosen:

Protocol A : Writes are considered to complete as soon as the local disk writes have completed, and the data packet has been placed in the send queue for the peers. In case of a node failure, data loss may occur because the data to be written to remote node disk may still be in the send queue. However, the data on the failover node is consistent, but not up-to-date. This is usually used for geographically separated nodes.

: Writes are considered to complete as soon as the local disk writes have completed, and the data packet has been placed in the send queue for the peers. In case of a node failure, data loss may occur because the data to be written to remote node disk may still be in the send queue. However, the data on the failover node is consistent, but not up-to-date. This is usually used for geographically separated nodes. Protocol B : Writes on the primary node are considered to be complete as soon as the local disk write has completed and the replication packet has reached the peer node. Data loss may occur in case of simultaneous failure of both participating nodes, because the in-flight data may not have been committed to disk.

: Writes on the primary node are considered to be complete as soon as the local disk write has completed and the replication packet has reached the peer node. Data loss may occur in case of simultaneous failure of both participating nodes, because the in-flight data may not have been committed to disk. Protocol C: Writes are considered complete only after both the local and the remote node's disks have confirmed the writes are complete. There is no data loss, so this is a popular schema for clustered nodes, but the I/O throughput is dependent on the network bandwidth.

DRBD classifies the cluster nodes as either "primary" or "secondary." Primary nodes can initiate modifications or writes whereas secondary nodes cannot. This means that a secondary DRBD node does not provide any access and cannot be mounted. Even read-only access is disallowed for cache coherency reasons. The secondary node is present mainly to act as the failover device in case of an error. The secondary node may become primary depending on the network configuration. Role assignment and designation is performed by the cluster management software.

There are different ways in which a node may be designated as primary:

Single Primary : The primary designation is given to one cluster member. Since only one cluster member manipulates the data, this mode is useful with conventional filesystems such as ext3 or XFS.

: The primary designation is given to one cluster member. Since only one cluster member manipulates the data, this mode is useful with conventional filesystems such as ext3 or XFS. Dual Primary: Both cluster nodes can be primary and are allowed to modify the data. This is typically used in cluster aware filesystems such as ocfs2. DRBD for the current release can support a maximum of two primary nodes in a basic cluster.

Worker Threads

A part of the communication between nodes is handled by threads to avoid deadlocks and complex design issues. The threads used for communication are:

drbd_receiver: handles incoming packets. On the secondary node, it allocates buffers, receives data blocks and issues write requests to the local disk. If it receives a write barrier, it sleeps until all pending write requests have been finished.

drbd_sender: Sender thread for data blocks in response to a read request. This is done in a thread other than drbd_receiver, to avoid distributed deadlocks. If a resynchronization process is running, its packets are generated by this thread.

drbd_asender: Acknowledgment sender. Hard drive drivers are informed of request completions through interrupts. However, sending data over the network in an interrupt callback routine may block the handler. So, the interrupt handler places the packet in a queue which is picked up by this thread and sent over the network.

Failures

DRBD requires a small reserve area for metadata, to handle post failure operations (such as synchronization) efficiently. This area can be configured either on a separate device (external metadata), or within the DRBD block device (internal metadata). It holds the metadata with respect to the disk including the activity log and the dirty bitmap (described below).

Node Failures

If a secondary node dies, it does not affect the system as a whole because writes are not initiated by the secondary node. If the failed node is primary, the data yet to be written to disk, but for which completions are not received, may get lost. To avoid this, DRBD maintains an "activity log," a reserved area on the local disk which contains information about write operations which have not completed. The data is stored in extents and is maintained in a least recently used (LRU) list. Each change of the activity log causes a meta data update (single sector write). The size of the activity log is configured by the user; it is a tradeoff between minimizing updates to the meta data and the resynchronization time after the crash of a primary node.

DRBD maintains a "dirty bitmap" in case it has to run without a peer node or without a local disk. It describes the pages which have been dirtied by the local node. Writes to the on-disk dirty bitmap are minimized by the activity log. Each time an extent is evicted from the activity log, the part of the bitmap associated with it which is no longer covered by the activity log is written to disk. The dirty bitmaps are sent over the network to communicate which pages are dirty should a resynchronization become necessary. Bitmaps are compressed (using run-length encoding) before sending on the network to reduce network overhead. Since most of the of the bitmaps are sparse, it proves to be pretty effective.

DRBD synchronizes data once the crashed node comes back up, or in response to data inconsistencies caused by an interruption in the link. Synchronization is performed in a linear order, by disk offset, in the same disk layout as the consistent node. The rate of synchronization can be configured by the rate parameter in the DRBD configuration file.

Disk Failures

In case of local disk errors, the system may choose to deal with it in one of the following ways, depending on the configuration:

detach : Detach the node from the backing device and continue in diskless mode. In this situation, the device on the peer node becomes the main disk. This is the recommended configuration for high availability.

: Detach the node from the backing device and continue in diskless mode. In this situation, the device on the peer node becomes the main disk. This is the recommended configuration for high availability. pass_on : Pass the error to the upper layers on a primary node. The disk error is ignored, but logged, when the node is secondary.

: Pass the error to the upper layers on a primary node. The disk error is ignored, but logged, when the node is secondary. call-local-io-error: Invokes a script. This mode can be used to perform a failover to a "healthy" node, and automatically shift the primary designation to another node.

Data Inconsistency issues

In the dual-primary case, both nodes may write to the same disk sector, making the data inconsistent. For writes at different offset, there is no synchronization required. To avoid inconsistency issues, data packets over the network are numbered sequentially to identify the order of writes. However, there are still some corner-case inconsistency problems the system can suffer from:

Simultaneous writes by both nodes at the same time. In such a situation, one of the node's writes are discarded. One of the primary nodes is marked with the "discard-concurrent-writes" flag, which causes it to discard write requests from the other node when it detects simultaneous writes. The node with discard-concurrent-writes flag set, sends a "discard ACK" to other nodes informing them that the write has been discarded. The other node, on detecting the discard ACK, writes the data from first node to keep the drives consistent.

Local request while remote request in flight This can happen when the disk latency exceeds the network latency. The local node writes to a given block, sending the write operation to the other node. The remote node then acknowledges the completion of the request and sends a new write of its own to the same block - all before the local write has completed. In this case, the local node keeps the new data write request on hold until the local writes are complete.

Remote request while local request is still pending: this situation comes about if the network reorders packets, causing a remote write to a given block to arrive before the acknowledgment of a previous, locally-generated write. Once again, the receiving node will simply hold the new data until the ACK is received.

Conclusion

DRBD is not the only distributed storage implementation under development. The implementation of Distributed Storage (DST) contributed by Evgeniy Polyakov and accepted in staging tree takes a different approach. DRBD is limited to 2-node active clusters, while DST can have larger numbers of nodes. DST works on client-server model, where the storage is at the server end, whereas DRBD is peer-to-peer based, and designed for high-availability as compared to distributing storage. DST, on the other hand, is designed for accumulative storage, with storage nodes which can be added as needed. DST has a pluggable module which accepts different algorithms for mapping the storage nodes into a cumulative storage. The algorithm chosen can be mirroring which would serve the same basic capability of replicated storage as DRBD.

DRBD code is maintained in the git repository at git://git.drbd.org/linux-2.6-drbd.git, under the "drbd" branch. It contains the minor review comments posted on LKML incorporated after the patchset was released by Philipp Reisner. For further information, see the several PDF documents mention in the DRBD patch posting.

