Ethereum does it too! A deep dive into DHT

The why and the what of DHT

Foreword

If you are either a blockchain developer or maybe just a fan, then you are probably familiar with the term DHT, i.e. Distributed Hash Table. In this article, we want to guide you through the elegance that characterizes the design of DHT. To do this, we will provide two classic examples of implementation: Chord and Kademlia.

Passing notes

Let’s start with a simple example to understand the use and implementation of DHT. A group of students in a class wants to pass notes to share the results of a test. Suppose we want to meet the following requirements:

Request In your class, pass a note to the 87th classmate and ask the score of the test.

The following instructions on how to pass the note are what we could refer to as a “Note-passing protocol”:

Note-passing Protocol 1. The front of the paper note must show the target T, while the back of the strip must display the request. 2. The distance between S and T is the straight-line distance between the two students’ seats. 3. The neighbors of T know T’s score. 4. S only knows its neighbors’ scores. 5. S can only pass the paper to R through S’s neighbors, and the distance between R and T must be less than the distance between S and T.

In case of compliance with the rules, the following can apply:

Score query 1. The front of the note reads “Give 87” and the back says “How many points did you score?” 2. Anyone who knows 87’s score can write it on the back side of the note. 3. Each student can only obtain the note once during the transfer of the note to student 87. 4. After student 87 or his neighbor have written the score, the note can be passed back to the original sender following the previous route. 5. The original sender opens the note, which states on its back “87 points”. Therefore, it is ensured that the score achieved by student 87 is 87 points.

What is DHT?

A note-passing protocol is just an extremely simplified version of how DHT work. DHT is, in fact, a mechanism for passing on messages through a specific routing scheme. As the name suggests, it is a distributed hash table, and a hash table is a data structure that can efficiently read/write data. The name “hash table” is derived from the hash function used in DHT that allows to map any piece of data to a fixed-length random string. Since the function is characterized by both unidirectionality and uniqueness, the random string can be used somehow like a fingerprint for identifying data, which we will refer to as a key. In order to read data from the hash table, the only thing you need to do is provide the respective key, and the hash table returns the complete data mapped to that key.

Using a hash table works like the students checking their test scores following their seat numbers: once you know the key (seat number), reading the content (test score) is easy.

DHT has more “decentralized” features with respect to passing a note in a class, though. Just like each student only needs to know the test scores of their neighbors, in DHT you don’t need to know the keys of the whole network; you can just pass the note on and on until the receiver is finally reached through a series of connections.

What is DHT and how does it work?

In order to achieve decentralization, you need a message relay mechanism like the one used in the note-passing protocol. This mechanism is called key-based routing. The neighbor list of each participant in the network, which determines the message relay mechanism, is called routing table, and participants are called nodes. Each node owns a unique number called the node ID, which, together with the node’s key, forms the set of a node’s identifiers.

As mentioned, a note-passing protocol works like DHT, however in a very simplified way. Therefore, the following can be mentioned as the core characteristics that differentiate DHT and a note-passing protocol:

Definition of distance: The distance in a note-passing protocol is the linear distance in space, while the distance in DHT is the logic distance. For example, Kademlia uses the XOR operation results of two identifiers to define the distance. Content of the routing table: In a note-passing protocol, the routing table only contains the physical neighbors, while in DHT the routing table records different distance intervals separately. Routing table update mechanism: The neighbors in a note-passing protocol are fixed, and the routing table does not need to be updated; however, the routing table in DHT is dynamic, as network members can join or leave the network at any given time. Therefore, an update mechanism must be put in place. Efficacy: The query time of the note protocol is proportional to the distance between the sender and the target. If the distance is increased by 2 times, the query time also doubles. DHT, on the other hand, is very efficient. If the distance is increased N times, the query time only needs to increase logN times.

Why put so much effort in inventing DHT?

DHT is based on network overlay, a structure called Peer-to-Peer Network. It has the following important features:

Openness: Participants can dynamically join and leave the network Release redundant resources: Participants’ resources cannot be effectively utilized Decentralization: No coordinators, i.e. centralization, are required for the system to operate Scalability: Users’ routing information can be distributed to each participant roughly evenly, increasing the total demand that can be handled as a whole. Growth in the number of participants means growth in demanded processing capacity.

Taken together, DHT was invented as a more decentralized and scalable solution for increasingly inadequate bandwidth and over-centralized Internet.

What can DHT do?

File Sharing: DHT-based BitTorrent service enables more efficient upload/download of large files Domain Name Query: DHT-based DNS services have better load balancing characteristics Anti-review communication software: DHT-based communication software, given its decentralization and peer-to-peer transmission characteristics, enables resistance to review of communication content by specific groups.

In Ethereum, DHT is only used as an efficient peer selection mechanism.

DHT classics: Chord & Kademlia

The overarching objective of DHT is using message relay to meet users’ requests. Under such a framework, each DHT design scheme differs with respect to the following characteristics:

Definition of logic distance Composition of the routing table Routing table update mechanism

Let’s take a look at the two best known DHTs, Chord and Kademlia.

Chord

The first paper theorizing Chord was published in 2001 by Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishna. According to Google Scholar statistics, the paper was cited more than 13,000 times since its publication. Therefore, it is safe to say that Chord represents a widely popular academic research project in the DHT environment.

We call the collection of all node identifiers the Key Space. We denote the number of bits of the identifier by m. If m=160, the size of the key space is 2¹⁶⁰.

Chord represents the key space as a ring, which it calls the Identifier Ring. The identifier is incremented clockwise, and the distance is defined as the number of identifiers from one point to another, clockwise. Given the fact that the key space is represented as a ring, distance is asymmetric. For example, a chord ring with a six-bit key length has 64 identifiers and the distance from ID 21 to ID 30 is 9; however, distance from ID 30 to ID 21 is 55.

For any node on the ring, the first next node encountered clockwise is called the successor, while the previous one is called the predecessor. In the note-passing protocol, we specified a key value (seat number) and its surrounding neighbors were responsible for passing the content (score). Conversely, Chord works as follows: given any key value, the successor is responsible for storing the data of its predecessor. As shown in the figure below, N14 is responsible for writing/reading data related to K10; N32 is responsible for writing/reading data related to K24 and K30.

Just as anyone in the note-passing protocol can pass a note asking other people’s scores, any node on the ring can initiate a query for the key value. If the node does not know the result of the query, the relay is passed on to the next node.

In a note-passing protocol, if the person receiving the note does not know the score of the target, the only thing he can do is pass on the note to his neighbor. Chord supports a more efficient way for passing on messages. Chord’s routing table is also called Finger Table, and it is divided into a number of sections, each of which can store m nodes. The secret behind Chord’s high efficiency lies in the fact that the node in the i-th column of the routing table is the successor of the key whose distance from the node itself is above 2^(i-1).

The following figure can be taken as an example to understand this tricky concept better. In the routing table of N8, the node in the first column is N14, which is the successor of N8+20; the node in the second column is also N14, which is the successor of N8+21; the node in the fourth column is N21, the successor of N8+23. Thus, the higher the distance from the target, the greater the steps with which the relay advances. The closer the distance is, the shorter the steps with which the relay advances.

Given the routing table characteristics, the node receiving the message first compares whether the latter node in the first column of the routing table is the target’s successor, and if so, asks the node for the content of the key value. If not, the message gets passed on further until the successor of the queried key is in the first column of its routing table. As shown on the right side of the following figure, the mentioned routing system works as follows:

When a new node joins the network, in addition to calculating its own routing table, the new node needs to update its predecessor’s data to ensure that the message relay is correct. The process works as follows:

Kademlia

Kademlia’s first official paper traces its origins back to 2002 with Petar Maymounkov and David Mazieres. Although Chord is undoubtedly the academics’ lover, Kademlia’s practicality makes it the first choice in real-life environments. To provide some examples, BitTorrent is the first large-scale, file-based DHT Application. Ethereum also uses Kademlia as a node selection mechanism, and IPFS introduced the improved S/Kademlia as the core.

Kademlia’s key-value space can be represented as a binary tree. The identifier can be represented as a leaf, and the distance between identifiers is defined as the result of an XOR operation.

In other words, the distance between two identifiers can be regarded as the degree of difference between them. When the two identifier bits are completely opposite, the difference is the largest and therefore distance is also the greatest. On the other hand, when two identifiers are identical, their distance will be zero, as the difference between the two values is the smallest. Kademlia’s concept of distance is very similar to Chord. The only difference is that in the case of Kademlia, the XOR operation is symmetrical while it is not on Chord, where the distance depends on the starting point and the end point.

Just as in the case of Chord, Kademlia’s routing table is divided into a number of different distance ranges. However, some features differentiate Kademlia’s routing table, namely:

Chord’s routing table includes one single node per each interval of distance. Conversely, Kademlia’s routing table makes use of what is generally referred to as k-buckets, which, as the name may suggest, contain k nodes for each distance interval. By the same token, each k-bucket corresponds to one distance interval. Going back to the successor and predecessor concept implemented by Chord to define which nodes store the information related to which other nodes, Kademlia implements a substantially different method. As a matter of fact, the data belonging to one node will be stored by its k closest neighbors in Kademlia. This is also one of the most compelling reasons that make Kademlia substantially more robust than Chord, as its redundancy is k times higher than Chord’s. Given the nature of XOR, each k bucket can be treated as a subtree. Kademlia’s query process is instead more similar to jumping from one subtree to another, until the closest subtree to the target is reached. Therefore, Kademlia, like Chord, is an efficient binary search. The symmetry characteristic distinguishing Kademlia’s routing distance allows each new node that is joining the network to immediately update the routing table of each neighbor simultaneously, significantly increasing network efficiency. Furthermore, this newly updated node will appear first in the k-bucket query, which again increases the robustness of Kademlia.

Kademlia’s simplicity and robustness make it quite a popular protocol for developers. To mention a well-known example of implementation, Ethereum uses Kademlia as a node selection mechanism for its gossiping protocol.

Conclusion

Let’s go back again to our note-passing protocol between classmates.

Note-passing Protocol 1. The front of the paper note must show the target T, while the back of the strip must display the request. 2. The distance between S and T is the straight-line distance between the two students’ seats. 3. The neighbors of T know T’s score. 4. S only knows its neighbors’ scores. 5. S can only pass the paper to R through S’s neighbors, and the distance between R and T must be less than the distance between S and T.

DHT achieves decentralization and scalability by matching the routing table with its message relay mechanism. However, the open nature of DHT is also prone to malicious attacks.

What are the weaknesses of DHT? Is DHT fit for decentralized ledger technologies? Let’s find out in the next article.