A collection or group of different computers (or nodes) which coordinate action through message passing and as a whole behaves as one functional unit to the end user.

Eg. In Google Drive when you upload a picture or view one it gives you the illusion that it is stored at one single place and you are interacting with one server, but it is a coordinated effort of multiple independent servers.

Distributed systems are eating up the world and are everywhere, and you interact with them almost every day, It is there in your bank, online shops, your favorite search engine, instant messenger. Even when you share pictures of your lunch with the world (Not in her wildest dreams Lady Ada Lovelace would have imagined the use of computers for this), it’s distributed system that makes it happen.

Why Distributed Systems?

The primary reasons are:

Scalability

Reliability

Performance

Scalability:

Everyone wants to see that cute kitten on Reddit and be aww.. but the hosting server has limited resources, and it won’t be able to cater the increasing demand for it. If this is not fixed soon, the world would be filled with hatred and become a gloomy place.

There are two possible ways to fix this:

Vertical Scaling: Upgrade the server resources (processor, storage, bandwidth, etc.) to serve the demand. For long we were blessed by Moore’s law and had a faster CPU was available in no time, but the time has ended, and it is not practically possible to have enough capacity for ever-increasing demand. Horizontal Scaling: distribute the job of processing or serving to multiple servers and keep on adding more as demand increases.

Reliability (Fault Tolerance):

It end of the month and your salary is just credited, and you are planning to buy all sorts of stuff, but the server holding the transaction information caught fire and melts, and now you are broke.

To prevent this, we replicate instead of one server holding this information, tens (or even more) of them have it, and they can be in different countries or continents to even safeguard from natural disasters.

Performance:

Now you are back to purchasing stuff, and Amazon wants you and all its customers to buy more, to do this they recommend things you might like. To do so, they have to analyze purchase history of all its customer. Amazon has billions of orders and receives more than 10 million order every day no one machine can practically process that much information, so they distribute the job to multiple (1000s) computers each crunching a subset of total data and influencing you to purchase more.

But… What’s the problem?

Building and managing distributed systems are complicated and hard. Few of the problems are:

Asynchrony:

Distributed systems work by passing messages to independent nodes, and Local clocks of the nodes can be in sync or out of sync. The network connecting nodes is unreliable and can have unbounded delays, because of these the order of messages passed is not guaranteed and delivery time unpredictable.

Food for thought:

Two generals have to coordinate a time to attack. Messengers can be killed, arbitrarily detained — no other form of communications are allowed. If either attack alone, the army will be destroyed.

Design a protocol to coordinate a successful attack.

Partial Failures

As Leslie Lamport describes

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Computers fail often. If a reliable computer fails (could stop working, loses its data, etc) only once a year and you have 1000 (a typical data center has 100,000) of them, that means on any given day you have on average four crashes. Automatically detecting these crashes is also practically impossible as due to unbounded network delays we cannot tell whether it crashed or is just slow.

Concurrency And Consistency:

The distributed system as a whole has a distributed global state which is constantly read and changed coordinating these concurrent reads and writes are hard. On top of that this state is replicated and cached among different other nodes keeping them all in sync gives rise to different sets of problems one of them is commonly known as “The Split-brain Syndrome” (of computers)

Let’s say you have three nodes in a cluster of your banking system and assuming they are highly available and serve all requests at all times. When any node receives a change request on an account (debit or credit), it is propagated to other two nodes over the network, and everything works fine. Imagine a case when there is network partition between the nodes and node 1 and node 2 cannot talk to node 3 and vice versa but all remain connected to the internet and receive requests, If node 1 receives a debit request and node 3 receives another debit request as well, both will independently process the transaction. But collectively the sum of amount in debit requests was more than the balance, and now the system will be in an inconsistent and irrecoverable state, even after network partition is fixed.

That’s all! I hope that was helpful. Thanks for reading.