Why Use Erlang?

It isn't obvious what makes the language good, and there are major barriers to entry.

The syntax is unfamiliar. (Not C-like.)

Application design is alien. (Not OOP.)

Only skilled programmers can write it. (There aren't enough examples on Google for amateurs to copy-paste together into applications.)

And yet, it's capable of extremely impressive things.

Massive code bases and ridiculous uptime at Ericsson.

Huge loads at Facebook.

Effective distributed infrastructure at Heroku and Github.

Rapid growth for Call of Duty 4 backends.

How does such an obscure language serve so well under fire?

The Trick

It's not the syntax. Pattern matching and bit syntax are great, but not the secret to success.

It's not the code organization. Erlang does fine with a flat module namespace, using that simplicity to do some nice (but non-essential) tricks with hot code reloading.

It's not functional programming, although referential transparency lets you reproduce bugs from a single stack trace.

It's not even amazing programmers: the community is small, and a large amount of impressive code is written by relative amateurs. While savants are struggling to compile PHP into C++ or optimize Ruby, average programmers are getting work done on an old and fairly static VM.

The trick is built into the very foundation of the language, and its effects are subtle. You notice them when you run your first Erlang service. Its ability to take abuse is uncanny: bad input, unthought of edge cases, connection failures, dumb mistakes, deadlocks...

No matter what, your poorly written code recovers. It somehow runs for weeks on end with no intevention, and it's hard to understand why.

Services in Software as a Service Systems

To understand the trick, let's look at a familiar example.

A recent presentation from Github described their internal services. Let's imagine what they could look like, supervised by admins and something like god. If a service fails, their supervisor restarts them. Supervisors have supervisors too: even admins have someone watching them.

Most services are independent. If metrics is down, chat still works so the problem may be discussed, and patches can be pushed out using the chatbot and deploy service.

Even the ones with dependencies are fairly isolated. If the database must be restarted, only a small window of writes from logging and metrics will be lost.

If temporary data loss is unacceptable, monitoring can be added between dependencies. Logging monitors the database. If it's down, logging can enter a reduced-functionality mode, cache incoming writes until the problem is resolved, or otherwise handle the fault.

Monitoring handles all possible faults! It doesn't matter whether the database is down due to a bug, power outage, network partition, or meteor strike. The fault is still handled.

A stronger version of monitoring is linking. If the chatbot is down, deployment can't be accessed. If deployment is down, the chatbot is useless. Failure in one should cause the other to shut down, minimizing error messages.

What have we got?

We've got an architecture made of loosely coupled, isolated services. They monitor and link between each other to handle all kinds of failure equally. Distributed across many redundant machines in multiple locations, they form an extremely robust system. Parts can fail, but others will notice and restart them.

Isn't that nice? Wouldn't it be cool if all software was this robust?

Software is Made of Services

Erlang takes the techniques that build robust systems in the large, and applies them to the smallest tasks. Actors are extremely cheap to create, and in-process message passing is instant. You can design a program like you would a large system, and reap the same benefits!

For example, let's pretend the chatbot is written in Erlang.

Each new command is handled in a new actor, so bugs and bad input don't affect anything. Commands are supervised, and can be retried.

The chatroom service that starts new commands monitors the TCP connection for failure, caching and re-trying during connection problems.

The Chuck Norris joke fetching and ranking service is completely isolated from the important stuff, and doesn't affect critical components

If the deployment actor crashes (probably due to external failure) the root supervisor will try to restart it a few times. If that fails, it too will crash - linking the chatbot to the deployment service.

You get the properties of a large, well-designed system on a small OS process level scale. More importantly, these benefits are baked in - this is the default way Erlang code is written. Even poorly written Erlang code can recover from failures, because everything is done by small, isolated services which can be restarted into a (hopefully) clean state.

This is unique.