Unfortunately, EC2 instances aren’t fault-tolerant. Under your virtual server is a host system. These are a few reasons your virtual server might suffer from a crash caused by the host system:

If the host hardware fails, it can no longer host the virtual server on top of it.

If the network connection to/from the host is interrupted, the virtual server loses the ability to communicate via network as well.

If the host system is disconnected from a power supply, the virtual server also goes down.

But the software running on top of the virtual server may also cause a crash:

If your software has a memory leak, you’ll run out of memory. It may take a day, a month, a year, or more, but eventually it will happen.

If your software writes to disk and never deletes its data, you’ll run out of disk space sooner or later.

Your application may not handle edge cases properly and instead just crashes.

Regardless of whether the host system or your software is the cause of a crash, a single EC2 instance is a single point of failure. If you rely on a single EC2 instance, your system will blow up: the only question is when.

Redundancy can remove a single point of failure

Imagine a production line that makes fluffy cloud pies. Producing a fluffy cloud pie requires several production steps (simplified!):

Produce a pie crust. Cool down the pie crust. Put the fluffy cloud mass on top of the pie crust. Cool the fluffy cloud pie. Package the fluffy cloud pie.

The current setup is a single production line. The big problem with this setup is that whenever one of the steps crashes, the entire production line must be stopped. Figure 1 illustrates the problem when the second step (cooling the pie crust) crashes. The following steps no longer work either, because they don’t no longer receive cool pie crusts.

Why not have multiple production lines? Instead of one line, suppose we have three. If one of the lines fails, the other two can still produce fluffy cloud pies for all the hungry customers in the world. Figure 2 shows the improvements; the only downside is that we need three times as many machines.