Let’s talk about Resilience

2,947 reads

Surviving downtime and creating critical moment experiences

Fail Whale from Twitter’s Golden Years

When was the last time your company’s product had a downtime? How did it affect your customers?

This question is typically something that a dev team usually turns a blind eye to. We expect the app to be always up, and when it goes down, we react. This reactive reasoning is fair to us developers, but with the cost of the experience of the user.

If you are new to service and application development, you might be thinking, what else can I do to keep my uptime high?

Let me introduce the concept of Resiliency.

Resiliency is more commonly defined as the capacity to recover quickly from failures, leaning towards elasticity. In this article, I’ll discuss client side and network related resilience and how to improve your current stack.

I will be using the awesome axios library to illustrate examples below.

const axios = require('axios');

Timeout

Networks are unpredictable beasts. We cannot predict when and how connections will drop. All we can do is prepare for it.

What happens when my app can’t reach an API? how about a slow response?

Usually, developers will leave this alone since we expect the user to always be connected to a fast network. This is a dangerous assumption to make, especially when we do not know who are users are.

To prepare for this, always add a timeout to your requests:

async function MakeRequest() {

try {

await axios.get('/slow', {

timeout: 5000

});

} catch (err) {

// ...

}

}

This ensures that users won’t have to wait a long time for your app to respond. You can discuss with your UX guy about various ways to accommodate this scenario. My favourite is to implement…

Retry

What happens when my request fails or times out?

Client and network errors are abundant, and no developer should ignore that fact. There are a lot of scenarios where requests fail and we have to be think how the app should react.

A good strategy is to implement retries. The usual threshold is to retry 3 times before actually failing.

async function MakeRequest(retry = 0) {

try {

await axios.get('/failing', {

timeout: 5000

});

} catch (err) {

if (err < 3) {

await MakeRequest(retry + 1);

} else {

// ...

}

}

}

This ensures that the app is given enough attempts to try and reach a once failed endpoint.

Work with your UX guy about this scenario to determine how to handle the interim requests being retried and the final proper failure.

Fallback

what if my request fails? what should I show to the user?

Inevitably, downtime will occur. Just because you are using AWS, doesn’t mean your app won’t fail. Everything will fail eventually, and you have to have a fallback.

async function MakeRequest(retry = 0, fallback = false) {

try {

const url = fallback === false ? '/failing' : '/fallback';

await axios.get(url, {

timeout: 5000

});

} catch (err) {

if (err < 3) {

await MakeRequest(retry + 1, fallback);

} else {

if (fallback === false) {

await MakeRequest(0, true);

} else {

// ...

}

}

}

}

This ensures that the app shall receive something in the event of a failed request. This doesn’t disrupt the experience of the user since it does not necessarily result in a failure in their perspective.

Logging

How do I know which request/screen/api is failing?

This has to be one of the basics that must be covered before writing an app. Ensure that proper monitoring is available in your development and production environments. This covers client-side and server-side.

Enforce centralized logging to your monitoring tools from your server side. Capture all handled exceptions and throw them to a searchable log. This will greatly help in your debugging efforts when worst-comes-to-worst, failures happen.

Implement client-side error handling as well, and if possible, throw them in a separate bucket in your monitoring tool. It’ll help you determine, previously unpredictable points of failure.

Circuit Breaker

What do I do if a service constantly fails?

This point covers more into the microservices approach, which in my opinion, in most cases you should totally avoid(ask me why in the comments). If you already are in this situation, let our friend Martin Fowler explain it.

In essence, your app needs to be able to select performant service nodes over underperforming ones and protect your service when those service nodes are down. One notable benefit is the smart handling of usable resources. Preventing unnecessary usage of cpu and memory where you can allot it somewhere else.

A combination of fallback, retries, and logging are essential to making this work.

Please read Martin Fowler’s take on it as he is the best to explain. You can also check the Netflix/Hystrix library for reference.

Conclusion

In the end, all of this is between the developers, architects, and UX to deliver a great experience to the user. This is always the ultimate goal.

Always remember that UX does not end with the design, it’s also how the users experience your application through different scenarios. Be it failures, slowness, or even outages.

I might not have covered each nook and cranny, but I hope it gives you a push to the right direction when thinking of architecting your next stack.

Good Luck!

Read more about resilience through the netflix blogs, they have great stuff over there. Don’t forget to check out Hystrix and Chaos monkey too.

If you would like these stuff and you’re in Singapore, come join us:

Tags