Reliability in distributed systems is determined by the weakest component, so even a minor internal function could bring your entire system down. Learn how stability patterns anticipate the hot spots of distributed network behavior, then see five patterns applied to RESTful transactions in Jersey and RESTEasy.

Implementing highly available and reliable distributed systems means learning to expect the unexpected. Sooner or later, if you're running a larger software system, you will be faced with incidents in the production environment. Typically you will find two different types of bugs. The first type is related to functional issues like calculation errors or errors in handling and interpreting data. These bugs are easy to reproduce and will usually be detected and fixed before a software system goes into production.

The second type of bug is more challenging because it only becomes activated under specific infrastructure conditions. These bugs are harder to identify and reproduce and will typically not be found during testing; instead, you'll most likely encounter them in the production environment. Better testing and software quality assurance techniques like code reviews and automated tests will increase your chance of eliminating these bugs when you find them; they won't, however, ensure that your code is bug free.

At worst, bugs in your code could trigger a cascade of errors within the system, potentially leading to a serious system failure. This is especially true for distributed systems where services are shared between other services and clients.

Stabilizing network behavior in distributed systems

The number one hot spot for serious system failure is network communication. Unfortunately, architects and designers of distributed systems are often incorrect in their assumptions about network behavior. Twenty years ago, L. Peter Deutsch and others at Sun Microsystems documented the Fallacies of Distributed Computing, which are still pervasive:

The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous

Many developers today rely on RESTful systems to address the challenges of network communication in distributed systems. An important characteristic of REST is that it doesn't hide the limitations of network communication behind high-level RPC stubs. But RESTful interfaces and endpoints alone don't ensure that the system is inherently stable; you still have to do more.

In this article I introduce four stability patterns that address common failures in distributed systems. My examples are focused on RESTful endpoints but the patterns could also be applied for other communication endpoints. The patterns demonstrated here were introduced by Michael Nygard in his book, Release It! Design and Deploy Production-Ready Software. The sample code and demos are my own.

download Download the source code for this article. Source code for "Stability patterns applied in a RESTful architecture." October 2014. Created by Gregor Roth for JavaWorld.

Stability patterns applied

Stability patterns are used to promote resiliency in distributed systems, using what we know about the hot-spots of network behavior to protect our systems against failure. The patterns I introduce in this article are designed to protect distributed systems against common failures in network communication, where integration points such as sockets, remote procedure calls, and database calls (that is, remote calls that are hidden by the database driver) are the first risk to system stability. Using these patterns can prevent a distributed system from shutting down just because some part of the system has failed.

The Webshop demo

In general online electronic payment systems do not have data about new customers. Instead, these systems often perform an external online credit-score check based on a new user's address data. The Webshop demo application determines which payment methods (such as credit card, PayPal account, pre-payment or per invoice) will be accepted based on the user's credit score.

This demo addresses a key scenario: What happens if a credit check fails? Should the order be rejected? In most cases a payment system will fall back on accepting only more reliable payment methods. Handling this external component failure is both a technology and a business decision; it requires weighing the tradeoff between losing orders and the possibility of a payment default.

Figure 1 shows a system overview of the Webshop demo.

Figure 1. A flow diagram of the electronic payment system

To determine the payment method, the Webshop application uses a payment service internally. The payment service provides functionality to get payment information and to determine the payment methods for a dedicated user. In this example the services are implemented in a RESTful way, meaning that HTTP methods such as GET or POST will be used explicitly. Furthermore, service resources are addressed by URIs. This approach is also reflected by the JAX-RS 2.0-specified annotations in the code examples. JAX-RS 2.0 specification implements a REST binding for Java and is part of the Java Platform, Enterprise Edition.

Listing 1. Determining the payment methods

@Singleton @Path("/") public class PaymentService { // ... private final PaymentDao paymentDao; private final URI creditScoreURI; private final static Function<Score, ImmutableSet<PaymentMethod>> SCORE_TO_PAYMENTMETHOD = score -> { switch (score) { case Score.POSITIVE: return ImmutableSet.of(CREDITCARD, PAYPAL, PREPAYMENT, INVOCE); case Score.NEGATIVE: return ImmutableSet.of(PREPAYMENT); default: return ImmutableSet.of(CREDITCARD, PAYPAL, PREPAYMENT); } }; @Path("paymentmethods") @GET @Produces(MediaType.APPLICATION_JSON) public ImmutableSet<PaymentMethod> getPaymentMethods(@QueryParam("addr") String address) { Score score = Score.NEUTRAL; try { ImmutableList<Payment> payments = paymentDao.getPayments(address, 50); score = payments.isEmpty() ? restClient.target(creditScoreURI).queryParam("addr", address).request().get(Score.class) : (payments.stream().filter(payment -> payment.isDelayed()).count() >= 1) ? Score.NEGATIVE : Score.POSITIVE; } catch (RuntimeException rt) { LOG.fine("error occurred by calculating score. Fallback to " + score + " " + rt.toString()); } return SCORE_TO_PAYMENTMETHOD.apply(score); } @Path("payments") @GET @Produces(MediaType.APPLICATION_JSON) public ImmutableList<Payment> getPayments(@QueryParam("userid") String userid) { // ... } // ... }

The getPaymentMethods() in Listing 1 is bound to the URI path segment paymentmethods, which will result in a URI such as: http://myserver/paymentservice/paymentmethods. The @GET annotation defines that the annotated method will be performed if an HTTP GET request is received for the given URI. The Webshop application calls the getPaymentMethods() to determine a user's reliability score, which is based on his or her credit history. If no history data is available, a credit-score service will be called. In the case of exceptions on integration points, the system is designed to downgrade its getPaymentMethods() functionality, even if this means accepting a less reliable payment method from an unknown or less trusted customer. If the internal paymentDao query or creditScoreURI query fails, the getPaymentMethods() will return default payment methods.

The 'Retry' pattern Apache HttpClient and other network clients implement some stability features out of the box. For instance, the client might execute retries internally under some circumstances. This strategy helps to handle transient network errors such as dropped connections or server failovers. Retrying will not help in the case of permanent errors, however. In this case retrying wastes resource and time on both the client and server side.

Now let's see how we could apply four common stability patterns to address potentially destabilizing errors in the external credit-score component.

'Use Timeouts' pattern

One of the simplest and most efficient stability patterns is to use proper timeouts. Sockets programming is the fundamental technology for enabling software to communicate on a TCP/IP network. Essentially, the Sockets API defines two types of timeouts:

The connection timeout denotes the maximum time elapsed before the connection is established or an error occurs. The socket timeout defines the maximum period of inactivity between two consecutive data packets arriving on the client side after a connection has been established.

In Listing 1, I used the JAX-RS 2.0 client interface to call the credit-score service, but what is a reasonable timeout period? The answer depends on your JAX-RS provider. The current version of Jersey, for example, uses HttpURLConnection. By default Jersey sets a connection or socket timeout of 0 millis, meaning that the timeout is infinite. If you don't think that's bad news, think again.

Consider that the JAX-RS client will be processed within a server/servlet engine, which uses a worker thread pool to handle incoming HTTP requests. If we're using the classic blocking request-handling approach, the getPaymentMethods() method from Listing 1 will be called via an exclusive thread of the pool. During the entire request-processing procedure, one dedicated thread will be bound to the request processing. If the internally called credit-score service (addressed by the creditScoreURI ) responds very slowly, all of the worker pool threads will eventually be suspended. Then let's say that another method of the payment service, such as getPayments() , is called. That request won't be handled because all the threads will be waiting for the credit-score response. In the worst case, a badly behaving credit-score service could take down all of our payment service functions.

Implementing timeouts: Thread pools vs. connection pooling

Reasonable timeouts are fundamental for availability, but the JAX-RS 2.0 client interface doesn't define an interface to set timeouts. Instead, you have to use vendor-specific interfaces. In the code below I've set custom properties for Jersey.

restClient = ClientBuilder.newClient(); restClient.property(ClientProperties.CONNECT_TIMEOUT, 2000); // jersey specific restClient.property(ClientProperties.READ_TIMEOUT, 2000); // jersey specific

In contrast to Jersey, RESTEasy uses the Apache HttpClient by default, which is much more efficient than using HttpURLConnection. The Apache HttpClient supports connection pooling. Connection pooling ensures that after performing an HTTP transaction the connection will be reused for further HTTP transactions, assuming the connection is identified as a persistent connection. This approach saves the overhead of establishing new TCP/IP connections, which is significant. It's not uncommon in a high-load system for hundreds or thousands of outgoing HTTP transactions per second to be performed by a single HTTP client instance.

In order to use Apache HttpClient in Jersey, you would need to set an ApacheConnectorProvider , as shown in Listing 2. Note that the timeout is set within the request-config definition.

Listing 2. Using Apache HttpClient in Jersey

ClientConfig clientConfig = new ClientConfig(); // jersey specific ClientConfig.connectorProvider(new ApacheConnectorProvider()); // jersey specific RequestConfig reqConfig = RequestConfig.custom() // apache HttpClient specific .setConnectTimeout(2000) .setSocketTimeout(2000) .setConnectionRequestTimeout(200) .build(); clientConfig.property(ApacheClientProperties.REQUEST_CONFIG, reqConfig); // jersey specific restClient = ClientBuilder.newClient(clientConfig);

Note that the connection pool specific connection request timeout is also set in the example above. The connection request timeout denotes the time elapsed from when a connection request was made to before HttpClient's internal connection-pool manager returns the requested connection. By default the timeout is infinite, which means that the connection-request call blocks until a connection becomes free. The effect is the same as it would be with infinite connection/socket timeouts.

As an alternative to using Jersey, you could set the connection request timeout in an indirect way via RESTEasy, as shown in Listing 3.