Bad error messages increase MTTR

Software errors are an unavoidable part of any software development process. Your application will not always function as planned. Things will go wrong, systems will be down, bugs will creep in, and users will do the wrong thing with your software. The real question is how long will it take you, or the operator who probably isn’t you, to recover from that failure. This is called mean time to recovery (MTTR) and it’s an important measure of the health of any software system. Many factors influence MTTR, but the one I want to focus on for this article are error messages. Note that you want a low MTTR.

Good error messages help operators and developers

When a problem arises, error messages help the software operator understand what has gone wrong. If the problem is part of the operational deployment then the error message will help the operator understand what the root cause of the problem is and quickly remedy it. That error message can also help the operator understand that the problem is not rooted in the operational deployment, but instead is a software issue that needs to be fixed by a developer. Finally, the error message may be an error that crops up during the development process. If that error message is well written and contains the appropriate information it will help the developer quickly fix the problem and improve software delivery times.

What are the essential elements of a good error message?

Report enough context so the reader knows what went wrong

Let’s pretend we have two pieces of Java software, simple-reader and complex-reader. Both of them do some kind of batch processing on XML files.

simple-reader only processes files from a single directory which the operator provides.

complex-reader processes several banks of XML files from disparate sources. This system is a connector between several legacy systems and some more modern ones. It reads these XML records from various sources some. The software fetches some from a FTP server, others from a Hadoop cluster, still more from a shared network drive.

Let’s say that simple-reader throws a FileNotFoundException with the path of the file when it encounters a problem reading from the directory. The operator or developer would receive the path to the file, and be able to remedy the situation

Let’s say that complex-reader also throws a FileNotFoundException with the path of the file when it encounters a problem reading from the directory (be it remote or otherwise). The operator or developer would receive the path to the file, and not know which of the subsystems that complex-reader was trying to fetch the file from when it encountered a problem.

complex-reader needs to be fixed by adding more context to the failure. What subsystem was it speaking to when it failed? What was the path on that subsystem? Was it a network failure or an actual read failure? Could the complex-reader report the number of files it had already processed from that endpoint before it failed? All of this information would help the operator find the real problem and fix it quickly. Probably by identifying a subsystem that’s down, or a networking issue that’s preventing complex-reader from fetching and reading the file.

Comparing and contrasting these two examples it becomes clear that more complex software needs more robust error messages. Give the operator the right amount of context to help them solve their problems.

If possible, help the reader know what to do next

The reader reading an error message should receive a call to action if it isn’t obvious from the message itself. This is especially true if the developer has special knowledge that can be encoded in the error message. You have to be careful with this sort of thing. Good advice encoded in an error message may quickly rot as technology and infrastructures change.

For example, if the complex-reader failed during a connection to read from an FTP site, and this site is often flaky, then the error message may read something like

Tried connecting to the FTP service at [49.132.9.1] [3] times before giving up. Service is often unreliable, try again in a couple of minutes.

This helps the reader know that being unable to connect is not an uncommon problem, that there could be a more serious underlying issue with the networking infrastructure, but that they shouldn’t panic until they’ve run the service after some wait time.

Wrap bad exceptions with in better custom ones

You can’t always control what exceptions are coming out of your software. Often times you use frameworks that throw exceptions while they run and you have no ability to control those. For the cases when you DO have the capacity to catch bad exceptions, the best thing you can do is transform them into more useful ones.

java.net.ConnectException as thrown by Java’s Socket class is a great example of a terrible error message because it doesn’t tell the reader the host and port of the destination system.

package com.mpinnegar; import java.io.IOException; import java.net.Socket; public class Connector { public static void main(String[] args) throws IOException { new Socket("localhost", 1111).getOutputStream().write("hello world".getBytes()); } } Invoking the above produces this error message Exception in thread "main" java.net.ConnectException: Connection refused: connect at java.net.DualStackPlainSocketImpl.connect0(Native Method) at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:79) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.mpinnegar.Connector.main(Connector.java:11) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Now let’s make this better by catching the exception and rethrowing it as our own special exception which includes the host and port. In this case I extended Exception, but I tend to extend RuntimeException whenever possible because I think checked exceptions are often more trouble than they’re worth.

package com.mpinnegar; import java.io.IOException; import java.net.ConnectException; import java.net.Socket; import java.net.UnknownHostException; public class Connector { public static void main(String[] args) throws IOException, BetterConnectException { String host = "localhost"; int port = 1111; try { new Socket(host, port).getOutputStream().write("hello world".getBytes()); } catch (ConnectException | UnknownHostException exception) { throw new BetterConnectException(host, port, exception); } } } Which produces this error message. Notice that we've preserved the original stack trace so we don't lose any information about the original exception, but we've also added our own information Exception in thread "main" com.mpinnegar.BetterConnectException: Failed trying to connect to host [localhost] on port [1111] at com.mpinnegar.Connector.main(Connector.java:15) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) Caused by: java.net.ConnectException: Connection refused: connect at java.net.DualStackPlainSocketImpl.connect0(Native Method) at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:79) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.<init>(Socket.java:434) at java.net.Socket.<init>(Socket.java:211) at com.mpinnegar.Connector.main(Connector.java:13) ... 5 more

Obfuscate sensitive data in your error messages

Logs, and by extension error messages, should never be considered a secure part of an application. They are often read by many different individuals with varying levels of access and authority. They can also live for months or years in archived form.

You should therefore sanitize any sensitive data that you write to an error message or log. The simplest way to do this in Java is to write an overridden .toString() method that doesn’t output any sensitive fields, outputs a small portion of those fields, or that outputs the full field only in some kind of developer mode to help with debugging. Below is an example of a class that obfuscates the user’s credit card number only when the system is setup to do so.

package com.mpinnegar; import static com.mpinnegar.Sanitizer.sanitize; public class Person { private String firstName; private String lastName; private String creditCardNumber; public Person(String firstName, String lastName, String creditCardNumber) { this.firstName = firstName; this.lastName = lastName; this.creditCardNumber = creditCardNumber; } //Getters and setters are omitted for brevity @Override public String toString() { return "Person{" + "firstName='" + firstName + '\'' + ", lastName='" + lastName + '\'' + ", creditCardNumber='" + sanitize(creditCardNumber) + '\'' + '}'; } }

package com.mpinnegar; public class Sanitizer { private static boolean isSanitizing = Boolean.getBoolean(System.getenv("shouldSanitizeData")); public static String sanitize(String sensitiveData) { if (isSanitizing) { return sensitiveData.replaceAll(".", "*"); } else { return sensitiveData; } } }

Demarcate your parameterized values

Often you will need to pass values into an error message. This helps the reader understand what the state of the system was when it failed. When passing these error messages in you should demarcate them from the text of the message. I use brackets ([ and ]), but you should use what makes the most sense for you. This is most important when you might output something that could be interpreted as part of the error message itself. Consider the following error message.

Tried printing to the screen, but failed.

What’s lost here is that the error message is actually telling the reader exactly what it tried to print. In this case it’s the empty string, but since there’s no demarcation it’s impossible for a reader to know that. Instead consider this improvement.

Tried printing [] to the screen, but failed.

Validate individual steps and tie your error messages to your code whenever possible

It puts more burden on the programmer, but whenever you can reasonably do so, validate input in a layered fashion so that you can respond with an error message specific to the problem at hand. Consider validating some data a reader has entered into a form which accepts first name, last name, and phone number.

In our example we require all fields except for the phone number, and the first name must be 20 characters or less in length. The last name is unbounded, but must be at least one character, and the phone number must be a 10 digit number after removing everything but the numbers.

Here’s a very simple, and very bad example. It provides almost no feedback to the reader on what went wrong. In our case we have a very simple form, only three fields, but in a real world example you may have tens of fields to deal with.

package com.mpinnegar; public class BadValidator { private static final int FIRST_NAME_FIELD_SIZE_LIMIT = 20; private static final int PHONE_NUMBER_FIELD_EXACT_SIZE = 10; private String errorMessage; boolean validate(String firstName, String lastName, String phoneNumber) { if ( firstName == null || firstName.length() == 0 || lastName == null || lastName.length() == 0 || (phoneNumber != null && phoneNumber.replaceAll("[^0-9]", "").length() != PHONE_NUMBER_FIELD_EXACT_SIZE) || firstName.length() > FIRST_NAME_FIELD_SIZE_LIMIT ) { errorMessage = "Error found in form data."; return false; } return true; } public String getErrorMessage() { return errorMessage; } }

Now let’s try again applying some of the lessons learned from above about layered validation. Note that it does take more code to do this. You don’t get this for free, but you’re making your application easier to use either when a developer sends a new request that’s bad and gets a reasonable error message back, or when an operator sends a bad request and now knows what they need to fix.

package com.mpinnegar; public class Validator { private static final int FIRST_NAME_FIELD_SIZE_LIMIT = 20; private static final int PHONE_NUMBER_FIELD_EXACT_SIZE = 10; private String errorMessage; boolean validate(String firstName, String lastName, String phoneNumber) { if (firstName == null || firstName.length() == 0) { errorMessage = missingRequiredValue("firstName"); return false; } if (lastName == null || lastName.length() == 0) { errorMessage = missingRequiredValue("lastName"); return false; } if (phoneNumber != null && phoneNumber.replaceAll("[^0-9]", "").length() != PHONE_NUMBER_FIELD_EXACT_SIZE) { errorMessage = "Phone number [" + phoneNumber + "] is invalid. It must be " + PHONE_NUMBER_FIELD_EXACT_SIZE + " characters long considering only the numbers and dropping everything else."; return false; } if (firstName.length() > FIRST_NAME_FIELD_SIZE_LIMIT) { errorMessage = "The first name field cannot be longer than " + FIRST_NAME_FIELD_SIZE_LIMIT + " characters. " + "The first name submitted was " + firstName + " which is " + (firstName.length() - FIRST_NAME_FIELD_SIZE_LIMIT) + " characters too long."; return false; } return true; } private String missingRequiredValue(String valueName) { return "Missing [" + valueName + "] which is a required field."; } public String getErrorMessage() { return errorMessage; } }

You can see in the above example how the length of the field is being shared between the error message and the actual code doing the validation. This helps cut down on the rot that can occur where error messages don’t match the code that they’re written against.

Notice that this isn’t perfect. We’ve written in the error message for the phone number that we’re dropping all the non-numeric characters. That’s a textual description of what the regex is doing. There’s nothing that prevents a developer from changing the regex without updating the error message.

Use parameterized logging in Java for your error messages (and other logs)

Your logging framework in Java should support parameterized logging (if you logging framework does not, get a new one). This is when you pass a string with placeholders in it (usually curly braces {}) to the logging framework’s log method, and then you pass a series of objects which replace the placeholders. Consider this log statement which does not use parameterized logging.

logger . debug ( "You just broke the world by passing me a this bad BigObject [" + objectWithReallyBigToString + "]" );

In the above form every time the code passes over this .debug() statement, even if your logger code is not in debug mode, you will pay the cost of the .toString() method on the objectWithReallyBigToString. This is because the code has to resolve the parameter’s value before passing it in.

Instead of that let’s do something better and use the logging framework’s parameterized logging. Notice how we use the {} which is where the logging framework will inject the value returned by objectWithReallyBigToString.toString()

logger . debug ( "You just broke the world by passing me a this bad BigObject [{}]" , objectWithReallyBigToString );

Now because we passed the objectWithReallyBigToString object into the second parameter and let the logging framework build the string itself, we don’t incur the cost of calculating it’s string value each time we pass over that piece of code while we aren’t in debug mode. This can make a BIG performance difference in pieces of the code that are frequently visited.

Conclusion

Whenever you’re programming think about the reader who’s coming along after you whether it be another developer or an operator. They need to be able to quickly ascertain what’s going on. When they come along make sure your error messages don’t look like this.

Special thanks to pathslong, MikhailEdoshin, dorkinson, and FranknFreddys for their feedback on this article.