Developers are often provided with a large amount of security advice, and it is not always clear what to do, how to do it, and how important it is. Especially considering that security advice has changed over time, it can be confusing. I was motivated to write this blog because the best guide I found on input validation is very dated, and I wanted to provide clear, modern guidance on the importance of the topic. There is also the OWASP Input Validation Cheat Sheet as another source on this topic.

This blog is targeted to developers and Application Security leads who need to provide guidance to developers on best practices for secure coding.

Input validation the first line of defence for secure coding. There are many ways that a hacker will go after your software, and it would be naive to assume that you know all of them. The point of input validation is that, when done correctly, it will stop a number of attacks that you will not foresee. When it doesn’t completely stop them, it usually makes them more restricted or more difficult to pull off. Remember: input validation is not about stopping specific attacks, but instead a general defence for stopping any number of attacks.

Input validation is not hard to do, but sometimes it takes time to figure out what character set the data should conform to. This blog will be primarily focused on web applications, but the same concepts apply in other scenarios.

Below is guidance on how to do it, when to do it, and examples of how input validation might save you if you got other parts of the coding wrong. It must be emphasised that input validation is not a panacea, but when done correctly, it sure makes your application a lot more likely to resist a number of attacks.

What is input validation?

Hackers attack websites by sending malicious inputs. This could be through a web form or AJAX request, or by sending requests directly to your API with tools such a curl or python, or by using an intercepting proxy (typically burp, but other tools include zap and charles) which is somewhere in between the former two methods.

Input validation means to check on the server side that the input supplied by the user/attacker is of the form that you expect it to be. If it is not of the right form, then the data should be rejected, typically with a 400 http status code.

What is meant by right form? At the minimum, there should be a check on the length of the data, and the set of characters in the data. Sometimes, there are more specific restrictions on the data — consider some insightful examples below.

Example: phone number. A phone number is primarily digits, with a maximum of 15 digits. If you allow international numbers, then plus (‘+’) is a valid character. If the area code is in parentheses, then the parentheses are additionally valid characters. If you want to allow the user to enter dashes or spaces, then you can add those to the list of allowed character as well, though you don’t need to (front end developers can make it so that the user does not need to send such data). In total we have at most 15 different valid characters and at most 15 digits — this is your validation rule. Additionally, you should always have a limit on the total length of the string supplied (including parentheses, spaces, dashes, pluses) to stop malicious users from fiddling.

Example: last name. This one is more complicated, but a Google search gives us a pretty good answer. Importantly, you need to allow hyphens (Hector Sausage-Hausen); spaces, commas, and full stops (Martin Luther King, Jr.); and single quotes (Mathias d’Arras). See also the comments that identify extra characters that should be included if one wants to truly try to accept all international names, but it depends upon your application. As for length limit, if you really want to allow the longest names in the world, you can, but I would personally think a limit like 40 characters is sufficient.

Example: email address. Email addresses are examples where there character set is well defined, but the format is restricted. There are published regular expressions all over the web that tell you how to validate email addresses, but this one looks quite useful because it tells you how to do it in many different languages.

Example: a quantity. A common requirement for ecommerce applications is to allow the customer to choose some number of items. Quantities should be positive integers, and there should be some reasonable upper limit to how many the person can choose. For example, you might allow only the quantities 0, 1, 2, …, 99 and anything else is rejected.

There are two types of input validation strategies: white list and black list. The examples above are white list — we have said exactly what is allowed and anything else is rejected. The alternative approach, black list, tells what is not allowed and accepts everything else. White list validation is a general defence that is not targeted towards a specific attack, and can often stop attacks that you may not foresee. On the other hand, black list validation rules out dangerous characters for a specific attack that you have in mind. Black list should never be relied upon by itself — we will talk more about that in the next section.

When and how to do it?

Validation must happen on the server side, and should be done before you do anything else with the data.

It’s not unusual to see client side validation (for usability), but if you do not also validate on the server side (for security), then it will stop your kid sister and nobody else. Remember, web hackers do not need to run the same JavaScript code that you serve to them. Most of the time, they are going to use an intercepting proxy to modify the request after it leaves the browser but before it reaches the server. For example, see this video. [Footnote: there are some edge cases where client side input validation makes sense for security, such as DOM based XSS, but these are advanced topics that are beyond the scope of this blog.]

What should be validated? Any data that you use that an attacker can manipulate. For web applications, this includes query parameters, http body data, and http headers. For emphasis, you don’t need to validate every http header — instead, only validate the ones that your application uses somewhere in the code. I’ve seen a number of cases where applications use http headers such as User-Agent without realising that an attacker could put anything he wants for such values. For example, with curl one would just use the -H option.

Validation should be done immediately, before anything else (including logging) is done with the data. I’ve seen cases where developers tried to validate strings that were formed by concatenating fixed values with user input. Because the validation happened on the result of the string concatenation, the validation routine had to be liberal enough to allow every character in the fixed string. But one of the characters in the fixed string was key to allowing the attacker to supply his own malicious input. Since the validation had to allow it when coded this way, a hack was trivial to pull off.

MVC frameworks such as .Net and Spring have good documentation on how to do input validation. For example, Microsoft has nice pages about input validation in various versions of ASP.NET MVC. Similarly, Spring has documentation and additional guidance can be found in various places. In other cases, you might use a framework design specifically for data validation, or write custom validation methods.

Importantly, white list validation must always be done. Black list validation can supplement white list validation, but it cannot be relied upon by itself to stop an attack. The reason is that hackers are very good at finding ways around black list validation.

Yes, it is your responsibility

The diagram below depicts an example that I have seen a number of times, especially in microservices architectures. System A receives data from the user, and passes it off to internal System B. System B processes the data. Where should input validation happen?

The developers of System A believe they are mainly acting as a proxy, and therefore the responsibility for input validation lies with System B. On the other hand, the developers of System B believe they are getting data from an internal trusted source (System A), and therefore the responsibility for validating it lies with System A. As a consequence, neither development team validates the data. In the event that they get hacked, everyone has an excuse on why they didn’t do it.

The answer is that both are responsible for validating their input data.

System A needs to validate because otherwise it is vulnerable to various injection attacks when the data gets sent to system B, such as json injection or http header injection. System B should validate their data because the muppets who wrote System A rarely do any type of meaningful validation on the data you process. In summary, no matter where you are developing, you need to validate your input. Don’t assume that somebody else is going to do it for you, because they won’t.

What about input sanitisation?

Sanitisation refers to changing user input so that it becomes not dangerous. The issue here is that what is dangerous depends upon the context in which it is used.

For example, suppose some user input is first logged and then later displayed to the user in his browser. In the context of logging, dangerous characters are new lines (ASCII value of 10) and carriage returns (ASCII value of 13) because they allow attackers to create a new line of your log file with anything he wants, an attack known as log forging (log forging discussed in more detail when we get to examples later). On the other hand, when it is displayed to the user, the dangerous characters become those that can be interpreted by the browser to do something that is not intended by the web designer — the characters < > / and “ are the first that come to mind (normally we escape these characters).

It is not unusual for developers to get this wrong. I have seen more than once the use of the OWASP Java HTML Sanitizer to attempt to sanitise data written to the log. Wrong tool, wrong context.

Because sanitisation depends upon context, it is not desirable to try to sanitise inputs at the beginning in such a way that they will not be dangerous when later used. In other words, input sanitisation should never be used in replace of input validation. Input validation should always be done. Thus, even if you get the sanitisation wrong (e.g. see previous paragraph), input validation will often save you.

Note also that the concept of input sanitisation is making us think in a black list approach rather than a white list approach, i.e. we are thinking about what characters might be harmful in a specific context and what to do with them. This is another reason why input sanitisation should not displace input validation. Even very good developers have learned this the hard way.

The bottom line is that white list input validation is a general defence, whereas input sanitisation is a specific defence. General defences should happen when input comes in (at “source”), specific defences should happen when the data is later used (at “sink”). Never omit the general defence.

Examples

We have given the guidance, but now let’s justify it with examples. Let’s see how input validation can often save us when proper coding defences are lacking somewhere else in the code base. An important takeaway here is that even though input validation does not stop everything, it certainly does stop a lot, and makes other attacks a lot harder. Given how easy it is to perform the validation, the bang-for-your-buck-analysis dictates to always do it.

Example: SQL injection

SQL injection vulnerabilities are most often due to forming SQL queries using string concatenation/substitution with user input. A typical example looks like this (Java):

String query = 'SELECT * FROM User where userId='' + request.getParameter('userId') + '''; // vulnerable

To get all users, the attacker can send the following for the userId parameter: xyz’ or 1=1 –.

That malicious input will change the query to:

SELECT * FROM User where userId='xyz' or 1=1 -- '

The attacker can do much more than this with more clever inputs, including fetching other columns or deleting the entire database. However, sticking to the simple example, notice the tools the attacker is using: the single quote to end the userId part of the query, the white spaces to add other statements, the equal to make a comparison that is always true, and the double dash to escape the remaining part of the query.

The proper defence against SQL injection, which should always be done, is either prepared statements or parameterised queries. However, let’s consider what would happen if the developer had validated the input but still formed the query with string concatenation. In this case, the code might look like:

// This code is still wrong, but it is better than above String userId = request.getParameter('userId'); if ( !validateUserId( userId ) ) { // Handle error condition, return status code of 400 ... } String query = 'SELECT * FROM User where userId='' + userId + '''; // <--- Don't do this!

For data types like user ids, phone numbers, quantities, email addresses, and many others, input validation would not have allowed the single quote, which has already stopped the attack. However, for a field like a last name, a single quote must be allowed or else O’Malley will throw a fit.

Still, input validation would not have allowed the equal character in the last name and some other characters that attackers like to use. Additionally, limiting the number of characters that an attacker can provide (example: 40 for last name) would also impede the attacker. Truthfully, a good hacker would still succeed in getting an injection, but you might have stopped the script kiddie. The lesson here is: Just because you validated the data does not mean that you can be sloppy elsewhere.

A great website telling how to code sql queries in various languages is Bobby Tables. They have a nice comic, but unfortunately the punch line is wrong:

As the website, correctly says on the about page: The answer is not to “sanitize your database inputs” yourself. It is prone to error. Instead, use prepared statements or parameterised queries.

Example: Log forging

A well written application should have logging throughout the code base. But logging user input can lead to problems. For example (C#):

log.Info("Value requested is " + Request["value"]); // vulnerable

Log files are separated by newlines or carriage returns, so if the value from the user contains a newline or a carriage return, then the user is able to put anything he wants into your log file, and it will be very hard for you to distinguish between what is real versus what was written by the attacker. If you did proper validation of your input, there are not many cases where one would allow newlines and/or carriage returns, so the validation would usually save you.

For a good technique to prevent log forging, see this nice blog by John Melton (the same concept works regardless of language).

Example: Path manipulation

Many web applications these days allow the user to upload to or read something from the server file system. It is not unusual for the file name to be formed by concatenating a fixed path with a file name provided by the user (C#):

String fn = Request["filename"]; // fn could have dangerous input String filepath = USER_STORAGE_PATH + fn; String[] lines = System.IO.File.ReadAllLines(filepath); // vulnerable

The above example does not validate the user provided filename, but normally a filename would only allow alphanumeric characters along with the ‘.’ for file extensions.

The risk here is that a user provides a file name of something like: ../../../system_secrets.txt (fictional file name). This allows the attacker to read the system_secrets.txt file which is outside the USER_STORAGE_PATH path. Note that the input validation would not have prohibited the use of ‘.’ , but it would have prevented from using the forward or backward slash, which is key to his success.

More generally, I don’t recommend that users should be able to provide the direct file names to access, and storage should be on a separate system. But even if you go against that advice, proper input validation will save you from path manipulation.

Example: Server side request forgery

This is a nice example because not many people know what it is, but it has recently become quite a nice tool that hackers have added to their toolbox.

Consider that your web application might initiate an http request, where the destination of that request is somehow formed from user input. One might think that making the http request from the server is benign, because it is no different from the user making a similar request from his browser. But that reasoning is wrong, because the request from your server has access to your internal network, whereas the user himself should not.

An example is insightful. The website InterviewCake teaches developers how to solve difficult job interview questions. From the website, you can actually write code in a number of different languages and run it, which runs in an AWS environment. It was then too easy for Christophe Tafani-Dereeper to write some simple python code on InterviewCake that reveals AWS security credentials from the private network that the application was hosted on:

This type of attack is common in AWS environments: learn about AWS Instance Metadata from Amazon.

Of course, allowing users to run arbitrary code in an environment that you host is extremely dangerous and is hard to defend against. More often we see cases like the following from StackOverflow (PHP):

In the above example, the server will get the url of a file from a query parameter. It assumes that the file type is either gif, png, or jpg, and then it servers that content to the client. The only validation check is that the protocol is http (or https).

Although it appears to restrict to gif, png, or jpg files, the default is to just read the content regardless of type. An attacker could thus request anything he wants, and the command is executed with server privileges.

To fix it, the server needs to validate that the incoming url is allowed to be accessed by the user. This can be done by verifying that the url matches a white list of allowed URLs. In this case, white list validation defeats the attack completely (when implemented properly), and no other defence is needed.

Example: Cross site scripting

Cross site scripting (XSS) is an old vulnerability that is still a major problem today. There are three types of XSS, but we’re not going to that level of detail. XSS happens when untrusted input (typically user input) is interpreted in a malicious way in another user’s browser. Let’s look at a simple example (Java JSP):

<% String name = request.getParameter("name"); %> Name provided: <%= name %>

If the input provided contains the query parameter

then that JavaScript will execute in a user’s browser. This type of XSS (reflected XSS) is typically exploited by user A emailing a link to user B with the malicious JavaScript embedded in it.

The proper defence for XSS is to escape the untrusted input. In JSP, this can be done with the JSTL <c:out> tag or fn:escapeXml(). But this needs to happen everywhere untrusted data is displayed, and missing one place can result in a critical security vulnerability.

Similar to other examples, input validation will often save you in the event that you missed a single place where the output needs to be escaped. As noted above, the characters < > / and ” are particularly dangerous in the context of html. These characters are rarely part of a while list of allowed user input.

Despite the dire warnings above, it is great to know that there are frameworks like Angular that escape all inputs by default, thus making XSS extremely unlikely to happen in that framework. Secure by default — what a novel concept.

Conclusion

White list input validation should always be done because it prevents a number of attacks that you may not foresee. This technique should happen as soon as data comes in, and invalid input should be rejected without further consideration. Input validation is not a panacea, so it should be coupled with specific defences that are relevant to the context in which the data is used. Input validation should be applied at the source, whereas the other specific defences are applied at the data sinks.