Data security is important and often undervalued by designers, developers, and clients alike. Since PHP 5.2.0, data sanitization and validation has been made significantly easier with the introduction of data filtering. Today, we're going to take a closer look at these filters, how to use them, and build a few custom functions.

Tutorial Details

Program : PHP

: PHP Version : 5.2.0+

: 5.2.0+ Difficulty: Beginner

Beginner Estimated Completion Time: 20 minutes

Introduction

I have always felt that it's easy to write code in PHP, and even easier to write bad code in PHP. The proliferation of PHP on the web has really been helped out by its use in popular open-source software packages like WordPress, Drupal, and Magento as well as major web applications like Facebook; with PHP being used in so many varied instances (dynamic websites, in-depth web applications, blogging platforms, content management systems, and e-commerce being only a subset of the many applications of PHP) the opportunities for dirty data and insecure systems are numerous. This tutorial will explain some methods of Getting Clean With PHP: Data Sanitization and Validation by focusing on several different forms of data inputs and how to use PHP filters and custom functions.

Why Sanitize and Validate?

In this tutorial, we are really focused on data inputs that users or external sources may provide. This means that we do not control the data we are receiving. All we can do is control what is done with it after we receive it. There are all sorts of threats related to data security from user-inputs and third-party data.

Some un-popular data security threats:

Cross-Site Scripting (XSS) : A form of code injection where a script is injected onto a website from a completely different website. This is by far the most common security vulnerability online. Two recent, very prominent examples of this technique are the Stalk Daily and Mikeyy Twitter Worms from earlier this year that used poorly sanitized inputs to launch Javascript via an "infected" Twitter web interface.

: A form of code injection where a script is injected onto a website from a completely different website. This is by far the most common security vulnerability online. Two recent, very prominent examples of this technique are the Stalk Daily and Mikeyy Twitter Worms from earlier this year that used poorly sanitized inputs to launch Javascript via an "infected" Twitter web interface. SQL Injection : The second most common security vulnerability online, this is another form of code injection in which a script is used to participate in one of numerous exploitative behaviors including (but not limited to) exposing and/or gaining unauthorized access to data, altering data inside of a database, or simply injecting code to be rendered or executed within a website thereby breaking or altering the website.

: The second most common security vulnerability online, this is another form of code injection in which a script is used to participate in one of numerous exploitative behaviors including (but not limited to) exposing and/or gaining unauthorized access to data, altering data inside of a database, or simply injecting code to be rendered or executed within a website thereby breaking or altering the website. Cross-Site Request Forgery (CSRF/XSRF) : A less common exploit that relies more on data sources like browser and session cookies than poorly sanitized and validated data inputs, CSRF (pronounced "sea-surf") can be used to execute commands on a website without the user's permission. One popular CSRF method is using an improperly formed image data URI or src value to execute a script instead of displaying an image.

: A less common exploit that relies more on data sources like browser and session cookies than poorly sanitized and validated data inputs, CSRF (pronounced "sea-surf") can be used to execute commands on a website without the user's permission. One popular CSRF method is using an improperly formed image data URI or src value to execute a script instead of displaying an image. Improper Data: Not really a "security vulnerability" per se, improper data can cause hosts of problems for a website owner or database administrator. Often, improper data can break poorly coded websites or cause automated systems to crash. An example of this was the ability to alter entire MySpace profile pages by posting using all sorts of HTML/CSS hackery (Note: this may still work; I've not used MySpace in a long time).

For our purposes, we are going to only focus on server-side methods of improving data security with PHP, so let's see how the terms "sanitization" and "validation" are defined with relation to PHP. According to the PHP manual:

Validation is used to validate or check if the data meets certain qualifications. For example, passing in FILTER_VALIDATE_EMAIL will determine if the data is a valid email address, but will not change the data itself. Sanitization will sanitize the data, so it may alter it by removing undesired characters. For example, passing in FILTER_SANITIZE_EMAIL will remove characters that are inappropriate for an email address to contain. That said, it does not validate the data.

Essentially, if your website is the nightclub that everybody wants to get into, validation checks the guest list and IDs at the door while sanitization acts as the bouncer that throws out any undesirables that happen to squeak past. With this in mind, let's take a look at PHP Filters Extension.

What Filters Do I Have?

All PHP installations are not created equal. While PHP 5.2.0 was the introduction of filters, not all installations have the same set of filters in their Filters Extension. Most installations will have all of the filters we're going to go over, but to teach you a bit about the Filters Extension, we're going to find out just what you have on your server. In the source download, I have included a file called getfilters.php that, once installed and run on your server, will display all of your filters (both data filters available through the filter_var function and stream filters available through stream_filter_append).

First, we get the array containing the list of all available filters with filter_list, then we loop through the array and echo out the filter name, find out the filter's assigned ID, and echo this ID as well.

How Do I Use A Filter?

PHP Filters for validation and sanitization are activated by passing at least two values to the PHP Filters Extension function filter_var. As an example, let's use the Sanitize Filter for an Integer number like so:

In the example, we have a variable $value that is passed through the Filters Extension function filter_var using the FILTER_SANITIZE_NUMBER_INT filter. This results in the following output:

The Sanitize Filter for an Integer number removes all non-integer characters from the output and produces a clean integer. Within the download source code, you can try out various inputs and it will apply a number of common filters to your input value. I have included a number of different example strings that you can test out as well.

What Do The Different Filters Do?

The list below is not complete, but it does contain the majority of the filters that come standard with 5.2.0+ installations. Custom filters and those added from custom extensions are not included here.

FILTER_VALIDATE_BOOLEAN: Checks whether or not the data passed to the filter is a boolean value of TRUE or FALSE. If the value is a non-boolean value, it will return FALSE. The script below would echo "TRUE" for the example data $value01 but would echo "FALSE" for the example data $value02:

FILTER_VALIDATE_EMAIL: Checks whether or not the data passed to the filter is a potentially valid e-mail address. It does not check whether the e-mail address actually exists, just that the format of the e-mail address is valid. The Script below would echo "TRUE" for the example data $value01 but would echo "FALSE" for the example data $value02 (because the second lacks the required @domain.tld portion of the e-mail address):

FILTER_VALIDATE_FLOAT: Checks whether or not the data passed to the filter is a valid float value. The Script below would echo "TRUE" for the example data $value01 but would echo "FALSE" for the example data $value02 (because comma separators are not allowed in float values):

FILTER_VALIDATE_INT: Checks whether or not the data passed to the filter is a valid integer value. The Script below would echo "TRUE" for the example data $value01 but would echo "FALSE" for the example data $value02 (because fractions / decimal numbers are not integers):

FILTER_VALIDATE_IP: Checks whether or not the data passed to the filter is a potentially valid IP address. It does not check if the IP address would resolve, just that it fits the required data structure for IP addresses. The Script below would echo "TRUE" for the example data $value01 but would echo "FALSE" for the example data $value02:

FILTER_VALIDATE_URL: Checks whether or not the data passed to the filter is a potentially valid URL. It does not check if the URL would resolve, just that it fits the required data structure for URLs. The Script below would echo "TRUE" for the example data $value01 but would echo "FALSE" for the example data $value02:

FILTER_SANITIZE_STRING: By default, this filter removes any data from a string that is invalid or not allowed in that string. For example, this will remove any HTML tags, like <script> or <strong> from an input string:

This script would remove the tags and return the following:

FILTER_SANITIZE_ENCODED: Many programmers use PHP's urlencode() function to handle their URL Encoding. This filter essentially does the same thing. For example, this will encode any spaces and/or special characters from an input string:

This script would encode the punctuation, spaces, and brackets, then return the following:

FILTER_SANITIZE_SPECIAL_CHARS: This filter will, by default, HTML-encode special characters like quotes, ampersands, and brackets (in addition to characters with ASCII value less than 32). While the demo page does not make it abundantly clear without viewing the source (because the HTML-encoded special characters will be interpreted and rendered out), if you take a look at the source code you'll see the encoding at work:

It converts the special characters into their HTML-encoded selves:

FILTER_SANITIZE_EMAIL: This filter does exactly what one would think it does. It removes any characters that are invalid in e-mail addresses (like parentheses, brackets, colons, etc). For example, let's say you accidentally added parentheses around a letter of your e-mail address (don't ask how, use your imagination):

It removes those parentheses and you get your beautiful e-mail address back:

This is a great filter to use on e-mail forms in concert with FILTER_VALIDATE_EMAIL to reduce user error or prevent XSS-related attacks (as some past XSS attacks involved the returning of the original data provided in a non-sanitized e-mail field directly to the browser).

FILTER_SANITIZE_URL: Similar to the e-mail address sanitize filter, this filter does exactly what one would think, as well. It removes any characters that are invalid in a URL (like certain UTF-8 characters, etc). For example, let's say you accidentally added a "®" into your website's URL (again, don't ask how, pretend a velociraptor did it):

It removes the unwanted "®" and you get your handsome URL back:

FILTER_SANITIZE_NUMBER_INT: This filter is similar to the FILTER_VALIDATE_INT but instead of simply checking if it is an Integer or not, it actually removes everything non-integer from the value! Handy, indeed, for pesky spambots and tricksters in some input forms:

Those silly letters and decimals get thrown right out:

FILTER_SANITIZE_NUMBER_FLOAT: This filter is similar to the FILTER_VALIDATE_INT but instead of simply checking if it is an Integer or not, it actually removes everything non-integer from the value! Handy, indeed, for pesky spambots and tricksters in some input forms:

Again, all those silly letters and decimals get thrown right out:

But what if you wanted to keep a decimal like in the next example:

It would still remove it and return:

One of the main reasons why FILTER_SANITIZE_NUMBER_FLOAT and FILTER_SANITIZE_INT are separate filters is to allow for this via a special Flag "FILTER_FLAG_ALLOW_FRACTION" that is added as a third value passed to filter_var:

It would keep the decimal and return:

Options, Flags, and Array Controls, OH MY!

The flag in this last example is just one of many more options, flags, and array controls that allow you to have more granular control over what types of data gets sanitized, definitions of delimiters, how arrays are processed by the filters, and more. You can find more about these flags and other filter-related functions in the PHP manual's Filters Extension section.

Other Methods of Santizing Data with PHP

Now, we'll go over a few key supplemental methods of sanitizing data with PHP to prevent "dirty data" from wreaking havoc on your systems. These are especially useful for applications still running PHP 4, as they were all available when it was released.

htmlspecialchars: This PHP function converts 5 special characters into their corresponding HTML entities:

'&' (ampersand) becomes '&'

'"' (double quote) becomes '"' when ENT_NOQUOTES is not set.

''' (single quote) becomes ''' only when ENT_QUOTES is set.

'<' (less than) becomes '<'

'>' (greater than) becomes '>'

It is used like any other PHP string function:

htmlentities: Like htmlspecialchars, this PHP function converts characters into their corresponding HTML entities. The big difference is that ALL characters that can be converted will be converted. This is a useful method of obfuscating e-mail addresses from some bots that collect e-mail addresses, as not of them are programmed to read htmlentities.

It is used like any other PHP string function:

mysql_real_escape_string: This MySQL function helps protect against SQL injection attacks. It is considered a best practice (or even a mandatory practice) to pass all data that is being sent to a MySQL query through this function. It escapes any special characters that could be problematic and would cause little Bobby Tables to destory yet another school students database.

Custom Functions

For many people, these built-in filters and functions are just not good enough. Data validation of some data like phone numbers, zip codes, or even e-mails often requires more strict validation and masking. To do this, many people create custom functions to validate and their data is real. An example of this may be as simple as using a MySQL query to look up the data in a database of known values like so:

Other custom functions can be made that do not rely on databases of known values, and can be created by checking magic-quotes, stripping slashes, and escaping for insert into a database:

The possibilities are endless, especially if you integrate regular expressions, but for most occasions, the PHP Filters Extension should do the trick.

Follow us on Twitter, or subscribe to the Nettuts+ RSS Feed for more daily web development tuts and articles.



