In our eBook PHP 7 Explained, we have already explained why the successor of PHP 5 is PHP 7 rather than PHP 6.

Since the attempt to create a Unicode-based PHP implementation has failed, PHP 7 – just like PHP 5 – does not handle Unicode strings natively. The commonly used UTF-8 encoding, for example, is a multibyte encoding, as opposed to ASCII, where each character is represented by one single byte. Calculating the string length is trivial for ASCII characters: just count the bytes. Calculating the length of a string that is encoded using UTF-8 is more challenging. UTF-8 is a variable-length encoding and each character (code point, to be exact) is represented by one to four bytes. For ASCII characters, everything works smoothly, because UTF-8 is a superset of ASCII. The problems start with non-ASCII characters:

1 var_dump ( strlen ( 'ö' ) ) ;



This simple script, at least when saved as UTF-8, will produce a most interesting result:

int(2)

When encoding the one German umlaut as UTF-8, two bytes are being used. Since PHP does not know about UTF-8 (or Unicode in general), the built-in strlen() function just counts bytes, which leads to a wrong result.

There are commonly used PHP extensions, for example iconv or mbstring ("multibyte string") that offer Unicode-enabled string handling functions, for example mb_strlen() (which, of course, requires the mbstring extension):

1 var_dump ( mb_strlen ( 'ö' ) ) ;



This function counts code points rather than bytes and thus yields the correct result:

int(1)

You can do the same with the `iconv` extension, if you have that one installed:

1 var_dump ( iconv_strlen ( 'ö' ) ) ;



Unsurprisingly, this function yields the same result:

int(1)

In both cases, we are cheating a little, since we are not specifying that our string is UTF-8 encoded. This works since by convention UTF-8 is the assumed default encoding pretty much everwhere on the Internet.

Now we will add magic into the mix, and new problems will arise. If you are using the mbstring extension then you can use the php.ini directive mbstring.func_overload to overload built-in PHP functions with the multibyte-enabled mb_ functions. Depending on the value you set mbstring.func_overload to, the mail() function, string functions, and regular expressions (unfortunately not the preg_ ones, but the removed ereg_ ones) can be overloaded.

The problem with this magic is that your program cannot know whether PHP's string functions operate with or without support for multibyte characters. And you certainly do not want to wrap an if around every string function. So just like with magic quotes, which we wrote about earlier, using mbstring.func_overload is not a good idea. That is why this php.ini directive has been deprecated in PHP 7.2, and will likely be removed in PHP 8.

Even if it potentially means a lot of work: you have to walk through your code and make it explicit with which encodings you work. Do not wait until PHP 8, because that would put you in a situation where you cannot upgrade to PHP 8. You effectively want to get started with your PHP 8 migration right now.

This article is an excerpt from our eBook PHP 7 Explained, which we have updated for PHP 7.2 recently.

For more insights into PHP 7, get your copy now, it includes life-long free updates.

Interested in our full-day workshop on PHP 7? Visit https://php7day.de/ today and reserve your seat!