Categories


Archives


Recent Posts


Categories


PHP and Unicode

astorm

Frustrated by Magento? Then you’ll love Commerce Bug, the must have debugging extension for anyone using Magento. Whether you’re just starting out or you’re a seasoned pro, Commerce Bug will save you and your team hours everyday. Grab a copy and start working with Magento instead of against it.

Updated for Magento 2! No Frills Magento Layout is the only Magento front end book you'll ever need. Get your copy today!

This entry is part 4 of 4 in the series Text Encoding and Unicode. Earlier posts include Inspecting Bytes with Node.js Buffer Objects, Unicode vs. UTF-8, and When Good Unicode Encoding Goes Bad. This is the most recent post in the series.

PHP’s unicode story is — not great.

PHP’s strings don’t know anything about text encoding. They are, under the hood, just an array of individual bytes. The lore has it that the abandoned PHP 6 project included attempts to make PHP strings unicode aware (similar to how python does it), but that this proved hard to do and was scrapped. The lore also has it that this was a major reason PHP 6 itself was scrapped, and that PHP 7 just skipped trying to bring unicode strings to PHP.

When I’m in a generous mood I can see one bright side to this state of affairs, and that’s that there’s a certain simplicity to passing around an opaque array of bytes. If there’s no text encoding on the string then there’s no opportunities for the programmer to convert things in a wrong or unexpected way.

When I consider the ramifications this spirit of generosity quickly dissolves.

Counting Char

If you create a PHP source file in a modern text editor that looks like this

# File: test.php
<?php
function main() {
    $string = 'Hyvä';
    echo 'This string is ' . strlen($string) . ' characters long',"\n";
}
main();

and run your program PHP will probably say the string is five characters long. That’s because PHP counts both bytes that make up the ä as individual characters.

I say probably because it will depend on how you’ve saved your source file. Since 'Hyvä' is a string constant that means PHP is using the bytes in your source file when it stores the string.

Save the same program with a text encoding of ISO 8859-1 and PHP will “correctly” see it as a four character string. This is because in ISO 8859-1 every character is a single byte.

Save the same program in a source file with UTF-16 and PHP (at least the version of 7.4 on my mac) won’t run the program. Likely because the <?php sequence doesn’t look right to the PHP engine when encoded as UTF-16.

Regular Expressions

OK, so let’s skip string constants and load our Hyvä string from a file instead

# File: test.php
<?php
function main() {
    // the /tmp/source.txt file contains Hyvä encoded as UTF-8
    $string = trim(file_get_contents('/tmp/source.txt'));
    echo 'This string is ' . strlen($string) . ' characters long',"\n";
}
main();

PHP still thinks the string is 5 characters long, but at least we’re immune from problems due to the source file’s encoding now.

The effects of no text encoding goes far beyond string length — consider a regular expression

# File: test.php
<?php
function main() {
    // the /tmp/source.txt file contains Hyvä encoded as UTF-8
    $string = trim(file_get_contents('/tmp/source.txt'));

    var_dump(
        preg_match('/Hyv[a-z]/', $string)
    );
}
main();

A reasonable person might expect the string to match the regular expression /Hyv[a-z]/ — but it won’t. Again — PHP sees that fourth character of the string Hyvä as the first byte of its two byte UTF-8 encoding. All your carefully crafted PHP regular expressions can fall apart if they encounter UTF-8 text encoded as more than one byte.

PHP Multibyte String Handling

There is a solution for anyone who wants to program for users of text outside of the US hegemony, and that’s the multibyte string extension. Somewhat frustratingly, this is not a default extension, so depending on where you get your PHP from these functions may or may not be available.

This small program will produce results more in line with what we might expect thanks to the mb_strlen function.

# File: test.php
<?php
function main() {
  $string = trim(file_get_contents('/tmp/source.txt'));
  echo 'This string is ' . mb_strlen($string) . ' characters long',"\n";
}
main();

But even here it’s not 100% clear how the multibyte string functions count things. Emoji still seem to give it a hard time. It counts the bellhop bell as a two character string

$string = '🛎️';
echo 'This string is ' . mb_strlen($string) . ' characters long',"\n";

Also — the behavior of these mb_ functions is influenced by the encoding value set via the mb_internal_encoding function. This means it’s still up to you to know something about the strings you’re working with.

We’ll leave how all these functions work as an exercise for our more intrepid readers.

PHP Unicode Regular Expressions

The multibyte string functions include a series of regular expression functions — although their names imply they use the old ereg_ regular expression syntax that was removed from PHP for non-multibyte strings.

It’s also possible to use pcre_ regular expressions with unicode strings via the u pattern modifier. This code will return false, indicating the regular expression didn’t match

var_dump(
    preg_match('%Hyv\w%','Hyvä')
);

However, if we add the u modifier to the regular expression

var_dump(
    preg_match('%Hyv\w%u','Hyvä')
);

Suddenly \w is able to match the ä. Be careful here though — some things may not work like you expect. For example, an a-z character range

var_dump(
    preg_match('%Hyv[a-z]%u','Hyvä')
);

still won’t (at least on my computer) match the ä in Hyvä.

Iconv and intl

Two other PHP features to be aware of are the iconv and intl extensions. Both of these libraries contain functions and classes that allow you to convert a string that contains bytes encoded in one text format into a string that contains bytes encoded in a different text format. This is useful functionality, but it’s still up to you to correctly identify the current encoding of any string you want to convert.

Take Aways

I’ve been vaguely aware of all this for a long time, but seeing it laid out so plainly is sobering. Most of the popular PHP frameworks in use by professionals the world over don’t use the mb_ string functions, and don’t use u based pcre_ regular expressions. This means there’s a sea of subtle bugs just waiting to be stumbled upon or exploited with a bit of unicode.

Series Navigation<< When Good Unicode Encoding Goes Bad

Copyright © Alan Storm 1975 – 2021 All Rights Reserved

Originally Posted: 12th February 2021