You are on page 1of 3

As a result of all the noise about UTF-8, got an email from Marek Gayer with some very smart

tips on handling UTF-8. What follows is a discussion illustrating what happens when you get obsessed with performance and optimizations (be warned may be boring, depending on your perspective).

Outrunning mbstring case functions with native PHP implementations


The native PHP strtolower / strtoupper functions dont understand UTF-8 they can only handle characters in the ASCII range plus (may) examine your servers locale setting for further character information. The latter behaviour actually makes them dangerous to use on a UTF-8 string, because theres a chance that strtolower could mistake bytes in a UTF-8 multi-byte sequences as being something it should convert to lowercase, breaking the encoding. That shouldnt be a problem if youre writing code for a server you control but it is if youre writing software for other people to use.

Restricting locale behaviour


Turns out you can disable this locale behaviour by restricting your locale to the POSIX locale, which means only characters in the ASCII range will be considered (overriding whatever your servers locale settings are), by executing the following;
view plainprint?

1. 2.

<?php setlocale(LC_CTYPE, 'C');

That should work on any platform (certainly *Nix-based and Windows) and effects more than just strtolower() / strtoupper() other PHP functionality picks up information from the locale, such as the PCRE /w meta character, strcasecmp() and ucfirst(), all of which might result in adverse effects on UTF-8. The only issue, as I see it, is if youre writing distributable software; should be messing with setlocale in the first place? See the warning in the documentation here can be a problem for Windows where you have only a single server process you may be effecting other apps running on the server.

Fast Case Conversion


To make it possible to do case conversion (e.g. strtolower/upper) without depending on mbstring (because who knows if shared hosts have installed it?), applications likeMediawiki (as in Wikipedia) and Dokuwiki solve this by implementing pure-PHP versions of these functions and using arrays like this or this ($UTF8_LOWER_TO_UPPER variable towards end of the script), which works because only a limited selection of alphabets have the notion of case in the first place the array is big but not sooo big that its a terrible performance overhead. Whats interesting to note about both those lookup arrays is they contain characters in the ASCII range. Theyre also support many alphabets. Mediawiki then (essentially) does a str_to_upper like this (at least in the 1.7.1 release see languages/LanguageUtf8.php this seems to have changed since under SVN);
view plainprint?

1. 2. 3. 4. 5.

// ... bunch of stuff removed return preg_replace( "/$x([a-z]|[\\xc0-\\xff][\\x80-\\xbf]*)/e", "strtr( \"\$1\" , \$wikiUpperChars )", $str );

its locating each valid UTF-8 character sequence and executing PHPs strtr() function with the lookup array, via callback the /e pattern modifier (time to phone a friend?)

to convert the case. That keeps memory use minimal, traded against performance (probably not benchmarked) many callbacks / evals. Dokuwiki (and phputf8) uses a similar approach but first splits the input string into an array or UTF-8 sequences and sees if they match in the lookup array. This is PHP UTF8 s implementation, which is almost the same (utf8_to_unicode() converts a UTF-8 string to an array of sequences, representing characters, and utf8_from_unicode()does the reverse) ;
view plainprint?

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

function utf8_strtolower($string){ global $UTF8_UPPER_TO_LOWER; $uni = utf8_to_unicode($string); if ( !$uni ) { return FALSE; } $cnt = count($uni); for ($i=0; $i < $cnt; $i++){ if ( isset($UTF8_UPPER_TO_LOWER[$uni[$i]]) ) { $uni[$i] = $UTF8_UPPER_TO_LOWER[$uni[$i]]; } } return utf8_from_unicode($uni); }

Thats going to use more memory for a short period, given that it copies the input string as an array (actually that needs fixing!) plus an array would need more space to store the equivalent information to a string but (should) be faster. Anyway enter Mareks approach which can be summarized as;
view plainprint?

1. 2. 3. 4.

function StrToLower ($s) { global $TabToLower; return strtr (strtolower ($s), $TabToLower); }

where $TabToLower is the lookup table (now minus the ASCII character lookups, handled by strtolower). Note the code Marek showed me uses classes this is just a simplification. It relies on the POSIX locale being set (otherwise the UTF-8 encoding might get broken) and exploit a facets UTF-8 s design, namely any complete sequence in a valid UTF-8 string is unique (cant be mistaken for part of a longer sequence). You also need to read the strtr() documentation very carefully strtr() may be called with only two arguments. If called with two arguments it behaves in a new way: from then has to be an array that contains string -> string pairs that will be replaced in the source string. strtr() will always look for the longest possible match first and will *NOT* try to replace stuff that it has already worked on. Ive yet to benchmark this but Marek tells me hes found it to be roughly x3 faster than the equivalent mbstring functions, which I can believe. Marek also employs some smart tricks for handling the lookup arrays. Both the dokuwiki and mediawiki approaches have all possible case conversions defined i.e. they apply to multiple human languages. While this may be appropriate for user

submitted content, when youre doing stuff like localizations of youre UI, chances are youll only be using a single language you dont need the full lookup table, just those applicable to the language involved, assuming you know what those are. Also you might think about looking at the incoming $_SERVER['HTTP_ACCEPT_LANGUAGE'] from the browser. Anyway when I get some time, will figure out how to use Mareks ideas in PHP UTF-8.

Output Conversion
Another smart tip from Marek, which I havent seen discussed before, is how to deliver content to clients that cant deal with UTF-8 e.g. old browsers, phones(?). His approach is simple and effective once youve finished building the output page, capture it in an output buffer, check what the client sent as acceptable character sets ($_SERVER['HTTP_ACCEPT_CHARSET']) and convert (downgrade) the output with iconv if necessary. You need to be careful examining the content of that header and processing it correctly. You also need to make sure youve redeclared the Content-Type charset plus any HTML meta characters or the encoding in an XML processing instruction. But this is certainly the serious / accessible way to solve the problem in PHP.

Moral of the story


is its worth talking to people who actually need UTF-8, vs. those in countries complacently using ISO-8859-1 (which doesnt natively support the Euro symbol BTW!). Given that Mediawiki has done Unicode Normalization in PHP (here), the only remaining piece of the puzzle is Unicode Collation (e.g. for sorting) heres a nice place for inspiration. After that who needs PHP 6 ;)

You might also like