Perl Programming/Unicode UTF-8

1

Perl Programming/Unicode UTF-8
Overview
In the context of web application development, Unicode with UTF-8 encoding is the best way to support multiple languages in your web application. Multiple languages can even be supported on the same web page. Unicode (usually in UTF-8 form) is replacing ASCII and the use of 8-bit "code pages" such as ISO-8859-1 and Windows-1252.

Unicode
Unicode [1] is a standard that specifies all of the characters for most of the world's writing systems. Each character is assigned a unique codepoint, such as U+0030. The first 256 code points are the same as ISO-8859-1 [2] to make it trivial to convert existing Western/Latin-1 text. To view properties [3] for a particular codepoint: use Unicode::UCD 'charinfo'; use Data::Dumper; print Dumper(charinfo(0x263a));

# U+263a

If you view the Unicode character reference [4], you will notice that not every codepoint has an assigned character. Also, because of backward compatibility with legacy encodings, some characters have multiple codepoints.

UTF-8
UTF-8 [5] is a specific encoding of Unicode — the most popular encoding. Other encodings include UTF-7, UTF-16, UTF-32, etc. You will probably want to use UTF-8, if you decide to use Unicode. An encoding defines how each Unicode codepoint maps to bits and bytes. In UTF-8 encoding, the first 128 Unicode codepoints use one byte. These byte values are the same as US-ASCII [6], making UTF-8 encoding and ASCII encoding interchangeable if only ASCII characters are used. The next 1,920 codepoints use 2-byte encoding in UTF-8. Three or four bytes are needed to encode the remaining codepoints. Note that although Unicode codepoints 128-255 are the same as ISO-8859-1, UTF-8 encodes each of these codepoints differently. UTF-8 uses two bytes to encode each of these codepoints, whereas ISO-8859-1 only uses one byte for each character in that range. Therefore, ISO-8859-1 and UTF-8 are not interchangeable. (If only ASCII characters are used, then they are all interchangeable, since ASCII, ISO-8859-1, and UTF-8 all share the same encoding for the first 128 Unicode codepoints.) So, to reiterate, with UTF-8, not all characters are encoded into a single byte (unlike ASCII and ISO-8859-1). Think about that for a moment... how might that affect editors (like Notepad), web pages and forms, databases, Perl itself, Perl IO, your Perl source code (if you want to include a character with a multi-byte encoding)? How might that affect passing strings around, if the strings contain characters with multi-byte encodings? Do regular expressions still work?

One exception might be your Perl source code itself. If your templates only require/contain N8CS. then you may not need UTF-8. they must be UTF-8 decoded. they should be N8CS decoded. database data. etc. rather than in just a few places (more about this below). file reading. if maybe someday your application will need to handle multiple languages. database.1 added some speed improvements [7]. Reason? It is easier to use UTF-8 everywhere. • the browser needs to be told that web pages are UTF-8 encoded via an HTTP header and a <meta> tag Do not use Perl versions prior to 5. due to the fact that the popular Windows-1252 character set (another one-byte-per-character encoding) is incompatible with ISO-8859-1 in this range.Perl Programming/Unicode UTF-8 2 Character Encoding Comparison Character Encoding US-ASCII ISO-8859-1 UTF-8 # characters 128 256 > 100. files. etc. STDOUT (which goes to the browser) must be UTF-8 encoded. and strings stored internally in Perl. (Okay. they do not have to be encoded as UTF-8 either. they must be decoded with that character set.) must be properly encoded (into an octet stream).8. v5. This includes web pages and hence web forms. Do I need UTF-8? If your web application will only ever need to use one language (or one character encoding). If they are N8CS (usually ISO-8859-1) encoded. HTML templates.8. which is often ISO-8859-1/Latin-1 • you have to interact with your database appropriately -. another exception might be your HTML templates. your source code does not have to be encoded as UTF-8.1. Although support for UTF-8 began with v5.6. How much does UTF-8 "cost"? • some functions are slower [7] with UTF-8 encoded strings in Perl • you have to write some additional Perl code to ensure that data coming into Perl is decoded properly. if you don't need any UTF-8 characters or strings in your source code).1. HTML templates.000 128 US-ASCII characters 1 byte 1 byte 1 byte Next 128 characters N/A 1 byte 2 bytes Remaining Characters N/A N/A 2 . • All text data going out of Perl (hence to the browser. it is better to start using UTF-8 now.. If they are encoded with some other character set.) must be properly decoded.e.6. If N8CS is sufficient (i.4 bytes As you can see from the table above. regular expressions do not work even in the next release. databases. here is a summary of what must be done: • All text (non-binary) data/octets coming into Perl (hence form data.0. which is not UTF-8) How do I use UTF-8? The "best practice" approach is to use UTF-8 everywhere. (By the way. rather than later.) To properly use UTF-8 in a Perl web application. Later. However.is it using UTF-8? • you have to ensure your web pages specify that pages are encoded in UTF-8 • you may need to make a web server adjustment (if it is configured to always serve some particular character set. and that data going out of Perl is encoded properly — but you have to do this anytime you use a character set other than the native 8-bit character set of the platform (which we'll now refer to as N8CS[8]). PHP will not have . you will find out that codepoints 128-159 (0x80-0x9F) are even trickier. codepoints 128-255 (0x80-0xff) are where you need to be careful. If the incoming text/octets are UTF-8 encoded. And it is more difficult to convert all of the pieces later. if possible. v5.

The resulting character string will have the UTF8 flag set.) Before we start getting into the finer details about how to use UTF-8. and upgraded (encoded) to UTF-8. we need to first define some terms. otherwise UTF-8 is used. if a character can not be represented in N8CS. This is a one-byte-per-character encoding. . An octet is a byte. It should not be decoded using a character set. Perl keeps a string in N8CS as long as possible. written. and then it gets internally encoded into UTF-8. Binary data also comes in as an octet stream. because it likely either doesn't contain any characters.a sequence of characters. Perl stores each string in one of the following encodings: • native encoding — byte encoding. and hence a maximum of only 255 characters can be encoded. stored. when a N8CS/native string is used together with a UTF-8 string. since <= 0xff # still N8CS.Perl Programming/Unicode UTF-8 UTF-8 support until v6. In other words. the native 8-bit character set of the platform (often ISO-8859-1/Latin-1). UTF-8 is used. a single character may be require one or more bytes to represent it. Encoded characters make up an octet stream. Perl uses a "UTF8 flag" to keep track of which encoding a string is internally using. the format/flag follows the string. N8CS is used. Perl can then store these as strings -. $native_string $native_string $native_string $utf8_string = = "\xf1". When creating your own strings. # N8CS byte string (one byte is used internally to encode) utf8::upgrade($my_string). Thankfully. In other words. When an octet stream comes into Perl. 8 bits. since <= 0xff You can convert a N8CS string to a UTF-8 string using utf8::upgrade(): $my_string = "\xf1". Perl strings/text Internally. Characters must be encoded (using a character set) in order to be used. Strings using this encoding are called byte strings or binary strings. We'll use the term octets when referring to data passing into or out of a Perl program. exchanged between programs. "\x{0100}". the bytes should be decoded (using the correct character set -. Strings using this encoding are called character strings or text strings or Unicode strings. Perl uses N8CS when possible (for backwards compatibility and efficiency reasons). the native byte string gets decoded with the native character set. if all code points in a string are are <= 0xFF.0. Depending on which character set is used for encoding. 3 Terminology A character is a logical entity. # UTF-8 character string now (two bytes are used internally to encode) Your program can have a mix of strings in both of Perl's internal formats. It uses N8CS[8]. = chr(0xf1). # still N8CS.the character set they were encoded with) so that Perl can determine which logical characters are contained in the encoded octet stream. the native string is silently implicitly decoded using N8CS. etc. However. = "\x{00f1}". However. and hence cannot be decoded with a character set. or it contains information in addition to characters. Encoding turns a logic character into something we can use in a program. It uses (obviously) UTF-8. This is the default encoding for all incoming text/octets if Perl is not instructed to decode (bad idea). • UTF-8 encoding — character encoding. and then talk a bit about Perl's dual personality when it comes internally storing text.

Process the string as you normally would. Encode the string into a UTF-8 encoded octet stream and output it. Using utf8::decode() may result in N8CS or UTF-8 internal encoding. Perl can't correctly guess which character encoding was used to encode some particular incoming text/octets. from SELECT statements). sockets. However. and what characters are found to be in the octet stream. the string will be internally stored as UTF-8. 1. Unicode character U+201c (left double quotation mark) is encoded in one byte in Windows-1252 (0x93). N8CS is used. (Normally. which is a one-byte-per-character encoding). Perl may check for malformed data (bad encoding) while decoding. HTML templates. otherwise UTF-8 is used. # N8CS byte string # UTF-8 character string now 4 Normally.Perl Programming/Unicode UTF-8 $my_string = 'a'. For example. . If you are certain that the incoming data/octets only contains N8CS (often this means ISO-8859-1) text. The typical flow of UTF-8 text/octets in to and out of a Perl program is as follows: 1. text files. 2. the encoding is UTF-8). Perl stores the string internally as N8CS or UTF-8. If the incoming text only contains ASCII characters. hence each octet is treated as a separate character — clearly. nor can it know which character encoding you want to use for outgoing text/octets. Encoding the text (this might be a no-op) and storing it internally as N8CS or UTF-8. This may generate decoding errors. some string operations may not work as expected — see Perl 5 "Unicode Bug". An incoming stream of UTF-8 octets is not the same as. a natively encoded string with non-ASCII characters. so they can be decoded properly.e. the UTF8 flag is set. UTF-8 decoding in Perl involves two steps: 1. and a number of other character encodings. with the UTF8 flag set (despite what the documentation for Encode says). Improper decoding can lead to double encoding. database data (e. If it is stored as UTF-8. Receive an external UTF-8 encoded text/octet stream and correctly decode it — i. other programs. 3. depending on which decoding method you select. In this case. $my_string . say. you should upgrade these N8CS byte strings to UTF-8 character strings using utf8::upgrade(). you must decode it.) 2. etc. "best practice" suggests that all incoming data/octets should be explicitly decoded — you can explicitly decode ISO-8859-1. Since there are multiple character encodings in use in the world. an incoming stream of Windows-1252 octets. tell Perl which character set the octets are encoded in (in this case. you must tell Perl which character set was used to encode them. UTF-8 Flow Any Perl IO needs to correctly handle decoding and encoding of strings/text. Decoding the text according to UTF-8 format rules. If you want Perl to interpret your incoming text/octets correctly..g. String operations will then work as expected. and it this be difficult to locate due to implicit decoding (discussed above). although sometimes slower. this is not what you want if you have a multi-byte UTF-8 encoded octet stream/text coming in. depending on which decoding method you select. you should not need to know about how Perl is internally storing/encoding text.= "\x{0100}". but UTF-8 encodes it using three octets (0xE2 0x80 0x9C). you do not need to explicitly decode it (because Perl's default internal encoding is N8CS. If any of these might contain UTF-8 encoded data/text. Using decode() always results in the string being internally stored as UTF-8. If you don't decode. Perl assumes input text/octets are N8CS encoded.. An exception to this rule is if you have a natively encoded string with bytes in the 0x80-0xFF range — in other words. Decoding Text Input External input includes submitted HTML form data. depending on which decoding method you select. Normally. ASCII.

the toolkit will decode them appropriately. # make sure upload() filehandles are not modified return $p if !$p || ( ref $p && fileno($p) ). strict. HTML::Template [12] currently does not support decoding of UTF-8 encoded HTML template files.HTML Templates If you are using a CGI framework or template engine to pull in UTF-8 encoded HTML template files. If the templates do not use BOMs. do not assume. For many applications. A better solution involves overriding the param method: package BEGIN { use use use { CGI::as_utf8. # may fail. Do not use :utf8 since it does not check that your incoming text is valid UTF-8. Input . ':encoding(utf8)'. if you use an appropriate Byte Order Mark (BOM) [11] in your template files to indicate the encoding. but this will fail if you have any binary file upload fields.47. utf8::decode($p). Basically. Input .Files. CGI. Input . Do not guess. For Template::Toolkit [10]. but only logs an error . the framework or template engine needs to do what we talked about in the previous section. This is a known limitation/bug [13]. warnings. this is often sufficient. you may need to inform it about the UTF-8 encoding.Perl Programming/Unicode UTF-8 Another important point to make here: you need to know which encoding was used for each input text. • You can use TMPL_VARs to insert UTF-8 content [14] into an N8CS (or even ASCII) encoded template file. "<:encoding(utf8)". so that it can "UTF-8 decode" the template files as they are read in. the template and the filled-in variables) to UTF-8 internally. which will treat (and decode) all parameters as UTF-8 strings. CGI 3.pm [15] does not decode your form parameters. my $param_org = \&CGI::param.Web Forms By default. # auto UTF-8 decoding on read 5 If you already have an open filehandle: binmode $in2_fh. and implicit decoding should upgrade the resulting text (i. $filename or die. UTF-8 decode your parameters/content before inserting them into an HTML template using TMPL_VARs. # earlier versions have a UTF-8 double-decoding bug no warnings 'redefine'.e. it simply marks it as UTF-8 — see Perlmonks [9]. automatically. my $might_decode = sub { my $p = shift.. use the ENCODING option: my $template = Template->new({ ENCODING => 'utf8' }). There are a few workarounds: • A patch [13] is available. You can use the -utf8 pragma. File Handles Perl can automatically decode data as it comes into Perl using PerlIO layers: open my $in_fh.

it simply marks it as UTF-8 — see Perl 5 Wiki [18]. all of your data is text). If you are using CGI. ":encoding(utf8)". *CGI::param = sub { # setting a param goes through the original interface goto &$param_org if scalar @_ != 2. you could add the following line of code to the beginning of your script to cause all data received on STDIN (i. my $binary_data = read(.. So. you should get UTF-8 encoded data back for text fields.. and the previous section describes how to properly handle UTF-8 encoded text form data.e. If you are writing some other (non-CGI) program that receives data on STDIN. my ($q. Do not use <strike> binmode STDIN. </strike> 6 # do NOT use this! since it does not check that your incoming text is valid UTF-8. The approach in the previous section is preferred.. all POSTed form data) to be automatically decoded as UTF-8: binmode STDIN. You should not have to use accept-charset in your HTML markup. my $iso8859_text = decode('ISO-8859-1'. browsers should encode form data in the same character encoding that was used to display the form.STDIN When a web form is POSTed. $p) = @_.Perl Programming/Unicode UTF-8 $p }. Note that the module assumes that web pages and forms are always UTF-8 encoded.). if you are sending UTF-8 forms. Input . readline STDIN). ":utf8".. text form data is available via CGI.pm.g. If you don't have any file uploads (i. then instead of the CGI::as_utf8 module. } } } 1 --use CGI::as_utf8. in your CGI::Application module(s) The above is rhesa's solution [16] with a slight modification — utf8::decode() is used instead of Encode's [17] decode_utf8()..pm's param() method. form data comes into Perl via STDIN. Note. and that the OO interface of CGI. since it will "do the right thing" if there is any binary form data (file uploads).e.pm is always used. # don't decode . e. readline STDIN). # put this line in your app. as it is more efficient when only ASCII characters are involved (since the UTF8 flag is not set). # assume object calls always return wantarray ? map { $might_decode->($_) } $q->$param_org($p) : $might_decode->( $q->$param_org($p) ). decode appropriately: my $utf8_text = decode('UTF-8'.

{mysql_enable_utf8 => 1} ). Input . or SET NAMES 'UTF8'.Database In the "use UTF-8 everywhere" model. pg_enable_utf8 => 1. Input . with Rose::DB: __PACKAGE__->register_db( domain => 'development'. connect_options => { pg_server_prepare => 0. Version 4.1."..PostgreSQL With PostgreSQL. If the incoming data for a field only contains ASCII octets. Input . my $dbh = DBI->connect('dbi:mysql:test_db'. As of v5. The driver is also smart enough to not decode binary data. UTF-8 decoding (and encoding) of string field data is automatic if you use the mysql_enable_utf8 database handle attribute [19]: use DBI(). For example.Perl Programming/Unicode UTF-8 Note that decode() always sets Perl's internal UTF8 flag. . the UTF8 flag is not set for that field (so it appears to be using utf8::decode()). $password. but do not decode incoming binary field data.0. UTF-8 was first available in MySQL v4. This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field data — the driver will do that for you. {pg_enable_utf8 => 1} ).. post_connect_sql => "SET CLIENT_ENCODING TO 'UTF8'. 7 .004 or higher of DBD::mysql is required. $username. UTF-8 decoding (and encoding) of string field data is automatic if you use the pg_enable_utf8 database handle attribute [20]: use DBI(). $username. The driver is also smart enough to not decode binary data. }.MySQL With MySQL. it is the system default. This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field data — the DBD::Pg driver will do that for you. When reading data from a UTF-8 database. You may (TBD: when?) also need to tell PostgreSQL to use UTF-8 when sending data out of the database: SET CLIENT_ENCODING TO 'UTF8'. $password. my $dbh = DBI->connect('dbi:mysql:test_db'. ensure incoming UTF-8 encoded string field data is UTF-8 decoded. configure your database to store values in UTF-8.

\S. utf8::upgrade($unicode_char). # suppose $windows1252_octets contains text from an external input. TBD: what is the actual performance degradation? What is the character set for \w with Unicode semantics? See also Unicode::Semantics [22]. During decoding. # will exhibit Unicode bug. since different encodings use different characters in this range.$windows1252_octets).8 or higher). utf8::upgrade($text). Perl 5 "Unicode Bug" Without a locale specified.12. which always work as expected. then \d. so it ignores them -. my $unicode_char = "\x{00f1}".) my $text = "\xE0". $utf8_string matches /\w/ 2. since the non-ASCII part (0x80-0xFF) of the character set is ignored for those operations. Processing strings Once all incoming strings have been decoded into UTF-8 internally. if you have native/N8CS strings with characters in the 0x80-0xFF (128-255) range. \W (hence regular expressions).it won't match /\w/ my $utf8_string = decode('cp1252'. may not work as expected. For example use Encode. any text/octets found to contain non-ASCII characters will be converted to UTF-8 internal encoding.this is called ASCII semantics. native encoding). 1. There are two ways to avoid this "Unicode Bug". so regex operations will be slower (vs. \w represents a much. The “feature” pragma now supports the new “unicode_strings” feature: use feature "unicode_strings". . Perl can't properly interpret characters in this range. and the "case changing component" has been fixed: "Perl 5. etc. # 0xE0 = à in ISO-8859-1 utf8::upgrade($text). won't match /\w/ # no Unicode bug. Both involve getting the natively encoded string to switch to UTF-8 encoding — because when the internal encoding is UTF-8. \s. \w. and it contains the character # "\xE0" (0xE0 = à).12 now bundles Unicode 5. matches /\w/ # U+00F1 = ñ Note that with internal UTF-8 encoding. Regular expression will work (if using Perl v5. you can process your text as normal.) Without a locale.12 is now available. # no Unicode bug. ensure you upgrade them to internal UTF-8 encoding: my $text = "\xE0". (Even if the string only contains ASCII characters. (This is another reason to try and use UTF-8 everywhere.Perl Programming/Unicode UTF-8 ). lc(). Use utf8::upgrade($native_string) to force $native_string to switch to UTF-8 internal encoding. 4/19/10 update: v5. much larger set of characters. If you create any strings in your source code that contain non-ASCII characters (characters above 0x7f). Follow "best practice" and always properly decode all external input text/octets. Unicode semantics are used. and uc(). String $windows1252_octets will exhibit the Unicode bug -. \D. See Automatic Character Set Conversion Between Server and Client [21] 8 2.2. This issue should be fixed in Perl v5. it is still "upgraded" to UTF-8.

":encoding(utf8)". but don't take a chance — "best practice" calls for explicitly encoding all output appropriately. the text will be sent using the bytes in Perl's internal format. This may work. If outgoing text is not encoded. you can opt to only encode the outgoing page if it is flagged as UTF-8: if(utf8::is_utf8($page)) { utf8::encode($page).e. regardless of how they are currently encoded internally. . # Make sure the output is utf8 encoded if it needs it if($_[0] && ${$_[0]} && utf8::is_utf8(${$_[0]}) ){ utf8::encode( ${$_[0]} ). Note that all of the above encoding techniques will only work properly if all of the input UTF-8 octets were properly decoded. sub { my $self = shift. } # else. add the following near the top of your Perl script: binmode STDOUT. # ${$_[0]} . The above code should be put into CGI::Application base class(es). STDOUT) is UTF8-encoded.Perl Programming/Unicode UTF-8 This will turn on Unicode semantics for all case changing operations on strings. log file output. which could be a mixture of native/N8CS and UTF-8. database writes. Encoding and output Output from a web program includes STDOUT (which is sent to your browser for a CGI program). $page is natively encoded.= 'utf8::encode() called'. Ä To avoid this warning." Read more [23]. explicitly encode output (as described below). stderr (which usually goes to the web server's error log). etc. 9 3.. # useful for debugging } }).STDOUT To ensure all output going back to the web browser (i. If you want to be a little more efficient (but not follow "best practice"). Perl will warn you if you print a string with a character that has an ordinal value greater than 255: $ perl -e 'print "\x{0100}\n"' Wide character in print at -e line 1. the code can be added to cgiapp_postrun(). Output . so skip encoding for output Here is a snippet [24] that can be used with the CGI::Application [25] framework: __PACKAGE__->add_callback('postrun'. Optionally.

). ':utf8'.Files. As of v5. UTF-8 was first available in MySQL v4.pm defaults to sending the following Content-Type header: Content-Type: text/html.004 or higher of DBD::mysql is required. If you are not using CGI. If you are using the CGI::Application framework.Perl Programming/Unicode UTF-8 Output . etc. Output . Output .pm to generate your HTML markup. File Handles If you need to write to files. The driver is also smart enough to not encode binary data. configure your database to store values in UTF-8. Perl can automatically encode data as it is written using PerlIO layers: open my $out_fh.0. or SET NAMES 'UTF8'. ">:utf8". Do not encode binary field data. it is the system default.1. The driver is also smart enough to not encode binary data. When writing data to a UTF-8 database (INSERT. See Automatic Character Set Conversion Between Server and Client [21] Output . $filename write If you already have an open filehandle: binmode $out2_fh. This means you should not call utf8::encode() (or any other UTF-8 encode function) on your strings when using this attribute — the driver will do that for you. or die. CGI. in the "use UTF-8 everywhere" model. "best practice" is to specify the UTF-8 charset in an HTTP Content-Type header and inside the HTML file in a content-type <meta> tag.PostgreSQL As mentioned above. UTF-8 encoding (and decoding) of string field data is automatic if you use the mysql_enable_utf8 database handle attribute [19]. charset=ISO-8859-1 Add the following to cause UTF-8 to be used instead of ISO-8859-1.MySQL As mentioned above. ensure your UTF-8 strings get UTF-8 encoded before being written to the database. Version 4. put the above line in cgiapp_init(). charset=UTF-8" /> . You may (TBD: when?) also need to tell PostgreSQL to expect UTF-8 coming into the database: SET CLIENT_ENCODING TO 'UTF8'. UPDATE. where $q is your CGI object: $q->charset('UTF-8'). UTF-8 encoding (and decoding) of string field data is automatic if you use the pg_enable_utf8 database handle attribute [20]. # auto UTF-8 encoding on 10 Tell the Browser to use UTF-8 To serve a UTF-8 encoded page to a browser. This means you should not call utf8::encode() (or any other UTF-8 encode function) on your strings when using this attribute — the DBD::Pg driver will do that for you.Database As mentioned above. put the following meta tag as the first meta tag in the <header> section of your HTML markup: <meta http-equiv="content-type" content="text/html.

.. line . and a character above 0xFF (255). To give your application a good Unicode test. .. try a character in the 0x80 . utf8::upgrade($smiley). Windows-1252 and UTF-8 are all encoded with the same one-byte values for the first 128 Unicode codepoints. it could be that your web server is configured to always send a particular character encoding in a header... but your pages are not being displayed properly. remove that line. If your source code is UTF-8 encoded. This is because ASCII. which will fail.Perl Programming/Unicode UTF-8 11 Perl source code If you only need to embed a few Unicode characters in a few strings in your source code. To determine if a Content-Type header is being sent by the web server: $ lwp-request -de www.0x9F (128-159) range. make sure your editor supports reading. then you need to tell Perl that your source code is UTF-8 encoded. # this script is in UTF-8 This is the only reason your program should ever have the above line -. or change it to AddDefaultCharset UTF-8 if all of the pages served by the server use UTF-8. use \x{..com | grep Content Apache may be configured with the following: AddDefaultCharset ISO-8859-1 If you can.. Perl will warn you [27] if you print a string that has a character with an ordinal value greater than 255 (hence it is a "wide" character that requires more than one byte of storage): Wide character in print at . ISO-8859-1. such as ISO-8859-1. your code is probably trying to decode the same string a second time. Wide character in print at . If you receive this error. and writing in UTF-8! Gotchas Often you may not notice Unicode issues until characters with codepoints above 128 are used. you do not need to save your source code/file in UTF-8. # convert to internal UTF-8 encoding If you have a lot of Unicode characters. or my $smiley = chr(0x263a).. Explicitly encode your output to avoid this warning..} or chr() in your code. Instead. Do this by adding the following line to your source code: use utf8.see utf8 [26]. See also When Apache and UTF-8 Fight [28]. editing. or you prefer to save your source code in UTF-8.bing. Cannot decode string with wide characters at . followed by utf8::upgrade(): my $smiley = "\x{263a}". Web Server Always Sends an ISO-8859-1 Header If you followed the steps above..

Decode and encode correctly and you will not have any problems with Microsoft smart quotes or any of the other characters in the nebulous range. In your browser. 0x8f. The problem is likely an encode/decode problem somewhere in the chain. Translation and try selecting ISO-8859-1 or Windows-1252 and run the program again. you may see EF BF BD.. 0x8d. From Windows-1252 [29]: "The Windows-1252 encoding is a superset of ISO-8859-1. Here's a fun program to try: my @undefined_chars_in_windows_1252 = (0x81... the draft HTML 5 specification requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding" since it is a superset of ISO-8859-1. then you know that the web server is serving up the wrong character encoding — there is a mismatch between what is being sent (i. no characters. 0x9f) { next if exists $h{$i}. HTTP header and/or meta tag).Perl Programming/Unicode UTF-8 12 ISO-8859-1 vs Windows-1252 Since you are learning about character encodings. these replacement characters appear because the HTML data is Windows-1252 encoded. Usually. } What do you see? Do you see the Windows-1252 characters.e. try selecting Windows-1252 or Western European (Windows) and see if that resolves the problem. you need to be aware of the difference between ISO-8859-1 and Windows-1252.. since the "paste" operation should automagically convert these characters to valid Unicode characters. It .) IE displays the replacement character as the empty square box. if you serve all web pages as UTF-8.) If your Perl script does not decode the submitted form properly (i. and what character set the browser is being told to use (i. If so. my %h = map { $_ => undef } @undefined_chars_in_windows_1252. If it doesn't resolve the problem. Firefox uses the black diamond with the question mark. Change Settings. (Recall that Unicode defines control characters in this range — not printable characters like smart quotes.$i. If you copy-paste those characters into a web form that was served with a Windows-1252 charset (or possibly even an ISO-8859-1 charset). If if does.e.e. printf "%02x:%c ". how the data is encoded). Microsoft "Smart" Quotes MS-Word (TBD: only older versions?) uses those nice left and right fancy/smart quotes.. which is used to indicate when a Unicode parser (such as a browser) was not able to decode a stream of Unicode encoded data. It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Better yet. also contains all the printable characters that are in ISO-8859-15 (though some are mapped to different code points). Window. (U+FFFD encodes to EF BF BD in UTF-8. you will get gibberish. Your Perl script will then only receive valid UTF-8 encoded characters. it might be that you don't have a Unicode font installed on . the characters may be submitted to the web server using the nebulous 0x80-0x9F (128-159) range.... but differs from ISO-8859-1 by using displayable characters rather than control characters in the 0x80 to 0x9F (128 . 0x90.. select View->Character Encoding and see if it is set to UTF-8. 0x9d). Many web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to accommodate such mislabeling. according the same character encoding that the web form used). If you save the web page and then open it in bvi. foreach my $i (0x80 . square boxes? If you are using PuTTY. $i. Strange Characters in my Browser Strange character: � This is Unicode's "replacement character" (codepoint U+FFFD).. but the browser was instructed to use UTF-8 encoding. submitted forms should never contain these nebulous values.159) range.

follow these steps to install it: Add/Remove Programs.Perl Programming/Unicode UTF-8 your computer.g. you should have the Arial Unicode MS font. it is likely that you forgot to decode incoming UTF-8 data (such as form data submitted from an UTF-8 encoded HTML form) in your Perl program and then you UTF-8 encoded it for output — a natively encoded string was UTF-8 encoded (not good). and writing in UTF-8 • ensure you set your editor to use a Unicode font • ensure you have a Unicode font installed Install a Unicode Font on Windows If you have one of the Microsoft products listed on this page [30]. but the browser was instructed to use ISO-8859-1 or Windows-1252. This is similar to HTML double encoding — e. Decode with Windows-1252. a "double encoding" results. Add or Remove Features. If these separate characters are later encoded to UTF-8 for output. Strange characters: ‘ ’ “ ” • – — These are the individual characters that correspond to the multi-byte UTF-8 encodings for the following Windows-1252 characters: ‘’“”•–— which are in the nebulous 0x80-0x9F (128-159) range. click "Choose advanced".. If you see the above sequences. If that doesn't resolve the problem. &amp. I asked for UTF-8 but I Got Something Else!? If you specifically asked for UTF-8 text. in many cases you can probably assume that the incoming text/octets are ISO-8859-1/Latin-1 or Windows-1252.. This means that the individual octets of a multi-byte UTF-8 character are seen as separate characters (not good). Universal Font. or the Unicode font does not have a glyph for that particular character. these characters appear because the HTML data is UTF-8 encoded. Double Encoding If you don't decode UTF-8 text/octets. editing. International Support. Office Shared Features. but the octet stream you receive is not valid UTF-8 encoding. Strange characters: ‘ ’ “ ” • – — These also correspond to some of the characters in the nebulous 0x80-0x9F (128-159) range. Apply the changes and restart your web browser. since it is a superset of ISO-8859-1. instead of &gt. Usually. In your browser. If it is not installed. there may be a double encoding problem somewhere. or if the encoding is already set to UTF-8. Fix the problem by calling utf8::decode() on the incoming UTF-8 encoded data. Perl will assume they are encoded with N8CS (often ISO-8859-1/Latin-1). try changing the encoding to UTF-8 and see if that resolves the problem. select MS-Office.gt. . 13 Strange Characters in my Editor • ensure your editor supports reading.

But wait. In a web form (textbox or textarea) type Alt-0147 to generate one of those pesky smart quotes from the Windows-1252 character set. Sometimes IE7 and IE8 do not seem to perform font substitution correctly. If the web page's character encoding is instead set to Windows-1252. not a Windows-1252 or CP-437 character! Well. which is then translated into the character set in use by the application. . Windows will convert those characters to Unicode/UTF-8 for us if the application expects UTF-8. and Unicode codepoint values for each character. If the browser uses the superset Windows-1252 encoding when ISO-8859-1 is specified. the character should be sent to the web server UTF-8 encoded as three octets: E2 80 9C — this is what U+201C looks like when encoded with UTF-8. In this scenario. Windows probably translates the 0147 to UTF-16. # for codepoints above 0xFF # same syntax for regex # run-time. hold down Alt. ord() does the reverse If your Perl source code file is in UTF-8 format. the character will be translated to the only quote character officially defined in ISO-8859-1. • To insert a character from the current DOS code page (usually CP-437 [32]): follow the same steps as above. Windows should translate the 147 character into the corresponding UTF-8 encoding. In Web Forms On Windows: • To insert a character from the Windows-1252 codepage [29]: set the Num Lock key on. If the web page's character encoding is set for UTF-8. and Windows-1252 character 147 is translated to its Unicode codepoint equivalent. copy. my $cloud_char = chr(0x2601). but the value may be either 0x93 or 0x22 (0x22 is the ASCII and ISO-8859-1 quote character). IE6 is not considered a modern browser. If the web page's character encoding is instead set to ISO-8859-1. my $utf8_char # tells Perl this file is UTF-8 encoded # U+263a. Misc Create Unicode characters On Windows. You can also use the application to view fonts. 0x22. U+201C.Perl Programming/Unicode UTF-8 14 Automatic Font Substitution Most modern browsers and word processors perform font substitution [31]. you can enter the Unicode characters directly: use utf8. the application will search through all of your fonts until it finds one containing that character and it will then display that character using the glyph in that font. Otherwise. but without the initial 0. characters. the character set is Unicode. and (switch to your application then) paste a Unicode character. "White Smiling Face" = "☺". you can always use the Character Map application to select. Ensure the "Character set" drop-down box is set to "Unicode". $utf8_char =~ /\x{263a}/. 0x93 is sent. the character will also be sent as a single octet. which means that if a character is not in the current font. (Internally. In Perl my $utf8_char = "\x{263a}". type 0 followed by the decimal value of the character you want. and it does not perform font substitution.) When the form is submitted. One workaround is to specify a Unicode font as the first font in the CSS font-family property. then using the numeric keypad. the character should be sent as a single octet: 0x93 (which is 147 decimal). we wanted to insert a Unicode character.

1). Strict. 15 UTF-8 vs utf8 As of Perl 5. the UTF8 flag is turned on. [17] for CHECK options. so that it can be decoded properly (as UTF-8 or Windows-1252) in your Perl program. 1). $flag = utf8::decode($utf8_octets). Decodes the UTF-8 octet sequence into a UTF-8 character string. The documentation would lead you to believe that the UTF8 flag is off if the text only contains ASCII characters and you are decoding UTF-8. "\x{FFFF_FFFF}". utf8 is the liberal.10 knows the difference. This is the only decode function that may result in an N8CS byte string. $utf8_octets [. This is not what happens — the flag is always turned on. Decodes the UTF-8 octet sequence into a UTF-8 character string. otherwise returns true. # okay # okay Encode Module vs Built-in/Core utf8:: To decode and encode UTF-8. otherwise returns true. 1). e. See also How do I enter . and the resulting string is N8CS.Yahoo Answers.. version. $utf8_octets [. Tests whether $string is internally encoded as UTF-8. encode_utf8("\x{FFFF_FFFF}". Otherwise. as appropriate.8. Encode [34] as of version 2. official UTF-8 decoding rules (see previous section for discussion) are followed.e..Perl Programming/Unicode UTF-8 Hopefully you see why it is imperative to know which encoding was used for the incoming form/text. utf8::encode() and utf8::decode() use official UTF-8. as the table below depicts.7. UTF-8 is the strict. liberal decoding rules (see previous section for discussion) are followed. since it does not turn the flag on if the octet sequence only contains ASCII octets. If $utf8_octets contains non-ASCII octets (i. official UTF-8. Lax. (This is the decode function I normally use. You should be aware of a bug [36] in the Encode module: whenever text is decoded using the Encode module.) Below. CHECK]) $utf8_string = decode('utf8'. and the resulting string is UTF-8. multi-byte UTF-8 encoded characters). Use utf8::decode() to obtain this efficiency. CHECK]) turned on turned on . Attempts to convert in-place the UTF-8 octet sequence into the corresponding N8CS or UTF-8 string. The Encode module is more flexible. Returns false if not. However. the UTF8 flag is always turned on. [33] .. # croaks In contrast.g. the utf8:: package can do some different tricks.. The Encode module will complain if you try to encode or decode invalid UTF-8. Returns false if $utf8_string is not UTF-8 encoded properly. see Encode's documentation data. encode("UTF-8". allowing just about any 4-byte values: encode("utf8". which relate to how the module handles malformed UTF-8 Functions Function UTF8 flag N/A depends Description / Notes $flag = utf8::is_utf8($string). $utf8_string = decode('UTF-8'. the UTF8 flag remains off. "\x{FFFF_FFFF}". There are performance gains to be had if the UTF8 flag can be kept off after decoding (and this is fine if the text only contains ASCII octets). you can use the Encode [17] module or the functions defined in the utf8:: [35] package by the Perl core. allowing different ways of handling malformed data. lax.

Strict. What does Website "x" Use? View a page. Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. Returns true on success. View->Character Encoding to see which encoding was selected. $utf8_string = decode_utf8($utf8_octets [. Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence.0. thereby avoiding the Perl 5 "Unicode Bug". HTML Character Entities In your UTF-8 travels. must be decoded as it enters Perl. this function cannot fail. and if the set is sufficient for your application. the platform's native 8-bit character set (often ISO-8859-1/Latin-1). Fails if $utf8_string cannot be represented in N8CS encoding. Either can be used in HTML markup. hence lax encoding is employed. Equivalent to encode("utf8". then returns false. $string). Equivalent to decode("utf8". Many fonts support this set of characters. CHECK]) $octet_count = utf8::upgrade($n8cs_string). On failure dies. $utf8_octets = encode_utf8($string) turned off $flag = utf8::downgrade($utf8_string [. Converts in-place the UTF-8 character string to the equivalent N8CS byte string. you may come across HTML Character Entities. For example.bing. Since all possible characters have a lax utf8 representation. Returns the number of octets now used to represent the string internally as UTF-8.com | grep Content This wiki uses UTF-8. . This function should be used to convert N8CS byte strings with characters in the 0x80-0xFF range to UTF-8. charset='''UTF-8'''" /> You can also see what Content-Type header is being returned using: $ lwp-request -de www. FAIL_OK]). turned on turned on Converts in-place the N8CS byte string into the corresponding UTF-8 character string. CHECK]) $utf8_octets = encode('utf8'. Lax. Also look at the HTML source and see if the meta tag is present: <meta http-equiv="Content-Type" content="text/html.Perl Programming/Unicode UTF-8 16 Decodes the UTF-8 octet sequence into a UTF-8 character string. $utf8_octets). unless FAIL_OK is true. or &reg. then in your browser. but your application will need to use the HTML encoding where ever a special character is needed. Since all possible characters have a lax utf8 representation. Converts in-place the N8CS or UTF-8 $string into a UTF-8 octet sequence. utf8::encode($string) turned off turned off turned off $utf8_octets = encode('UTF-8'. Any characters encoded with something other than N8CS. this function cannot fail. the registered sign can be represented in HTML as either &#174. liberal UTF-8 encoding rules (see previous section for discussion) are followed. $string) Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. UTF-8 may not be required. $string [. turned off Perl Character encodings To determine which character encodings your Perl supports: perl -MEncode -le "print for Encode->encodings(':all')" It is important to remember that Perl only uses two character encodings internally: native/byte and UTF-8/character. Each of these has a Unicode codepoint and an entity name. Starting with HTML 4. hence lax decoding is employed. official UTF-8 encoding rules (see previous section for discussion) are followed. 252 character entities [37] are supported.

cpan.work around the Perl 5 Unicode bug there are many Unicode:xxx modules [46] on CPAN UTF-8 round trip with MySQL [47] . org/ perldoc?CGI . References [1] http:/ / en. org/ dist/ Template-Toolkit/ lib/ Template/ FAQ.Wikipedia Perl Unicode introduction [44] Unicode support in Perl [45] Unicode::Semantics [22] . au& forum_name=html-template-users [15] http:/ / search. org/ wiki/ Mapping_of_Unicode_characters#Character_properties [4] http:/ / en. wikibooks. uses UTF-16 as the sole internal character encoding. Windows Vista and Windows 7).NET bytecode environments.Perlmonks CGI::Application . and KDE also use it for internal representation. cpan.Perlmonks UTF-8 and Unicode FAQ for Unix/Linux [49] Perl Unicode Mailing List <perl-unicode@perl. wikipedia. Windows 2000. pod#Why_do_I_get_rubbish_for_my_utf-8_templates? [11] http:/ / en. Positively Must Know About Unicode and Character Sets (No Excuses!) [39] . net/ mailarchive/ forum. com. org/ wiki/ Byte_Order_Mark [12] http:/ / search. cpan. wikipedia. org/ perldoc?perlunicode#Speed [8] http:/ / en. perlmonks. UTF-8 has become the main storage encoding on most Unix-like operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional extended ASCII character sets. wikipedia. org/ wiki/ Perl_programming%2Funicode_utf-8#endnote_N8CS [9] http:/ / www. org/ perldoc?HTML::Template [13] https:/ / rt. Windows XP. org/ wiki/ Unicode [2] http:/ / en. html?id=30586 [14] http:/ / sourceforge. cpan. wikipedia. wikipedia. cpan. org/ wiki/ Unicode/ Character_reference/ 0000-0FFF [5] http:/ / en. From Wikipedia [38]: "Windows NT (and its descendants. 8030702%40netratings. php?thread_name=4607245C.by Mark Rajcok • Perl Unicode tutorial [42] • Perl Unicode FAQ [43] • Perl utf8 pragma [26] • • • • • • • • • • • Perl Encode module [17] .Which is the proper way of handling and outputting utf8 [48] . org/ ?node_id=626470 [10] http:/ / search. Mac OS X. wikipedia.by Joel Spolsky • FMTYEWTK about Characters vs Bytes [40] . org/ wiki/ ASCII [7] http:/ / search.N8CS is a term that was coined for this document.org> Footnotes ^ ." References • The Absolute Minimum Every Software Developer Absolutely. The Java and .Perl Programming/Unicode UTF-8 17 Operating Systems and Unicode It is interesting to note which Unicode encoding popular Operating Systems use.pm and UTF-8 handling [9] .Perlmonks • CGI::Application and UTF-8 Form Processing example [41] . Do not expect to see this term used elsewhere.Perlmonks Understanding CGI.handles all character encoding and decoding Unicode [1] . org/ wiki/ ISO_8859 [3] http:/ / en. wikipedia. org/ wiki/ UTF-8 [6] http:/ / en. org/ Public/ Bug/ Display.

org/ wiki/ Unicode#Operating_systems http:/ / joelonsoftware. html?id=34259 http:/ / www. postgresql. alanwood. perl. cpan. org/ ?node_id=330567 http:/ / cgi-app. org/ perldoc?perlunifaq http:/ / search. perlmonks. org/ perldoc?perlunicode http:/ / search. org/ perldoc?perlunitut http:/ / search. html http:/ / en. cpan. erlbaum. com/ cgiapp@lists. cpan. perlmonks. psu. html http:/ / cgi-app. html#AEN29751 http:/ / search. cpan. uk/ ~mgk25/ unicode. edu/ ejp10/ blogs/ gotunicode/ 2009/ 02/ when-apache-and-utf-8-fight. cl. wikipedia. cpan. html 18 . html#The-%22Unicode-Bug%22 http:/ / www. org/ perldoc?Encode#UTF-8_vs. _utf8_vs. microsoft. cpan. perlmonks. _UTF8 http:/ / perldoc. org/ ?node_id=651574 http:/ / search. org/ perldoc?Unicode::Semantics http:/ / perldoc. cpan. html https:/ / rt. cpan. cpan. wikipedia. net/ demos/ ent4_frame. org/ perldoc?DBD::mysql#DATABASE_HANDLES http:/ / search. html http:/ / perlmonks. org/ docs/ 8. perlfoundation. org/ perldoc?perluniintro http:/ / search. cpan. org/ perldoc?DBD::Pg#pg_enable_utf8_(boolean) http:/ / www. org/ utf8. org/ ?node_id=651403 http:/ / www. org/ perl5/ index. com/ question/ index?qid=20081226081225AA2NMGi http:/ / search. org/ Ticket/ Display. yahoo. perl. wikipedia. cgi?Utf8Example http:/ / search. cgi http:/ / search. org/ index. mail-archive. aspx?FMID=1081 http:/ / en. html http:/ / en.Perl Programming/Unicode UTF-8 [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] http:/ / www. org/ index. org/ perldoc?utf8 http:/ / search. cgi?the_utf8_perlio_layer http:/ / search. cpan. ac. cam. cpan. 4/ interactive/ multibyte. wikipedia. org/ wiki/ Code_page_437 http:/ / answers. net/ msg08043. pl?node_id=620803 http:/ / www. org/ search?query=unicode http:/ / www. org/ perlunicode. org/ index. com/ typography/ fonts/ font. cpan. personal. com/ articles/ Unicode. org/ wiki/ Font_substitution http:/ / en. org/ wiki/ Windows-1252 http:/ / www. org/ perldoc?Encode http:/ / www. org/ perldoc?perldiag#Wide_character_in_%s http:/ / www.

org/w/index.Article Sources and Contributors 19 Article Sources and Contributors Perl Programming/Unicode UTF-8  Source: http://en. org/ licenses/ by-sa/ 3. 0/ . Mrajcok. 5 anonymous edits License Creative Commons Attribution-Share Alike 3.php?oldid=1957841  Contributors: Adrignola.wikibooks.0 Unported http:/ / creativecommons.

Sign up to vote on this title
UsefulNot useful