You are on page 1of 19

Perl Programming/Unicode UTF-8

1

Perl Programming/Unicode UTF-8
Overview
In the context of web application development, Unicode with UTF-8 encoding is the best way to support multiple languages in your web application. Multiple languages can even be supported on the same web page. Unicode (usually in UTF-8 form) is replacing ASCII and the use of 8-bit "code pages" such as ISO-8859-1 and Windows-1252.

Unicode
Unicode [1] is a standard that specifies all of the characters for most of the world's writing systems. Each character is assigned a unique codepoint, such as U+0030. The first 256 code points are the same as ISO-8859-1 [2] to make it trivial to convert existing Western/Latin-1 text. To view properties [3] for a particular codepoint: use Unicode::UCD 'charinfo'; use Data::Dumper; print Dumper(charinfo(0x263a));

# U+263a

If you view the Unicode character reference [4], you will notice that not every codepoint has an assigned character. Also, because of backward compatibility with legacy encodings, some characters have multiple codepoints.

UTF-8
UTF-8 [5] is a specific encoding of Unicode — the most popular encoding. Other encodings include UTF-7, UTF-16, UTF-32, etc. You will probably want to use UTF-8, if you decide to use Unicode. An encoding defines how each Unicode codepoint maps to bits and bytes. In UTF-8 encoding, the first 128 Unicode codepoints use one byte. These byte values are the same as US-ASCII [6], making UTF-8 encoding and ASCII encoding interchangeable if only ASCII characters are used. The next 1,920 codepoints use 2-byte encoding in UTF-8. Three or four bytes are needed to encode the remaining codepoints. Note that although Unicode codepoints 128-255 are the same as ISO-8859-1, UTF-8 encodes each of these codepoints differently. UTF-8 uses two bytes to encode each of these codepoints, whereas ISO-8859-1 only uses one byte for each character in that range. Therefore, ISO-8859-1 and UTF-8 are not interchangeable. (If only ASCII characters are used, then they are all interchangeable, since ASCII, ISO-8859-1, and UTF-8 all share the same encoding for the first 128 Unicode codepoints.) So, to reiterate, with UTF-8, not all characters are encoded into a single byte (unlike ASCII and ISO-8859-1). Think about that for a moment... how might that affect editors (like Notepad), web pages and forms, databases, Perl itself, Perl IO, your Perl source code (if you want to include a character with a multi-byte encoding)? How might that affect passing strings around, if the strings contain characters with multi-byte encodings? Do regular expressions still work?

you will find out that codepoints 128-159 (0x80-0x9F) are even trickier. files. and that data going out of Perl is encoded properly — but you have to do this anytime you use a character set other than the native 8-bit character set of the platform (which we'll now refer to as N8CS[8]). it is better to start using UTF-8 now. another exception might be your HTML templates. Although support for UTF-8 began with v5. databases. and strings stored internally in Perl. then you may not need UTF-8. file reading. etc. If the incoming text/octets are UTF-8 encoded. v5. rather than later.) To properly use UTF-8 in a Perl web application. here is a summary of what must be done: • All text (non-binary) data/octets coming into Perl (hence form data. your source code does not have to be encoded as UTF-8. • the browser needs to be told that web pages are UTF-8 encoded via an HTTP header and a <meta> tag Do not use Perl versions prior to 5. Reason? It is easier to use UTF-8 everywhere.e. If they are encoded with some other character set. etc.is it using UTF-8? • you have to ensure your web pages specify that pages are encoded in UTF-8 • you may need to make a web server adjustment (if it is configured to always serve some particular character set. they must be decoded with that character set. due to the fact that the popular Windows-1252 character set (another one-byte-per-character encoding) is incompatible with ISO-8859-1 in this range. This includes web pages and hence web forms. And it is more difficult to convert all of the pieces later. One exception might be your Perl source code itself. How much does UTF-8 "cost"? • some functions are slower [7] with UTF-8 encoded strings in Perl • you have to write some additional Perl code to ensure that data coming into Perl is decoded properly. PHP will not have . If they are N8CS (usually ISO-8859-1) encoded. If your templates only require/contain N8CS. if possible.Perl Programming/Unicode UTF-8 2 Character Encoding Comparison Character Encoding US-ASCII ISO-8859-1 UTF-8 # characters 128 256 > 100. database data.) must be properly encoded (into an octet stream). if you don't need any UTF-8 characters or strings in your source code). database.1.1. However. which is not UTF-8) How do I use UTF-8? The "best practice" approach is to use UTF-8 everywhere. HTML templates.4 bytes As you can see from the table above. Do I need UTF-8? If your web application will only ever need to use one language (or one character encoding). rather than in just a few places (more about this below). they do not have to be encoded as UTF-8 either. codepoints 128-255 (0x80-0xff) are where you need to be careful. HTML templates..8.6. v5. (By the way. if maybe someday your application will need to handle multiple languages.) must be properly decoded.000 128 US-ASCII characters 1 byte 1 byte 1 byte Next 128 characters N/A 1 byte 2 bytes Remaining Characters N/A N/A 2 . they must be UTF-8 decoded. If N8CS is sufficient (i. which is often ISO-8859-1/Latin-1 • you have to interact with your database appropriately -.8. they should be N8CS decoded. regular expressions do not work even in the next release.6. (Okay. STDOUT (which goes to the browser) must be UTF-8 encoded.1 added some speed improvements [7]. • All text data going out of Perl (hence to the browser.0. Later.

$native_string $native_string $native_string $utf8_string = = "\xf1". because it likely either doesn't contain any characters. the bytes should be decoded (using the correct character set -. This is the default encoding for all incoming text/octets if Perl is not instructed to decode (bad idea). However. if a character can not be represented in N8CS. Encoded characters make up an octet stream. since <= 0xff # still N8CS. 3 Terminology A character is a logical entity. "\x{0100}". It uses (obviously) UTF-8. = "\x{00f1}". otherwise UTF-8 is used. a single character may be require one or more bytes to represent it. Strings using this encoding are called character strings or text strings or Unicode strings. Perl uses a "UTF8 flag" to keep track of which encoding a string is internally using. we need to first define some terms. or it contains information in addition to characters. Perl uses N8CS when possible (for backwards compatibility and efficiency reasons). Thankfully. 8 bits. and then it gets internally encoded into UTF-8. It uses N8CS[8]. Strings using this encoding are called byte strings or binary strings. • UTF-8 encoding — character encoding. Characters must be encoded (using a character set) in order to be used. It should not be decoded using a character set. # N8CS byte string (one byte is used internally to encode) utf8::upgrade($my_string). When an octet stream comes into Perl. This is a one-byte-per-character encoding. # still N8CS.the character set they were encoded with) so that Perl can determine which logical characters are contained in the encoded octet stream. UTF-8 is used. the format/flag follows the string. written. Perl strings/text Internally. etc. In other words. Depending on which character set is used for encoding.Perl Programming/Unicode UTF-8 UTF-8 support until v6. and upgraded (encoded) to UTF-8. and hence cannot be decoded with a character set. An octet is a byte. = chr(0xf1). the native string is silently implicitly decoded using N8CS. . We'll use the term octets when referring to data passing into or out of a Perl program.0. Encoding turns a logic character into something we can use in a program. the native byte string gets decoded with the native character set. However. since <= 0xff You can convert a N8CS string to a UTF-8 string using utf8::upgrade(): $my_string = "\xf1". Perl can then store these as strings -. # UTF-8 character string now (two bytes are used internally to encode) Your program can have a mix of strings in both of Perl's internal formats. if all code points in a string are are <= 0xFF. Binary data also comes in as an octet stream.a sequence of characters. In other words. the native 8-bit character set of the platform (often ISO-8859-1/Latin-1). when a N8CS/native string is used together with a UTF-8 string. Perl keeps a string in N8CS as long as possible. N8CS is used. and then talk a bit about Perl's dual personality when it comes internally storing text.) Before we start getting into the finer details about how to use UTF-8. Perl stores each string in one of the following encodings: • native encoding — byte encoding. The resulting character string will have the UTF8 flag set. and hence a maximum of only 255 characters can be encoded. When creating your own strings. stored. exchanged between programs.

a natively encoded string with non-ASCII characters. N8CS is used. 1. you must decode it. However. this is not what you want if you have a multi-byte UTF-8 encoded octet stream/text coming in. other programs. The typical flow of UTF-8 text/octets in to and out of a Perl program is as follows: 1. you must tell Perl which character set was used to encode them. otherwise UTF-8 is used. $my_string .) 2. Using utf8::decode() may result in N8CS or UTF-8 internal encoding. so they can be decoded properly. Using decode() always results in the string being internally stored as UTF-8.Perl Programming/Unicode UTF-8 $my_string = 'a'. hence each octet is treated as a separate character — clearly. UTF-8 Flow Any Perl IO needs to correctly handle decoding and encoding of strings/text. An exception to this rule is if you have a natively encoded string with bytes in the 0x80-0xFF range — in other words. 3. the encoding is UTF-8). depending on which decoding method you select. sockets. Encode the string into a UTF-8 encoded octet stream and output it. tell Perl which character set the octets are encoded in (in this case. If you are certain that the incoming data/octets only contains N8CS (often this means ISO-8859-1) text.= "\x{0100}". you should upgrade these N8CS byte strings to UTF-8 character strings using utf8::upgrade(). although sometimes slower.. Encoding the text (this might be a no-op) and storing it internally as N8CS or UTF-8. etc. you should not need to know about how Perl is internally storing/encoding text. If it is stored as UTF-8. In this case. Decoding Text Input External input includes submitted HTML form data. the string will be internally stored as UTF-8. If you want Perl to interpret your incoming text/octets correctly. UTF-8 decoding in Perl involves two steps: 1. If the incoming text only contains ASCII characters. Perl assumes input text/octets are N8CS encoded.e. 2. depending on which decoding method you select. with the UTF8 flag set (despite what the documentation for Encode says). the UTF8 flag is set. Unicode character U+201c (left double quotation mark) is encoded in one byte in Windows-1252 (0x93). and it this be difficult to locate due to implicit decoding (discussed above). database data (e. Perl stores the string internally as N8CS or UTF-8. This may generate decoding errors. ASCII. Since there are multiple character encodings in use in the world. # N8CS byte string # UTF-8 character string now 4 Normally. Normally. say. Decoding the text according to UTF-8 format rules. . HTML templates. some string operations may not work as expected — see Perl 5 "Unicode Bug". An incoming stream of UTF-8 octets is not the same as. nor can it know which character encoding you want to use for outgoing text/octets. Perl may check for malformed data (bad encoding) while decoding. depending on which decoding method you select. from SELECT statements). "best practice" suggests that all incoming data/octets should be explicitly decoded — you can explicitly decode ISO-8859-1. which is a one-byte-per-character encoding). If any of these might contain UTF-8 encoded data/text. (Normally.g. and a number of other character encodings.. but UTF-8 encodes it using three octets (0xE2 0x80 0x9C). Process the string as you normally would. you do not need to explicitly decode it (because Perl's default internal encoding is N8CS. Receive an external UTF-8 encoded text/octet stream and correctly decode it — i. an incoming stream of Windows-1252 octets. text files. Perl can't correctly guess which character encoding was used to encode some particular incoming text/octets. String operations will then work as expected. For example. Improper decoding can lead to double encoding. and what characters are found to be in the octet stream. If you don't decode.

but only logs an error . HTML::Template [12] currently does not support decoding of UTF-8 encoded HTML template files.Files. which will treat (and decode) all parameters as UTF-8 strings. • You can use TMPL_VARs to insert UTF-8 content [14] into an N8CS (or even ASCII) encoded template file. but this will fail if you have any binary file upload fields. File Handles Perl can automatically decode data as it comes into Perl using PerlIO layers: open my $in_fh. this is often sufficient. the toolkit will decode them appropriately. CGI. Input . and implicit decoding should upgrade the resulting text (i. the template and the filled-in variables) to UTF-8 internally..47.e. utf8::decode($p). If the templates do not use BOMs. # auto UTF-8 decoding on read 5 If you already have an open filehandle: binmode $in2_fh. do not assume. Basically. my $might_decode = sub { my $p = shift. You can use the -utf8 pragma. A better solution involves overriding the param method: package BEGIN { use use use { CGI::as_utf8. the framework or template engine needs to do what we talked about in the previous section. Input .Perl Programming/Unicode UTF-8 Another important point to make here: you need to know which encoding was used for each input text. you may need to inform it about the UTF-8 encoding. UTF-8 decode your parameters/content before inserting them into an HTML template using TMPL_VARs.Web Forms By default. so that it can "UTF-8 decode" the template files as they are read in. warnings.pm [15] does not decode your form parameters. ':encoding(utf8)'. strict. # make sure upload() filehandles are not modified return $p if !$p || ( ref $p && fileno($p) ). This is a known limitation/bug [13]. my $param_org = \&CGI::param. Input . For Template::Toolkit [10]. $filename or die. automatically. For many applications. use the ENCODING option: my $template = Template->new({ ENCODING => 'utf8' }). CGI 3. # may fail. "<:encoding(utf8)". # earlier versions have a UTF-8 double-decoding bug no warnings 'redefine'. it simply marks it as UTF-8 — see Perlmonks [9]. Do not use :utf8 since it does not check that your incoming text is valid UTF-8. There are a few workarounds: • A patch [13] is available.HTML Templates If you are using a CGI framework or template engine to pull in UTF-8 encoded HTML template files. Do not guess. if you use an appropriate Byte Order Mark (BOM) [11] in your template files to indicate the encoding.

pm. e. you could add the following line of code to the beginning of your script to cause all data received on STDIN (i. Note. decode appropriately: my $utf8_text = decode('UTF-8'.pm's param() method. form data comes into Perl via STDIN. in your CGI::Application module(s) The above is rhesa's solution [16] with a slight modification — utf8::decode() is used instead of Encode's [17] decode_utf8(). $p) = @_. </strike> 6 # do NOT use this! since it does not check that your incoming text is valid UTF-8.. You should not have to use accept-charset in your HTML markup.e. as it is more efficient when only ASCII characters are involved (since the UTF8 flag is not set). and that the OO interface of CGI.e. # assume object calls always return wantarray ? map { $might_decode->($_) } $q->$param_org($p) : $might_decode->( $q->$param_org($p) ). text form data is available via CGI. readline STDIN). # don't decode . then instead of the CGI::as_utf8 module. The approach in the previous section is preferred.pm is always used.STDIN When a web form is POSTed. Input .. all of your data is text). If you are writing some other (non-CGI) program that receives data on STDIN. So. my $iso8859_text = decode('ISO-8859-1'.. if you are sending UTF-8 forms. and the previous section describes how to properly handle UTF-8 encoded text form data. since it will "do the right thing" if there is any binary form data (file uploads)..). Do not use <strike> binmode STDIN. Note that the module assumes that web pages and forms are always UTF-8 encoded.. # put this line in your app. my $binary_data = read(. readline STDIN). all POSTed form data) to be automatically decoded as UTF-8: binmode STDIN.g. If you are using CGI. you should get UTF-8 encoded data back for text fields. ":encoding(utf8)". my ($q.Perl Programming/Unicode UTF-8 $p }. If you don't have any file uploads (i. browsers should encode form data in the same character encoding that was used to display the form. it simply marks it as UTF-8 — see Perl 5 Wiki [18]. } } } 1 --use CGI::as_utf8. *CGI::param = sub { # setting a param goes through the original interface goto &$param_org if scalar @_ != 2. ":utf8".

Perl Programming/Unicode UTF-8 Note that decode() always sets Perl's internal UTF8 flag. or SET NAMES 'UTF8'. the UTF8 flag is not set for that field (so it appears to be using utf8::decode()).0. but do not decode incoming binary field data.004 or higher of DBD::mysql is required. {pg_enable_utf8 => 1} ). $password.1. $username. $username.. {mysql_enable_utf8 => 1} ). If the incoming data for a field only contains ASCII octets. The driver is also smart enough to not decode binary data. UTF-8 decoding (and encoding) of string field data is automatic if you use the mysql_enable_utf8 database handle attribute [19]: use DBI(). The driver is also smart enough to not decode binary data. This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field data — the DBD::Pg driver will do that for you. UTF-8 decoding (and encoding) of string field data is automatic if you use the pg_enable_utf8 database handle attribute [20]: use DBI(). . UTF-8 was first available in MySQL v4. Input . post_connect_sql => "SET CLIENT_ENCODING TO 'UTF8'. my $dbh = DBI->connect('dbi:mysql:test_db'. As of v5. ensure incoming UTF-8 encoded string field data is UTF-8 decoded. with Rose::DB: __PACKAGE__->register_db( domain => 'development'. pg_enable_utf8 => 1. Input . 7 . Input .MySQL With MySQL. connect_options => { pg_server_prepare => 0. When reading data from a UTF-8 database. You may (TBD: when?) also need to tell PostgreSQL to use UTF-8 when sending data out of the database: SET CLIENT_ENCODING TO 'UTF8'. my $dbh = DBI->connect('dbi:mysql:test_db'. $password. }. it is the system default. Version 4. configure your database to store values in UTF-8.".PostgreSQL With PostgreSQL.Database In the "use UTF-8 everywhere" model. This means you should not call utf8::decode() (or any other UTF-8 decode function) on incoming string field data — the driver will do that for you.. For example.

\w represents a much. and uc(). String $windows1252_octets will exhibit the Unicode bug -. This issue should be fixed in Perl v5. since different encodings use different characters in this range.12 is now available. # 0xE0 = à in ISO-8859-1 utf8::upgrade($text). See Automatic Character Set Conversion Between Server and Client [21] 8 2. (This is another reason to try and use UTF-8 everywhere.it won't match /\w/ my $utf8_string = decode('cp1252'. There are two ways to avoid this "Unicode Bug". my $unicode_char = "\x{00f1}". \D. any text/octets found to contain non-ASCII characters will be converted to UTF-8 internal encoding. utf8::upgrade($unicode_char). lc().this is called ASCII semantics.2. (Even if the string only contains ASCII characters. Use utf8::upgrade($native_string) to force $native_string to switch to UTF-8 internal encoding. utf8::upgrade($text). During decoding. $utf8_string matches /\w/ 2. Regular expression will work (if using Perl v5. and it contains the character # "\xE0" (0xE0 = à). # suppose $windows1252_octets contains text from an external input. 4/19/10 update: v5. \w. which always work as expected. so it ignores them -. it is still "upgraded" to UTF-8.12. If you create any strings in your source code that contain non-ASCII characters (characters above 0x7f). Perl can't properly interpret characters in this range. and the "case changing component" has been fixed: "Perl 5. Perl 5 "Unicode Bug" Without a locale specified. so regex operations will be slower (vs. \W (hence regular expressions). Processing strings Once all incoming strings have been decoded into UTF-8 internally. native encoding). # will exhibit Unicode bug. much larger set of characters. Both involve getting the natively encoded string to switch to UTF-8 encoding — because when the internal encoding is UTF-8.) my $text = "\xE0". you can process your text as normal. if you have native/N8CS strings with characters in the 0x80-0xFF (128-255) range. matches /\w/ # U+00F1 = ñ Note that with internal UTF-8 encoding. then \d.$windows1252_octets). # no Unicode bug. may not work as expected. For example use Encode. Unicode semantics are used. etc. Follow "best practice" and always properly decode all external input text/octets.12 now bundles Unicode 5. won't match /\w/ # no Unicode bug.8 or higher). TBD: what is the actual performance degradation? What is the character set for \w with Unicode semantics? See also Unicode::Semantics [22]. since the non-ASCII part (0x80-0xFF) of the character set is ignored for those operations.) Without a locale.Perl Programming/Unicode UTF-8 ). The “feature” pragma now supports the new “unicode_strings” feature: use feature "unicode_strings". ensure you upgrade them to internal UTF-8 encoding: my $text = "\xE0". \S. . 1. \s.

Encoding and output Output from a web program includes STDOUT (which is sent to your browser for a CGI program). STDOUT) is UTF8-encoded. # Make sure the output is utf8 encoded if it needs it if($_[0] && ${$_[0]} && utf8::is_utf8(${$_[0]}) ){ utf8::encode( ${$_[0]} ).Perl Programming/Unicode UTF-8 This will turn on Unicode semantics for all case changing operations on strings.e. etc. regardless of how they are currently encoded internally. you can opt to only encode the outgoing page if it is flagged as UTF-8: if(utf8::is_utf8($page)) { utf8::encode($page). stderr (which usually goes to the web server's error log). which could be a mixture of native/N8CS and UTF-8. Note that all of the above encoding techniques will only work properly if all of the input UTF-8 octets were properly decoded. Optionally. log file output. ":encoding(utf8)".= 'utf8::encode() called'. database writes. } # else. Output . add the following near the top of your Perl script: binmode STDOUT. If outgoing text is not encoded. ." Read more [23]. This may work. but don't take a chance — "best practice" calls for explicitly encoding all output appropriately.STDOUT To ensure all output going back to the web browser (i. explicitly encode output (as described below). The above code should be put into CGI::Application base class(es). the text will be sent using the bytes in Perl's internal format. # ${$_[0]} . # useful for debugging } }). so skip encoding for output Here is a snippet [24] that can be used with the CGI::Application [25] framework: __PACKAGE__->add_callback('postrun'. $page is natively encoded. Ä To avoid this warning. sub { my $self = shift. If you want to be a little more efficient (but not follow "best practice").. 9 3. the code can be added to cgiapp_postrun(). Perl will warn you if you print a string with a character that has an ordinal value greater than 255: $ perl -e 'print "\x{0100}\n"' Wide character in print at -e line 1.

You may (TBD: when?) also need to tell PostgreSQL to expect UTF-8 coming into the database: SET CLIENT_ENCODING TO 'UTF8'.MySQL As mentioned above. "best practice" is to specify the UTF-8 charset in an HTTP Content-Type header and inside the HTML file in a content-type <meta> tag. Output . The driver is also smart enough to not encode binary data. UTF-8 was first available in MySQL v4. # auto UTF-8 encoding on 10 Tell the Browser to use UTF-8 To serve a UTF-8 encoded page to a browser. Perl can automatically encode data as it is written using PerlIO layers: open my $out_fh. Do not encode binary field data.Perl Programming/Unicode UTF-8 Output . put the above line in cgiapp_init(). charset=UTF-8" /> . where $q is your CGI object: $q->charset('UTF-8'). If you are not using CGI. UPDATE.Database As mentioned above. Output .004 or higher of DBD::mysql is required. UTF-8 encoding (and decoding) of string field data is automatic if you use the mysql_enable_utf8 database handle attribute [19]. ensure your UTF-8 strings get UTF-8 encoded before being written to the database.Files.pm defaults to sending the following Content-Type header: Content-Type: text/html.). ':utf8'. ">:utf8".0. Version 4. configure your database to store values in UTF-8. $filename write If you already have an open filehandle: binmode $out2_fh. As of v5. The driver is also smart enough to not encode binary data. This means you should not call utf8::encode() (or any other UTF-8 encode function) on your strings when using this attribute — the driver will do that for you. See Automatic Character Set Conversion Between Server and Client [21] Output . CGI. When writing data to a UTF-8 database (INSERT. or SET NAMES 'UTF8'. it is the system default. File Handles If you need to write to files.1. If you are using the CGI::Application framework. etc. in the "use UTF-8 everywhere" model.PostgreSQL As mentioned above. or die. charset=ISO-8859-1 Add the following to cause UTF-8 to be used instead of ISO-8859-1.pm to generate your HTML markup. put the following meta tag as the first meta tag in the <header> section of your HTML markup: <meta http-equiv="content-type" content="text/html. This means you should not call utf8::encode() (or any other UTF-8 encode function) on your strings when using this attribute — the DBD::Pg driver will do that for you. UTF-8 encoding (and decoding) of string field data is automatic if you use the pg_enable_utf8 database handle attribute [20].

use \x{.. such as ISO-8859-1. This is because ASCII.} or chr() in your code. If you receive this error. . Web Server Always Sends an ISO-8859-1 Header If you followed the steps above. or my $smiley = chr(0x263a). try a character in the 0x80 .. your code is probably trying to decode the same string a second time. but your pages are not being displayed properly. Windows-1252 and UTF-8 are all encoded with the same one-byte values for the first 128 Unicode codepoints. make sure your editor supports reading.. Perl will warn you [27] if you print a string that has a character with an ordinal value greater than 255 (hence it is a "wide" character that requires more than one byte of storage): Wide character in print at . editing. Explicitly encode your output to avoid this warning. it could be that your web server is configured to always send a particular character encoding in a header. Instead. To give your application a good Unicode test.bing.com | grep Content Apache may be configured with the following: AddDefaultCharset ISO-8859-1 If you can. then you need to tell Perl that your source code is UTF-8 encoded.. Wide character in print at . line . you do not need to save your source code/file in UTF-8.Perl Programming/Unicode UTF-8 11 Perl source code If you only need to embed a few Unicode characters in a few strings in your source code. # convert to internal UTF-8 encoding If you have a lot of Unicode characters. Do this by adding the following line to your source code: use utf8. or you prefer to save your source code in UTF-8. To determine if a Content-Type header is being sent by the web server: $ lwp-request -de www. If your source code is UTF-8 encoded. ISO-8859-1. remove that line.. which will fail. Cannot decode string with wide characters at ..see utf8 [26].0x9F (128-159) range.. or change it to AddDefaultCharset UTF-8 if all of the pages served by the server use UTF-8. See also When Apache and UTF-8 Fight [28]. followed by utf8::upgrade(): my $smiley = "\x{263a}". # this script is in UTF-8 This is the only reason your program should ever have the above line -. utf8::upgrade($smiley).. and a character above 0xFF (255)... and writing in UTF-8! Gotchas Often you may not notice Unicode issues until characters with codepoints above 128 are used.

In your browser. Firefox uses the black diamond with the question mark.) IE displays the replacement character as the empty square box. The problem is likely an encode/decode problem somewhere in the chain. 0x8d.. It . If you save the web page and then open it in bvi. Your Perl script will then only receive valid UTF-8 encoded characters. but the browser was instructed to use UTF-8 encoding. these replacement characters appear because the HTML data is Windows-1252 encoded. If if does....e. It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Translation and try selecting ISO-8859-1 or Windows-1252 and run the program again.. how the data is encoded). you will get gibberish.e. Many web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to accommodate such mislabeling. Here's a fun program to try: my @undefined_chars_in_windows_1252 = (0x81. 0x90. If so. square boxes? If you are using PuTTY. the draft HTML 5 specification requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding" since it is a superset of ISO-8859-1. but differs from ISO-8859-1 by using displayable characters rather than control characters in the 0x80 to 0x9F (128 .159) range. 0x8f. 0x9d). and what character set the browser is being told to use (i. Strange Characters in my Browser Strange character: � This is Unicode's "replacement character" (codepoint U+FFFD). since the "paste" operation should automagically convert these characters to valid Unicode characters. Better yet. (Recall that Unicode defines control characters in this range — not printable characters like smart quotes. If it doesn't resolve the problem. try selecting Windows-1252 or Western European (Windows) and see if that resolves the problem. $i.. printf "%02x:%c ". Microsoft "Smart" Quotes MS-Word (TBD: only older versions?) uses those nice left and right fancy/smart quotes. you may see EF BF BD.. From Windows-1252 [29]: "The Windows-1252 encoding is a superset of ISO-8859-1.. also contains all the printable characters that are in ISO-8859-15 (though some are mapped to different code points). Usually. it might be that you don't have a Unicode font installed on . Decode and encode correctly and you will not have any problems with Microsoft smart quotes or any of the other characters in the nebulous range. my %h = map { $_ => undef } @undefined_chars_in_windows_1252. you need to be aware of the difference between ISO-8859-1 and Windows-1252. (U+FFFD encodes to EF BF BD in UTF-8. the characters may be submitted to the web server using the nebulous 0x80-0x9F (128-159) range.) If your Perl script does not decode the submitted form properly (i. Window.. submitted forms should never contain these nebulous values. HTTP header and/or meta tag). foreach my $i (0x80 .e. according the same character encoding that the web form used).Perl Programming/Unicode UTF-8 12 ISO-8859-1 vs Windows-1252 Since you are learning about character encodings. 0x9f) { next if exists $h{$i}. no characters. } What do you see? Do you see the Windows-1252 characters.. then you know that the web server is serving up the wrong character encoding — there is a mismatch between what is being sent (i. If you copy-paste those characters into a web form that was served with a Windows-1252 charset (or possibly even an ISO-8859-1 charset). select View->Character Encoding and see if it is set to UTF-8. which is used to indicate when a Unicode parser (such as a browser) was not able to decode a stream of Unicode encoded data.$i. Change Settings. if you serve all web pages as UTF-8.

In your browser. it is likely that you forgot to decode incoming UTF-8 data (such as form data submitted from an UTF-8 encoded HTML form) in your Perl program and then you UTF-8 encoded it for output — a natively encoded string was UTF-8 encoded (not good). in many cases you can probably assume that the incoming text/octets are ISO-8859-1/Latin-1 or Windows-1252. click "Choose advanced". Add or Remove Features. select MS-Office. If that doesn't resolve the problem. If these separate characters are later encoded to UTF-8 for output. Apply the changes and restart your web browser. Usually. If it is not installed. If you see the above sequences.. Double Encoding If you don't decode UTF-8 text/octets. or the Unicode font does not have a glyph for that particular character.g. a "double encoding" results. Decode with Windows-1252. editing. follow these steps to install it: Add/Remove Programs. Fix the problem by calling utf8::decode() on the incoming UTF-8 encoded data. and writing in UTF-8 • ensure you set your editor to use a Unicode font • ensure you have a Unicode font installed Install a Unicode Font on Windows If you have one of the Microsoft products listed on this page [30]. Universal Font. &amp.Perl Programming/Unicode UTF-8 your computer. there may be a double encoding problem somewhere. Office Shared Features. since it is a superset of ISO-8859-1. 13 Strange Characters in my Editor • ensure your editor supports reading. . try changing the encoding to UTF-8 and see if that resolves the problem. This means that the individual octets of a multi-byte UTF-8 character are seen as separate characters (not good). Strange characters: ‘ ’ “ ” • – — These are the individual characters that correspond to the multi-byte UTF-8 encodings for the following Windows-1252 characters: ‘’“”•–— which are in the nebulous 0x80-0x9F (128-159) range. or if the encoding is already set to UTF-8. International Support. you should have the Arial Unicode MS font. Perl will assume they are encoded with N8CS (often ISO-8859-1/Latin-1).gt.. This is similar to HTML double encoding — e. I asked for UTF-8 but I Got Something Else!? If you specifically asked for UTF-8 text. but the browser was instructed to use ISO-8859-1 or Windows-1252. but the octet stream you receive is not valid UTF-8 encoding. Strange characters: ‘ ’ “ ” • – — These also correspond to some of the characters in the nebulous 0x80-0x9F (128-159) range. these characters appear because the HTML data is UTF-8 encoded. instead of &gt.

If the web page's character encoding is instead set to Windows-1252. but the value may be either 0x93 or 0x22 (0x22 is the ASCII and ISO-8859-1 quote character).) When the form is submitted. and it does not perform font substitution. If the web page's character encoding is set for UTF-8. the character set is Unicode. not a Windows-1252 or CP-437 character! Well. and Windows-1252 character 147 is translated to its Unicode codepoint equivalent. then using the numeric keypad. IE6 is not considered a modern browser. characters. Ensure the "Character set" drop-down box is set to "Unicode". (Internally. If the web page's character encoding is instead set to ISO-8859-1. # for codepoints above 0xFF # same syntax for regex # run-time. the character will also be sent as a single octet. You can also use the application to view fonts. One workaround is to specify a Unicode font as the first font in the CSS font-family property. my $cloud_char = chr(0x2601). hold down Alt. ord() does the reverse If your Perl source code file is in UTF-8 format. • To insert a character from the current DOS code page (usually CP-437 [32]): follow the same steps as above. $utf8_char =~ /\x{263a}/. Windows should translate the 147 character into the corresponding UTF-8 encoding. my $utf8_char # tells Perl this file is UTF-8 encoded # U+263a. In Web Forms On Windows: • To insert a character from the Windows-1252 codepage [29]: set the Num Lock key on. 0x93 is sent. the character should be sent as a single octet: 0x93 (which is 147 decimal). . Sometimes IE7 and IE8 do not seem to perform font substitution correctly. But wait. but without the initial 0. copy. In a web form (textbox or textarea) type Alt-0147 to generate one of those pesky smart quotes from the Windows-1252 character set. "White Smiling Face" = "☺". we wanted to insert a Unicode character. you can always use the Character Map application to select. the application will search through all of your fonts until it finds one containing that character and it will then display that character using the glyph in that font. In Perl my $utf8_char = "\x{263a}". type 0 followed by the decimal value of the character you want. Windows probably translates the 0147 to UTF-16. you can enter the Unicode characters directly: use utf8. and Unicode codepoint values for each character. which is then translated into the character set in use by the application. Windows will convert those characters to Unicode/UTF-8 for us if the application expects UTF-8. the character should be sent to the web server UTF-8 encoded as three octets: E2 80 9C — this is what U+201C looks like when encoded with UTF-8. If the browser uses the superset Windows-1252 encoding when ISO-8859-1 is specified. U+201C. the character will be translated to the only quote character officially defined in ISO-8859-1. which means that if a character is not in the current font.Perl Programming/Unicode UTF-8 14 Automatic Font Substitution Most modern browsers and word processors perform font substitution [31]. 0x22. Otherwise. and (switch to your application then) paste a Unicode character. Misc Create Unicode characters On Windows. In this scenario.

allowing just about any 4-byte values: encode("utf8". Returns false if $utf8_string is not UTF-8 encoded properly. encode_utf8("\x{FFFF_FFFF}".) Below.10 knows the difference.e. $utf8_octets [. This is not what happens — the flag is always turned on. otherwise returns true. $utf8_string = decode('UTF-8'. CHECK]) turned on turned on . 1). which relate to how the module handles malformed UTF-8 Functions Function UTF8 flag N/A depends Description / Notes $flag = utf8::is_utf8($string). Decodes the UTF-8 octet sequence into a UTF-8 character string. Encode [34] as of version 2. Otherwise. see Encode's documentation data. since it does not turn the flag on if the octet sequence only contains ASCII octets. as the table below depicts. Lax. # okay # okay Encode Module vs Built-in/Core utf8:: To decode and encode UTF-8. UTF-8 is the strict. lax. 1). official UTF-8. Tests whether $string is internally encoded as UTF-8.8. Returns false if not. [33] . $flag = utf8::decode($utf8_octets). official UTF-8 decoding rules (see previous section for discussion) are followed. and the resulting string is N8CS. If $utf8_octets contains non-ASCII octets (i. the utf8:: package can do some different tricks. "\x{FFFF_FFFF}".Perl Programming/Unicode UTF-8 Hopefully you see why it is imperative to know which encoding was used for the incoming form/text. encode("UTF-8". $utf8_octets [.. liberal decoding rules (see previous section for discussion) are followed.g. the UTF8 flag remains off. The Encode module will complain if you try to encode or decode invalid UTF-8. "\x{FFFF_FFFF}". otherwise returns true.Yahoo Answers. version. 15 UTF-8 vs utf8 As of Perl 5. multi-byte UTF-8 encoded characters). Decodes the UTF-8 octet sequence into a UTF-8 character string. Use utf8::decode() to obtain this efficiency. e. as appropriate. The Encode module is more flexible. You should be aware of a bug [36] in the Encode module: whenever text is decoded using the Encode module. allowing different ways of handling malformed data. However.. so that it can be decoded properly (as UTF-8 or Windows-1252) in your Perl program. utf8::encode() and utf8::decode() use official UTF-8. CHECK]) $utf8_string = decode('utf8'. the UTF8 flag is turned on.. you can use the Encode [17] module or the functions defined in the utf8:: [35] package by the Perl core. Strict. The documentation would lead you to believe that the UTF8 flag is off if the text only contains ASCII characters and you are decoding UTF-8. utf8 is the liberal.. There are performance gains to be had if the UTF8 flag can be kept off after decoding (and this is fine if the text only contains ASCII octets). Attempts to convert in-place the UTF-8 octet sequence into the corresponding N8CS or UTF-8 string. 1). [17] for CHECK options. # croaks In contrast. This is the only decode function that may result in an N8CS byte string. (This is the decode function I normally use.7. the UTF8 flag is always turned on. See also How do I enter . and the resulting string is UTF-8.

then returns false. unless FAIL_OK is true. $string [. FAIL_OK]). Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. but your application will need to use the HTML encoding where ever a special character is needed. official UTF-8 encoding rules (see previous section for discussion) are followed. $utf8_octets). Converts in-place the UTF-8 character string to the equivalent N8CS byte string. For example. Each of these has a Unicode codepoint and an entity name. Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. this function cannot fail. . Also look at the HTML source and see if the meta tag is present: <meta http-equiv="Content-Type" content="text/html. Starting with HTML 4. must be decoded as it enters Perl. HTML Character Entities In your UTF-8 travels. Many fonts support this set of characters. $string). What does Website "x" Use? View a page. CHECK]) $octet_count = utf8::upgrade($n8cs_string). and if the set is sufficient for your application. Returns the number of octets now used to represent the string internally as UTF-8.0.com | grep Content This wiki uses UTF-8. Converts in-place the N8CS or UTF-8 $string into a UTF-8 octet sequence. Returns true on success. or &reg. then in your browser. UTF-8 may not be required. Equivalent to encode("utf8". $utf8_string = decode_utf8($utf8_octets [. thereby avoiding the Perl 5 "Unicode Bug". This function should be used to convert N8CS byte strings with characters in the 0x80-0xFF range to UTF-8. CHECK]) $utf8_octets = encode('utf8'. utf8::encode($string) turned off turned off turned off $utf8_octets = encode('UTF-8'. hence lax decoding is employed. Any characters encoded with something other than N8CS. On failure dies. Equivalent to decode("utf8". 252 character entities [37] are supported. Strict. Since all possible characters have a lax utf8 representation. the registered sign can be represented in HTML as either &#174. this function cannot fail. charset='''UTF-8'''" /> You can also see what Content-Type header is being returned using: $ lwp-request -de www.bing. $utf8_octets = encode_utf8($string) turned off $flag = utf8::downgrade($utf8_string [. $string) Encodes the N8CS or UTF-8 $string into a UTF-8 octet sequence. turned on turned on Converts in-place the N8CS byte string into the corresponding UTF-8 character string. Lax. Since all possible characters have a lax utf8 representation. the platform's native 8-bit character set (often ISO-8859-1/Latin-1).Perl Programming/Unicode UTF-8 16 Decodes the UTF-8 octet sequence into a UTF-8 character string. hence lax encoding is employed. liberal UTF-8 encoding rules (see previous section for discussion) are followed. you may come across HTML Character Entities. Either can be used in HTML markup. View->Character Encoding to see which encoding was selected. turned off Perl Character encodings To determine which character encodings your Perl supports: perl -MEncode -le "print for Encode->encodings(':all')" It is important to remember that Perl only uses two character encodings internally: native/byte and UTF-8/character. Fails if $utf8_string cannot be represented in N8CS encoding.

wikipedia.pm and UTF-8 handling [9] . cpan. au& forum_name=html-template-users [15] http:/ / search.work around the Perl 5 Unicode bug there are many Unicode:xxx modules [46] on CPAN UTF-8 round trip with MySQL [47] .by Mark Rajcok • Perl Unicode tutorial [42] • Perl Unicode FAQ [43] • Perl utf8 pragma [26] • • • • • • • • • • • Perl Encode module [17] . wikipedia. cpan. Windows 2000." References • The Absolute Minimum Every Software Developer Absolutely. pod#Why_do_I_get_rubbish_for_my_utf-8_templates? [11] http:/ / en. Mac OS X. Windows Vista and Windows 7).Perlmonks UTF-8 and Unicode FAQ for Unix/Linux [49] Perl Unicode Mailing List <perl-unicode@perl. org/ wiki/ ISO_8859 [3] http:/ / en.N8CS is a term that was coined for this document. php?thread_name=4607245C. org/ wiki/ ASCII [7] http:/ / search. org/ wiki/ Unicode [2] http:/ / en. html?id=30586 [14] http:/ / sourceforge. org/ dist/ Template-Toolkit/ lib/ Template/ FAQ. org/ wiki/ Unicode/ Character_reference/ 0000-0FFF [5] http:/ / en. org/ wiki/ Byte_Order_Mark [12] http:/ / search. org/ perldoc?HTML::Template [13] https:/ / rt.by Joel Spolsky • FMTYEWTK about Characters vs Bytes [40] .Perlmonks CGI::Application . com. Positively Must Know About Unicode and Character Sets (No Excuses!) [39] . cpan. wikibooks. uses UTF-16 as the sole internal character encoding. and KDE also use it for internal representation.org> Footnotes ^ . cpan. UTF-8 has become the main storage encoding on most Unix-like operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional extended ASCII character sets. wikipedia.Perl Programming/Unicode UTF-8 17 Operating Systems and Unicode It is interesting to note which Unicode encoding popular Operating Systems use. wikipedia. wikipedia.NET bytecode environments. 8030702%40netratings. References [1] http:/ / en. org/ wiki/ Mapping_of_Unicode_characters#Character_properties [4] http:/ / en. From Wikipedia [38]: "Windows NT (and its descendants. net/ mailarchive/ forum. wikipedia. org/ ?node_id=626470 [10] http:/ / search. org/ wiki/ Perl_programming%2Funicode_utf-8#endnote_N8CS [9] http:/ / www.Which is the proper way of handling and outputting utf8 [48] . perlmonks. org/ perldoc?CGI .Perlmonks • CGI::Application and UTF-8 Form Processing example [41] . org/ perldoc?perlunicode#Speed [8] http:/ / en. Do not expect to see this term used elsewhere. org/ Public/ Bug/ Display.Perlmonks Understanding CGI. Windows XP. cpan.Wikipedia Perl Unicode introduction [44] Unicode support in Perl [45] Unicode::Semantics [22] . wikipedia. The Java and . org/ wiki/ UTF-8 [6] http:/ / en.handles all character encoding and decoding Unicode [1] .

perlmonks. cpan. cpan. cl. cpan. org/ perldoc?Unicode::Semantics http:/ / perldoc. html http:/ / perlmonks. org/ perldoc?utf8 http:/ / search. cgi?Utf8Example http:/ / search. html https:/ / rt. html http:/ / cgi-app. cpan. cpan. perl. org/ wiki/ Windows-1252 http:/ / www. org/ docs/ 8. wikipedia. edu/ ejp10/ blogs/ gotunicode/ 2009/ 02/ when-apache-and-utf-8-fight. wikipedia. org/ perldoc?perlunicode http:/ / search. cpan. perlmonks. html#The-%22Unicode-Bug%22 http:/ / www. cpan. alanwood. yahoo. org/ ?node_id=651403 http:/ / www. uk/ ~mgk25/ unicode. html?id=34259 http:/ / www. erlbaum. net/ msg08043. wikipedia. cpan. cpan. org/ index. 4/ interactive/ multibyte. com/ articles/ Unicode. pl?node_id=620803 http:/ / www. mail-archive. org/ wiki/ Code_page_437 http:/ / answers. org/ perldoc?DBD::Pg#pg_enable_utf8_(boolean) http:/ / www. org/ perldoc?perlunifaq http:/ / search. org/ search?query=unicode http:/ / www. com/ typography/ fonts/ font. postgresql. psu. cpan. org/ perldoc?perldiag#Wide_character_in_%s http:/ / www. org/ ?node_id=330567 http:/ / cgi-app. org/ index. org/ Ticket/ Display. org/ perldoc?DBD::mysql#DATABASE_HANDLES http:/ / search. org/ index. org/ perldoc?perlunitut http:/ / search. ac. cgi?the_utf8_perlio_layer http:/ / search. cpan. com/ cgiapp@lists. html#AEN29751 http:/ / search. cpan. org/ utf8. org/ wiki/ Unicode#Operating_systems http:/ / joelonsoftware. html 18 . org/ perldoc?Encode#UTF-8_vs. perlfoundation. personal. org/ wiki/ Font_substitution http:/ / en. cgi http:/ / search.Perl Programming/Unicode UTF-8 [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] http:/ / www. org/ ?node_id=651574 http:/ / search. _utf8_vs. microsoft. perl. perlmonks. org/ perlunicode. html http:/ / en. org/ perldoc?Encode http:/ / www. aspx?FMID=1081 http:/ / en. cpan. org/ perl5/ index. cam. com/ question/ index?qid=20081226081225AA2NMGi http:/ / search. wikipedia. net/ demos/ ent4_frame. org/ perldoc?perluniintro http:/ / search. _UTF8 http:/ / perldoc. html http:/ / en.

5 anonymous edits License Creative Commons Attribution-Share Alike 3.php?oldid=1957841  Contributors: Adrignola. 0/ .0 Unported http:/ / creativecommons.Article Sources and Contributors 19 Article Sources and Contributors Perl Programming/Unicode UTF-8  Source: http://en.org/w/index. org/ licenses/ by-sa/ 3.wikibooks. Mrajcok.