Punycode - Wikipedia

Punycode - Wikipedia https://en.wikipedia.
org/wiki/Punycode
Punycode
Punycode is a representation of Unicode with the limited ASCII character subset used for Internet
hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII
consisting of letters, digits, and hyphens, which is called the Letter-Digit-Hyphen (LDH) subset. For
example, München (German name for Munich) is encoded as Mnchen-3ya.
While the Domain Name System (DNS) technically supports arbitrary sequences of octets in domain name
labels, the DNS standards recommend the use of the LDH subset of ASCII conventionally used for host
names, and require that string comparisons between DNS domain names should be case-insensitive. The
Punycode syntax is a method of encoding strings containing Unicode characters, such as internationalized
domain names (IDNA), into the LDH subset of ASCII favored by DNS. It is specified in IETF Request for
Comments 3492.[1]
Contents
Encoding procedure
Separation of ASCII characters
Encoding of non-ASCII character insertions as code numbers
Re-encoding of code numbers as ASCII sequences
Examples
Internationalized domain names
See also
References
External links
Encoding procedure
As stated in RFC 3492, "Punycode is an instance of a more general algorithm called Bootstring, which
allows strings composed from a small set of 'basic' code points to uniquely represent any string of code
points drawn from a larger set." Punycode defines parameters for the general Bootstring algorithm to match
the characteristics of Unicode text. This section demonstrates the procedure for Punycode encoding, using
the example of the string "bücher" (Bücher is German for books), which is translated into the label "bcher-
kva".
Separation of ASCII characters
First, all ASCII characters in the string are copied from input to output, skipping over any other characters.
For example, "bücher" is copied to "bcher". If any characters were copied, i.e. there was at least one ASCII
character in the input, an ASCII hyphen is added to the output next (e.g., "bücher" → "bcher-", but "ü" →
""). Since the ASCII hyphen is an ASCII character, the hyphen may itself appear in the output before this
additional hyphen. However, the additional hyphen does not cause any ambiguity when reading the output, as
no later part of the encoding process can introduce another ASCII hyphen; if there are one or more ASCII
hyphens in the output, the last one always signifies the end of the ASCII characters.
1 of 5 8/2/21, 09:58
Punycode - Wikipedia https://en.wikipedia.org/wiki/Punycode
Encoding of non-ASCII character insertions as code numbers
The next part of the encoding process first requires an understanding of the decoder, which is a finite-state
machine with two state variables i and n. i is an index into the string ranging from zero (representing a
potential insertion at the start) to the current length of the extended string (representing a potential insertion
at the end).
i starts at zero, and n starts at 128 (the first non-ASCII code point). The state progression is a monotonic
function. A state change either increments i or, if i is at its maximum, resets i to zero and increments n by 1,
then goes back to incrementing i in the following state change. At each state change, either the code point
denoted by n is inserted or it is not inserted.
The code numbers generated by the encoder represent how many possibilities to skip before an insertion is
made. There are six possible places to insert a character in the current string "bcher" (including before the
first character and after the last one). There are 124 code points between the last one considered (127 = 0x7F,
the end of ASCII) and "ü" (code point 252 = 0xFC, see Unicode's Latin-1 Supplement). Also there is one
position to insert a "ü" that needs to be skipped (at position zero before the 'b'). That is why it is necessary to
tell the decoder to skip a total of (6 × 124) + 1 = 745 possible insertions before getting to the one required.
Once the character is inserted there are now seven possible places to insert another character.
Re-encoding of code numbers as ASCII sequences
Punycode uses generalized variable-length integers to represent these values. For example, this is how "kva"
is used to represent the code number 745:
A number system with little-endian ordering is used which allows variable-length codes without
separate delimiters: a digit lower than a threshold value marks that it is the most-significant
digit, hence the end of the number. The threshold value depends on the position in the number
and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits
vary.
In this case a number system with 36 symbols is used, with the case-insensitive 'a' through 'z'
equal to the decimal numbers 0 through 25, and '0' through '9' equal to the decimal numbers 26
through 35. Thus "kva", corresponds to the decimal number string "10 21 0".
To decode this string of symbols, a sequence of thresholds will be needed, in this case (1, 1, 26). We can
show that the weight of the least-significant digit is always 1, i.e., the first symbol is the units place value; 'k'
(=10) with a weight of 1 equals 10. After this, the weight of the next digit depends on the first threshold. The
second symbol has a place value of 36 minus the previous threshold value, in this case, 35. Therefore, the
sum of the first two symbols 'k' (=10) and 'v' (=21) is 10 × 1 + 21 × 35. Since the second symbol is not less
than the threshold value of 1, there is more to come. However, since the third symbol in this example is 'a'
(=0), we may ignore calculating its weight. Therefore, "kva" represents the decimal number (10 × 1) + (21 ×
35) = 745.
It is easy to show by induction that the weight of the (n+1)-th digit is the weight of the previous one times
(36 - threshold of the n-th digit).
The thresholds themselves are determined for each successive encoded character by an algorithm keeping
them between 1 and 26 inclusive, meaning the last character of an encoding will always be alphabetic. The
case can then be used to provide information about the original case of the string.
For the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-
2 of 5 8/2/21, 09:58
kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes
codes representing insertion of ý, the character following ü, starting with "ýbücher" with code "bcher-kvaf"
(different from "übücher" coded "bcher-jvab"), etc.
To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded
values from encoding inadmissible Unicode values: however, these should be checked for and detected
during decoding.
Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the
character set ranges within the string as it operates. It is optimized for the case where the string is composed
of zero or more ASCII characters and in addition characters from only one other script system, but will cope
with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been
normalized using Nameprep and (for top-level domains) filtered against an officially registered language
table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output
Punycode string.
Examples
The following table shows examples of Punycode encodings for different types of input.[2]
3 of 5 8/2/21, 09:58
Input Punycode of input Description of input
The empty string.
a a- Only ASCII characters, one, lowercase.
A A- Only ASCII characters, one, uppercase.
3 3- Only ASCII characters, one, a digit.
- -- Only ASCII characters, one, a hyphen.
-- --- Only ASCII characters, two hyphens.
Only ASCII characters, more than one, no

London London-
hyphens.
Lloyd-Atkinson Lloyd-Atkinson- Only ASCII characters, one hyphen.
This has spaces This has spaces- Only ASCII characters, with spaces.
-> $1.00 <- -> $1.00 <-- Only ASCII characters, mixed symbols.
No ASCII characters, one Latin-1

ü tda
Supplement character.
α mxa No ASCII characters, one Greek character.
例 fsq No ASCII characters, one CJK character.
� n28h No ASCII characters, one emoji character.
No ASCII characters, more than one

αβγ mxacd
character.
Mixed string, with one character that is

München Mnchen-3ya
not an ASCII character.
Mnchen-3ya Mnchen-3ya- Double-encoded Punycode of "München".
Mixed string, with one character that is

München-Ost Mnchen-Ost-9db
not ASCII, and a hyphen.
Mixed string, with one space, one hyphen,

Bahnhof München-Ost Bahnhof Mnchen-Ost-u6b
and one character that is not ASCII.
abæcdöef abcdef-qua4k Mixed string, two non-ASCII characters.
правда 80aafi6cg Russian, without ASCII.
ยจฆฟคฏข 22cdfh1b8fsa Thai, without ASCII.
도메인 hq1bm8jm9l Korean, without ASCII.
ドメイン名例 eckwd4c7cu47r2wf Japanese, without ASCII.
MajiでKoiする5秒前 MajiKoi5-783gue6qz075azm5e Japanese with ASCII.
Mixed non-ASCII scripts (Latin-1

「bücher」 bcher-kva8445foa
Supplement and CJK).
Internationalized domain names

To prevent non-international domain names containing hyphens from being accidentally interpreted as
Punycode, international domain name Punycode sequences have a so-called ASCII Compatible Encoding
(ACE) prefix, "xn--", prepended.[3] Thus the domain name "bücher.tld" would be represented in ASCII as
4 of 5 8/2/21, 09:58
"xn--bcher-kva.tld".
See also
Emoji domain
UTF-5
UTF-6
Website spoofing
References
1. RFC 3492, Punycode: A Bootstring encoding of Unicode for Internationalized
Domain Names in Applications (IDNA), A. Costello, The Internet Society (March
2003)
2. The Punycode in this table was created using the builtin codec "punycode" of the
Python programming language version 3.8 (s.encode("punycode")). See talk page.
3. Internet Assigned Numbers Authority (2003-02-14). "Completion of IANA Selection
of IDNA Prefix" (https://web.archive.org/web/20100427154004/http://www.atm.tut.fi
/list-archive/ietf-announce/msg13572.html). www.atm.tut.fi. Archived from the
original (http://www.atm.tut.fi/list-archive/ietf-announce/msg13572.html) on
2010-04-27. Retrieved 2017-09-22.
External links
IETF Punycode standard (https://www.rfc-editor.org/rfc/rfc3492.txt)
ICU IDNA Demonstration (https://web.archive.org/web/20070129082116/http://dem
o.icu-project.org/icu-bin/idnbrowser) An online demonstration of how ICU performs
IDN operations
List of TLDs considered by the Mozilla developers to have an effective anti-spoofing
policy for name registration (https://www.mozilla.org/projects/security/tld-idn-policy
-list.html)
IDN and Punycode in IE7 (http://blogs.msdn.com/ie/archive/2006/07/31/684337.asp
x)
Simple Punycode converter (https://www.charset.org/punycode.php)
Online on-the-fly Punycode converter based on the Punycode.js JavaScript library (h
ttps://mothereff.in/punycode)
Online modular converter offering Punycode and Bootstring (https://cryptii.com/pip
es/bootstring)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Punycode&oldid=1035294994"
This page was last edited on 24 July 2021, at 20:46 (UTC).
Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may
apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered
trademark of the Wikimedia Foundation, Inc., a non-profit organization.
5 of 5 8/2/21, 09:58

Punycode - Wikipedia

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Punycode - Wikipedia

Uploaded by

Copyright:

Available Formats

Punycode - Wikipedia https://en.wikipedia.

Separation of ASCII characters

Encoding of non-ASCII character insertions as code numbers

Re-encoding of code numbers as ASCII sequences

Input Punycode of input Description of input

The empty string.

a a- Only ASCII characters, one, lowercase.

A A- Only ASCII characters, one, uppercase.

3 3- Only ASCII characters, one, a digit.

- -- Only ASCII characters, one, a hyphen.

-- --- Only ASCII characters, two hyphens.

Only ASCII characters, more than one, no

Lloyd-Atkinson Lloyd-Atkinson- Only ASCII characters, one hyphen.

No ASCII characters, one Latin-1

α mxa No ASCII characters, one Greek character.

例 fsq No ASCII characters, one CJK character.

� n28h No ASCII characters, one emoji character.

No ASCII characters, more than one

Mixed string, with one character that is

Mnchen-3ya Mnchen-3ya- Double-encoded Punycode of "München".

Mixed string, with one character that is

Mixed string, with one space, one hyphen,

abæcdöef abcdef-qua4k Mixed string, two non-ASCII characters.

правда 80aafi6cg Russian, without ASCII.

ยจฆฟคฏข 22cdfh1b8fsa Thai, without ASCII.

도메인 hq1bm8jm9l Korean, without ASCII.

ドメイン名例 eckwd4c7cu47r2wf Japanese, without ASCII.

MajiでKoiする5秒前 MajiKoi5-783gue6qz075azm5e Japanese with ASCII.

Mixed non-ASCII scripts (Latin-1

Internationalized domain names

Retrieved from "https://en.wikipedia.org/w/index.php?title=Punycode&oldid=1035294994"

This page was last edited on 24 July 2021, at 20:46 (UTC).

You might also like