Professional Documents
Culture Documents
8
Mark Davis
Chief SW Globalization Architect
IBM
Agenda
What is Collation?
Features
Mechanisms
Warnings
ICU 1.8 Collation
di silva Dickens
di Silva di silva
Di silva disilva
Di Silva di Silva
Dickens diSilva
disilva Di silva
diSilva Disilva
Disilva Di Silva
DiSilva DiSilva
Extended Customizations
User-defined Script Order
“&” ≡ b<<בβ<б
“ampersand” β<b<б<ב
Merging tailorings Numbers
Iranian + French A-1 < A-234
A-234 < A-1
Collation also used for:
Searching
ignore case, accent options
Selection
Return all records where
• Jones ≤ name < Smith
Graphemes
What a user considers a “character”
Regular expressions (Level 3)
• UTR #18
UCA
UTS #10: Unicode Collation Algorithm
Levels, Expansions, Contractions, Punctuation,
Canonical Equivalence, etc.
Default ordering: all Unicode code points
Provides for tailoring to given languages
Also see: The Unicode Standard,
§5.17: Sorting and Searching
Aligned with ISO 14651
APIs
String Compare
Sort Keys
String Search
Sort Keys
Transform string into series of bytes
which will binary-compare
a: 06 C3 01 20 01 02 00
A: 06 C3 01 20 01 08 00
á: 06 C3 01 20 32 01 02 02 00
ab: 06 C3 06 D7 01 20 20 01 02 02 00
b: 06 D7 01 20 01 02 00
not
FR UCA
not
code
found
found
synthesized
Processing Overview
Checks for identical prefixes
Tolerant of most unnormalized text
invokes normalization rarely
Uses “exceptional values”
Compresses sort keys
Incremental length/normalization
Identical Prefixes
Sorting / Searching Databases
Many comparisons to “close” strings
Check initial prefixes with binary
compare
Drop into collation loop at first
difference
Complication…
Initial Prefix Complication
Need to backup if in “bad”
position:
Type Example
Contraction (Spanish) c h
Normalization a °
Surrogate Pair <L> <T>
Fast C or D (FCD)
Accepts all NFD, most NFC,
without normalization
X FCD NFC NFD
A- ring Y Y
Angstrom Y
A + ring Y Y
A + grave Y Y
A-ring + grave Y
A + cedilla + ring Y Y
A + ring + cedilla
A-ring + cedilla Y
Exceptional Values
Normal weight storage
P P P P P P P P P P P P P P P P S S S S S S S S C C T T T T T T
16b 8b 1 1 6b