From ASCII To UTF-8-RolandSchock

IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019
From ASCII to UTF-8:

A Primer for a successful codepage conversion
Roland Schock
ARS Computer und Consulting GmbH
Session code: IBM Data Tech Summit, Toronto
24.09.2019 Db2 for Distributed
1
Agenda
• Introduction into Character sets, Codepages and UTF-8

• Planning your conversion
• Pitfalls (CODEUNITS32, Expansion, Factor 4, …)
• Codepages and Collation Sequences in Db2
• Performance implications when choosing a codepage
• Methods to convert a database to UTF-8
• Pitfalls of using multiple codepages within a Db2 for LUW database
unknowingly to the system
• Tips and tricks to do a successful migration
Character Sets
• Basically a character set is just a collection of entities or graphical

symbols with a meaning.
• Examples for character sets are the latin alphabet, digits, naval flag
signs or other symbols:
A, B, C, ... ᇹぁゆ㌹㌺
agpx
A b c d ㍻㋿亹怔떟떥
3
Character Encoding
• A character encoding or code page is a mapping of symbols of a

character set to bit patterns which are also referred as code points.
A → 17, B → 23, C → 42, …
• Typical examples of encodings are ASCII, EBCDIC or Unicode.
• Part of the encoding scheme is also the definition of a serialisation

scheme to convert the code point into a sequence of bytes.
(big-endian, little-endian, middle-endian, different CPU architectures)
4
ASCII
• Sample of an encoding scheme:
• First version 1963, Standardized 1968

• Ordered mapping to 7-bit numbers
5
Single Byte Char Sets (SBCS)
• Extensions from 7-bit ASCII to 8-bit code pages

• ISO-8859-x: ASCII + special characters for some languages
• ISO-8859-1 (Latin 1): ASCII + West European Chars
• ISO-8859-2 (Latin 2): ASCII + East European Chars
• ISO-8859-15: Modified ISO-8859-1 including Euro-Symbol (€)
• Platform specific charsets: Windows ANSI or MacRoman
• Common practice: Use Codepoint 0 for string termination
6
Double Byte Char Sets (DBCS) and EUC (Extended Unix Code)
• DBCS
• Expansion of SBCS from one byte to two bytes per character
• Mainly for asiatic languages with more than 256 characters to encode
• Latin text is expanded to twice the size of SBCS
• Codepoints < 256 insert a zero byte in the stream
• EUC
• Multi Byte Char Set (MBCS): 2 or 4 bytes/char
• Used for Japanese, Korean, Traditional and Simplified Chinese on Unix platforms
• Uses single shift characters to switch to a another code group to build a multi
byte character
7
Unicode Character Set
• Intended to simplify and unify the different definitions of code pages

and hence conversion.
• The first definition contained 65536 characters (16-bit, 1991, UCS-2).
• Version 2.0 defined the charset with 16 planes for up to 1,114,112
characters (32-bit, 1996, UCS-4).
• Today in Unicode Version 12.1 we have approx. 137,929 characters
assigned to code points.
(Including ㋿ for the new Japanese Reiwa era since May, 1st 2019)
8
Unicode Encodings
• UCS-2: two bytes per character

• UCS-4: four bytes per character
• UTF-16: Encoding of UCS-4 into one or two words: the first 64k code
points use two bytes per character, all others four byte
• UTF-8: dynamic or variable length encoding of characters in 1..4 bytes
• Possible problems with UCS-2, UCS-4, UTF-16:
• Byte order (big-endian vs. little-endian) of different processor architectures.
• Zero-Bytes in character string possible ➔ not valid as string termination
9
UTF-8
• Encoding in variable length sequence of 1..4 bytes

• Simple recognition of multibyte chars
• Compact storage of text in latin chars
• Only the shortest encoding allowed
10
Planning your conversion
• Pitfalls
• Truncation
• CODEUNITS32
• Making an educated guess
• Selecting a collation sequence
• IDENTITY
• SYSTEM and SYSTEM_xxx
• UCA collations
• Making Db2 case-insensitive
11
Character Expansion and Truncation
• Trivial for Latin text as plain english aka ASCII text will not expand
• Using non-english chars get expanded in UTF-8 to multiple bytes
eating up a potential buffer in string definition
• By default Db2 uses bytes and not characters in definitions
CHAR(20) is actually 20 bytes long and not 20 characters!
• Example:
The German word for apples 'Äpfel' gets stored in 6 bytes
db2 "select 'Äpfel' as text, hex('Äpfel') as hex from sysibm.sysdummy1"
TEXT HEX
Äpfel C3847066656C
12
STRING_UNITS and CODEUNITS32
• Without further specification Db2 uses Bytes aka OCTETS for all sizes
• With the global variable NLS_STRING_UNITS or the DB CFG value
STRING_UNITS the default number of bytes per CHAR can be changed
• CODEUNITS16 specify 2 bytes or CODEUNITS32 4 bytes per character
• With NLS_STRING_UNITS=CODEUNITS32 a CHAR(10) occupies 40
bytes in memory and can store 10 ASCII or 10 Umlaut characters
• But as the internal storage of a CHAR uses a length byte, the
maximum number of chars in string depend on (NLS_)STRING_UNITS
→ with CODEUNITS32 you can use max. CHAR(63)
13
Avoid Truncation aka data loss
• If not changing STRING_UNITS you have to change your schema to

accommodate room for character expansion
• Do not blindly multiply all CHAR(xx) definitions by 4 (=worst case)
• And you might need to move the table to a tablespace with greater
page size as well, if the row length grows too much
• Get an educated guess and work on that
• Use script from Peter Schurr, IBM to analyze your source db and get a good
guess, which definitions need a change
• Don't resize fields, which will never experience character expansion (ID fields)
14
Selecting a collating sequence
• When creating your new database you also have to choose a collating
sequence which determines the sort order of strings
• By default binary IDENTITY is used. This is the fastest method.
• The collating sequences SYSTEM and SYSTEM_xx (xx=SBCS codepage)
use an 8-bit ordering scheme and binary order for the rest → still fast
• UCA collating sequences can accommodate the most complex
ordering schemes, but can cost performance during sorting/searching
• With UCA collating sequences you can make Db2 case-insensitive ;-)
15
Methods to convert a database to UTF-8
• Requirements
• "Mudmap" for migration
• Tools
• Tripwires to avoid
• Functions in UTF-8 databases
16
Requirements
• Migrating from a SBCS to UTF-8 requires conversion of the byte code

(on disk) to a symbol and back into the byte code in the new code set
• The only way to change a codepage of a database is a full export and
load of all the data → No shortcut possible!
• Nope! Backup/Restore can not change codepage or byte order!
• Tools
• db2move
• Your own Export/Load scripts
• some IBM migration utilities (DMT, etc.)
• CDC Replication, Federation and Admin_Move_Table(), etc.
17
db2move
• Why don't we just use db2move?

• free utility from IBM in all Db2 editions
• Export/Load from binary IXF files
• Can create destination tables
• Db2move is by default single threaded in EXPORT/LOAD mode

• Performance of db2move in COPY mode with LOAD_ONLY was not
tested so far
• Flexibility with home grown scripts was preferred
18
"Mudmap" for migration (1|3)
• Create new database as needed:

create db NEWDB using codeset UTF-8 TERRITORY DE COLLATE USING
CLDR181_LDE_AN_CX_EX_FX_HX_NX_S2 PAGESIZE 16 K DFT_EXTENT_SZ 8
AUTOCONFIGURE APPLY DB ONLY;
• Extract database structure via db2look (drop explain tables first):
db2 "connect to olddb"
db2 "call sysproc.sysinstallobjects('EXPLAIN','D',CAST(NULL AS
VARCHAR(128)),'DB2INST1')"
db2look -d olddb –td@ -e -l -o db2look.out -xd -f
• Start EXPORTing your data with scripts (will take a while anyway)
• Split up db2look file in tables/views, functions/triggers and grants
19
• Create Db2 script update_schema.db2 for schema changes and reorgs

ALTER TABLE myschema.mytable ALTER COLUMN name SET DATA TYPE VARCHAR(80);
• Make schema changes if needed by your application
• Generate REORGS out of ALTER TABLE script
for i in `fgrep -i "ALTER TABLE" update_schema.db2 | cut -d' ' -f 3 | sort
| uniq` ; do echo "REORG TABLE $i ;" ; done
• Create the new schema in the new database from db2look part 1
• Use your update_schema.db2 script to alter and reorg empty tables
• Increase UTIL_HEAP_SZ, SORTHEAP to improve LOAD performance
• Backup new and still empty database for easier restart of LOAD
20
• Create LOAD scripts and don't forget IDENTITYOVERRIDE for tables

with exported identity column values
• Create EXCEPTION_TABLES for LOAD
• LOAD data in parallel in 4-6 threads (with statistics), using screen ;-)
• Check EXCEPTION_TABLES and LOAD messages, fix problems, reload
• Run check integrity, restart sequences
• Drop empty exception tables
• Create functions/triggers, grants from other parts of db2look file
• Backup database
21
Tools for easier migration
• db2move
• IBM Migration Toolkit (generates scripts, unfortunately discontinued)
• homegrown export/loads
• IBM Database Conversion Workbench (DCW)
• HPU*
• InfoSphere CDC*
* extra license required
22
Tripwires to avoid (1|3)
• SQL3125W during LOAD

• Occurs for every truncated value
• Hard to find as LOAD message is not very informative
• Reasons:
• Character expansion higher than anticipated (or forgotten expansion)
• Full length strings from Export will throw this message on LOAD just in case
• IDENTITYOVERRIDE and Sequences
• Default LOAD options generate new values for identity columns → ~RI
• Don't forget to restart sequences: ALTER SEQUENCE x.y RESTART WITH
23
Tripwires to avoid (2|3):

Beware of bad 3rd Party Applications
• Application stored Unicode strings in Non-Unicode database
• Workaround from application developer with manual conversion in application
and storing value as VARCHAR FOR BIT DATA and column with CODEPAGE used
• Db2 Export assumed base codepage and tried to convert and replaces chars
➔ Exported data is already corrupt
• Import might convert again and replace chars ➔ data corruption
• Solution:
• All tables with a codepage column were exported as VARCHAR FOR BIT DATA
• LOAD was also changed to VARCHAR FOR BITDATA
• The application had a script to fix those records after the data migration
24
Tripwires to avoid (3|3)

Application changes in own code
• Choose the right function: bytewise or charwise
• Watch out for defaults: e.g. with CHAR()
In a non-Unicode database, the string units of the result is OCTETS. Otherwise, the string units of the result are
determined by the data type of the first argument.
• OCTETS, if the first argument is character string or a graphic string with string units of OCTETS,
CODEUNITS16, or double bytes.
• CODEUNITS32, if the first argument is character string or a graphic string with string units of CODEUNITS32.
• Determined by the default string unit of the environment, if the first argument is not a character string or a
graphic string.
In a Unicode database, when the output string is truncated part-way through a multiple-byte character:
• If the input was a character string, the partial character is replaced with one or more blanks
• If the input was a graphic string, the partial character is replaced by the empty string
Do not rely on either of these behaviors because they might change in a future release.
25
Functions in UTF-8 databases (1|4)
• The datatype CHAR is by default bytewise, but can also be specified

with CODEUNITS16 or CODEUNITS32
For example, CHAR(5 CODEUNITS32), storing 'abc' is handled in memory as 'abc ' (two spaces) to meet the logical
definition of the column. However, on disk [in UTF-8] it is stored as 'abc ' (17 spaces) since the Db2 Data
Management Services layer requires the column to meet the physical definition.
• Beware of NCHAR mapping

• Db2 10.5 and earlier map NCHAR by default to GRAPHIC_CU16
• Db2 11.1 and later map NCHAR by default to CHAR_CU32
• See also https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.config.doc/doc/r0060937.html
• This can cause issues in MQTs, lots of rebinding and return values
26
• Some functions can be augmented with a char size:

• INSTR, LOCATE, LOCATE_IN_STRING can have an additional param to specify
OCTETS, CODEUNITS16 or CODEUNITS32
• INSTRB, INSTR2, INSTR4 work on single bytes, double bytes or 4 bytes
27
• LENGTH, LENGTH2, LENGTH4, LENGTHB return the length in bytes of

the string in this representation
• Example with the Unicode string '&N~AB', where '&' is the musical symbol G clef character,
and '~' is the combining tilde character.
This string is shown in different Unicode encoding forms in the following example:
What is the result of:

SELECT LENGTH(UTF8_VAR, CODEUNITS16), LENGTH(UTF8_VAR, CODEUNITS32),
LENGTH(UTF8_VAR, OCTETS) FROM SYSIBM.SYSDUMMY1
28
• CHARACTER_LENGTH in codeunits, OCTET_LENGTH in bytes

• POSITION (byte or char), POSSTR (bytewise only)
• LOWER/LCASE is either bytewise or needs more details for conversion
and resulttype
• TRIM, LTRIM, RTRIM remove spaces in either codeunits

• TO_SINGLE_BYTE projects chars into single byte version
This list is not complete! See Knowledge Center for more…

29
Summary
• Code pages are not as complicated as they might look

• Converting a database to UTF-8 is not just an Export/Load, but
requires some planning, dry runs and maybe application awareness
• Parallel Export and Loads can dramatically reduce time needed
• Don't assume a function works char-wise, default is mostly octets.
• When selecting/updating data manually: Be aware of code page
conversion between your database and your shell (see Appendix)
30
Roland Schock
ARS Computer und Consulting GmbH
roland.schock@ars.de
Session code: A1
Please fill out your session

Please fill out
evaluation your leaving!
before session
evaluation before leaving!
31
Appendix
• Where can I define the code page used?

• What is code page conversion and where does it happen?
• What problems can arise and how can I avoid them?
• Performance considerations
32
Usage of a code page
• Code pages can be specified at different levels:

• At the operating system where the application runs
• At the operating system where the server runs
• At the operating system where the application is prepared/bound
• At the database level
33
Default code page
• As default DB2 server and clients use the local settings of the
operating system or user:
• Windows: The server process is using the default region settings of the operating
system.
• Linux/Unix: The codepage is derived from the locale setting for the instance user
(i.e. the user running the database processes).
• Client (LUW): The current locale settings of the user determine the code page
used during CONNECT.
• Programming language: Java is always using Unicode when connecting to a
database via JDBC.
34
Specifying a code page: OS level
• Windows: Control Panel → Regional and Language settings, chcp

command
• Linux/Unix: locale command
35
At prepare/bind time
• Special case during development of database software with static,

embedded SQL.
• Embedded SQL needs a prepare phase before compilation of the
source code.
• Later the prepared package needs to be bound to the database with
the bind command.
• Both commands need a database connection and at the connect
time; the current setting of the locale is used.
36
Defining a database w/ code page
• Explicitly set the code page at creation time:

CREATE DB test USING CODESET codeset
TERRITORY territory COLLATE collatingseq
• Otherwise current locale is used to determine database codeset.
• The choosen code page cannot be changed later.
• In DB2 for iSeries and for z/OS you can also define single columns of a
table in a different code set (not detailed here).
37
Overview

38
Code page conversion
• If application and server use a different code page, code page

conversion happens.
• Code page conversion is always done at the receivers side:
• at the servers side for data sent from client to server
• at the clients side for data sent from server to client
• Exception: Importing IXF files generated on a different system with
another code page
• If conversion tables are missing: SQLCODE -332
39
Client to server conversion
Client Server
uses code page X uses code page Y
§ Send data using § Receive data

code page X § Convert to code page Y
§ Process data
§ Receive data in Y § Return result in code page Y
§ Convert to code
page X
Using DB2 Connect
Client Gateway Server

uses code page X uses code page Y uses code page Z
§ Send data using § Receive data

code page X § Convert to code
page Y
§ Send data in Y § Receive data
§ Convert to code
page Z
§ Receive data in Z § Return result in
§ Convert to Y code page Z
§ Receive data in Y § Return result in Y
§ Convert to code
page X
41
Other considerations
• Mapping of characters (injective):

If a character in the source code page is not contained in the target
code page, it is replaced by a substitution character.
• Round trip conversion (bijective):
If no substitution needs to take place between source and target code
pages, a round trip conversion does not loose information.
• Encoding/Decoding can change the number of bytes needed to store
the data.
42
More considerations
• Using different conversion tables and €-Symbol:

Microsoft ANSI code page and the official code page 850 have a
different code point for the Euro symbol. If needed code conversion
tables can be replaced (ref. Administration Guide, Planning).
• Unicode support:
DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding
for Unicode databases
• For PureXML (V9.x) a UTF-8 database is needed.
43
More considerations
• To change a code page of a database, you have to use db2move

(Export/Import). Backup/Restore cannot be used. So choosing the
right database code page during database creation is crucial.
• Binary data (BLOB, FOR BIT DATA) is internally stored with code page
0, so no character conversion is applied.
44
Overview

45
Troubleshooting
• Identify used code pages:

• Display the codepage of a database:
db2 get db cfg for sample
• Displaying SQLCA area during CONNECT with CLP
When connecting to a database via CLP the option "-a" displays the SQLCA data
area, which shows the code page of the database and the connecting client.
• If connecting to iSeries or zSeries machines from DB2 LUW, check if
conversion tables are available.
46
Pitfalls
• Watch out for unintentional "conversions"

• All database communication partners are configured correct,
but the DBA is looking via a console window at the data and the
console window (or putty) is using a font with the wrong codepage
to display the data!
47
db2set DB2CODEPAGE
• Know what you intend to do, if you use the DB2 environment variable
DB2CODEPAGE
• It tells DB2, that you will feed it with the right code points, regardless
of the displayed symbols.
• See Technote "Setting DB2CODEPAGE=1208 may result in incorrect

character data insertion"
SQL0191N Error occurred because of a fragmented MBCS character.
http://www.ibm.com/support/docview.wss?uid=swg21601028
48
db2set DB2CONSOLECP
• Intended to allow DB2 CLI to use different codepages for output:
• Multiple APARs for DB2 9.1, 9.5, 9.7:

"DB2CONSOLECP environment variable has no effect on DB2 message
text or is not working"
49
DB2 Special Registers for NLS
• Change message text for DB2 Monreport modules:

db2 "SET CURRENT LOCALE LC_MESSAGES = 'de_DE'"
db2 "call monreport.lockwait"
• Change message names for Time/Dates:
db2 "SET CURRENT LOCALE LC_TIME = 'fr_FR'"
db2 "values monthname(current date)"
(Works with DAYNAME, MONTHNAME, NEXT_DAY, ROUND, ROUND_TIMESTAMP,
TIMESTAMP_FORMAT, TRUNCATE, TRUNC_TIMESTAMP and VARCHAR_FORMAT)
50
Performance considerations
• Try to avoid unneccessary conversions.

• Create databases already with the code page needed for your
applications.
• For international databases prefer UTF-8, especially when used with
Java programs.
• Remember: Conversion takes time.
51
Links
• IBM developerworks white paper:

http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html
• DB2 Knowledge Center, Db2 11.1, Multicultural Support

https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.nls.doc/doc/c0006848.html
• Unicode
http://www.unicode.org
• UTF-8 article at Wikipedia

http://en.wikipedia.org/wiki/UTF-8
52

From ASCII To UTF-8-RolandSchock

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

From ASCII To UTF-8-RolandSchock

Uploaded by

Copyright:

Available Formats

IDUG & IBM Data Tech Summit

Toronto, Canada | September 23 – 24, 2019

From ASCII to UTF-8:

• Introduction into Character sets, Codepages and UTF-8

• Basically a character set is just a collection of entities or graphical

• A character encoding or code page is a mapping of symbols of a

• Part of the encoding scheme is also the definition of a serialisation

• Sample of an encoding scheme:

• First version 1963, Standardized 1968

Single Byte Char Sets (SBCS)

• Extensions from 7-bit ASCII to 8-bit code pages

• Common practice: Use Codepoint 0 for string termination

Unicode Character Set

• Intended to simplify and unify the different definitions of code pages

• UCS-2: two bytes per character

• Encoding in variable length sequence of 1..4 bytes

Planning your conversion

Character Expansion and Truncation

STRING_UNITS and CODEUNITS32

Avoid Truncation aka data loss

• If not changing STRING_UNITS you have to change your schema to

Selecting a collating sequence

Methods to convert a database to UTF-8

• Migrating from a SBCS to UTF-8 requires conversion of the byte code

• Why don't we just use db2move?

• Db2move is by default single threaded in EXPORT/LOAD mode

"Mudmap" for migration (1|3)

• Create new database as needed:

"Mudmap" for migration (2|3)

• Create Db2 script update_schema.db2 for schema changes and reorgs

"Mudmap" for migration (3|3)

• Create LOAD scripts and don't forget IDENTITYOVERRIDE for tables

Tools for easier migration

* extra license required

Tripwires to avoid (1|3)

• SQL3125W during LOAD

Tripwires to avoid (2|3):

Tripwires to avoid (3|3)

Functions in UTF-8 databases (1|4)

• The datatype CHAR is by default bytewise, but can also be specified

• Beware of NCHAR mapping

Functions in UTF-8 databases (2|4)

• Some functions can be augmented with a char size:

Functions in UTF-8 databases (3|4)

• LENGTH, LENGTH2, LENGTH4, LENGTHB return the length in bytes of

What is the result of:

Functions in UTF-8 databases (4|4)

• CHARACTER_LENGTH in codeunits, OCTET_LENGTH in bytes

• TRIM, LTRIM, RTRIM remove spaces in either codeunits

This list is not complete! See Knowledge Center for more…

• Code pages are not as complicated as they might look

Please fill out your session

• Where can I define the code page used?

Usage of a code page

• Code pages can be specified at different levels:

Default code page

Specifying a code page: OS level

• Windows: Control Panel → Regional and Language settings, chcp

• Special case during development of database software with static,

Defining a database w/ code page

• Explicitly set the code page at creation time:

• Where can I define the code page used?

Code page conversion