You are on page 1of 52

IDUG & IBM Data Tech Summit

Toronto, Canada | September 23 – 24, 2019

From ASCII to UTF-8:


A Primer for a successful codepage conversion

Roland Schock
ARS Computer und Consulting GmbH
Session code: IBM Data Tech Summit, Toronto
24.09.2019 Db2 for Distributed

1
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Agenda

• Introduction into Character sets, Codepages and UTF-8


• Planning your conversion
• Pitfalls (CODEUNITS32, Expansion, Factor 4, …)
• Codepages and Collation Sequences in Db2
• Performance implications when choosing a codepage
• Methods to convert a database to UTF-8
• Pitfalls of using multiple codepages within a Db2 for LUW database
unknowingly to the system
• Tips and tricks to do a successful migration
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Character Sets

• Basically a character set is just a collection of entities or graphical


symbols with a meaning.
• Examples for character sets are the latin alphabet, digits, naval flag
signs or other symbols:

A, B, C, ... ᇹぁゆ ㌹ ㌺
agpx
A b c d ㍻㋿亹怔떟떥

3
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Character Encoding

• A character encoding or code page is a mapping of symbols of a


character set to bit patterns which are also referred as code points.
A → 17, B → 23, C → 42, …
• Typical examples of encodings are ASCII, EBCDIC or Unicode.

• Part of the encoding scheme is also the definition of a serialisation


scheme to convert the code point into a sequence of bytes.
(big-endian, little-endian, middle-endian, different CPU architectures)

4
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

ASCII

• Sample of an encoding scheme:

• First version 1963, Standardized 1968


• Ordered mapping to 7-bit numbers
5
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Single Byte Char Sets (SBCS)

• Extensions from 7-bit ASCII to 8-bit code pages


• ISO-8859-x: ASCII + special characters for some languages
• ISO-8859-1 (Latin 1): ASCII + West European Chars
• ISO-8859-2 (Latin 2): ASCII + East European Chars
• ISO-8859-15: Modified ISO-8859-1 including Euro-Symbol (€)
• Platform specific charsets: Windows ANSI or MacRoman

• Common practice: Use Codepoint 0 for string termination

6
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Double Byte Char Sets (DBCS) and EUC (Extended Unix Code)

• DBCS
• Expansion of SBCS from one byte to two bytes per character
• Mainly for asiatic languages with more than 256 characters to encode
• Latin text is expanded to twice the size of SBCS
• Codepoints < 256 insert a zero byte in the stream
• EUC
• Multi Byte Char Set (MBCS): 2 or 4 bytes/char
• Used for Japanese, Korean, Traditional and Simplified Chinese on Unix platforms
• Uses single shift characters to switch to a another code group to build a multi
byte character

7
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Unicode Character Set

• Intended to simplify and unify the different definitions of code pages


and hence conversion.
• The first definition contained 65536 characters (16-bit, 1991, UCS-2).
• Version 2.0 defined the charset with 16 planes for up to 1,114,112
characters (32-bit, 1996, UCS-4).
• Today in Unicode Version 12.1 we have approx. 137,929 characters
assigned to code points.
(Including ㋿ for the new Japanese Reiwa era since May, 1st 2019)

8
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Unicode Encodings

• UCS-2: two bytes per character


• UCS-4: four bytes per character
• UTF-16: Encoding of UCS-4 into one or two words: the first 64k code
points use two bytes per character, all others four byte
• UTF-8: dynamic or variable length encoding of characters in 1..4 bytes
• Possible problems with UCS-2, UCS-4, UTF-16:
• Byte order (big-endian vs. little-endian) of different processor architectures.
• Zero-Bytes in character string possible ➔ not valid as string termination

9
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

UTF-8

• Encoding in variable length sequence of 1..4 bytes


• Simple recognition of multibyte chars
• Compact storage of text in latin chars
• Only the shortest encoding allowed

10
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Planning your conversion

• Pitfalls
• Truncation
• CODEUNITS32
• Making an educated guess
• Selecting a collation sequence
• IDENTITY
• SYSTEM and SYSTEM_xxx
• UCA collations
• Making Db2 case-insensitive

11
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Character Expansion and Truncation

• Trivial for Latin text as plain english aka ASCII text will not expand
• Using non-english chars get expanded in UTF-8 to multiple bytes
eating up a potential buffer in string definition
• By default Db2 uses bytes and not characters in definitions
CHAR(20) is actually 20 bytes long and not 20 characters!
• Example:
The German word for apples 'Äpfel' gets stored in 6 bytes
db2 "select 'Äpfel' as text, hex('Äpfel') as hex from sysibm.sysdummy1"
TEXT HEX
Äpfel C3847066656C
12
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

STRING_UNITS and CODEUNITS32

• Without further specification Db2 uses Bytes aka OCTETS for all sizes
• With the global variable NLS_STRING_UNITS or the DB CFG value
STRING_UNITS the default number of bytes per CHAR can be changed
• CODEUNITS16 specify 2 bytes or CODEUNITS32 4 bytes per character
• With NLS_STRING_UNITS=CODEUNITS32 a CHAR(10) occupies 40
bytes in memory and can store 10 ASCII or 10 Umlaut characters
• But as the internal storage of a CHAR uses a length byte, the
maximum number of chars in string depend on (NLS_)STRING_UNITS
→ with CODEUNITS32 you can use max. CHAR(63)
13
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Avoid Truncation aka data loss

• If not changing STRING_UNITS you have to change your schema to


accommodate room for character expansion
• Do not blindly multiply all CHAR(xx) definitions by 4 (=worst case)
• And you might need to move the table to a tablespace with greater
page size as well, if the row length grows too much
• Get an educated guess and work on that
• Use script from Peter Schurr, IBM to analyze your source db and get a good
guess, which definitions need a change
• Don't resize fields, which will never experience character expansion (ID fields)

14
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Selecting a collating sequence

• When creating your new database you also have to choose a collating
sequence which determines the sort order of strings
• By default binary IDENTITY is used. This is the fastest method.
• The collating sequences SYSTEM and SYSTEM_xx (xx=SBCS codepage)
use an 8-bit ordering scheme and binary order for the rest → still fast
• UCA collating sequences can accommodate the most complex
ordering schemes, but can cost performance during sorting/searching
• With UCA collating sequences you can make Db2 case-insensitive ;-)

15
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Methods to convert a database to UTF-8

• Requirements
• "Mudmap" for migration
• Tools
• Tripwires to avoid
• Functions in UTF-8 databases

16
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Requirements

• Migrating from a SBCS to UTF-8 requires conversion of the byte code


(on disk) to a symbol and back into the byte code in the new code set
• The only way to change a codepage of a database is a full export and
load of all the data → No shortcut possible!
• Nope! Backup/Restore can not change codepage or byte order!
• Tools
• db2move
• Your own Export/Load scripts
• some IBM migration utilities (DMT, etc.)
• CDC Replication, Federation and Admin_Move_Table(), etc.
17
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

db2move

• Why don't we just use db2move?


• free utility from IBM in all Db2 editions
• Export/Load from binary IXF files
• Can create destination tables

• Db2move is by default single threaded in EXPORT/LOAD mode


• Performance of db2move in COPY mode with LOAD_ONLY was not
tested so far
• Flexibility with home grown scripts was preferred
18
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

"Mudmap" for migration (1|3)

• Create new database as needed:


create db NEWDB using codeset UTF-8 TERRITORY DE COLLATE USING
CLDR181_LDE_AN_CX_EX_FX_HX_NX_S2 PAGESIZE 16 K DFT_EXTENT_SZ 8
AUTOCONFIGURE APPLY DB ONLY;
• Extract database structure via db2look (drop explain tables first):
db2 "connect to olddb"
db2 "call sysproc.sysinstallobjects('EXPLAIN','D',CAST(NULL AS
VARCHAR(128)),'DB2INST1')"
db2look -d olddb –td@ -e -l -o db2look.out -xd -f
• Start EXPORTing your data with scripts (will take a while anyway)
• Split up db2look file in tables/views, functions/triggers and grants
19
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

"Mudmap" for migration (2|3)

• Create Db2 script update_schema.db2 for schema changes and reorgs


ALTER TABLE myschema.mytable ALTER COLUMN name SET DATA TYPE VARCHAR(80);
• Make schema changes if needed by your application
• Generate REORGS out of ALTER TABLE script
for i in `fgrep -i "ALTER TABLE" update_schema.db2 | cut -d' ' -f 3 | sort
| uniq` ; do echo "REORG TABLE $i ;" ; done
• Create the new schema in the new database from db2look part 1
• Use your update_schema.db2 script to alter and reorg empty tables
• Increase UTIL_HEAP_SZ, SORTHEAP to improve LOAD performance
• Backup new and still empty database for easier restart of LOAD
20
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

"Mudmap" for migration (3|3)

• Create LOAD scripts and don't forget IDENTITYOVERRIDE for tables


with exported identity column values
• Create EXCEPTION_TABLES for LOAD
• LOAD data in parallel in 4-6 threads (with statistics), using screen ;-)
• Check EXCEPTION_TABLES and LOAD messages, fix problems, reload
• Run check integrity, restart sequences
• Drop empty exception tables
• Create functions/triggers, grants from other parts of db2look file
• Backup database
21
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Tools for easier migration

• db2move
• IBM Migration Toolkit (generates scripts, unfortunately discontinued)
• homegrown export/loads
• IBM Database Conversion Workbench (DCW)
• HPU*
• InfoSphere CDC*

* extra license required

22
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Tripwires to avoid (1|3)

• SQL3125W during LOAD


• Occurs for every truncated value
• Hard to find as LOAD message is not very informative
• Reasons:
• Character expansion higher than anticipated (or forgotten expansion)
• Full length strings from Export will throw this message on LOAD just in case
• IDENTITYOVERRIDE and Sequences
• Default LOAD options generate new values for identity columns → ~RI
• Don't forget to restart sequences: ALTER SEQUENCE x.y RESTART WITH

23
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Tripwires to avoid (2|3):


Beware of bad 3rd Party Applications
• Application stored Unicode strings in Non-Unicode database
• Workaround from application developer with manual conversion in application
and storing value as VARCHAR FOR BIT DATA and column with CODEPAGE used
• Db2 Export assumed base codepage and tried to convert and replaces chars
➔ Exported data is already corrupt
• Import might convert again and replace chars ➔ data corruption
• Solution:
• All tables with a codepage column were exported as VARCHAR FOR BIT DATA
• LOAD was also changed to VARCHAR FOR BITDATA
• The application had a script to fix those records after the data migration

24
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Tripwires to avoid (3|3)


Application changes in own code
• Choose the right function: bytewise or charwise
• Watch out for defaults: e.g. with CHAR()
In a non-Unicode database, the string units of the result is OCTETS. Otherwise, the string units of the result are
determined by the data type of the first argument.
• OCTETS, if the first argument is character string or a graphic string with string units of OCTETS,
CODEUNITS16, or double bytes.
• CODEUNITS32, if the first argument is character string or a graphic string with string units of CODEUNITS32.
• Determined by the default string unit of the environment, if the first argument is not a character string or a
graphic string.
In a Unicode database, when the output string is truncated part-way through a multiple-byte character:
• If the input was a character string, the partial character is replaced with one or more blanks
• If the input was a graphic string, the partial character is replaced by the empty string
Do not rely on either of these behaviors because they might change in a future release.

25
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Functions in UTF-8 databases (1|4)

• The datatype CHAR is by default bytewise, but can also be specified


with CODEUNITS16 or CODEUNITS32
For example, CHAR(5 CODEUNITS32), storing 'abc' is handled in memory as 'abc ' (two spaces) to meet the logical
definition of the column. However, on disk [in UTF-8] it is stored as 'abc ' (17 spaces) since the Db2 Data
Management Services layer requires the column to meet the physical definition.

• Beware of NCHAR mapping


• Db2 10.5 and earlier map NCHAR by default to GRAPHIC_CU16
• Db2 11.1 and later map NCHAR by default to CHAR_CU32
• See also https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.config.doc/doc/r0060937.html
• This can cause issues in MQTs, lots of rebinding and return values

26
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Functions in UTF-8 databases (2|4)

• Some functions can be augmented with a char size:


• INSTR, LOCATE, LOCATE_IN_STRING can have an additional param to specify
OCTETS, CODEUNITS16 or CODEUNITS32
• INSTRB, INSTR2, INSTR4 work on single bytes, double bytes or 4 bytes

27
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Functions in UTF-8 databases (3|4)

• LENGTH, LENGTH2, LENGTH4, LENGTHB return the length in bytes of


the string in this representation
• Example with the Unicode string '&N~AB', where '&' is the musical symbol G clef character,
and '~' is the combining tilde character.
This string is shown in different Unicode encoding forms in the following example:

What is the result of:


SELECT LENGTH(UTF8_VAR, CODEUNITS16), LENGTH(UTF8_VAR, CODEUNITS32),
LENGTH(UTF8_VAR, OCTETS) FROM SYSIBM.SYSDUMMY1
28
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Functions in UTF-8 databases (4|4)

• CHARACTER_LENGTH in codeunits, OCTET_LENGTH in bytes


• POSITION (byte or char), POSSTR (bytewise only)
• LOWER/LCASE is either bytewise or needs more details for conversion
and resulttype

• TRIM, LTRIM, RTRIM remove spaces in either codeunits


• TO_SINGLE_BYTE projects chars into single byte version

This list is not complete! See Knowledge Center for more…


29
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Summary

• Code pages are not as complicated as they might look


• Converting a database to UTF-8 is not just an Export/Load, but
requires some planning, dry runs and maybe application awareness
• Parallel Export and Loads can dramatically reduce time needed
• Don't assume a function works char-wise, default is mostly octets.
• When selecting/updating data manually: Be aware of code page
conversion between your database and your shell (see Appendix)

30
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Roland Schock
ARS Computer und Consulting GmbH
roland.schock@ars.de

Session code: A1

Please fill out your session


Please fill out
evaluation your leaving!
before session
evaluation before leaving!

31
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Appendix

• Where can I define the code page used?


• What is code page conversion and where does it happen?
• What problems can arise and how can I avoid them?
• Performance considerations

32
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Usage of a code page

• Code pages can be specified at different levels:


• At the operating system where the application runs
• At the operating system where the server runs
• At the operating system where the application is prepared/bound
• At the database level

33
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Default code page

• As default DB2 server and clients use the local settings of the
operating system or user:
• Windows: The server process is using the default region settings of the operating
system.
• Linux/Unix: The codepage is derived from the locale setting for the instance user
(i.e. the user running the database processes).
• Client (LUW): The current locale settings of the user determine the code page
used during CONNECT.
• Programming language: Java is always using Unicode when connecting to a
database via JDBC.

34
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Specifying a code page: OS level

• Windows: Control Panel → Regional and Language settings, chcp


command
• Linux/Unix: locale command

35
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

At prepare/bind time

• Special case during development of database software with static,


embedded SQL.
• Embedded SQL needs a prepare phase before compilation of the
source code.
• Later the prepared package needs to be bound to the database with
the bind command.
• Both commands need a database connection and at the connect
time; the current setting of the locale is used.

36
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Defining a database w/ code page

• Explicitly set the code page at creation time:


CREATE DB test USING CODESET codeset
TERRITORY territory COLLATE collatingseq
• Otherwise current locale is used to determine database codeset.
• The choosen code page cannot be changed later.
• In DB2 for iSeries and for z/OS you can also define single columns of a
table in a different code set (not detailed here).

37
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Overview

• Where can I define the code page used?


• What is code page conversion and where does it happen?
• What problems can arise and how can I avoid them?
• Performance considerations

38
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Code page conversion

• If application and server use a different code page, code page


conversion happens.
• Code page conversion is always done at the receivers side:
• at the servers side for data sent from client to server
• at the clients side for data sent from server to client
• Exception: Importing IXF files generated on a different system with
another code page
• If conversion tables are missing: SQLCODE -332

39
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Client to server conversion

Client Server
uses code page X uses code page Y

§ Send data using § Receive data


code page X § Convert to code page Y
§ Process data
§ Receive data in Y § Return result in code page Y
§ Convert to code
page X
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Using DB2 Connect

Client Gateway Server


uses code page X uses code page Y uses code page Z

§ Send data using § Receive data


code page X § Convert to code
page Y
§ Send data in Y § Receive data
§ Convert to code
page Z
§ Receive data in Z § Return result in
§ Convert to Y code page Z
§ Receive data in Y § Return result in Y
§ Convert to code
page X
41
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Other considerations

• Mapping of characters (injective):


If a character in the source code page is not contained in the target
code page, it is replaced by a substitution character.
• Round trip conversion (bijective):
If no substitution needs to take place between source and target code
pages, a round trip conversion does not loose information.
• Encoding/Decoding can change the number of bytes needed to store
the data.

42
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

More considerations

• Using different conversion tables and €-Symbol:


Microsoft ANSI code page and the official code page 850 have a
different code point for the Euro symbol. If needed code conversion
tables can be replaced (ref. Administration Guide, Planning).
• Unicode support:
DB2 supports the UCS-2 character set with UTF-8 and UCS-2 encoding
for Unicode databases
• For PureXML (V9.x) a UTF-8 database is needed.

43
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

More considerations

• To change a code page of a database, you have to use db2move


(Export/Import). Backup/Restore cannot be used. So choosing the
right database code page during database creation is crucial.
• Binary data (BLOB, FOR BIT DATA) is internally stored with code page
0, so no character conversion is applied.

44
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Overview

• Where can I define the code page used?


• What is code page conversion and where does it happen?
• What problems can arise and how can I avoid them?
• Performance considerations

45
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Troubleshooting

• Identify used code pages:


• Display the codepage of a database:
db2 get db cfg for sample
• Displaying SQLCA area during CONNECT with CLP
When connecting to a database via CLP the option "-a" displays the SQLCA data
area, which shows the code page of the database and the connecting client.
• If connecting to iSeries or zSeries machines from DB2 LUW, check if
conversion tables are available.

46
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Pitfalls

• Watch out for unintentional "conversions"


• All database communication partners are configured correct,
but the DBA is looking via a console window at the data and the
console window (or putty) is using a font with the wrong codepage
to display the data!

47
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

db2set DB2CODEPAGE

• Know what you intend to do, if you use the DB2 environment variable
DB2CODEPAGE
• It tells DB2, that you will feed it with the right code points, regardless
of the displayed symbols.

• See Technote "Setting DB2CODEPAGE=1208 may result in incorrect


character data insertion"
SQL0191N Error occurred because of a fragmented MBCS character.
http://www.ibm.com/support/docview.wss?uid=swg21601028

48
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

db2set DB2CONSOLECP

• Intended to allow DB2 CLI to use different codepages for output:

• Multiple APARs for DB2 9.1, 9.5, 9.7:


"DB2CONSOLECP environment variable has no effect on DB2 message
text or is not working"
49
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

DB2 Special Registers for NLS

• Change message text for DB2 Monreport modules:


db2 "SET CURRENT LOCALE LC_MESSAGES = 'de_DE'"
db2 "call monreport.lockwait"
• Change message names for Time/Dates:
db2 "SET CURRENT LOCALE LC_TIME = 'fr_FR'"
db2 "values monthname(current date)"
(Works with DAYNAME, MONTHNAME, NEXT_DAY, ROUND, ROUND_TIMESTAMP,
TIMESTAMP_FORMAT, TRUNCATE, TRUNC_TIMESTAMP and VARCHAR_FORMAT)

50
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Performance considerations

• Try to avoid unneccessary conversions.


• Create databases already with the code page needed for your
applications.
• For international databases prefer UTF-8, especially when used with
Java programs.
• Remember: Conversion takes time.

51
IDUG & IBM Data Tech Summit
Toronto, Canada | September 23 – 24, 2019

Links

• IBM developerworks white paper:


http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html

• DB2 Knowledge Center, Db2 11.1, Multicultural Support


https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.admin.nls.doc/doc/c0006848.html

• Unicode
http://www.unicode.org

• UTF-8 article at Wikipedia


http://en.wikipedia.org/wiki/UTF-8

52

You might also like