You are on page 1of 38

DB2 UDB Performance Troubleshooting.

Finding CPU bottlenecks.


Albert Grankin, Senior Technical Staff Member,
Support Architect

DB2 UDB Customer Tech Fair, July 2010, Seoul, Korea.


1
Introduction.

Performance problems are usually one of the hardest ones to diagnose ;


Iterative in nature – “peeling an onion” ;
Not easy to characterize ;
Not easy to collect the data ;

A lot of times the issues are resource bottlenecks:


– CPU usage bottleneck ;
– Memory bottleneck ( exhaustion ) ;
– I/O bottleneck ;
– Lock/latch contentions ;
– Etc.

In this presentation we will focus on CPU contention (bottleneck) ;


2

2 Finding CPU Bottlenecks


질문 해주세요

3 Finding CPU Bottlenecks


Collecting Baseline Data – OS info.

Collecting baseline data is an important step in any performance PD:

What to collect?
– Basic OS info:
• Vmstat ;
• Ps –elf ;
• Svmon –G ;
• Iostat ;

How does normal CPU profile look like?

How high is run queue, memory utilization, etc.?

4 Finding CPU Bottlenecks


Collecting Baseline Data – DB2 Data.

Basic DB2 data:

– Dbm snapshot ;
– DB snapshot ;
– Dynamic SQL snapshot ;
– Db2pd –everything ;
– Application snapshots ;

5 Finding CPU Bottlenecks


CPU Usage – terminology.

• CPU Time ;
• User ;
• System ;

Time slice
Queues:
• Run;
• Blocked;
• Processes ;
• Kernel threads ;
• User threads ;

6 Finding CPU Bottlenecks


What is CPU Bottleneck?
Symptoms of CPU contention:

Database is running slow ;


OS commands run slow too ;
OS/Database appears to hang ;

Signs of CPU bottleneck:

CPU utilization is consistently at 100% (usr+sys)


Run queue is much higher than normal ;

What tool can be used to find that? 7

7 Finding CPU Bottlenecks


OS Tools - Vmstat.

Good overall tool to show CPU,


memory usage statistics ;

Will show CPU, I/O, memory


bottleneck information, if present ;

Can be run continiously – vmstat 1 10

8 Finding CPU Bottlenecks


Vmstat – output example.

9 Finding CPU Bottlenecks


OS Tools – vmstat, important columns.
Run Queue ( 1st column ) – Very high number may indicate CPU
bottleneck/starvation;

Blocked Queue ( 2nd column ) – Any number in this signifies that the
process is waiting for a resource.

Avm or swapped – Active virtual memory pages;

Fre ( Freelist ) – How much memory is free.

CPU
– %usr – signifies how much percentage of CPU time is being spent in user code.

– %sys – signifies how much percentage of CPU time is being spent in kernel
code.

– %wio – signifies how much percentage of CPU time is being spent waiting on
disk/NFS I/O requests.
10

10 Finding CPU Bottlenecks


Vmstat, more columns.

pi and po – Page in and Page out columns.

.
Some other vmstat options:
– On AIX, vmstat –I collects filepage I/O info as well,

– vmstat –P all 1 10 collects paging information for all page sizes.

– On Solaris vmstat –p collects the paging info. On Linux,

– vmstat –a provides more info on the memory status.

– Also the vmstat –s option on all the platforms provides a summary info since
startup and can be looked into to find out if paging etc. happened before too.

11

11 Finding CPU Bottlenecks


How do we find who is consuming most of the CPU?

ps command.
– Prints current processes running on the machine ;
– Shows CPU accumulated by different processes ;
– Useful to determine if DB2 engine is the highest CPU consumer ;
– For DPF allows to determine which partition consumes more CPU ;

db2pd –edus command.


– Prints threads/ EDUs (Engine Dispatchable Units) currently running in DB2
server ;
– Shows CPU accumulated by each thread ;
– Useful to determine what thread inside DB2 engine consumes the most of the
CPU;

12

12 Finding CPU Bottlenecks


PS Command Output – Example.

ps aux
– Several times, with an interval, to see accumulated deltas ;

13

13 Finding CPU Bottlenecks


Db2pd –edus output.
db2pd –edus, 2 iterations, interval 7 minutes….

14

14 Finding CPU Bottlenecks


What top CPU consuming EDU executing?

1 method:
– Execute application snapshot and look for EDU in the output ;
– We will get application id, current executing statement, row stats etc. ;
– Snapshot can be very slow during CPU bottleneck conditions ;

2 method:
– Db2pd –agents and grep for an EDU id identified from db2pd –edus ;
– We will get an application handle that agent belongs to ;
– Db2pd –applications and db2pd –dynamic will get us current executing
statement;
OR
– Db2pd –apinfo <apphdl> (new option) will get us the same info – similar to app
snapshot ;

15

15 Finding CPU Bottlenecks


Mapping EDU ID to Application and SQL it is executing.

EDU ID from ‘db2pd –edus’ output matches AgentEDUID and CoorEDUID ;


Anchor ID and statement UID are keys into dynamic SQL output.

16

16 Finding CPU Bottlenecks


Db2pd –apinfo example

17

17 Finding CPU Bottlenecks


Intermittent CPU Contentions.

Problem statement:
“CPU spikes happen several times a day, at different times and last
only a few minutes…”

Discussion:
How do we collect diagnostic data?

18

18 Finding CPU Bottlenecks


Threshold Based Data Collection – watchCPU.pl.

Perl script developed by APD team ;

Continiously monitors vmstat command ;

Fires up diagnostic data collection when specified threshold violated.

Can continiously collect some basic data to show events leading up to


CPU spike ;

19

19 Finding CPU Bottlenecks


WatchCPU.pl

Separate options for different thresholds:


– -s <value> or –u <value> triggers when CPU %% goes above specified user or
system CPU values ;
– -qrun <value> - triggers when run queue goes above certain value ;

-n <script_name> Specify name of the script to trigger data collection ;

Many other different different options exist:


– watchCPU.pl –h for help

20

20 Finding CPU Bottlenecks


WatchCPU.pl – commands examples.

watchCPU.pl –i 10 –qrun 100 –t 10 –n collectSpikeData.ksh


– Trigger data collection when run-q is greater than 100 and %idle is less than 10%.
Watch for this trigger only once and wait for at least 10 seconds before confirming that
trigger has breached.

watchCPU.pl –i 15 –r 3 –n collect.ksh –args sample –fork –fileos cmdfile –d 10


– Trigger data collection when %idle is less than 15%. Call the script collect.ksh and
pass the dbname as argument. Collect data for the trigger 3 times. Also, on an
ongoing basis collect some DB2 and OS data every 10 mins.

Copy of watchCPU.pl script is provided with this presentation ;


– Template data collection script is included - collectDataTemplateForWatchCPU.ksh

21

21 Finding CPU Bottlenecks


Discussion – what if top CPU consumer can’t be easily
identified?

22

22 Finding CPU Bottlenecks


Finding the most expensive SQL – Dynamic SQL snapshot.

Provides historic info on statements that ran on the system ;


Data is retirieved from dynamic statement cache ;
Has cumulative statistics on SQL preparation times, user and sys CPU
accumulated (etc.) ;

Needs statement monitor switch turned on ;

23

23 Finding CPU Bottlenecks


Dynamic SQL Snapshot Example.

24

24 Finding CPU Bottlenecks


Using Snapshot Table Functions
for dynamic SQL snapshots.

Find most executed SQL:


SELECT NUM_EXECUTIONS, TOTAL_EXEC_TIME, SUBSTR(STMT_TEXT,1,80) FROM
TABLE( SNAPSHOT_DYN_SQL( 'SAMPLE', -1 )) AS DYNSNAP
ORDER BY 1 DESC
FETCH FIRST 20 ROWS ONLY

Find longest running queries:


SELECT TOTAL_EXEC_TIME/NUM_EXECUTIONS, SUBSTR(STMT_TEXT,1,80) FROM
TABLE( SNAPSHOT_DYN_SQL( 'SAMPLE', -1 )) AS DYNSNAP ORDER BY 1 DESC FETCH
FIRST 20 ROWS ONLY

Find most expensive queries:


SELECT TOTAL_USR_CPU_TIME/NUM_EXECUTIONS AS U_CPU,
TOTAL_SYS_CPU_TIME/NUM_EXECUTIONS AS S_CPU,
SUBSTR(STMT_TEXT,1,80) AS SQL_TEXT
FROM TABLE( SNAPSHOT_DYN_SQL( 'SAMPLE', -1 )) AS DYNSNAP
WHERE NUM_EXECUTIONS > 0
ORDER BY 1 DESC, 2 DESC FETCH FIRST 20 ROWS ONLY ;
25

25 Finding CPU Bottlenecks


Some Best Practices on Reducing CPU Usage by
Applications.

Use SQL parameter markers – prepare once, execute many times ;


– SQL compiling may be very CPU intensive ;
– V. 9.7 has statement concentrator ;

Use application server connection pooling to maintain persistent


connections to the database ;
– Connect/disconnect is very CPU intensive ;

Avoid using short lived statements in multipartitioned environment:


– Massive parallel singleton inserts/selects/updates ;

In MPP environment, make sure that queries are colocated as much as


possible;
– Join on partitioning keys ;
– Have uniform partitioning keys for different tables ;
26

26 Finding CPU Bottlenecks


Case Study – CPU Contention During Connection Spike.

Problem Statement
Large brokerage system at the bank ;
Application server maintains persistent DB
connections ;
Database performance degrades dramatically
during active trading ;
Problem goes away if DB2 instance gets
restarted ;
Happens 2-3 times a day ;

27

27 Finding CPU Bottlenecks


Case Study - What Changed?

New brokerage application version was rolled out ;

Customer claims nothing wrong with application


code ;

No configuration changes on the DB Server ;

No OS configuration changes ;

28

28 Finding CPU Bottlenecks


Case Study: Planning Data Collection Strategy.

Start collecting data long before problem happens.


Every 5 minutes collect:
– vmstat 1 20 ;
– ps –elf
– Database monitor snapshot
– Database snapshot
– Application snapshot
– Dynamic SQL snapshot
When problem starts, collect stacks for all DB2 processes, 2 iterations, 1
minute interval:
– db2pd –stacks all ; sleep 60; db2pd –stacks all

29

29 Finding CPU Bottlenecks


Case Study: Vmstat output.

Question: Is there something wrong in this output?


30

30 Finding CPU Bottlenecks


Case Study: Investigation.

Vmstat output shows significant run queue size ;

CPU usage at 100% total with %sys way over %usr.


– Processes spent too much time in the kernel ;
– This is usually a sign of some resource contention in Db2 ;

31

31 Finding CPU Bottlenecks


Case Study: Looking at number of connections.

Database snapshot:

32

32 Finding CPU Bottlenecks


Case Study: Investigation.

Number of applications connected to the database steadily climbing to over


1,700 – instance becomes unusable afterwards ;

Db2pd –edus confirms the same – number of agents grows ;

Application server should be maintaining persistent connections ;

Customer claims that number of frontend users stays the same – no


increase, even though number of DB2 connections is increasing ;

What could be wrong here?

33

33 Finding CPU Bottlenecks


Case Study: Stacks Analysis.

34

34 Finding CPU Bottlenecks


Case Study: Investigation – Application Snapshot Snippet.

35

35 Finding CPU Bottlenecks


Case Study: Investigation

This application appears to be idle for 9 minutes 41 seconds ;

We are waiting to receive more data from the client ;

The transaction is still open!


( UOW stop timestamp is empty ) ;

There were several hundreds of connections like this ;

Question: How does application server decides if persistent connection is


free to be used for a new request?

36

36 Finding CPU Bottlenecks


Case Study: Resolution.

Customer’s new application had a new logic added to stored procedure that
was called from all applications ;

Under certain conditions this SP would raise user defined exception and
return it back to the client application.

Exception handling code on the client side would close the cursor but would
skip committing or rolling back the transaction.

Client application would consider request processing finished, but


transaction was still open, - application server connection to the database
was not free ;

Application server had to spawn new connections to process new requests ;


37

37 Finding CPU Bottlenecks


질문 있습니까?
있습니까

38

38 Finding CPU Bottlenecks

You might also like