You are on page 1of 140

Database

Server
Health
Check
Josh Berkus
PostgreSQL Experts Inc.
pgCon 2010
DATABASE SERVER
HELP 5¢
Program of Treatment
● What is a Healthy Database?
● Know Your Application
● Load Testing
● Doing a database server checkup
● hardware
● OS & FS
● PostgreSQL
● application
● Common Ailments of the Database Server
What is a Healthy Database
Server?
What is a Healthy Database
Server?

● Response Times
What is a Healthy Database
Server?

● Response Times
● lower than required
● consistent & predicable
● Capacity for more
● CPU and I/O headroom
● low server load
30

25
Median Response Time

20

Expected Load
15

10 Max Response Time

0
25 50 75 100 125 150 175 200 225 250

Number of Clients
What is an Unhealthy Database
Server?
● Slow response times
● Inconsistent response times
● High server load
● No capacity for growth
30

25
Median Response Time

20

Expected Load
15

10 Max Response Time

0
25 50 75 100 125 150 175 200 225 250

Number of Clients
A healthy database server is
able to maintain consistent
and acceptable response times
under expected loads with
margin for error.
30

25
Median Response Time

20

15

10

0
25 50 75 100 125 150 175 200 225 250

Number of Clients
Hitting The Wall
CPUs Floored

Average: CPU %user %system %iowait %idle


Average:all 69.36 0.13 24.87 5.77
0 88.96 0.09 10.03 1.11
1 12.09 0.02 86.98 0.00
2 98.90 0.00 0.00 10.10
3 77.52 0.44 1.70 20.34

16:38:29 up 13 days, 22:10, 3 users,


load average: 11.05, 9.08, 8.13
CPUs Floored

Average: CPU %user %system %iowait %idle


Average:all 69.36 0.13 24.87 5.77
0 88.96 0.09 10.03 1.11
1 12.09 0.02 86.98 0.00
2 98.90 0.00 0.00 10.10
3 77.52 0.44 1.70 20.34

16:38:29 up 13 days, 22:10, 3 users,


load average: 11.05, 9.08, 8.13
IO Saturated

Device: tps MB_read/s MB_wrtn/s


sde 414.33 0.40 38.15
sdf 1452.00 99.14 29.00

Average: CPU %user %system %iowait %idle


Average:all 34.75 0.13 58.75 6.37
0 8.96 0.09 90.03 1.11
1 12.09 0.02 86.98 0.00
2 91.90 0.00 7.00 10.10
3 27.52 0.44 51.70 20.34
Out of Connections

FATAL:
connection limit
exceeded for non-
superusers
How close are you
to the wall?
The Checkup
(full physical)
1. Analyze application
2. Analyze platform
3. Correct anything obviously wrong
4. Set up load test
5. Monitor load test
6. Analyze Results
7. Correct issues
The Checkup
(semi-annual)
1. Check response times
2. Check system load
3. Check previous issues
4. Check for Signs of Illness
5. Fix new issues
Know
your
application!
Application database usage
Which does your application do?
✔ small reads
✔ large sequential reads
✔ small writes
✔ large writes
✔ long-running procedures/transactions
✔ bulk loads and/or ETL
What Color Is My Application?
W ● Web Application (Web)

O ● Online Transaction Processing (OLTP)

D ● Data Warehousing (DW)


What Color Is My Application?
W ● Web Application (Web)
● DB much smaller than RAM
● 90% or more simple queries
O ● Online Transaction Processing (OLTP)

D ● Data Warehousing (DW)


What Color Is My Application?
W ● Web Application (Web)
● DB smaller than RAM
● 90% or more simple queries
O ● Online Transaction Processing (OLTP)
● DB slightly larger than RAM to 1TB
● 20-40% small data write queries
● Some long transactions and complex read queries
D ● Data Warehousing (DW)
What Color Is My Application?
W ● Web Application (Web)
● DB smaller than RAM
● 90% or more simple queries
O ● Online Transaction Processing (OLTP)
● DB slightly larger than RAM to 1TB
● 20-40% small data write queries
● Some long transactions and complex read queries
D ● Data Warehousing (DW)
● Large to huge databases (100GB to 100TB)
● Large complex reporting queries
● Large bulk loads of data
● Also called "Decision Support" or "Business Intelligence"
What Color Is My Application?
W ● Web Application (Web)
● CPU-bound
● Ailments: idle connections/transactions, too many queries
O ● Online Transaction Processing (OLTP)
● CPU or I/O bound
● Ailments: locks, database growth, idle transactions,
database bloat
D ● Data Warehousing (DW)
● I/O or RAM bound
● Resources: database growth, longer running queries,
memory usage growth
Special features required?
● GIS
● heavy cpu for GIS functions
● lots of RAM for GIS indexes
● TSearch
● lots of RAM for indexes
● slow response time on writes
● SSL
● response time lag on connections
Load
Testing
80

70

60
Requests Per Second

50

40

30

20

10

0
02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM
12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM

Time
80

70

DOWNTIME
60
Requests Per Second

50

40

30

20

10

0
02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM
12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM

Time
When preventing downtime,
it is not average load which
matters, it is peak load.
What to load test
● Load should be as similar as possible to your
production traffic
● You should be able to create your target level of
traffic
● better: incremental increases
● Test the whole application as well
● the database server may not be your weak point
How to Load Test
1. Set up a load testing tool
you'll need test servers for this*
2. Turn on PostgreSQL, HW, application
monitoring
all monitoring should start at the same time
3. Run the test for a defined time
1 hour is usually good
4. Collect and analyze data
5. Re-run at higher level of traffic
Test Servers
● Must be as close as reasonable to production
servers
● otherwise you don't know how production will be
different
● there is no predictable multiplier
● Double them up as your development/staging
or failover servers
● If your test server is much smaller, then you
need to do a same-load comparison
Tools for Load Testing
Production Test
1. Determine the peak load hour on the
production servers
2. Turn on lots of monitoring during
that peak load hour
3. Analyze results

Pretty much your only choice without a test


server.
Issues with Production Test
● Not repeatable
−load won't be exactly the same ever again
● Cannot test target load
−just whatever happens to occur during that hour
−can't test incremental increases either
● Monitoring may hurt production performance
● Cannot test experimental changes
The Ad-Hoc Test

● Get 10 to 50 coworkers to open several


sessions each
● Have them go crazy on using the application
Problems with Ad-Hoc Testing
● Not repeatable
● minor changes in response times may be due to
changes in worker activity
● Labor intensive
● each test run shuts down the office
● Can't reach target levels of load
● unless you have a lot of coworkers
Seige
● HTTP traffic generator
● all test interfaces must be addressable as URLs
● useless for non-web applications
● Simple to use
● create a simple load test in a few hours
● Tests the whole web application
● cannot test database separately
● http://www.joedog.org/index/siege-home
pgReplay
● Replays your activity logs at variable speed
● get exactly the traffic you get in production
● Good for testing just the database server
● Can take time to set up
● need database snapshot, collect activity logs
● must already have production traffic

● http://pgreplay.projects.postgresql.org/
tsung
● Generic load generator in erlang
● a load testing kit rather than a tool
● Generate a tsung file from your actvity logs using
pgFouine and test the database
● Generate load for a web application using custom
scripts
● Can be time consuming to set up
● but highly configurable and advanced
● very scalable - cluster of load testing clients

● http://tsung.erlang-projects.org/
pgBench
● Simple micro-benchmark
● not like any real application
● Version 9.0 adds multi-threading, customization
● write custom pgBench scripts
● run against real database
● Fairly ad-hoc compared to other tools
● but easy to set up

● ships with PostgreSQL


Benchmarks
● Many “real” benchmarks available
● DBT2, EAstress, CrashMe, DBT5, DBMonster, etc.
● Useful for testing your hardware
● not useful for testing your application
● Often time-consuming and complex
Platform-specific
● Web framework or platform tests
● Rails: ActionController::PerformanceTest
● J2EE: OpenDemand, Grinder, many more
– JBoss, BEA have their own tools
● Zend Framework Performance Test
● Useful for testing specific application
performance
● such as performance of specific features, modules
● Not all platforms have them
Flight-Check

● Attend the tutorial tomorrow!


monitoring PostgreSQL during
load test
log_collector = on
log_destination = 'csvlog'
log_filename = 'load_test_1_%h'
log_rotation_age = 60min
log_rotation_size = 1GB

log_min_duration_statement = 0
log_connections = on
log_disconnections = on
log_temp_files = 100kB
log_lock_waits = on
monitoring hardware during
load test
sar -A -o load_test_1.sar 30 240

iostat
or fsstat
or zfs iostat
monitoring application during
load test
● Collect response times
● with timestamp
● with activity

● Monitor hardware and utilization


● activity
● memory & CPU usage

● Record errors & timeouts


Checking Hardware
Checking Hardware

● CPUs and Cores


● RAM
● I/O & disk support
● Network
CPUs and Cores
● Pretty simple: ● Rules of thumb
● number ● fewer faster CPUs is
● type usually better than
more slower ones
● speed ● core != cpu
● L1/L2 cache ● thread != core
● virtual core != core
CPU calculations
● ½ to 1 core for OS
● ½ to 1 core for software raid or ZFS
● 1 core for postmaster and bgwriter
● 1 core per:
● DW: 1 to 3 concurrent users
● OLTP: 10 to 50 concurrent users
● Web: 100 to 1000 concurrent users
CPU tools
● sar
● mpstat

● pgTop
in praise of sar
● collects data about all aspects of HW usage
● available on most OSes
● but output is slightly different
● easiest tool for collecting basic information
● often enough for server-checking purposes
● BUT: does not report all data on all platforms
sar
CPUs: sar -P ALL and sar -u
Memory: sar -r and sar -R
I/O: sar -b and sar -d
network: sar -n
sar CPU output
Linux
06:05:01 AM CPU %user %nice %system %iowait %steal %idle
06:15:01 AM all 14.26 0.09 6.01 1.32 0.00 78.32
06:15:01 AM 0 14.26 0.09 6.01 1.32 0.00 78.32

Solaris
15:08:56 %usr %sys %wio %idle
15:09:26 10 5 0 85
15:09:56 9 7 0 84
15:10:26 15 6 0 80
15:10:56 14 7 0 79
15:11:26 15 5 0 80
15:11:56 14 5 0 81
Memory
● Only one statistic: how much?
● Not generally an issue on its own
● low memory can cause more I/O
● low memory can cause more CPU time
memory sizing

Shared Filesystem work_mem


Buffers Cache maint_mem

In Buffer

In Cache

On Disk
Figure out Memory Sizing
● What is the active portion of your database?
● i.e. gets queried frequently
● How large is it?
● Where does it fit into the size categories?
● How large is the inactive portion of your
database?
● how frequently does it get hit? (remember backups)
Memory Sizing
● Other needs for RAM – work_mem:
● sorts and aggregates: do you do a lot of big ones?
● GIN/GiST indexes: these can be huge
● hashes: for joins and aggregates
● VACUUM
I/O Considerations
● Throughput
● how fast can you get data off disk?

● Latency
● how long does it take to respond to requests?

● Seek Time
● how long does it take to find random disk pages?
I/O Considerations
● Throughput
● important for large databases
● important for bulk loads
● Latency
● huge effect on small writes & reads
● not so much on large scans
● Seek Time
● important for small writes & reads
● very important for index lookups
I/O Considerations
● Web
● concerned about read latency & seek time
● OLTP
● concerned about write latency & seek time
● DW/BI
● concerned about throughput & seek time
------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-
Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
32096M 79553 99 240548 45 50646 5 72471 94 185634 10 1140 1

------Sequential Output------ --Sequential Input-- --Random-


-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP/sec %CP
24G 260044 33 62110 17 89914 15 1167 25
6549ms 4882ms 3395ms 107ms
Common I/O Types
● Software RAID & ZFS
● Hardware RAID Array
● NAS/SAN
● SSD
Hardware RAID Sanity Check
● RAID 1 / 10, not 5
● Battery-backed write cache?
● otherwise, turn write cache off
● SATA < SCSI/SAS
● about ½ real throughput
● Enough drives?
● 4-14 for OLTP application
● 8-48 for DW/BI
Sw RAID / ZFS Sanity Check
● Enough CPUs?
● will need one for the RAID
● Enough disks?
● same as hardware raid
● Extra configuration?
● caching
● block size
NAS/SAN Sanity Check
● Check latency!
● Check real throughput
● drivers often a problem
● Enough network bandwidth?
● multipath or fiber required to get HW RAID
performance
SSD Sanity Check
● 1 SSD = 4 Drives
● relative performance
● Check write cache configuration
● make sure data is safe
● Test real throughput, seek times
● drivers often a problem
● Research durability stats
IO Tools
● I/O Tests ● Monitoring Tools
● dd test ● sar
● Bonnie++ ● mpstat iowait
● IOZone ● iostat
● filebench ● on zfs: fsstat, zfs
-iostat
● EXPLAIN ANALYZE
Network
● Throughput
● not usually an issue, except:
– iSCSI / NAS / SAN
– ELT & Bulk Load Processes
● remember that gigabit is only 100MB/s!
● Latency
● real issue for Web / OLTP
● consider putting app ↔ database on private
network
Checkups for the Cloud
Just like real HW, except ...
● Low ceiling on #cpus, RAM
● Virtual Core < Real Core
● “CPU Stealing”
● last-generation hardware
● calculate 50% more cores
Cloud I/O Hell
● I/O tends to be very slow, erratic
● comparable to a USB thumb drive
● horrible latency, up to ½ second
● erratic, speeds go up and down
● RAID together several volumes on EBS
● use asynchronous commit
– or at least commit_siblings
#1 Cloud Rule

If your database
doesn't fit in RAM,
don't host it
on a public cloud
Checking Operating System
and Filesystem
OS Basics
● Use recent versions
● large performance, scaling improvements in Linux &
Solaris in last 2 years
● Check OS tuning advice for databases
● advice for Oracle is usually good for PostgreSQL
● Keep up with information about issues &
patches
● frequently specific releases have major issues
● especially check HW drivers
OS Basics
● Use Linux, BSD or Solaris!
● Windows has poor performance and weak
diagnostic tools
● OSX is optimized for desktop and has poor
hardware support
● AIX and HPUX require expertise just to install, and
lack tools
Filesystem Layout
● One array / one big pool
● Two arrays / partitions
● OS and transaction log
● Database
● Three arrays
● OS & stats file
● Transaction log
● Database
Linux Tuning
● XFS > Ext3 (but not that much)
● Ext3 Tuning: data=writeback,noatime,nodiratime
● XFS Tuning: noatime,nodiratime
– for transaction log: nobarrier
● “deadline” I/O scheduler
● Increase SHMMAX and SHMALL
● to ½ of RAM
● Cluster filesystems also a possibility
● OCFS, RHCFS
Solaris Tuning
● Use ZFS
● no advantage to UFS anymore
● mixed filesystems causes caching issues
● set recordsize
– 8K small databases
– 128K large databases
– check for throughput/latency issues
Solaris Tuning
● Set OS parameters via “projects”
● For all databases:
● project.max-shm-memory=(priv,12GB,deny)
● For high-connection databases:
● use libumem
● project.max-shm-ids=(priv,32768,deny)
● project.max-sem-ids=(priv,4096,deny)
● project.max-msg-ids=(priv,4096,deny)
FreeBSD Tuning
● ZFS: same as Solaris
● definite win for very large databases
● not so much for small databases
● Other tuning per docs
PostgreSQL Checkup
postgresql.conf: formulae

shared_buffers = 
available RAM / 4
postgresql.conf: formulae

max_connections =
web: 100 to 200
OLTP: 50 to 100
DW/BI: 5 to 20

if you need more, use pooling! 


postgresql.conf: formulae
Web/OLTP:
work_mem = Av.RAM * 2 / 
max_connections 

DW/BI:
work_mem AvRAM / 
max_connections
postgresql.conf: formulae
Web/OLTP:
maintenance_work_mem = 
Av.RAM * 16

DW/BI:
maintenance_work_mem = 
AvRAM / 8
postgresql.conf: formulae

autovacuum = on
DW/BI & bulk loads:
autovacuum = off
autovacuum_max_workers =
1/2
postgresql.conf: formulae

checkpoint_segments =
web: 8 to 16
oltp: 32 to 64
BI/DW: 128 to 256
postgresql.conf: formulae

wal_buffers = 8MB
effective_cache_size =
AvRAM * 0.75
How much recoverability do
you need?
● None:
● fsync=off
● full_page_writes=off
● consider using ramdrive
● Some Loss OK
● synchronous_commit = off
● wal_buffers = 16MB to 32MB
● Data integrity critical
● keep everything on
File Locations
● Database
● Transaction Log
● Activity Log
● Stats File
● Tablespaces?
Database Checks: Indexes
select relname, seq_scan, seq_tup_read,
pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0)
+ coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activity
from pg_stat_user_tables where seq_scan > 1000 and
pg_relation_size(relid) > 1000000 order by seq_scan desc limit 10;

relname | seq_scan | seq_tup_read | size | update_activity


----------------+----------+--------------+---------+-----------------
permissions | 12264 | 53703 | 2696 kB | 365
users | 11697 | 351635 | 17 MB | 741
test_set | 9150 | 18492353300 | 275 MB | 27643
test_pool | 5143 | 3141630847 | 212 MB | 77755
Database Checks: Indexes
SELECT indexrelid::regclass as index , relid::regclass as
table FROM pg_stat_user_indexes JOIN pg_index USING
(indexrelid) WHERE idx_scan < 100 AND indisunique IS FALSE;

index | table
acct_acctdom_idx | accounts
hitlist_acct_idx | hitlist
hitlist_number_idx | hitlist
custom_field_acct_idx | custom_field
user_log_accstrt_idx | user_log
user_log_idn_idx | user_log
user_log_feed_idx | user_log
user_log_inbdstart_idx | user_log
user_log_lead_idx | user_log
Database Checks:
Large Tables

relname | total_size | table_size


-------------------------+------------+------------
operations_2008 | 9776 MB | 3396 MB
operations_2009 | 9399 MB | 3855 MB
request_by_second | 7387 MB | 5254 MB
request_archive | 6975 MB | 3349 MB
events | 92 MB | 66 MB
event_edits | 82 MB | 68 MB
2009_ops_eoy | 33 MB | 19 MB
Database Checks:
Heavily-Used Tables
select relname, pg_size_pretty(pg_relation_size(relid)) as size,
coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) +
coalesce(n_tup_del,0) as update_activity from pg_stat_user_tables
order by update_activity desc limit 10;

relname | size | update_activity


------------------------+---------+-----------------
session_log | 344 GB | 4811814
feature | 279 MB | 1012565
daily_feature | 28 GB | 984406
cache_queue_2010_05 | 2578 MB | 981812
user_log | 30 GB | 796043
vendor_feed | 29 GB | 479392
vendor_info | 23 GB | 348355
error_log | 239 MB | 214376
test_log | 945 MB | 185785
settings | 215 MB | 117480
Database Unit Tests
● You need them!
● you will be changing database objects and rewriting
queries
● find bugs in testing or in testing … or in production
● Various tools
● pgTAP
● Framework-level tests
– Rails, Django, Catalyst, JBoss, etc.
Application Stack
Checkup
The Layer Cake

Queries Transactions Application

Drivers Connections Caching Middleware

Schema Config PostgreSQL

Filesystem Kernel Operating System

Storage RAM/CPU Network Hardware


The Layer Cake

Queries Transactions Application

Drivers Connections Caching Middleware

Schema Config PostgreSQL

Filesystem Kernel Operating System

Storage RAM/CPU Network Hardware


The Funnel

Application

Middleware

PostgreSQL

OS

HW
Check PostgreSQL Drivers
● Does the driver version match the PostgreSQL
version?
● Have you applied all updates?
● Are you using the best driver?
● There are several Python, C++ drivers
● Don't use ODBC if you can avoid it.
● Does the driver support cached plans & binary
data?
● If so, are they being used?
Check Caching
Check Caching
● Does the application use data caching?
● what kind?
● could it be used more?
● what is the cache invalidation strategy?
● is there protection from “cache refresh storms”?
● Does the application use HTTP caching?
● could they be using it more?
Check Connection Pooling
● Is the application using connection pooling?
● all web applications should, and most OLTP
● external or built into the application server?
● Is it configured correctly?
● max. efficiency: transaction / statement mode
● make sure timeouts match
Check Query Design
● PostgreSQL does better with fewer, bigger
statements
● Check for common query mistakes
● joins in the application layer
● pulling too much data and discarding it
● huge OFFSETs
● unanchored text searches
Check Transaction
Management
● Are transactions being used for loops?
● batches of inserts or updates can be 75% faster if
wrapped in a transaction
● Are transactions aborted properly?
● on error
● on timeout
● transactions being held open while non-database
activity runs
Common Ailments
of the
Database Server
Check for them,
monitor for them
● ailments could throw off your response time
targets
● database could even “hit the wall”
● check for them during health check
● and during each checkup
● add daily/continuous monitors for them
● Nagios check_postgres.pl has checks for many of
these things
Database Growth
● Checkup:
● check both total database size and largest table(s)
size daily or weekly
● Symptoms:
● database grows faster than expected
● some tables grow continuously and rapidly
Database Growth
● Caused By:
● faster than expected increase in usage
● “append forever” tables
● Database Bloat
● Leads to:
● slower seq scans and index scans
● swapping & temp files
● slower backups
Database Growth
● Treatment:
● check for Bloat
● find largest tables and make them smaller
– expire data
– partitioning
● horizontal scaling (if possible)
● get better storage & more RAM, sooner
Database Bloat
-[ RECORD 1 ]+-----
schemaname | public
tablename | user_log
tbloat | 3.4
wastedpages | 2356903
wastedbytes | 19307749376
wastedsize | 18 GB
iname | user_log_accttime_idx
ituples | 941451584
ipages | 9743581
iotta | 40130146
ibloat | 0.2
wastedipages | 0
wastedibytes | 0
wastedisize | 0 bytes
Database Bloat
● Caused by:
● Autovacuum not keeping up
– or not enough manual vacuum
– often on specific tables only
● FSM set wrong (before 8.4)
● Idle In Transaction
● Leads To:
● slow response times
● unpredictable response times
● heavy I/O
Database Bloat
● Treatment:
● make autovacuum more aggressive
– on specific tables with bloat
● fix FSM_relations/FSM_pages
● check when tables are getting vacuumed
● check for Idle In Transaction
Memory Usage Growth
00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
01:00:00 0 0 100 0 0 100 0 0
02:00:00 0 0 100 0 0 100 0 0
03:00:00 0 0 100 0 0 100 0 0
04:00:00 0 0 100 0 0 100 0 0

00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s


01:00:00 3788 115 98 0 0 100 0 0
02:00:00 21566 420 78 0 0 100 0 0
03:00:00 455721 1791 59 0 0 100 0 0
04:00:00 908 6 96 0 0 100 0 0
Memory Usage Growth
● Caused by:
● Database Growth or Bloat
● work_mem limit too high
● bad queries
● Leads To:
● database out of cache
– slow response times
● OOM Errors (OOM Killer)
Memory Usage Growth
● Treatment
● Look at ways to shrink queries, DB
– partitioning
– data expiration
● lower work_mem limit
● refactor bad queries
● Or just buy more RAM
Idle Connections

select datname, usename, count(*) from


pg_stat_activity where current_query =
'<IDLE>' group by datname, usename;

datname | usename | count


---------+---------+-------
track | www | 318
Idle Connections
● Caused by:
● poor session management in application
● wrong connection pool settings
● Leads to:
● memory usage for connections
● slower response times
● out-of-connections at peak load
Idle Connections
● Treatment:
● refactor application
● reconfigure connection pool
– or add one
Idle In Transaction

select datname, usename, max(now() - xact_start) as


max_time, count(*) from pg_stat_activity where
current_query ~* '<IDLE> in transaction' group by
datname, usename;

datname | usename | max_time | count


---------+----------+---------------+-------
track | admin | 00:00:00.0217 | 1
track | www | 01:03:06.0709 | 7
Idle In Transaction
● Caused by:
● poor transaction control by application
● abandoned sessions not being terminated fast
enough
● Leads To:
● locking problems
● database bloat
● out of connections
Idle In Transaction
● Treatment
● refactor application
● change driver/ORM settings for transactions
● change session timeouts & keepalives on pool,
driver, database
Longer Running Queries
● Detection:
● log slow queries to PostgreSQL log
● do daily or weekly report (pgfouine)
● Symptoms:
● number of long-running queries in log increasing
● slowest queries getting slower
Longer Running Queries
● Caused by:
● database growth
● poorly-written queries
● wrong indexes
● out-of-date stats
● Leads to:
● out-of-CPU
● out-of-connections
Longer Running Queries
● Treatments:
● refactor queries
● update indexes
● make Autoanalyze more aggressive
● control database growth
Too Many Queries
Too Many Queries
● Caused By:
● joins in middleware
● not caching
● poll cycles without delays
● other application code issues
● Leads To:
● out-of-CPU
● out-of-connections
Too Many Queries
● Treatment:
● characterize queries using logging
● refactor application
Locking
● Detection:
● log_lock_waits
● scan activity log for deadlock warnings
● query pg_stat_activity and pg_locks
● Symptoms:
● deadlock error messages
● number and time of lock_waits getting larger
Locking
● Caused by:
● long-running operations with exclusive locks
● inconsistent foreign key updates
● poorly planned runtime DDL
● Leads to:
● poor response times
● timeouts
● deadlock errors
Locking
● Treatment
● analyze locks
● refactor operations taking locks
– establish a canonical order of updates for long
transactions
– use pessimistic locks with NOWAIT
● rely on cascade for FK updates
– not on middleware code
Temp File Usage
● Detection:
● log_temp_files = 100kB
● scan logs for temp files weekly or daily
● Symptoms:
● temp file usage getting more frequent
● queries using temp files getting longer
Temp File Usage
● Caused by:
● Sorts, hashes & aggregates too big for work_mem
● Leads to:
● slow response times
● timeouts
Temp File Usage
● Treatment
● find swapping queries via logs
● set work_mem higher for that ROLE, or
● refactor them to need less memory, or
● buy more RAM
All healthy now?

See you in six months!


Q&A
● Josh Berkus ● Also see:
● josh@pgexperts.com ● Load Testing
● it.toolbox.com/blogs/ (tommorrow)
database-soup ● Testing BOF (Friday)
● PostgreSQL Experts
● www.pgexperts.com
● pgCon Sponsor

Copyright 2010 Josh Berkus & PostgreSQL Experts Inc. Distributable under the creative
commons attribution license,
except for 3rd-party images which are property of their respective owners.

You might also like