Professional Documents
Culture Documents
DB Server Health Check
DB Server Health Check
Program of Treatment
What is a Healthy Database? Know Your Application Load Testing Doing a database server checkup
Response Times
Response Times
lower than required consistent & predicable CPU and I/O headroom low server load
30
25
20
15
10
Number of Clients
Expected Load
Slow response times Inconsistent response times High server load No capacity for growth
30
25
20
15
10
Number of Clients
Expected Load
A healthy database server is able to maintain consistent and acceptable response times under expected loads with margin for error.
30
25
20
15
10
Number of Clients
CPUs Floored
Average: CPU Average:all 0 1 2 3 %user 69.36 88.96 12.09 98.90 77.52 %system 0.13 0.09 0.02 0.00 0.44 %iowait 24.87 10.03 86.98 0.00 1.70 %idle 5.77 1.11 0.00 10.10 20.34
CPUs Floored
Average: CPU Average:all 0 1 2 3 %user 69.36 88.96 12.09 98.90 77.52 %system 0.13 0.09 0.02 0.00 0.44 %iowait 24.87 10.03 86.98 0.00 1.70 %idle 5.77 1.11 0.00 10.10 20.34
IO Saturated
Device: sde sdf Average: CPU Average:all 0 1 2 3 tps 414.33 1452.00 %user 34.75 8.96 12.09 91.90 27.52 MB_read/s 0.40 99.14 %system 0.13 0.09 0.02 0.00 0.44 MB_wrtn/s 38.15 29.00 %idle 6.37 1.11 0.00 10.10 20.34
Out of Connections
small reads large sequential reads small writes large writes long-running procedures/transactions bulk loads and/or ETL
DB slightly larger than RAM to 1TB 20-40% small data write queries Some long transactions and complex read queries
DB smaller than RAM 90% or more simple queries DB slightly larger than RAM to 1TB 20-40% small data write queries Some long transactions and complex read queries Large to huge databases (100GB to 100TB) Large complex reporting queries Large bulk loads of data Also called "Decision Support" or "Business Intelligence"
CPU-bound Ailments: idle connections/transactions, too many queries CPU or I/O bound Ailments: locks, database growth, idle transactions, database bloat
GIS
heavy cpu for GIS functions lots of RAM for GIS indexes lots of RAM for indexes slow response time on writes response time lag on connections
TSearch
SSL
Load Testing
80 70 60
50 40 30 20 10 0
02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM
Time
80 70 60
DOWNTIME
50 40 30 20 10 0
02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM
Time
When preventing downtime, it is not average load which matters, it is peak load.
Load should be as similar as possible to your production traffic You should be able to create your target level of traffic
better: incremental increases the database server may not be your weak point
Test Servers
otherwise you don't know how production will be different there is no predictable multiplier
Double them up as your development/staging or failover servers If your test server is much smaller, then you need to do a same-load comparison
Production Test
1. Determine the peak load hour on the production servers 2. Turn on lots of monitoring during that peak load hour 3. Analyze results Pretty much your only choice without a test server.
Not repeatable
load won't be exactly the same ever again
Get 10 to 50 coworkers to open several sessions each Have them go crazy on using the application
Not repeatable
minor changes in response times may be due to changes in worker activity each test run shuts down the office unless you have a lot of coworkers
Labor intensive
Seige
all test interfaces must be addressable as URLs useless for non-web applications create a simple load test in a few hours cannot test database separately
Simple to use
http://www.joedog.org/index/siege-home
pgReplay
Good for testing just the database server Can take time to set up
need database snapshot, collect activity logs must already have production traffic http://pgreplay.projects.postgresql.org/
tsung
a load testing kit rather than a tool Generate a tsung file from your actvity logs using pgFouine and test the database Generate load for a web application using custom scripts but highly configurable and advanced very scalable - cluster of load testing clients http://tsung.erlang-projects.org/
pgBench
Simple micro-benchmark
not like any real application write custom pgBench scripts run against real database but easy to set up ships with PostgreSQL
Benchmarks
DBT2, EAstress, CrashMe, DBT5, DBMonster, etc. not useful for testing your application
Platform-specific
Flight-Check
Collect response times with timestamp with activity Monitor hardware and utilization
Checking Hardware
Checking Hardware
Pretty simple:
Rules of thumb
fewer faster CPUs is usually better than more slower ones core != cpu thread != core virtual core != core
CPU calculations
to 1 core for OS to 1 core for software raid or ZFS 1 core for postmaster and bgwriter 1 core per:
DW: 1 to 3 concurrent users OLTP: 10 to 50 concurrent users Web: 100 to 1000 concurrent users
CPU tools
sar mpstat pgTop
in praise of sar
sar
CPUs: sar -P ALL and sar -u Memory: sar -r and sar -R I/O: sar -b and sar -d network: sar -n
Solaris
15:08:56 15:09:26 15:09:56 15:10:26 15:10:56 15:11:26 15:11:56 %usr 10 9 15 14 15 14 %sys 5 7 6 7 5 5 %wio 0 0 0 0 0 0 %idle 85 84 80 79 80 81
Memory
Only one statistic: how much? Not generally an issue on its own
low memory can cause more I/O low memory can cause more CPU time
memory sizing
Shared Buffers
Filesystem Cache work_mem maint_mem
In Buffer In Cache
On Disk
How large is it? Where does it fit into the size categories? How large is the inactive portion of your database?
Memory Sizing
sorts and aggregates: do you do a lot of big ones? GIN/GiST indexes: these can be huge hashes: for joins and aggregates VACUUM
I/O Considerations
Throughput
Latency
Seek Time
I/O Considerations
Throughput
important for large databases important for bulk loads huge effect on small writes & reads not so much on large scans important for small writes & reads very important for index lookups
Latency
Seek Time
I/O Considerations
Web
concerned about read latency & seek time concerned about write latency & seek time concerned about throughput & seek time
OLTP
DW/BI
------Sequential Output------ --Sequential Input- --Random-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --SeeksSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP 32096M 79553 99 240548 45 50646 5 72471 94 185634 10 /sec %CP 1140 1
------Sequential Output------ --Sequential Input-- --Random-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-24G 260044 6549ms 33 62110 4882ms 17 89914 3395ms --Seeks-15 1167 25 Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP/sec %CP 107ms
otherwise, turn write cache off about real throughput 4-14 for OLTP application 8-48 for DW/BI
Enough drives?
Enough CPUs?
will need one for the RAID same as hardware raid caching block size
Enough disks?
Extra configuration?
1 SSD = 4 Drives
IO Tools
I/O Tests
Monitoring Tools
sar mpstat iowait iostat on zfs: fsstat, zfs -iostat EXPLAIN ANALYZE
Network
Throughput
remember that gigabit is only 100MB/s! real issue for Web / OLTP consider putting app database on private network
Latency
comparable to a USB thumb drive horrible latency, up to second erratic, speeds go up and down RAID together several volumes on EBS use asynchronous commit
or at least commit_siblings
#1 Cloud Rule
OS Basics
large performance, scaling improvements in Linux & Solaris in last 2 years advice for Oracle is usually good for PostgreSQL
OS Basics
Windows has poor performance and weak diagnostic tools OSX is optimized for desktop and has poor hardware support AIX and HPUX require expertise just to install, and lack tools
Filesystem Layout
OS and transaction log Database OS & stats file Transaction log Database
Three arrays
Linux Tuning
Solaris Tuning
Use ZFS
no advantage to UFS anymore mixed filesystems causes caching issues set recordsize
Solaris Tuning
FreeBSD Tuning
definite win for very large databases not so much for small databases
PostgreSQL Checkup
postgresql.conf: formulae
Web/OLTP:
work_mem=Av.RAM*2/ max_connections
DW/BI:
work_memAvRAM/ max_connections
postgresql.conf: formulae
Web/OLTP:
maintenance_work_mem= Av.RAM*16
DW/BI:
maintenance_work_mem= AvRAM/8
None:
fsync=off full_page_writes=off consider using ramdrive synchronous_commit = off wal_buffers = 16MB to 32MB keep everything on
Some Loss OK
File Locations
you will be changing database objects and rewriting queries find bugs in testing or in testing or in production pgTAP Framework-level tests
Various tools
Hardware
Hardware
The Funnel
Application Middleware PostgreSQL OS HW
Does the driver version match the PostgreSQL version? Have you applied all updates? Are you using the best driver?
There are several Python, C++ drivers Don't use ODBC if you can avoid it.
Check Caching
Check Caching
what kind? could it be used more? what is the cache invalidation strategy? is there protection from cache refresh storms? could they be using it more?
all web applications should, and most OLTP external or built into the application server? max. efficiency: transaction / statement mode make sure timeouts match
Is it configured correctly?
PostgreSQL does better with fewer, bigger statements Check for common query mistakes
joins in the application layer pulling too much data and discarding it huge OFFSETs unanchored text searches
batches of inserts or updates can be 75% faster if wrapped in a transaction on error on timeout transactions being held open while non-database activity runs
database could even hit the wall and during each checkup Nagios check_postgres.pl has checks for many of these things
Database Growth
Checkup:
check both total database size and largest table(s) size daily or weekly database grows faster than expected some tables grow continuously and rapidly
Symptoms:
Database Growth
Caused By:
faster than expected increase in usage append forever tables Database Bloat slower seq scans and index scans swapping & temp files slower backups
Leads to:
Database Growth
Treatment:
check for Bloat find largest tables and make them smaller
horizontal scaling (if possible) get better storage & more RAM, sooner
Database Bloat
-[ RECORD 1 ]+----schemaname | public tablename | user_log tbloat | 3.4 wastedpages | 2356903 wastedbytes | 19307749376 wastedsize | 18 GB iname | user_log_accttime_idx ituples | 941451584 ipages | 9743581 iotta | 40130146 ibloat | 0.2 wastedipages | 0 wastedibytes | 0 wastedisize | 0 bytes
Database Bloat
Caused by:
FSM set wrong (before 8.4) Idle In Transaction slow response times unpredictable response times heavy I/O
Leads To:
Database Bloat
Treatment:
fix FSM_relations/FSM_pages check when tables are getting vacuumed check for Idle In Transaction
00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 01:00:00 3788 115 98 0 0 100 0 0 02:00:00 21566 420 78 0 0 100 0 0 03:00:00 455721 1791 59 0 0 100 0 0 04:00:00 908 6 96 0 0 100 0 0
Caused by:
Database Growth or Bloat work_mem limit too high bad queries database out of cache
Leads To:
Treatment
lower work_mem limit refactor bad queries Or just buy more RAM
Idle Connections
select datname, usename, count(*) from pg_stat_activity where current_query = '<IDLE>' group by datname, usename; datname | usename | count ---------+---------+------track | www | 318
Idle Connections
Caused by:
poor session management in application wrong connection pool settings memory usage for connections slower response times out-of-connections at peak load
Leads to:
Idle Connections
Treatment:
or add one
Idle In Transaction
select datname, usename, max(now() - xact_start) as max_time, count(*) from pg_stat_activity where current_query ~* '<IDLE> in transaction' group by datname, usename; datname | usename | max_time | count ---------+----------+---------------+------track | admin | 00:00:00.0217 | 1 track | www | 01:03:06.0709 | 7
Idle In Transaction
Caused by:
poor transaction control by application abandoned sessions not being terminated fast enough locking problems database bloat out of connections
Leads To:
Idle In Transaction
Treatment
refactor application change driver/ORM settings for transactions change session timeouts & keepalives on pool, driver, database
Detection:
log slow queries to PostgreSQL log do daily or weekly report (pgfouine) number of long-running queries in log increasing slowest queries getting slower
Symptoms:
Caused by:
database growth poorly-written queries wrong indexes out-of-date stats out-of-CPU out-of-connections
Leads to:
Treatments:
refactor queries update indexes make Autoanalyze more aggressive control database growth
Caused By:
joins in middleware not caching poll cycles without delays other application code issues out-of-CPU out-of-connections
Leads To:
Treatment:
Locking
Detection:
log_lock_waits scan activity log for deadlock warnings query pg_stat_activity and pg_locks deadlock error messages number and time of lock_waits getting larger
Symptoms:
Locking
Caused by:
long-running operations with exclusive locks inconsistent foreign key updates poorly planned runtime DDL poor response times timeouts deadlock errors
Leads to:
Locking
Treatment
establish a canonical order of updates for long transactions use pessimistic locks with NOWAIT not on middleware code
Detection:
log_temp_files = 100kB scan logs for temp files weekly or daily temp file usage getting more frequent queries using temp files getting longer
Symptoms:
Caused by:
Sorts, hashes & aggregates too big for work_mem slow response times timeouts
Leads to:
Treatment
find swapping queries via logs set work_mem higher for that ROLE, or refactor them to need less memory, or buy more RAM
Q&A
Josh Berkus
Also see:
PostgreSQL Experts
Copyright 2010 Josh Berkus & PostgreSQL Experts Inc. Distributable under the creative commons attribution license, except for 3rd-party images which are property of their respective owners.