DB Server Health Check

Database Server Health Check
Josh Berkus PostgreSQL Experts Inc. pgCon 2010
DATABASE SERVER HELP 5
Program of Treatment

What is a Healthy Database? Know Your Application Load Testing Doing a database server checkup

hardware OS & FS PostgreSQL application
Common Ailments of the Database Server
What is a Healthy Database Server?
Response Times
Response Times

lower than required consistent & predicable CPU and I/O headroom low server load
Capacity for more

30
25
Median Response Time
20
15
10
Max Response Time
0 25 50 75 100 125 150 175 200 225 250
Number of Clients
Expected Load
What is an Unhealthy Database Server?

Slow response times Inconsistent response times High server load No capacity for growth
30
25
20
15
10
Max Response Time
0 25 50 75 100 125 150 175 200 225 250
Number of Clients
Expected Load
A healthy database server is able to maintain consistent and acceptable response times under expected loads with margin for error.
30
25
20
15
10
0 25 50 75 100 125 150 175 200 225 250
Number of Clients
Hitting The Wall
CPUs Floored
Average: CPU Average:all 0 1 2 3 %user 69.36 88.96 12.09 98.90 77.52 %system 0.13 0.09 0.02 0.00 0.44 %iowait 24.87 10.03 86.98 0.00 1.70 %idle 5.77 1.11 0.00 10.10 20.34
16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
CPUs Floored
Average: CPU Average:all 0 1 2 3 %user 69.36 88.96 12.09 98.90 77.52 %system 0.13 0.09 0.02 0.00 0.44 %iowait 24.87 10.03 86.98 0.00 1.70 %idle 5.77 1.11 0.00 10.10 20.34
16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
IO Saturated
Device: sde sdf Average: CPU Average:all 0 1 2 3 tps 414.33 1452.00 %user 34.75 8.96 12.09 91.90 27.52 MB_read/s 0.40 99.14 %system 0.13 0.09 0.02 0.00 0.44 MB_wrtn/s 38.15 29.00 %idle 6.37 1.11 0.00 10.10 20.34
%iowait 58.75 90.03 86.98 7.00 51.70
Out of Connections
FATAL: connection limit exceeded for nonsuperusers
How close are you to the wall?
The Checkup (full physical)

1. Analyze application 2. Analyze platform 3. Correct anything obviously wrong 4. Set up load test 5. Monitor load test 6. Analyze Results 7. Correct issues
The Checkup (semi-annual)

1. Check response times 2. Check system load 3. Check previous issues 4. Check for Signs of Illness 5. Fix new issues
Know your application!
Application database usage

Which does your application do?
small reads large sequential reads small writes large writes long-running procedures/transactions bulk loads and/or ETL
What Color Is My Application?

W O
Web Application (Web)
Online Transaction Processing (OLTP)
Data Warehousing (DW)

W O

DB much smaller than RAM 90% or more simple queries

W O
DB smaller than RAM 90% or more simple queries

DB slightly larger than RAM to 1TB 20-40% small data write queries Some long transactions and complex read queries

W O

DB smaller than RAM 90% or more simple queries DB slightly larger than RAM to 1TB 20-40% small data write queries Some long transactions and complex read queries Large to huge databases (100GB to 100TB) Large complex reporting queries Large bulk loads of data Also called "Decision Support" or "Business Intelligence"



W O

CPU-bound Ailments: idle connections/transactions, too many queries CPU or I/O bound Ailments: locks, database growth, idle transactions, database bloat

I/O or RAM bound

Resources: database growth, longer running queries, memory usage growth
Special features required?
GIS

heavy cpu for GIS functions lots of RAM for GIS indexes lots of RAM for indexes slow response time on writes response time lag on connections
TSearch

SSL
Load Testing
80 70 60
Requests Per Second
50 40 30 20 10 0
02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM
Time
80 70 60
DOWNTIME
Requests Per Second
50 40 30 20 10 0
02:00:00 AM 06:00:00 AM 10:00:00 AM 02:00:00 PM 06:00:00 PM 10:00:00 PM 12:00:00 AM 04:00:00 AM 08:00:00 AM 12:00:00 PM 04:00:00 PM 08:00:00 PM
Time
When preventing downtime, it is not average load which matters, it is peak load.
What to load test
Load should be as similar as possible to your production traffic You should be able to create your target level of traffic
better: incremental increases the database server may not be your weak point
Test the whole application as well
How to Load Test

1. Set up a load testing tool
you'll need test servers for this*
2. Turn on PostgreSQL, HW, application monitoring

all monitoring should start at the same time
3. Run the test for a defined time

1 hour is usually good
4. Collect and analyze data 5. Re-run at higher level of traffic
Test Servers
Must be as close as reasonable to production servers
otherwise you don't know how production will be different there is no predictable multiplier
Double them up as your development/staging or failover servers If your test server is much smaller, then you need to do a same-load comparison
Tools for Load Testing
Production Test
1. Determine the peak load hour on the production servers 2. Turn on lots of monitoring during that peak load hour 3. Analyze results Pretty much your only choice without a test server.
Issues with Production Test
Not repeatable
load won't be exactly the same ever again
Cannot test target load

just whatever happens to occur during that hour can't test incremental increases either
Monitoring may hurt production performance Cannot test experimental changes
The Ad-Hoc Test
Get 10 to 50 coworkers to open several sessions each Have them go crazy on using the application
Problems with Ad-Hoc Testing
Not repeatable
minor changes in response times may be due to changes in worker activity each test run shuts down the office unless you have a lot of coworkers
Labor intensive
Can't reach target levels of load
Seige
HTTP traffic generator

all test interfaces must be addressable as URLs useless for non-web applications create a simple load test in a few hours cannot test database separately
Simple to use
Tests the whole web application
http://www.joedog.org/index/siege-home
pgReplay
Replays your activity logs at variable speed
get exactly the traffic you get in production
Good for testing just the database server Can take time to set up

need database snapshot, collect activity logs must already have production traffic http://pgreplay.projects.postgresql.org/
tsung
Generic load generator in erlang

a load testing kit rather than a tool Generate a tsung file from your actvity logs using pgFouine and test the database Generate load for a web application using custom scripts but highly configurable and advanced very scalable - cluster of load testing clients http://tsung.erlang-projects.org/
Can be time consuming to set up

pgBench
Simple micro-benchmark
not like any real application write custom pgBench scripts run against real database but easy to set up ships with PostgreSQL
Version 9.0 adds multi-threading, customization

Fairly ad-hoc compared to other tools
Benchmarks
Many real benchmarks available
DBT2, EAstress, CrashMe, DBT5, DBMonster, etc. not useful for testing your application
Useful for testing your hardware
Often time-consuming and complex
Platform-specific
Web framework or platform tests

Rails: ActionController::PerformanceTest J2EE: OpenDemand, Grinder, many more
JBoss, BEA have their own tools
Zend Framework Performance Test
Useful for testing specific application performance
such as performance of specific features, modules
Not all platforms have them
Flight-Check
Attend the tutorial tomorrow!
monitoring PostgreSQL during load test

log_collector = on log_destination = 'csvlog' log_filename = 'load_test_1_%h' log_rotation_age = 60min log_rotation_size = 1GB log_min_duration_statement = 0 log_connections = on log_disconnections = on log_temp_files = 100kB log_lock_waits = on
monitoring hardware during load test

sar -A -o load_test_1.sar 30 240 iostat or fsstat or zfs iostat
monitoring application during load test
Collect response times with timestamp with activity Monitor hardware and utilization
activity memory & CPU usage Record errors & timeouts
Checking Hardware
Checking Hardware

CPUs and Cores RAM I/O & disk support Network
CPUs and Cores
Pretty simple:

Rules of thumb
number type speed L1/L2 cache
fewer faster CPUs is usually better than more slower ones core != cpu thread != core virtual core != core
CPU calculations

to 1 core for OS to 1 core for software raid or ZFS 1 core for postmaster and bgwriter 1 core per:

DW: 1 to 3 concurrent users OLTP: 10 to 50 concurrent users Web: 100 to 1000 concurrent users
CPU tools
sar mpstat pgTop
in praise of sar

collects data about all aspects of HW usage available on most OSes
but output is slightly different often enough for server-checking purposes
easiest tool for collecting basic information
BUT: does not report all data on all platforms
sar
CPUs: sar -P ALL and sar -u Memory: sar -r and sar -R I/O: sar -b and sar -d network: sar -n
sar CPU output

Linux
06:05:01 AM 06:15:01 AM 06:15:01 AM CPU all 0 %user 14.26 14.26 %nice 0.09 0.09 %system 6.01 6.01 %iowait 1.32 1.32 %steal 0.00 0.00 %idle 78.32 78.32
Solaris
15:08:56 15:09:26 15:09:56 15:10:26 15:10:56 15:11:26 15:11:56 %usr 10 9 15 14 15 14 %sys 5 7 6 7 5 5 %wio 0 0 0 0 0 0 %idle 85 84 80 79 80 81
Memory

Only one statistic: how much? Not generally an issue on its own

low memory can cause more I/O low memory can cause more CPU time
memory sizing
Shared Buffers
Filesystem Cache work_mem maint_mem
In Buffer In Cache
On Disk
Figure out Memory Sizing
What is the active portion of your database?
i.e. gets queried frequently
How large is it? Where does it fit into the size categories? How large is the inactive portion of your database?
how frequently does it get hit? (remember backups)
Memory Sizing
Other needs for RAM work_mem:

sorts and aggregates: do you do a lot of big ones? GIN/GiST indexes: these can be huge hashes: for joins and aggregates VACUUM
I/O Considerations
Throughput
how fast can you get data off disk?
Latency
how long does it take to respond to requests?
Seek Time
how long does it take to find random disk pages?
I/O Considerations
Throughput

important for large databases important for bulk loads huge effect on small writes & reads not so much on large scans important for small writes & reads very important for index lookups
Latency

Seek Time

I/O Considerations
Web
concerned about read latency & seek time concerned about write latency & seek time concerned about throughput & seek time
OLTP
DW/BI
------Sequential Output------ --Sequential Input- --Random-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --SeeksSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP 32096M 79553 99 240548 45 50646 5 72471 94 185634 10 /sec %CP 1140 1
------Sequential Output------ --Sequential Input-- --Random-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-24G 260044 6549ms 33 62110 4882ms 17 89914 3395ms --Seeks-15 1167 25 Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP/sec %CP 107ms
Common I/O Types

Software RAID & ZFS Hardware RAID Array NAS/SAN SSD
Hardware RAID Sanity Check

RAID 1 / 10, not 5 Battery-backed write cache?
otherwise, turn write cache off about real throughput 4-14 for OLTP application 8-48 for DW/BI
SATA < SCSI/SAS
Enough drives?

Sw RAID / ZFS Sanity Check
Enough CPUs?
will need one for the RAID same as hardware raid caching block size
Enough disks?
Extra configuration?

NAS/SAN Sanity Check

Check latency! Check real throughput
drivers often a problem multipath or fiber required to get HW RAID performance
Enough network bandwidth?
SSD Sanity Check
1 SSD = 4 Drives
relative performance make sure data is safe drivers often a problem
Check write cache configuration
Test real throughput, seek times
Research durability stats
IO Tools
I/O Tests

Monitoring Tools

dd test Bonnie++ IOZone filebench
sar mpstat iowait iostat on zfs: fsstat, zfs -iostat EXPLAIN ANALYZE
Network
Throughput
not usually an issue, except:

iSCSI / NAS / SAN ELT & Bulk Load Processes
remember that gigabit is only 100MB/s! real issue for Web / OLTP consider putting app database on private network
Latency

Checkups for the Cloud
Just like real HW, except ...

Low ceiling on #cpus, RAM Virtual Core < Real Core

CPU Stealing last-generation hardware calculate 50% more cores
Cloud I/O Hell
I/O tends to be very slow, erratic

comparable to a USB thumb drive horrible latency, up to second erratic, speeds go up and down RAID together several volumes on EBS use asynchronous commit
or at least commit_siblings
#1 Cloud Rule
If your database doesn't fit in RAM, don't host it on a public cloud
Checking Operating System and Filesystem
OS Basics
Use recent versions
large performance, scaling improvements in Linux & Solaris in last 2 years advice for Oracle is usually good for PostgreSQL
Check OS tuning advice for databases
Keep up with information about issues & patches

frequently specific releases have major issues especially check HW drivers
OS Basics
Use Linux, BSD or Solaris!
Windows has poor performance and weak diagnostic tools OSX is optimized for desktop and has poor hardware support AIX and HPUX require expertise just to install, and lack tools
Filesystem Layout

One array / one big pool Two arrays / partitions

OS and transaction log Database OS & stats file Transaction log Database
Three arrays

Linux Tuning
XFS > Ext3 (but not that much)

Ext3 Tuning: data=writeback,noatime,nodiratime XFS Tuning: noatime,nodiratime
for transaction log: nobarrier
deadline I/O scheduler Increase SHMMAX and SHMALL
to of RAM OCFS, RHCFS
Cluster filesystems also a possibility
Solaris Tuning
Use ZFS

no advantage to UFS anymore mixed filesystems causes caching issues set recordsize

8K small databases 128K large databases check for throughput/latency issues
Solaris Tuning

Set OS parameters via projects For all databases:
project.max-shm-memory=(priv,12GB,deny) use libumem project.max-shm-ids=(priv,32768,deny) project.max-sem-ids=(priv,4096,deny) project.max-msg-ids=(priv,4096,deny)
For high-connection databases:

FreeBSD Tuning
ZFS: same as Solaris

definite win for very large databases not so much for small databases
Other tuning per docs
PostgreSQL Checkup
postgresql.conf: formulae shared_buffers= availableRAM/4
postgresql.conf: formulae max_connections= web:100to200 OLTP:50to100 DW/BI:5to20

if you need more, use pooling!
postgresql.conf: formulae
Web/OLTP:
work_mem=Av.RAM*2/ max_connections
DW/BI:
work_memAvRAM/ max_connections
postgresql.conf: formulae
Web/OLTP:
maintenance_work_mem= Av.RAM*16
DW/BI:
maintenance_work_mem= AvRAM/8
postgresql.conf: formulae autovacuum = on

DW/BI & bulk loads:
autovacuum = off autovacuum_max_workers = 1/2
postgresql.conf: formulae checkpoint_segments = web: 8 to 16 oltp: 32 to 64 BI/DW: 128 to 256
postgresql.conf: formulae wal_buffers = 8MB effective_cache_size = AvRAM * 0.75
How much recoverability do you need?
None:

fsync=off full_page_writes=off consider using ramdrive synchronous_commit = off wal_buffers = 16MB to 32MB keep everything on
Some Loss OK

Data integrity critical
File Locations

Database Transaction Log Activity Log Stats File Tablespaces?
Database Checks: Indexes

select relname, seq_scan, seq_tup_read, pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activity from pg_stat_user_tables where seq_scan > 1000 and pg_relation_size(relid) > 1000000 order by seq_scan desc limit 10; relname | seq_scan | seq_tup_read | size | update_activity ----------------+----------+--------------+---------+----------------permissions | 12264 | 53703 | 2696 kB | 365 users | 11697 | 351635 | 17 MB | 741 test_set | 9150 | 18492353300 | 275 MB | 27643 test_pool | 5143 | 3141630847 | 212 MB | 77755
Database Checks: Indexes

SELECT indexrelid::regclass as index , relid::regclass as table FROM pg_stat_user_indexes JOIN pg_index USING (indexrelid) WHERE idx_scan < 100 AND indisunique IS FALSE; index acct_acctdom_idx hitlist_acct_idx hitlist_number_idx custom_field_acct_idx user_log_accstrt_idx user_log_idn_idx user_log_feed_idx user_log_inbdstart_idx user_log_lead_idx | | | | | | | | | | table accounts hitlist hitlist custom_field user_log user_log user_log user_log user_log
Database Checks: Large Tables

relname | total_size | table_size -------------------------+------------+-----------operations_2008 | 9776 MB | 3396 MB operations_2009 | 9399 MB | 3855 MB request_by_second | 7387 MB | 5254 MB request_archive | 6975 MB | 3349 MB events | 92 MB | 66 MB event_edits | 82 MB | 68 MB 2009_ops_eoy | 33 MB | 19 MB
Database Checks: Heavily-Used Tables

select relname, pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activity from pg_stat_user_tables order by update_activity desc limit 10; relname | size | update_activity ------------------------+---------+----------------session_log | 344 GB | 4811814 feature | 279 MB | 1012565 daily_feature | 28 GB | 984406 cache_queue_2010_05 | 2578 MB | 981812 user_log | 30 GB | 796043 vendor_feed | 29 GB | 479392 vendor_info | 23 GB | 348355 error_log | 239 MB | 214376 test_log | 945 MB | 185785 settings | 215 MB | 117480
Database Unit Tests
You need them!
you will be changing database objects and rewriting queries find bugs in testing or in testing or in production pgTAP Framework-level tests
Various tools

Rails, Django, Catalyst, JBoss, etc.
Application Stack Checkup
The Layer Cake

Queries Drivers Schema Filesystem Storage Transactions Connections Config Kernel RAM/CPU Caching
Application Middleware PostgreSQL Operating System

Network
Hardware
The Layer Cake

Queries Drivers Schema Filesystem Storage Transactions Connections Config Kernel RAM/CPU Caching
Application Middleware PostgreSQL Operating System

Network
Hardware
The Funnel
Application Middleware PostgreSQL OS HW
Check PostgreSQL Drivers
Does the driver version match the PostgreSQL version? Have you applied all updates? Are you using the best driver?

There are several Python, C++ drivers Don't use ODBC if you can avoid it.
Does the driver support cached plans & binary data?
If so, are they being used?
Check Caching
Check Caching
Does the application use data caching?

what kind? could it be used more? what is the cache invalidation strategy? is there protection from cache refresh storms? could they be using it more?
Does the application use HTTP caching?
Check Connection Pooling
Is the application using connection pooling?

all web applications should, and most OLTP external or built into the application server? max. efficiency: transaction / statement mode make sure timeouts match
Is it configured correctly?

Check Query Design
PostgreSQL does better with fewer, bigger statements Check for common query mistakes

joins in the application layer pulling too much data and discarding it huge OFFSETs unanchored text searches
Check Transaction Management
Are transactions being used for loops?
batches of inserts or updates can be 75% faster if wrapped in a transaction on error on timeout transactions being held open while non-database activity runs
Are transactions aborted properly?

Common Ailments of the Database Server
Check for them, monitor for them
ailments could throw off your response time targets
database could even hit the wall and during each checkup Nagios check_postgres.pl has checks for many of these things
check for them during health check
add daily/continuous monitors for them
Database Growth
Checkup:
check both total database size and largest table(s) size daily or weekly database grows faster than expected some tables grow continuously and rapidly
Symptoms:

Database Growth
Caused By:

faster than expected increase in usage append forever tables Database Bloat slower seq scans and index scans swapping & temp files slower backups
Leads to:

Database Growth
Treatment:

check for Bloat find largest tables and make them smaller

expire data partitioning
horizontal scaling (if possible) get better storage & more RAM, sooner
Database Bloat
-[ RECORD 1 ]+----schemaname | public tablename | user_log tbloat | 3.4 wastedpages | 2356903 wastedbytes | 19307749376 wastedsize | 18 GB iname | user_log_accttime_idx ituples | 941451584 ipages | 9743581 iotta | 40130146 ibloat | 0.2 wastedipages | 0 wastedibytes | 0 wastedisize | 0 bytes
Database Bloat
Caused by:
Autovacuum not keeping up

or not enough manual vacuum often on specific tables only
FSM set wrong (before 8.4) Idle In Transaction slow response times unpredictable response times heavy I/O
Leads To:

Database Bloat
Treatment:
make autovacuum more aggressive
on specific tables with bloat
fix FSM_relations/FSM_pages check when tables are getting vacuumed check for Idle In Transaction
Memory Usage Growth

00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 01:00:00 0 0 100 0 0 100 0 0 02:00:00 0 0 100 0 0 100 0 0 03:00:00 0 0 100 0 0 100 0 0 04:00:00 0 0 100 0 0 100 0 0
00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 01:00:00 3788 115 98 0 0 100 0 0 02:00:00 21566 420 78 0 0 100 0 0 03:00:00 455721 1791 59 0 0 100 0 0 04:00:00 908 6 96 0 0 100 0 0
Memory Usage Growth
Caused by:

Database Growth or Bloat work_mem limit too high bad queries database out of cache
Leads To:
slow response times
OOM Errors (OOM Killer)
Memory Usage Growth
Treatment
Look at ways to shrink queries, DB

partitioning data expiration
lower work_mem limit refactor bad queries Or just buy more RAM
Idle Connections
select datname, usename, count(*) from pg_stat_activity where current_query = '<IDLE>' group by datname, usename; datname | usename | count ---------+---------+------track | www | 318
Idle Connections
Caused by:

poor session management in application wrong connection pool settings memory usage for connections slower response times out-of-connections at peak load
Leads to:

Idle Connections
Treatment:

refactor application reconfigure connection pool
or add one
Idle In Transaction
select datname, usename, max(now() - xact_start) as max_time, count(*) from pg_stat_activity where current_query ~* '<IDLE> in transaction' group by datname, usename; datname | usename | max_time | count ---------+----------+---------------+------track | admin | 00:00:00.0217 | 1 track | www | 01:03:06.0709 | 7
Idle In Transaction
Caused by:

poor transaction control by application abandoned sessions not being terminated fast enough locking problems database bloat out of connections
Leads To:

Idle In Transaction
Treatment

refactor application change driver/ORM settings for transactions change session timeouts & keepalives on pool, driver, database
Longer Running Queries
Detection:

log slow queries to PostgreSQL log do daily or weekly report (pgfouine) number of long-running queries in log increasing slowest queries getting slower
Symptoms:

Caused by:

database growth poorly-written queries wrong indexes out-of-date stats out-of-CPU out-of-connections
Leads to:

Treatments:

refactor queries update indexes make Autoanalyze more aggressive control database growth
Too Many Queries
Too Many Queries
Caused By:

joins in middleware not caching poll cycles without delays other application code issues out-of-CPU out-of-connections
Leads To:

Too Many Queries
Treatment:

characterize queries using logging refactor application
Locking
Detection:

log_lock_waits scan activity log for deadlock warnings query pg_stat_activity and pg_locks deadlock error messages number and time of lock_waits getting larger
Symptoms:

Locking
Caused by:

long-running operations with exclusive locks inconsistent foreign key updates poorly planned runtime DDL poor response times timeouts deadlock errors
Leads to:

Locking
Treatment

analyze locks refactor operations taking locks

establish a canonical order of updates for long transactions use pessimistic locks with NOWAIT not on middleware code
rely on cascade for FK updates
Temp File Usage
Detection:

log_temp_files = 100kB scan logs for temp files weekly or daily temp file usage getting more frequent queries using temp files getting longer
Symptoms:

Temp File Usage
Caused by:
Sorts, hashes & aggregates too big for work_mem slow response times timeouts
Leads to:

Temp File Usage
Treatment

find swapping queries via logs set work_mem higher for that ROLE, or refactor them to need less memory, or buy more RAM
All healthy now?
See you in six months!
Q&A
Josh Berkus

Also see:
josh@pgexperts.com it.toolbox.com/blogs/ database-soup www.pgexperts.com pgCon Sponsor
Load Testing (tommorrow) Testing BOF (Friday)
PostgreSQL Experts

Copyright 2010 Josh Berkus & PostgreSQL Experts Inc. Distributable under the creative commons attribution license, except for 3rd-party images which are property of their respective owners.

DB Server Health Check

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DB Server Health Check

Uploaded by

Copyright:

Available Formats

Database Server Health Check

Josh Berkus PostgreSQL Experts Inc. pgCon 2010

DATABASE SERVER HELP 5

hardware OS & FS PostgreSQL application

Common Ailments of the Database Server

What is a Healthy Database Server?

What is a Healthy Database Server?

What is a Healthy Database Server?

Capacity for more

Median Response Time

Max Response Time

0 25 50 75 100 125 150 175 200 225 250

What is an Unhealthy Database Server?

Median Response Time

Max Response Time

0 25 50 75 100 125 150 175 200 225 250

Median Response Time

0 25 50 75 100 125 150 175 200 225 250

Hitting The Wall

16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13

16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13

%iowait 58.75 90.03 86.98 7.00 51.70

FATAL: connection limit exceeded for nonsuperusers

How close are you to the wall?

The Checkup (full physical)

The Checkup (semi-annual)

Know your application!

Application database usage

What Color Is My Application?

Web Application (Web)

Online Transaction Processing (OLTP)

Data Warehousing (DW)

What Color Is My Application?

Web Application (Web)

DB much smaller than RAM 90% or more simple queries

Online Transaction Processing (OLTP)

Data Warehousing (DW)

What Color Is My Application?

Web Application (Web)

DB smaller than RAM 90% or more simple queries

Online Transaction Processing (OLTP)

Data Warehousing (DW)

What Color Is My Application?

Web Application (Web)

Online Transaction Processing (OLTP)

Data Warehousing (DW)

What Color Is My Application?

Web Application (Web)

Online Transaction Processing (OLTP)

Data Warehousing (DW)

I/O or RAM bound

Special features required?

Requests Per Second

Requests Per Second

What to load test

Test the whole application as well

How to Load Test

2. Turn on PostgreSQL, HW, application monitoring

3. Run the test for a defined time

4. Collect and analyze data 5. Re-run at higher level of traffic

Must be as close as reasonable to production servers

Tools for Load Testing

Issues with Production Test