Professional Documents
Culture Documents
1. Information
Research from ebook DevOps Troubleshooting , download from http://cuongquach.com . And ebook is on
www.allitebooks.com
1. System Load
Typically systems seem to be more responsive when under CPU-bound load than when the I/O-bound load. I’ve seen
systems with loads in the hundress that were CPU-bound, and I could still run diagnostic tools on those systems with
pretty good response times. On the other hand (mặtc khác), I’ve seen systems with relatively low I/O-bound loads on
which just logging in took a minute because the disk I/O was completely saturated (bảo hòa). A system that runs out of
RAM resource often appears to have I/O-bound load, since once the system starts using swap storage on the disk, it
can consume (tiêu thụ) disk resource and cause a downward spiral (vòng xoáy đi xuống) as processes slow to a halt.
# top -n 1
And we know services that we need monitor. For examples: we only monitor asterisk, we can use :
■ us: user CPU time : This is the percentage of CPU time spent running users’ processes that aren’t niced. (Nicing is a
process that allows you to change its priority in relation to other processes.)
■ sy: system CPU time : This is the percentage of CPU time spent running the kernel and kernel processes.
■ ni: nice CPU time : If you have user processes that have been niced, this metric will tell you the percentage of CPU
time spent running them.
■ id: CPU idle time : This is one of the metrics that you want to be high. It represents the percentage of CPU time that is
spent idle. If you have a sluggish system but this number is high, you know the cause isn’t high CPU load.
■ wa: I/O wait : This number represents the percentage of CPU time that is spent waiting for I/O. It is a particularly
valuable metric when you are tracking down the cause of a sluggish system, because if this value is low, you can pretty
safely rule out disk or network I/O as the cause.
■ hi: hardware interrupts : This is the percentage of CPU time spent servicing hardware interrupts.
■ si: software interrupts : This is the percentage of CPU time spent servicing software interrupts.
■ st: steal time : If you are running virtual machines, this metric will tell you the percentage of CPU time that was stolen
from you for other tasks.
For examples:
You can see that the system is over 50% idle, which matchs a load of 1.7 on a four-CPU system. When you diagnose a
slow system, ont of the first values you should look at is I/O wait so you can rule out disk I/O. If I/O wait is low, then you
can look at the idle percentage. If I/O wait is high, then the next step is to diagnose what is causing high disk I/O, which I
will cover momentarily. If the I/O wait is low and the idle percentage is high, you then know any sluggishness is not
because of CPU resources, and you will have to start troubleshooting elsewhere. This might mean looking fo r network
problems, or in the case of a web server, looking at slow queries to MySQL, for instance.
In this example, the mysqld process is consuming 53% of the CPU and the nagios2db_status process is consuming 12%.
Note that this is the percentage of a single CPU, so if you have a four-CPU machine, you could possibly see more than
one process consuming 99% CPU
Once Linux loads a file into RAM, it doesn’t necessarily remove it from RAM when a program is done with it. If there is
RAM available, Linux will cache the file in RAM so that if a program accesses the file again, it can do so much more
quickly. If the system does need RAM for active processes, it won’t cache as many files. Because of the file cache, it’s
common for a server that has been running for a fair amount of time to report a small amount of RAM free with the
remainder residing in cach.
To find out how much RAM is really being used by processes, you must subtract the file cache from the used RAM. In the
example code you just looked at, out of the 997,408k RAM that is used, 286,040 is being used by the Linux file cach, so
that means that only 711,368k is actually being used.
NOTE:
- On top command, we can type m to sort follow Memory or type c to sort follow cpu
- Type f to add more column or arrange column
It can sometimes be difficult to figure out exactly which process is using the I/O, but if you have multiple partitions on
your system, you can narrow it down by giuring out which partition most of the I/O is on. To do this, you need the iostat
program, which is provided by the sysstat package is both Redhat and Debian-based systems
■ tps : This lists the transfers per second to the device. “Transfers” is another way to say I/O requests sent to the device.
■ Blk_read/s : This is the number of blocks read from the device per second.
■ Blk_wrtn/s : This is the number of blocks written to the device per second.
■ Blk_read : In this column is the total number of blocks read from the device.
■ Blk_wrtn : In this column is the total number of blocks written to the device.
When you have a system under heavy I/O load (khi bạn có 1 hệ thống chịu tải nặng), the first step is to look at each of
the partitions and identify which partition is getting the heaviest I/O load. Say, for instance, that you have a database
server and database itself is stored on /dev/sda3. If you see that the bulk of the I/O is coming from there (đến từ đó),
you have a good clue (một manh mối tốt) that the database is likely consuming the I/O.
Once you figure that out (một khi bạn tìm ra điều đó), the next step is to identify whether the I/O is mostly from reads or
writes. Let’s say you suspect (nghi ngờ) that a backup job is causing the increase in I/O. Since the backup job is mostly
concerned with (chủ yếu liên quan đến) reading files from the file system and writing them over the network to the
backup server, you could possibly rule that out if you see that the bulk of the I/O is due to writes, not reads (bạn có thể
loại trừ nếu bạn thấy rằng phần lớn I / O là do ghi chứ không phải đọc.)
We’re already discussed how to use the tool iostat included in the sysstat package to troubleshoot high IO,but sysstat
includes tools that can also report on CPU and RAM utilization. Although it’s true can already do this with top, what
makes sysstat even more useful is that is provides a simple mechnism to log system statistics like CPU load, RAM, and
I/O stats. With these statistics, when someone complains that a system was slow around noon yesterday, you can play
back these logs and see what could have caused the problem.
2. Configure sysstat
On a Debian-based system like Ubuntu, sysstat won’t be enabled automatically, so edit /etc/default/sysstat and change
ENABLE=”false” to ENABLE=”true”
On a Red Hat-based system, we may want to edit the /etc/sysconfig/sysstat file and change the HISTORY option so it
logs more than 7 days of statistics.
On both distribution types, statistics will be captured every 10 minutes and a daily summary will be logged
Once enabled, sysstat gathers system stats (thu thập số liệu thống kê hệ thống) every 10 minutes and stores them under
/var/log/sysstat or /var/log/sa. In addition, it will rotate out the statistics (thống kê) file every night before midnight.
Both of these actions are run in the /etc/cron.d/sysstat script, so if you want to change how frequently sysstat gathers
information, you can modify it from that file
From the output you can see many of the same CPU statistics you would view in top output. At the bottom, sar provides
an overall average as well.
4. View RAM Statistics
The sysstat cron job collects much more information than CPU load, though. For instance, to gather RAM statistics
instead, use the -r option
Here you can see how much free and used memory you have as well as view statistics about swap and the file cache
similar to what you would see in either top or free output. The difference here is that you can go back in time.
Here you can see the number of total transactions per second (tps) plus how many of those transactions were reads and
writes (rtps and wtps, respectively). The bread/s column doesn’t measure bread I/O, instead it tells you the average
number of bytes read per second. Similarly, the bwrtn/s tells you average bytes written per second.
There are tons of individual arguments you can pass sar to pull out specific sets of data, but sometimes you just want to
see everything all at once. For that, just use the -A option. That will output all of the statistics from load average, CPU
load, RAM, disk I/O, network I/O, and all sorts of other interesting statistics. This can give you a good idea of what sorts
of statistics sar can output, so you can then read the sar manual (type man sar) to see what flags to pass sar to see
particular statictics.
6. View Statistics from Previous Days
Of course, so far I’ve just listed how to pull all of the statistics for the entire day. Often you want data from only a
portion of the day. To pull out data for a certain time range, use the -s and -e arguments to specify the starting time and
ending time you are interested in, respectively (tương ứng). For instance, if you wanted to pull CPU data just from 8:00
p.m to 8:30 p.m, you would type
If you want to pull data from a day other than today, just use the -f option followed by the full path to the particular
statistics file stored under /var/log/sysstat or /var/log/sa. For instance, to pull data from the statistics on the sixth of the
month you would type
$ sar -f /var/log/sysstat/sa06