You are on page 1of 10

Disk IO Errors: Troubleshooting

on Linux Servers
Disk IO errors  (input/output) issues are a common cause of poor performance
on web hosting servers. Hard drives have speed limits, and if software tries to
read or write too much data too quickly, applications and users are forced to
wait. To put it another way, storage devices can be a bottleneck that stops the
server from reaching its full performance potential. 
Disk IO is not the only cause of slow servers, so in this article, we’ll explain
how to use Linux IO stats to identify disk IO issues and how to diagnose and
fix servers with storage bottlenecks. 

What Are the Symptoms of Disk IO Errors?


Because disk IO is so important to server performance, it can manifest in
many different ways:

 Increased website latency: sites hosted on the server take longer to


load than expected. 
 High server load: excessive IO can cause other components, including
the CPU, to work harder. You can learn more about measuring server
load in Troubleshooting High Server Loads.
 You receive a chkservd notification: cPanel & WHM’s built-
in chkservd monitoring tool will alert you if a service is unavailable. 
 Slow email delivery and webmail interfaces: Email software needs to
read data from and write data to the hard drives. 
 Slow cPanel & WHM interfaces: Many cPanel & WHM features rely on
fast hard drive or database access. 

If you observe any of these, high IO loads might be the culprit, but how do you
know it’s a storage bottleneck rather than a problem with the network or
processor?

Is Disk IO the Cause of Poor Performance?


From the user’s perspective, an IO bottleneck might look just like network
latency, among other problems. It’s prudent to make sure that the server’s
storage is the real culprit so we don’t waste time and money fixing the wrong
problem.
We’ll need to use diagnostic tools on the server’s command line, so log in with
SSH. 
First, let’s see if the CPU is waiting for disk operations to complete. Type “top”
and press enter. This launches the top tool, which shows server statistics and
a list of running processes. The wa metric shows IO-wait, the amount of time
the CPU spends waiting for IO completion represented as a percentage. 

IO-wait is one of a series of processor activity figures in the %CPU row. It also
includes:

 us: time spent on user processes.


 sy: time spent on system processes. 
 id: idle time.

On the single-CPU server in the example images, these are straightforward to


understand. Our server’s CPU spends 59 percent of its time waiting for IO
input instead of processing data.  An IO-wait above 1 may indicate the
server’s hard drives are struggling to supply the processor with data. 
On multi-core and multi-processor servers, it’s a little more complex.
Because top adds the CPU utilization figures for all cores, they can exceed
100 percent. As a rule, if the IO-wait percentage is bigger than 1 when divided
by the number of CPU cores, then the processor must wait before it can
process data. For example, on a 4-core system with a wa of 10 percent, the
IO-wait is around 2.5, so the processors are forced to wait. 
IO-wait times don’t always mean there is an IO bottleneck, but it is a valuable
clue, especially when it correlates with observed performance issues. To
discover the cause, we need to investigate further with vmstat, which shows
statistics for IO, CPU, and memory activity, among others. 

vmstat 1 10

We’re asking vmstat to show us ten readings at one-second intervals. The


first line shows average IO stats since the last reboot, and the subsequent
lines show real-time statistics.

We are interested in the io column, which is divided into input and output. It


shows that large amounts of data were written to a storage device throughout
the test period. Compared to the average loads in the first row, the IO system
is being seriously stressed. 
Next, we want to know which hard drive is under load. To find out, we can
use iostat. 

iostat -md

The -m option tells iostat to display statistics in megabytes per second, and -d
says we’re interested in device utilization. 
The device called vda is writing 730 MB of data each second. Whether that’s
a problem depends on the capabilities of the server and the device, but with
the observed performance degradation and large IO wait times, it’s
reasonable to conclude that excessive disk IO on vda is the cause of our
issues. 
There is one other piece of information that could help us narrow things down:
the mount point of the vda device. The mount point is the directory on the
server’s filesystem the device is connected to. You can find it with
the lsblk command.

We can see that vda has one partition called vda1 and that it’s mounted on


the root of the filesystem (/).  On this server, that information is not particularly
helpful; it only has one mounted device. However, on a server with several
storage devices, lsblk can help you to figure out where the data is being
written and which application is writing it. 

How To Fix Disk IO Issues


Once you have identified the affected drive, there are several approaches you
can take to mitigate disk IO issues.
For example, you may want to try changing a few settings to see if
performance improves before you upgrade hard drives or memory. Three hard
drive configuration settings you should try changing first are:
1. Turn on write caching
2. Turn on direct memory access
3. Upgrade Server Hardware
Here’s how to do that:

Turn on Write Caching


Write caching collects data for multiple writes in a RAM cache before writing
them permanently to the drive. Because it reduces the number of hard drive
writes, it can improve performance in some scenarios. 
Write caching can cause data loss if the server’s power is cut before the
cache is written to the disk. Don’t activate write caching if you want to
minimize the risk of data loss. 
The hdparm utility can turn write caching on and off. It may not be installed by
default on your server, but you can install it from CentOS’s core repository
with:

yum install hdparm

The following command turns write caching on:

hdparm -W1 /dev/sda

To turn write caching off, use:

hdparm -W0 /dev/sda

Turn on Direct Memory Access


Direct Memory Access (DMA) allows the server’s components to access its
RAM directly, without going via the CPU.  It can significantly increase hard
drive performance in some scenarios. 
To enable DMA, run the following command, replacing /dev/hda with your
hard drive:

hdparm -d1 /dev/hda

You can turn DMA off with:


hdparm -d0 /dev/hda

DMA isn’t available on all servers and, with virtual servers in particular, you
may not be able to modify hard drive settings.

Upgrade Server Hardware


If configuration tweaks don’t solve your IO issues, it’s time to think about
upgrading, replacing, or reorganizing the server’s hardware. 

 Upgrade the hard drive: If the drive is slow, consider upgrading to a


faster spinning disk drive or a solid state drive (SSD) that can better
cope with the IO load. 
 Split the application load between hard disks: Add hard drives and
divide the IO load between them. You may also want to consider giving
the operating system’s root filesystem its own drive, so that the
operating system and cPanel & WHM don’t have to compete with user
applications.
 Increase the server’s memory: If applications can cache more data in
RAM, they will not need to read from and write to the filesystem as
often.  For some applications, an in-memory cache such as Memcached
or Varnish may improve performance while reducing disk IO. 
 Check the RAID array hardware: A malfunctioning RAID controller can
degrade IO performance. 

Disk IO bottlenecks can be tricky to diagnose, but the process we’ve outlined
here will help you to quickly determine whether you have an IO problem,
which drives are affected, and what you can do to improve your server’s
performance. 
As always, if you have any feedback or comments, please let us know. We
are here to help in the best ways we can. You’ll find us on Discord, the cPanel
forums, and Reddit.
[junhee.lee:root@* /]# ls -al

ls: cannot access test

[junhee.lee:root@* /test]# ls

ls: cannot open directory .: Input/output error

1. If you enter a command in the / directory, the phrase cannot access test is displayed.

2. If you enter a command in the /test directory, the message ls: cannot open directory .: Input/output
error is displayed.

 - I thought it was a simple disk problem, but it was an error that occurred because the mount settings
were wrong.

3. An error that occurs when a disk failure occurs

 - Disk replacement is required. The possibility of data loss should be considered.

[junhee.lee:root@* ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 20G 2.7G 17G 14% /

tmpfs 32G 0 32G 0% /dev/shm

/dev/sda3 197G 842M 197G 1% /usr/local/cdnet

/dev/sdb1 1.7T 614G 1.1T 37% /test

There was no problem when outputting the currently mounted information.

[junhee.lee:root@* /]# ls -al

total 129640

dr-xr-xr-x. 26 root root 4096 Sep 25 03:04 .

dr-xr-xr-x. 26 root root 4096 Sep 25 03:04 ..


.

d?????????? ?? ? ? ? test

After entering the ls -al command in the /(root) directory, I checked that the owner and permission
were not set in the test directory.

[junhee.lee:root@* /]# umount /dev/sdb1

[junhee.lee:root@* /]# mount -a

[junhee.lee:root@* /]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 20G 2.7G 17G 14% /

tmpfs 32G 0 32G 0% /dev/shm

/dev/sda3 197G 842M 197G 1% /usr/local/cdnet

/dev/sdb1 1.7T 614G 1.1T 37% /test

umount /dev/sdb1

mount -a

df -h

[junhee.lee:root@* /]# ls -al

total 129640

dr-xr-xr-x. 26 root root 4096 Sep 25 03:04 .

dr-xr-xr-x. 26 root root 4096 Sep 25 03:04 ..

.
.

drwxr-xr-x 5 mysql mysql 40 Sep 23 06:37 test

After umount -> mount -a, check that information is output when entering ls -al in the /(root)
directory

[junhee.lee:root@* /teset]# ls

backup data log

[junhee.lee:root@* /test]# cd log/

[junhee.lee:root@* /test/log]# ls

backup * error * * * * *

[junhee.lee:root@* /test/log]# cd error/

[junhee.lee:root@* /test/log/error]# ls

*.err

[junhee.lee:root@* /test/log/error]# cp *.err *.err2

[junhee.lee:root@* /test/log/error]# ls -al

total 72

drwxr-xr-x 2 test test 49 Sep 29 00:10 .

drwxr-xr-x 10 test test 121 Aug 20 07:07 ..

-rw-rw---- 1 test test 36009 Sep 22 02:11 *.err

-rw-r----- 1 root root 36009 Sep 29 00:10 *.err2

1-The error message is as follows


2-Analysis: The above situations are generally caused by hard disk problems leading to file system
damage

3. Analysis: The above situations are generally caused by hard disk problems leading to file system
damage

Solution: 1. If conditions permit, restart the system and try.

2. Use fsck to repair the file system. The repair needs to unmount the disk mount first, and then use:
fsck -y /dev/sdb1; after the repair is completed, remount the directory, and enter the directory again to
view the problem solution.

You might also like