Professional Documents
Culture Documents
Debuuging Hung Process Linux PDF
Debuuging Hung Process Linux PDF
At times when you see a batch job or process on Linux that doesn’t seem to be doing anything, it can be
difficult to figure out where exactly it is stuck. You can run strace on the process to figure out if it is hung
on a system call but that may not tell you as much as you would like. Lots of processes have file-handles,
network sockets, pipes, or other connections open so you don’t always know what the bottleneck or
blockage is. Here is an example of a process that was stuck doing nothing for a long period of time and
how we found out what it was stuck on. In the below given example Originally the problem was found
because the logfile hadn’t been written to in 1+ hours. But what was wrong? Had it died? Was it stuck?
Was it just failing to write out the log?
Gives you the PID and the command run (filtered by grep to only get back what we care about.)
Note: this will include the grep command sometimes, excluded here due to irrelevance.
The process is running. The way this process works it reads from a file, writes to a log file, does
filesystem deletes against an NFS mounted filer, has a database connection, and is a child process of
another job. Any one of those things could potentially be causing this job to run inordinately slowly.
$ strace -p 21129
Process 21129 attached - interrupt to quit
read(6,
Process 21129 detached
Nothing ever happened on this so I used Ctrl+c to break out of it. The process is definitely doing a whole
lot of nothing from a system perspective.
This machine does have a network connection established on port 45950 to db.env.xyz, I could have
received more information but I didn’t include enough command line switches to netstat. Look up some
more info with different switches.
pg. 1
Debugging Hang process in Linux
That’s a database connection to the oracle database. But is it the stuck read(6 we saw via strace? At this
point we don’t know. In fact we don’t even know what else it has open that it could be reading.
ls -l /proc/21129/fd
total 0
lr-x------ 1 inv users 64 Sep 13 09:16 0 -> pipe:[4230561338]
l-wx------ 1 inv users 64 Sep 13 09:16 1 -> /web/imageserver/log/garbagecollector.stdout
l-wx------ 1 inv users 64 Sep 13 09:16 2 -> /web/imageserver/log/garbagecollector.stdout
lr-x------ 1 inv users 64 Sep 13 09:16 3 -> /web/imageserver/data/garbagecollector.1.txt
lrwx------ 1 inv users 64 Sep 13 09:16 5 -> socket:[4231677600] this is socket ID
lrwx------ 1 inv users 64 Sep 13 09:16 6 -> socket:[4231677602] this is socket ID
the above output tells us that it has 6 different file descriptors open and that the 6 it is reading from in the
strace command is a socket. But is it the database connection socket or something else?
Let’s Convert the port number to hex so it can be looked up and correlated with the socket id.
Notice the 4231677602 in that line? That’s the socket id. That tells us that the process which is waiting on
read(6 is waiting on a read from the database.
That gives us enough information to go talk to the database administrators and figure out why the
database call is taking forever.
This is written on the best effort basis However, any new ideas or
suggestions are Welcome for further tweak and improvements.
pg. 2