Dealing with Runaway Processes
There is no official definition of a “runaway process”. Generally, it is a process that ignores its scheduled priority. It can also be a process that enters an infinite loop. Or it can be a process that spawns a large number of new processes, causing system overflow.
A runaway process is not always a process that acts in an unexpected fashion. A more likely cause for such behavior is unexpected input or volume of input. The exact cause can be identified later based on the application log data. The purpose of process described here is to help you catch potential runaway processes.
The major identifying characteristic of a runaway process is very high CPU utilization over an extended period of time. This script looks at CPU time of particularly busy processes and compared it to the time elapsed since the processes were started. The preference is given to processes with the highest CPU time-to-elapsed time ratio.
The ratio of greater-than-one indicates that the process is using multiple CPU cores. Your options may include modifying application settings, adding CPU cores (on a VM), renicing the process, or setting CPU affinity (see below).
As a starting point, you may want to try “htop” (yum -y install htop). Sort by CPU% and see how long the busiest processes have been running. You can also accomplish this with “ps”:
ps -eLfww | grep -E "\s[0-9]{1,2}\-[0-9]{2}:[0-9]{2}:[0-9]{2}\s" | sort -k9,1n
In the end, identifying a “runaway process” is left to your best judgement. There is no rule of thumb. Unfortunately, you need good understanding of your application and the script below can only serve as a general guide.
Don’t rush to kill a suspected runaway process, unless you already know it is a problem. Try lowering its runtime priority with “renice” or confining it to particular CPU cores using the taskset command.
Example:
taskset -cp 30-31 ${pid}
This will restrict process ${pid} to cores 30 and 31. Keep in mind that child processes of ${pid} will be unaffected by this change. If you want to set CPU affinity for them as well, you will need to run the taskset command for each PID in /proc/${pid}/task/*
You can obtain the number of CPU cores by looking at the /proc/cpuinfo file:
grep processor /proc/cpuinfo
Alternatively, you can set CPU affinity for all processes running under a particular UID. You can also start a process with CPU affinity defined beforehand.
The advantage of this approach is that any child process spawned under the parent PID will respect the parent’s CPU affinity setting. This can be easily added to the /etc/init.d startup script for the particular application. The same is true for setting the processes’ nice level.
And here is the script. Save is as /var/adm/bin/runaways.sh, make it executable by root, and create a soft link:
ln -s /var/adm/bin/runaways.sh /usr/bin/runaways
The basic syntax is:
runaways | sort -rn | more
The options are as follows:
runaways [OPTION] -a Look at all processes by all users -u <username1,username2> Exclude listed system UIDs -l Exclude all UIDs below 100 (default)
The output fields are:
1 - CPU time/Elapsed time (the higher the worse) 2 - CPU time in seconds 3 - Elapsed time in seconds 4 - CPU % 5 - Memory % 6 - CPU time 7 - Elapsed time 8 - Start time 9 - PID 10 - State 11 - UID 12 - Username 13 - Command 14 - Command with path and options
And here’s the script:
#!/bin/bash # # | # ___/"\___ # __________/ o \__________ # (I) (G) \___/ (O) (R) # Igor Os # krazyworks.com # ---------------------------------------------------------------------------- # Identify potential runaway processes # # There is no official definition of a runaway process. Generally, it is a process that ignores its scheduled priority. # It can also be a process that enters an infinite loop. Or it can be a process that spawns a large number of new processes, # causing system overflow. # # A runaway process is not always a process that acts in an unexpected fashion. A more likely cause for such behavior is # unexpected input or volume of input. The exact cause will be identified later based on the application log data. The # purpose of this script is to help you catch potential runaway processes. # # The major identifying characteristic of a runaway process is very high CPU utilization over an extended period of time. # This script looks at CPU time of a particularly busy process and compared it to the time elapsed since the process was # started. The preference is given to processes with the highest CPU time/Elapsed time ratio. # # Don't rush to kill a suspected runaway process, unless you already know it is a problem. Try lowering its runtime priority # with "renice" or confining it to a single CPU core using "taskset" command. # # Examples: # # taskset -cp 30-31 ${pid} # This will restrict process ${pid} to cores 30 and 31. Keep in mind that child processes of ${pid} will be unaffected by # this change. If you want to set CPU affinity for them as well, you will need to run the taskset command for each PID in # /proc/${pid}/task/* # # ---------------------------------------------------------------------------- usage() { cat << EOF runaways [OPTION] -a Look at all processes by all users -u <username1,username2> Exclude listed system UIDs -l Exclude all UIDs below 100 (default) EOF exit 0 } if [ "" ] then o=$(echo | grep -Eo "[a-z]{1}") else o=l fi if [ "${o}" == "u" ] then if [ ! "" ] then usage else userlist="" for i in `echo ${userlist} | sed 's/,/ /g'` do if [ `grep -c ^${i}: /etc/passwd` -lt 1 ] then echo "Specified username ${i} must be a valid user. Exiting..." exit 1 fi done fi fi configure() { n=10 } tosec() { awk -F $':' '{if (NF == 2) {printf $1*60 + $2" "} else if (NF == 3) {split($1, a, "-"); \ if (a[2] > 0) {printf ((a[1]*24+a[2])*60 + $2) * 60 + $3" ";} else {printf ($1*60 + $2) * 60 + $3" ";}}}' } export -f tosec all_processes() { ps -e --no-headers -o %cpu,%mem,cputime,etime,stime,pid,state,uid,user,comm,command --sort -%cpu 2>/dev/null| head -${n} | \ awk '{system("echo "$3 "| tosec"); system("echo "$4 "| tosec"); printf("%s \n",$0)}' | sort -rn -k1 | \ while read line ; do a=$(echo ${line} | awk '{print $1}') ; b=$(echo ${line} | awk '{print $2}'); \ r=$(echo "scale=2;${a}/${b}*1"|bc -l|awk '{printf "%.2f", $0}'); echo -e "${r} ${line}" ; done } exclude_specific_users() { ps -u ${userlist} --deselect --no-headers \ -o %cpu,%mem,cputime,etime,stime,pid,state,uid,user,comm,command --sort -%cpu 2>/dev/null| head -${n} | \ awk '{system("echo "$3 "| tosec"); system("echo "$4 "| tosec"); printf("%s \n",$0)}' | sort -rn -k1 } exlude_low_uids() { ps -u `grep -E ":x:[0-9]{1,2}:[0-9]" /etc/passwd | awk -F':' '{printf $1","}' | sed 's/,$//g'` \ --deselect --no-headers -o %cpu,%mem,cputime,etime,stime,pid,state,uid,user,comm,command --sort -%cpu 2>/dev/null| head -${n} | \ awk '{system("echo "$3 "| tosec"); system("echo "$4 "| tosec"); printf("%s \n",$0)}' | sort -rn -k1 | \ while read line ; do a=$(echo ${line} | awk '{print $1}') ; b=$(echo ${line} | awk '{print $2}'); \ r=$(echo "scale=2;${a}/${b}*1"|bc -l|awk '{printf "%.2f", $0}'); echo -e "${r} ${line}" ; done } # RUNTIME configure case ${o} in a) all_processes ;; u) exclude_specific_users ;; l) exlude_low_uids ;; *) usage ;; esac