|
|
Home Page
Tips page
University Page
Programming
Debian & Linux
Some works
About me
Del.icio.us Bookmarks
BOINC Combined Statistics
Site Statistics
Contact me sending an e-mail (antispam defense activated)
|
|
|
|
A Script to Kill top-CPU Processes: chk_cpu.sh
A Script to Kill top-CPU Processes: chk_cpu.sh
Sandro
Tosi, 02 April
2006
As one of my first activity where I work, I was asked to forge
a script to kill the top-most CPU using processes.
The idea behind this request is that on the servers where the
script will run, no process should run for a long time: they provide
reverse proxies or forms services, thus many processes but that should
terminate in few minutes. Sometimes happens that a process goes crazy and
starts using a full cpu for nothing...
As time goes on, the number of such processes becomes bigger
and the server becomes unresponsive. So we decided to kill such
processes; the only limit is not to kill a root process (in
the first versions, sometimes happened that the script tried
to kill kswapd...).
1. How the script works
This script is really simple, but not in its code! ;-) Every
time it runs, it generates a snapshot of running processes through top; then it
compares this snapshot with the previous and, for each process (based
on PID) present in both snapshot, calculates its running time: if this
time is bigger than the given threshold, the script will kill
it.
There are 2 running mode: historical or not. If HISTORY is
set to 1, then in the directory START_DIR there
will be a file for every time the script run (this could lead to a
filesystem full issue); if it's set to 0, in that dir there will be
only the current and the previous top snapshot.
To be sure not to kill a really intensive process, we give it
a try to terminate in two-loops: the first time we noticed a process is
running over the given threshold, we put it in a pid list (a file) and
stop; before kill a process we check that its pid is in that list,
if so we will kill it, otherwise we put that pid in the list, and so on.
We have scheduled this every 20 mins, with a threshold of 15
mins. For what said before, a process will be killed at least if it takes 30mins over 40.
I tried to write the script as self-explain, but if you got
some questions, just ask ;-)
2. Script code
Here below you can find the script code (available for
download here):
#!/bin/sh
#####################################################################
# Check CPU
Script
(chk_cpu.sh)
#
#
#
# Author:
Sandro
Tosi
#
# Created on:
2004-07-21
#
# Last
modified on:
2006-03-16
#
#
#
# Mission:
Kill each process (not owned by root) that exceed a give #
#
time threshold. For each killed process an email is sent #
#
to given
recipients.
#
#
It is intended to be scheduled every x minutes: if
in #
#
this time-frame a process takes more than y seconds
of #
#
CPU and it's not started from root, it will be
killed. #
#
Note: y < x otherwise the script won't
work...
#
#####################################################################
########################################################################
# Given a time
from a top line, returns the value converted in seconds #
########################################################################
function
converttime()
{
if [ -n "$1" ]
then
echo $1 | awk -F "." '{ print $1 }' | awk -F ":" '{ min=$1; sec=$2;
print 60*min+sec }'
else
echo "0"
fi
}
###############
# Main script
#
###############
# Should solve
issue due to SLES 8 upgrade, 2006-03-03
export
TERM=vt100
#
Configuration:
#
Base directory
START_DIR='/logs/chk_cpu'
TSTAMP=`date
+"%y%m%d%H%M%S"`
#
Name for the file of previous run
AWK_BEFORE=$START_DIR/top_before.txt
#
Name for the file of current run
AWK_NOW=$START_DIR/top_now.txt
#
Used when HISTORY=1
TOP_CURRENT=$START_DIR/top_$TSTAMP.txt
#
top rows to skip (remove n lines from top output)
# to
know the exact numbers execute top -b -n 1 | head -n 30
#
and count the line before the first process. KEEP ATTENTION
#
while counting lines: take 6 instead of 5 will remove the
#
top cpu crunching process...
TOPROWSTOSKIP=5
#
Time threshold, in seconds
THRESHOLD=900
#
CSV list of recipients of notification emails
RECIPIENTS=<set
here the list of notified people>
#
Used to give the process a chanche to end...it will be killed
#
after the second time is selected from running processes
PROC_LIST_BFR=$START_DIR/proc_list_bfr
PROC_LIST_NEW=$START_DIR/proc_list_new
#
HISTORY=1 and in START_DIR will be all the history of previous top
result
#
HISTORY=0 and in START_DIR there will be only AWK_BEFORE and AWK_NOW
files
HISTORY=0
# Here starts
the real script code...
# All
$TOP_CURRENT point to $AWK_NOW
if [ $HISTORY
-eq 0 ] ; then
TOP_CURRENT=$AWK_NOW
fi
# This is
executed if $AWK_BEFORE does not exist (usually the first time the
script is run)
if [ ! -e
$AWK_BEFORE ] ; then
top
-b -n 1 | grep -v top | sed -e "1,$TOPROWSTOSKIP d" | colrm 40 44
> $TOP_CURRENT
if
[ $HISTORY -eq 1 ] ; then
ln -sf $TOP_CURRENT $AWK_BEFORE
elif [ $HISTORY -eq 0 ] ; then
mv $TOP_CURRENT $AWK_BEFORE
fi
exit;
fi
# Used during
norma execution
top -b -n 1 |
grep -v top | sed -e "1,$TOPROWSTOSKIP d" | colrm 40 44 >
$TOP_CURRENT
# All
$TOP_CURRENT now links to $AWK_NOW
if [ $HISTORY
-eq 1 ] ; then
ln
-sf $TOP_CURRENT $AWK_NOW
fi
if [ -e
$PROC_LIST_NEW ]
then
mv $PROC_LIST_NEW $PROC_LIST_BFR
touch $PROC_LIST_NEW
else
touch $PROC_LIST_NEW
fi
if [ ! -e
$PROC_LIST_BFR ]
then
touch $PROC_LIST_BFR
fi
for process in
`awk '{ print $1 }' $AWK_NOW` ; do
time_now=`grep ^[" "]*$process" " $AWK_NOW | awk '{ print $10 }'` ;
time_before=`grep ^[" "]*$process" " $AWK_BEFORE | awk '{ print $10 }'`
;
sec_tn=`converttime $time_now`;
sec_tb=`converttime $time_before`;
delta_t=`echo $sec_tn" - "$sec_tb | bc `;
if [ $delta_t -gt $THRESHOLD ] ; then
IS_IN_LOOP=`grep ^[" "]*$process" " $PROC_LIST_BFR`
if [ -z "$IS_IN_LOOP" ]
then
grep ^[" "]*$process" " $AWK_NOW
>> $PROC_LIST_NEW
else
CHK_ROOT=`grep ^[" "]*$process" "
$AWK_NOW | awk '{ print $2 }'`
if [ "$CHK_ROOT" == "root" ]
then
MSG="Warning!!"
else
kill
$process
MSG="KILLED"
fi
proc_name=`grep ^[" "]*$process" "
$AWK_NOW | awk '{ print $11 }'`
mail -s "`hostname` CPU USAGE ALERT: $proc_name $MSG " $RECIPIENTS
<<EOF
`echo "CPU
Usage of process $proc_name exceed the threshold."`
`echo " "`
`echo " "`
`echo "Last
'top' row for this process was:"`
`echo " "`
`echo
" PID USER
PR NI VIRT RES SHR U
%MEM TIME+ COMMAND"`
`grep ^["
"]*$process" " $AWK_NOW`
EOF
fi
fi
done
if [ $HISTORY
-eq 1 ] ; then
ln
-sf $TOP_CURRENT $AWK_BEFORE
elif [
$HISTORY -eq 0 ] ; then
mv
$TOP_CURRENT $AWK_BEFORE
fi |
|