Home Page

Tips page

University Page

Programming

Debian & Linux

Some works

About me

Del.icio.us Bookmarks

BOINC Combined Statistics

Site Statistics

Contact me sending an e-mail (antispam defense activated)

debian

hacker emblem

blogger

GeoURL

View Sandro Tosi's profile on LinkedIn

This is my Google PageRank

A Script to Kill top-CPU Processes: chk_cpu.sh

A Script to Kill top-CPU Processes: chk_cpu.sh

 Sandro Tosi, 02 April 2006


As one of my first activity where I work, I was asked to forge a script to kill the top-most CPU using processes.

The idea behind this request is that on the servers where the script will run, no process should run for a long time: they provide reverse proxies or forms services, thus many processes but that should terminate in few minutes. Sometimes happens that a process goes crazy and starts using a full cpu for nothing...

As time goes on, the number of such processes becomes bigger and the server becomes unresponsive. So we decided to kill such processes; the only limit is not to kill a root process (in the first versions, sometimes happened that the script tried to kill kswapd...).

1. How the script works

This script is really simple, but not in its code! ;-) Every time it runs, it generates a snapshot of running processes through top; then it compares this snapshot with the previous and, for each process (based on PID) present in both snapshot, calculates its running time: if this time is bigger than the given threshold, the script will kill it.

There are 2 running mode: historical or not. If HISTORY is set to 1, then in the directory START_DIR there will be a file for every time the script run (this could lead to a filesystem full issue); if it's set to 0, in that dir there will be only the current and the previous top snapshot.

To be sure not to kill a really intensive process, we give it a try to terminate in two-loops: the first time we noticed a process is running over the given threshold, we put it in a pid list (a file) and stop; before kill a process we check that its pid is in that list, if so we will kill it, otherwise we put that pid in the list, and so on.

We have scheduled this every 20 mins, with a threshold of 15 mins. For what said before, a process will be killed at least if it takes 30mins over 40.

I tried to write the script as self-explain, but if you got some questions, just ask ;-)

2. Script code

Here below you can find the script code (available for download here):

#!/bin/sh
#####################################################################
# Check CPU Script (chk_cpu.sh)                                     #
#                                                                   #
# Author: Sandro Tosi                                               #
# Created on: 2004-07-21                                            #
# Last modified on: 2006-03-16                                      #
#                                                                   #
# Mission: Kill each process (not owned by root) that exceed a give #
#          time threshold. For each killed process an email is sent #
#          to given recipients.                                     #
#          It is intended to be scheduled every x minutes: if in    #
#          this time-frame a process takes more than y seconds of   #
#          CPU and it's not started from root, it will be killed.   #
#          Note: y < x otherwise the script won't work...           #
#####################################################################
 
 
########################################################################
# Given a time from a top line, returns the value converted in seconds #
########################################################################
function converttime()
{
    if [ -n "$1" ] 
    then
    echo $1 | awk -F "." '{ print $1 }' | awk -F ":" '{ min=$1; sec=$2; print 60*min+sec }'
    else
    echo "0"
    fi
}
 
 
###############
# Main script #
###############
 
# Should solve issue due to SLES 8 upgrade, 2006-03-03
export TERM=vt100
 
 
# Configuration:
 
#  Base directory
START_DIR='/logs/chk_cpu'
TSTAMP=`date +"%y%m%d%H%M%S"`
#  Name for the file of previous run
AWK_BEFORE=$START_DIR/top_before.txt
#  Name for the file of current run
AWK_NOW=$START_DIR/top_now.txt
#  Used when HISTORY=1
TOP_CURRENT=$START_DIR/top_$TSTAMP.txt
#  top rows to skip (remove n lines from top output)
#  to know the exact numbers execute top -b -n 1 | head -n 30
#  and count the line before the first process. KEEP ATTENTION
#  while counting lines: take 6 instead of 5 will remove the
#  top cpu crunching process...
TOPROWSTOSKIP=5
#  Time threshold, in seconds
THRESHOLD=900
#  CSV list of recipients of notification emails
RECIPIENTS=<set here the list of notified people>
#  Used to give the process a chanche to end...it will be killed
#  after the second time is selected from running processes
PROC_LIST_BFR=$START_DIR/proc_list_bfr
PROC_LIST_NEW=$START_DIR/proc_list_new
#  HISTORY=1 and in START_DIR will be all the history of previous top result
#  HISTORY=0 and in START_DIR there will be only AWK_BEFORE and AWK_NOW files
HISTORY=0
 
# Here starts the real script code...
 
# All $TOP_CURRENT point to $AWK_NOW
if [ $HISTORY -eq 0 ] ; then
  TOP_CURRENT=$AWK_NOW
fi
 
 
# This is executed if $AWK_BEFORE does not exist (usually the first time the script is run)
if [ ! -e $AWK_BEFORE ] ; then
  top -b -n 1 | grep -v top | sed -e "1,$TOPROWSTOSKIP d" | colrm 40 44 > $TOP_CURRENT
  if [ $HISTORY -eq 1 ] ; then
    ln -sf $TOP_CURRENT $AWK_BEFORE
  elif [ $HISTORY -eq 0 ] ; then
    mv $TOP_CURRENT $AWK_BEFORE
  fi
  exit;
fi
 
# Used during norma execution
top -b -n 1 | grep -v top | sed -e "1,$TOPROWSTOSKIP d" | colrm 40 44 > $TOP_CURRENT
 
# All $TOP_CURRENT now links to $AWK_NOW
if [ $HISTORY -eq 1 ] ; then
  ln -sf $TOP_CURRENT $AWK_NOW
fi
 
 
if [ -e $PROC_LIST_NEW ]
then
    mv $PROC_LIST_NEW $PROC_LIST_BFR
    touch $PROC_LIST_NEW
else
    touch $PROC_LIST_NEW
fi
 
if [ ! -e $PROC_LIST_BFR ]
then
    touch $PROC_LIST_BFR
fi
 
for process in `awk '{ print $1 }' $AWK_NOW` ; do
    time_now=`grep ^[" "]*$process" " $AWK_NOW | awk '{ print $10 }'` ;
    time_before=`grep ^[" "]*$process" " $AWK_BEFORE | awk '{ print $10 }'` ;
    sec_tn=`converttime $time_now`;
    sec_tb=`converttime $time_before`;
    delta_t=`echo $sec_tn" - "$sec_tb | bc `;
    if [ $delta_t -gt $THRESHOLD ] ; then
    IS_IN_LOOP=`grep ^[" "]*$process" " $PROC_LIST_BFR`
    if [ -z "$IS_IN_LOOP" ]
    then
        grep ^[" "]*$process" " $AWK_NOW >> $PROC_LIST_NEW
    else
        CHK_ROOT=`grep ^[" "]*$process" " $AWK_NOW | awk '{ print $2 }'`
        if [ "$CHK_ROOT" == "root" ] 
        then
        MSG="Warning!!"
        else
           kill $process
           MSG="KILLED"
        fi
        proc_name=`grep ^[" "]*$process" " $AWK_NOW | awk '{ print $11 }'`
            mail -s "`hostname` CPU USAGE ALERT: $proc_name $MSG " $RECIPIENTS <<EOF
`echo "CPU Usage of process $proc_name exceed the threshold."`
`echo " "`
`echo " "`
`echo "Last 'top' row for this process was:"`
`echo " "`
`echo "  PID USER      PR  NI  VIRT  RES  SHR U %MEM    TIME+  COMMAND"`
`grep ^[" "]*$process" " $AWK_NOW` 
EOF
    fi
    fi
done
 
if [ $HISTORY -eq 1 ] ; then
  ln -sf $TOP_CURRENT $AWK_BEFORE
elif [ $HISTORY -eq 0 ] ; then
  mv $TOP_CURRENT $AWK_BEFORE
fi