Rebooting Servers in a Controlled Manner

This page contains a script that is designed to handle the rebooting and monitoring of servers, regardless of OS. It performs the following:

Sends a reboot command to the target
Waits a specified amount of time for the server to go down
Waits until the RSCD Agent is back and running, performing a query every X minutes for Y amount of times.

This script can be very useful in Batch Jobs where a series of events needs to be sequenced, and one of those events is a reboot. Simply sending a reboot command to the server is insufficient, since the RSCD Agent needs to be back up and running in order for further jobs to commence against the target.

Revision History

(1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Kraus
(1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop, timeouts set to 300s (5m), intervals set to 20s. Some debugging statements added.
(1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for WinNT

Implementing

In the Depot, create a new NSH Script using the code in the box below.
1. Tip: Double-click within the box to select the contents of the script.
2. Ensure you set the script type to Execute separately against each host.
Create a NSH Script Job based on the NSH Script.
Practice executing this script in a controlled manner, using online a single host that is ready for being rebooted.

Script

#
# BladeLogic Multi-Platform Reboot And Monitoring Script
#    (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Krause
#    (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop,
# timeouts set to 300s (5m), intervals set to 20s. Some debugging statements added.
#    (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for
#WinNT
#
# Maximum time to wait to have the server
# go down. Not that reliable as we are only
# testing that the agent has gone down and
# not necessarily that the server has gone
# down. Also defined is the interval time
# between checks to see if the server is
# down.
#
MAX_SHUTDOWN_TIME=300
SHUTDOWN_INTERVAL=20

#
# Maximum amount of time we will wait to have
# the server comeback up once we have detected
# that it has gone down. Also defined is the
# interval time between checks to see if the
# server is back up.
#
MAX_REBOOT_TIME=300
REBOOT_INTERVAL=20

if [ $# -ge 1 ]
       then
       echo "Accepting boot Arguments for Solaris"
BOOT_ARGS=$@
       echo "Boot Args: $BOOT_ARGS"
fi

OS=`uname -s`
HOSTNAME=$NSH_RUNCMD_HOST
# The NSH_RUNCMD_HOST envar retuns the FQDN which is what we want

if [ "$OS" = "WindowsNT" ]
then
   DEVNULL=NUL
else
   DEVNULL=/dev/null
fi

# DEBUG=echo
DEBUG=false

if test -z "$HOSTNAME"
then
   echo Usage $0 hostname
   exit 1
fi

pwd | egrep -q ^//

if [[ $? -ne 0 ]]
then
print "ERROR: You must run this script using the \"runscript\" option." 1>&2
exit 1
fi

# Have to be local so the uname -D command works properly
cd //@/

agent_up ()
{
#    uname -D //$1/ > $DEVNULL 2> $DEVNULL
   echo uname -D //$1/
    uname -D //$1/
   return $?
}

if agent_up $HOSTNAME
then
   # XXX
  $DEBUG "testing sleep (`which sleep`) interval - should be 10 second delay"
  $DEBUG `date`
  $DEBUG `sleep 10`
  $DEBUG `date`

   echo Rebooting server $HOSTNAME ...

   case "$OS" in
        SunOS)
if [ -z $BOOT_ARGS ]
then
nexec $HOSTNAME shutdown -i6 -y -g 0
else
nexec $HOSTNAME reboot -- $BOOT_ARGS
fi
            ;;

        Linux)
            nexec $HOSTNAME shutdown -r now
            ;;

        WindowsNT)
            nexec $HOSTNAME reboot
            ;;

        *)
           echo "Unknown platform \"$OS\""
           exit 1
            ;;
   esac

   if test $? -ne 0
   then
       echo '***** Warning - Possible error in sending reboot request'
   fi

   #
   # Give the server a certain amount of time to kill the
   # agent and reboot
   #
   count=$SHUTDOWN_INTERVAL
    sleep $SHUTDOWN_INTERVAL

   while agent_up $HOSTNAME
   do
       echo `date` Agent still running ...
       count=`expr $count + $SHUTDOWN_INTERVAL`

       if test $count -gt $MAX_SHUTDOWN_TIME
       then
           echo "Reboot command sent but server not coming down"
           exit 1
       fi

        sleep $SHUTDOWN_INTERVAL
   done

   #
   # Now we know the agent is down and we are waiting for the
   # system to reboot. Give a bunch of time to come back up.
   #
   count=$REBOOT_INTERVAL
    sleep $REBOOT_INTERVAL

   while ! agent_up $HOSTNAME
   do
       echo `date` Agent still not up ...
       count=`expr $count + $REBOOT_INTERVAL`
        sleep $REBOOT_INTERVAL

       if test $count -gt $MAX_REBOOT_TIME
       then
           echo "Reboot has not yet come up after more than $count seconds ..."
           exit 1
       fi
   done

   echo Server $HOSTNAME back up and running
else
   echo Agent currently not running
   exit 1
fi

exit 0

Rebooting Servers in a Controlled Manner

Revision History

Implementing

Script

On this page