Unsupported content This version of the documentation is no longer supported. However, the documentation is available for your convenience. You will not be able to leave comments.

Rebooting Servers in a Controlled Manner


This page contains a script that is designed to handle the rebooting and monitoring of servers, regardless of OS. It performs the following:

  1. Sends a reboot command to the target
  2. Waits a specified amount of time for the server to go down
  3. Waits until the RSCD Agent is back and running, performing a query every X minutes for Y amount of times.

This script can be very useful in Batch Jobs where a series of events needs to be sequenced, and one of those events is a reboot. Simply sending a reboot command to the server is insufficient, since the RSCD Agent needs to be back up and running in order for further jobs to commence against the target.

Revision History

  • (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Kraus
  • (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop, timeouts set to 300s (5m), intervals set to 20s. Some debugging statements added.
  • (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for WinNT

Implementing

  1. In the Depot, create a new NSH Script using the code in the box below.
    1. Tip: Double-click within the box to select the contents of the script.
    2. Ensure you set the script type to Execute separately against each host
  2. Create a NSH Script Job based on the NSH Script.
  3. Practice executing this script in a controlled manner, using online a single host that is ready for being rebooted.

Script

#
#  BladeLogic Multi-Platform Reboot And Monitoring Script
#    (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Krause
#    (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop,
# timeouts set to 300s (5m), intervals set to 20s.  Some debugging statements added.
#    (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for
#WinNT
#
# Maximum time to wait to have the server
# go down. Not that reliable as we are only
# testing that the agent has gone down and
# not necessarily that the server has gone
# down. Also defined is the interval time
# between checks to see if the server is
# down.
#
MAX_SHUTDOWN_TIME=300
SHUTDOWN_INTERVAL=20

#
# Maximum amount of time we will wait to have
# the server comeback up once we have detected
# that it has gone down.  Also defined is the
# interval time between checks to see if the
# server is back up.
#
MAX_REBOOT_TIME=300
REBOOT_INTERVAL=20

if [ $# -ge 1 ]
       then
       echo "Accepting boot Arguments for Solaris"
BOOT_ARGS=$@
       echo "Boot Args: $BOOT_ARGS"
fi

OS=`uname -s`
HOSTNAME=$NSH_RUNCMD_HOST
# The NSH_RUNCMD_HOST envar retuns the FQDN which is what we want

if [ "$OS" = "WindowsNT" ]
then
   DEVNULL=NUL
else
   DEVNULL=/dev/null
fi

# DEBUG=echo
DEBUG=false

if test -z "$HOSTNAME"
then
   echo Usage $0 hostname
   exit 1
fi

pwd | egrep -q ^//

if [[ $? -ne 0 ]]
then
print "ERROR: You must run this script using the \"runscript\" option." 1>&2
exit 1
fi

# Have to be local so the uname -D command works properly
cd //@/

agent_up ()
{
#    uname -D //$1/ > $DEVNULL 2> $DEVNULL
   echo uname -D //$1/
    uname -D //$1/
   return $?
}

if agent_up $HOSTNAME
then
   # XXX
  $DEBUG "testing sleep (`which sleep`) interval - should be 10 second delay"
  $DEBUG `date`
  $DEBUG `sleep 10`
  $DEBUG `date`

   echo Rebooting server $HOSTNAME ...

   case "$OS" in
        SunOS)
if [ -z $BOOT_ARGS ]
then
nexec $HOSTNAME shutdown -i6 -y -g 0
else
nexec $HOSTNAME reboot -- $BOOT_ARGS
fi
            ;;

        Linux)
            nexec $HOSTNAME shutdown -r now
            ;;

        WindowsNT)
            nexec $HOSTNAME reboot
            ;;

        *)
           echo "Unknown platform \"$OS\""
           exit 1
            ;;
   esac

   if test $? -ne 0
   then
       echo '***** Warning - Possible error in sending reboot request'
   fi

   #
   # Give the server a certain amount of time to kill the
   # agent and reboot
   #
   count=$SHUTDOWN_INTERVAL
    sleep $SHUTDOWN_INTERVAL

   while agent_up $HOSTNAME
   do
       echo `date` Agent still running ...
       count=`expr $count + $SHUTDOWN_INTERVAL`

       if test $count -gt $MAX_SHUTDOWN_TIME
       then
           echo "Reboot command sent but server not coming down"
           exit 1
       fi

        sleep $SHUTDOWN_INTERVAL
   done

   #
   # Now we know the agent is down and we are waiting for the
   # system to reboot. Give a bunch of time to come back up.
   #
   count=$REBOOT_INTERVAL
    sleep $REBOOT_INTERVAL

   while ! agent_up $HOSTNAME
   do
       echo `date` Agent still not up ...
       count=`expr $count + $REBOOT_INTERVAL`
        sleep $REBOOT_INTERVAL

       if test $count -gt $MAX_REBOOT_TIME
       then
           echo "Reboot has not yet come up after more than $count seconds ..."
           exit 1
       fi
   done

   echo Server $HOSTNAME back up and running
else
   echo Agent currently not running
   exit 1
fi

exit 0

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*