Rebooting Servers in a Controlled Manner

This topic was edited by a BMC Contributor and has not been approved.  More information.

This page contains a script that is designed to handle the rebooting and monitoring of servers, regardless of OS. It performs the following:

  1. Sends a reboot command to the target
  2. Waits a specified amount of time for the server to go down
  3. Waits until the RSCD Agent is back and running, performing a query every X minutes for Y amount of times.

This script can be very useful in Batch Jobs where a series of events needs to be sequenced, and one of those events is a reboot. Simply sending a reboot command to the server is insufficient, since the RSCD Agent needs to be back up and running in order for further jobs to commence against the target.

Revision History

  • (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Kraus
  • (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop, timeouts set to 300s (5m), intervals set to 20s. Some debugging statements added.
  • (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for WinNT

Implementing

  1. In the Depot, create a new NSH Script using the code in the box below.
    1. Tip: Double-click within the box to select the contents of the script.
    2. Ensure you set the script type to Execute separately against each host
  2. Create a NSH Script Job based on the NSH Script.
  3. Practice executing this script in a controlled manner, using online a single host that is ready for being rebooted.

Script


#
#  BladeLogic Multi-Platform Reboot And Monitoring Script
#    (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Krause
#    (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop,
#	 timeouts set to 300s (5m), intervals set to 20s.  Some debugging statements added.
#    (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for
#	WinNT
#
# Maximum time to wait to have the server
# go down. Not that reliable as we are only
# testing that the agent has gone down and
# not necessarily that the server has gone
# down. Also defined is the interval time
# between checks to see if the server is
# down.
#
MAX_SHUTDOWN_TIME=300
SHUTDOWN_INTERVAL=20
 
#
# Maximum amount of time we will wait to have
# the server comeback up once we have detected
# that it has gone down.  Also defined is the
# interval time between checks to see if the
# server is back up.
#
MAX_REBOOT_TIME=300
REBOOT_INTERVAL=20
 
if [ $# -ge 1 ]
        then
        echo "Accepting boot Arguments for Solaris"
	BOOT_ARGS=$@
        echo "Boot Args: $BOOT_ARGS"
fi
 
OS=`uname -s`
HOSTNAME=$NSH_RUNCMD_HOST
# The NSH_RUNCMD_HOST envar retuns the FQDN which is what we want
 
if [ "$OS" = "WindowsNT" ]
then
    DEVNULL=NUL
else
    DEVNULL=/dev/null
fi
 
# DEBUG=echo
DEBUG=false
 
if test -z "$HOSTNAME"
then
    echo Usage $0 hostname
    exit 1
fi
 
pwd | egrep -q ^//
 
if [[ $? -ne 0 ]]
then
	print "ERROR: You must run this script using the \"runscript\" option." 1>&2
	exit 1
fi
 
# Have to be local so the uname -D command works properly
cd //@/
 
agent_up ()
{
#    uname -D //$1/ > $DEVNULL 2> $DEVNULL
    echo uname -D //$1/
    uname -D //$1/
    return $?
}
 
if agent_up $HOSTNAME
then
    # XXX
   $DEBUG "testing sleep (`which sleep`) interval - should be 10 second delay"
   $DEBUG `date`
   $DEBUG `sleep 10`
   $DEBUG `date`
 
    echo Rebooting server $HOSTNAME ...
 
    case "$OS" in
        SunOS)
		if [ -z $BOOT_ARGS ]
			then
			nexec $HOSTNAME shutdown -i6 -y -g 0
		else
			nexec $HOSTNAME reboot -- $BOOT_ARGS
		fi
            ;;
 
        Linux)
            nexec $HOSTNAME shutdown -r now
            ;;
 
        WindowsNT)
            nexec $HOSTNAME reboot
            ;;
 
        *)
            echo "Unknown platform \"$OS\""
            exit 1
            ;;
    esac
 
    if test $? -ne 0
    then
        echo '***** Warning - Possible error in sending reboot request'
    fi
 
    #
    # Give the server a certain amount of time to kill the
    # agent and reboot
    #
    count=$SHUTDOWN_INTERVAL
    sleep $SHUTDOWN_INTERVAL
 
    while agent_up $HOSTNAME
    do
        echo `date` Agent still running ...
        count=`expr $count + $SHUTDOWN_INTERVAL`
 
        if test $count -gt $MAX_SHUTDOWN_TIME
        then
            echo "Reboot command sent but server not coming down"
            exit 1
        fi
 
        sleep $SHUTDOWN_INTERVAL
    done
 
    #
    # Now we know the agent is down and we are waiting for the
    # system to reboot. Give a bunch of time to come back up.
    #
    count=$REBOOT_INTERVAL
    sleep $REBOOT_INTERVAL
 
    while ! agent_up $HOSTNAME
    do
        echo `date` Agent still not up ...
        count=`expr $count + $REBOOT_INTERVAL`
        sleep $REBOOT_INTERVAL
 
        if test $count -gt $MAX_REBOOT_TIME
        then
            echo "Reboot has not yet come up after more than $count seconds ..."
            exit 1
        fi
    done
 
    echo Server $HOSTNAME back up and running
else
    echo Agent currently not running
    exit 1
fi
 
exit 0
Was this page helpful? Yes No Submitting... Thank you

Comments