Rebooting Servers in a Controlled Manner
This page contains a script that is designed to handle the rebooting and monitoring of servers, regardless of OS. It performs the following:
- Sends a reboot command to the target
- Waits a specified amount of time for the server to go down
- Waits until the RSCD Agent is back and running, performing a query every X minutes for Y amount of times.
This script can be very useful in Batch Jobs where a series of events needs to be sequenced, and one of those events is a reboot. Simply sending a reboot command to the server is insufficient, since the RSCD Agent needs to be back up and running in order for further jobs to commence against the target.
Revision History
- (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Kraus
- (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop, timeouts set to 300s (5m), intervals set to 20s. Some debugging statements added.
- (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for WinNT
Implementing
- In the Depot, create a new NSH Script using the code in the box below.
- Tip: Double-click within the box to select the contents of the script.
- Ensure you set the script type to Execute separately against each host.
- Create a NSH Script Job based on the NSH Script.
- Practice executing this script in a controlled manner, using online a single host that is ready for being rebooted.
Script
#
# BladeLogic Multi-Platform Reboot And Monitoring Script
# (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Krause
# (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop,
# timeouts set to 300s (5m), intervals set to 20s. Some debugging statements added.
# (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for
#WinNT
#
# Maximum time to wait to have the server
# go down. Not that reliable as we are only
# testing that the agent has gone down and
# not necessarily that the server has gone
# down. Also defined is the interval time
# between checks to see if the server is
# down.
#
MAX_SHUTDOWN_TIME=300
SHUTDOWN_INTERVAL=20
#
# Maximum amount of time we will wait to have
# the server comeback up once we have detected
# that it has gone down. Also defined is the
# interval time between checks to see if the
# server is back up.
#
MAX_REBOOT_TIME=300
REBOOT_INTERVAL=20
if [ $# -ge 1 ]
then
echo "Accepting boot Arguments for Solaris"
BOOT_ARGS=$@
echo "Boot Args: $BOOT_ARGS"
fi
OS=`uname -s`
HOSTNAME=$NSH_RUNCMD_HOST
# The NSH_RUNCMD_HOST envar retuns the FQDN which is what we want
if [ "$OS" = "WindowsNT" ]
then
DEVNULL=NUL
else
DEVNULL=/dev/null
fi
# DEBUG=echo
DEBUG=false
if test -z "$HOSTNAME"
then
echo Usage $0 hostname
exit 1
fi
pwd | egrep -q ^//
if [[ $? -ne 0 ]]
then
print "ERROR: You must run this script using the \"runscript\" option." 1>&2
exit 1
fi
# Have to be local so the uname -D command works properly
cd //@/
agent_up ()
{
# uname -D //$1/ > $DEVNULL 2> $DEVNULL
echo uname -D //$1/
uname -D //$1/
return $?
}
if agent_up $HOSTNAME
then
# XXX
$DEBUG "testing sleep (`which sleep`) interval - should be 10 second delay"
$DEBUG `date`
$DEBUG `sleep 10`
$DEBUG `date`
echo Rebooting server $HOSTNAME ...
case "$OS" in
SunOS)
if [ -z $BOOT_ARGS ]
then
nexec $HOSTNAME shutdown -i6 -y -g 0
else
nexec $HOSTNAME reboot -- $BOOT_ARGS
fi
;;
Linux)
nexec $HOSTNAME shutdown -r now
;;
WindowsNT)
nexec $HOSTNAME reboot
;;
*)
echo "Unknown platform \"$OS\""
exit 1
;;
esac
if test $? -ne 0
then
echo '***** Warning - Possible error in sending reboot request'
fi
#
# Give the server a certain amount of time to kill the
# agent and reboot
#
count=$SHUTDOWN_INTERVAL
sleep $SHUTDOWN_INTERVAL
while agent_up $HOSTNAME
do
echo `date` Agent still running ...
count=`expr $count + $SHUTDOWN_INTERVAL`
if test $count -gt $MAX_SHUTDOWN_TIME
then
echo "Reboot command sent but server not coming down"
exit 1
fi
sleep $SHUTDOWN_INTERVAL
done
#
# Now we know the agent is down and we are waiting for the
# system to reboot. Give a bunch of time to come back up.
#
count=$REBOOT_INTERVAL
sleep $REBOOT_INTERVAL
while ! agent_up $HOSTNAME
do
echo `date` Agent still not up ...
count=`expr $count + $REBOOT_INTERVAL`
sleep $REBOOT_INTERVAL
if test $count -gt $MAX_REBOOT_TIME
then
echo "Reboot has not yet come up after more than $count seconds ..."
exit 1
fi
done
echo Server $HOSTNAME back up and running
else
echo Agent currently not running
exit 1
fi
exit 0
# BladeLogic Multi-Platform Reboot And Monitoring Script
# (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Krause
# (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop,
# timeouts set to 300s (5m), intervals set to 20s. Some debugging statements added.
# (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for
#WinNT
#
# Maximum time to wait to have the server
# go down. Not that reliable as we are only
# testing that the agent has gone down and
# not necessarily that the server has gone
# down. Also defined is the interval time
# between checks to see if the server is
# down.
#
MAX_SHUTDOWN_TIME=300
SHUTDOWN_INTERVAL=20
#
# Maximum amount of time we will wait to have
# the server comeback up once we have detected
# that it has gone down. Also defined is the
# interval time between checks to see if the
# server is back up.
#
MAX_REBOOT_TIME=300
REBOOT_INTERVAL=20
if [ $# -ge 1 ]
then
echo "Accepting boot Arguments for Solaris"
BOOT_ARGS=$@
echo "Boot Args: $BOOT_ARGS"
fi
OS=`uname -s`
HOSTNAME=$NSH_RUNCMD_HOST
# The NSH_RUNCMD_HOST envar retuns the FQDN which is what we want
if [ "$OS" = "WindowsNT" ]
then
DEVNULL=NUL
else
DEVNULL=/dev/null
fi
# DEBUG=echo
DEBUG=false
if test -z "$HOSTNAME"
then
echo Usage $0 hostname
exit 1
fi
pwd | egrep -q ^//
if [[ $? -ne 0 ]]
then
print "ERROR: You must run this script using the \"runscript\" option." 1>&2
exit 1
fi
# Have to be local so the uname -D command works properly
cd //@/
agent_up ()
{
# uname -D //$1/ > $DEVNULL 2> $DEVNULL
echo uname -D //$1/
uname -D //$1/
return $?
}
if agent_up $HOSTNAME
then
# XXX
$DEBUG "testing sleep (`which sleep`) interval - should be 10 second delay"
$DEBUG `date`
$DEBUG `sleep 10`
$DEBUG `date`
echo Rebooting server $HOSTNAME ...
case "$OS" in
SunOS)
if [ -z $BOOT_ARGS ]
then
nexec $HOSTNAME shutdown -i6 -y -g 0
else
nexec $HOSTNAME reboot -- $BOOT_ARGS
fi
;;
Linux)
nexec $HOSTNAME shutdown -r now
;;
WindowsNT)
nexec $HOSTNAME reboot
;;
*)
echo "Unknown platform \"$OS\""
exit 1
;;
esac
if test $? -ne 0
then
echo '***** Warning - Possible error in sending reboot request'
fi
#
# Give the server a certain amount of time to kill the
# agent and reboot
#
count=$SHUTDOWN_INTERVAL
sleep $SHUTDOWN_INTERVAL
while agent_up $HOSTNAME
do
echo `date` Agent still running ...
count=`expr $count + $SHUTDOWN_INTERVAL`
if test $count -gt $MAX_SHUTDOWN_TIME
then
echo "Reboot command sent but server not coming down"
exit 1
fi
sleep $SHUTDOWN_INTERVAL
done
#
# Now we know the agent is down and we are waiting for the
# system to reboot. Give a bunch of time to come back up.
#
count=$REBOOT_INTERVAL
sleep $REBOOT_INTERVAL
while ! agent_up $HOSTNAME
do
echo `date` Agent still not up ...
count=`expr $count + $REBOOT_INTERVAL`
sleep $REBOOT_INTERVAL
if test $count -gt $MAX_REBOOT_TIME
then
echo "Reboot has not yet come up after more than $count seconds ..."
exit 1
fi
done
echo Server $HOSTNAME back up and running
else
echo Agent currently not running
exit 1
fi
exit 0
Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*