Unsupported content This version of the documentation is no longer supported. However, the documentation is available for your convenience. You will not be able to leave comments.

Rebooting servers in a predefined order


About

The information on this page was initially contributed by @snowka.

Requirements

This document is meant to show how to use BMC Bladelogic Server Automation (BSA) to reboot a set of servers in a predefined order and to handle individual stop/starts of services on each server. Each step of the process is supposed to handle the services stop/start and reboot of a defined subset of the overall group of targets and then move on to next subset, e.g. first to reboot the webservers and then database servers. Each subset might contain 1..n targets.

Example:

All servers to reboot:
 srv1
 srv2

Process of reboot:

worddave72f206eaa9118b1436464fceb209494.png

Solution

BSA integrated capabilities to add properties to data classes can be used to add additional properties for servers containing information about the servers boot orders. As many of the boot processes will restart servers which are part of applications this example uses additional properties to hold information about:

  1. AppGroup – the name of the application a server runs for
  2. BootOrder – the step within the reboot process at which the server is supposed to be rebooted
  3. SrvRole – the function a server provides for the application (e.g. database, webserver,..)

worddavf20e4a480133198b2dcaf331b68eccac.png

The properties can then be used to group the servers in server smart groups

worddav55b73d985bf98974bba762f45c50861e.png

worddavadecf9877a83b30364837a853441dfb5.png

To perform a server reboot we can leverage BSA's packaging capabilities by creating a BLPackage that contains only external command as a dummy which is set to reboot "After item deployment"

worddav8d4638c14474241e2771279384a4b6d2.png

The stop/start of services is handled by regular BLPackages set to stop or start of the needed services.

The packages can then be used for jobs pointing to the server smart groups made up by the boot order property for rebooting and server role for the service handling

worddav14508251eafbaa01d1445b01e166bd81.png

worddavee208104e581343037c57db6118d7319.png

Those jobs are then grouped in a batch job which is defined to use the target servers of the individual jobs it contains.

worddave7266588ed091453d5089c0e6d10fe46.png

Alternative

As the previous way requires single reboot jobs for each server smart group but runs without the use of any sort of script the number of reboot jobs might get high. An alternative to it would be the use of an NSH script that simply checks weather or not the boot order of a target server equals to current step in the process. For this an updated version of the "Reboot server in a controlled manner" NSH-script can be used (double-click to select):

#
# BladeLogic Multi-Platform Reboot And Monitoring Script
# (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Krause
# (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop,
# timeouts set to 300s (5m), intervals set to 20s. Some debugging statements added.
# (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for
# WinNT
# (1.3) Updated by Steffen Nowka to include parameterized reboot flow by checking a reboot
# order of the target against the current order to be rebooted
#
# Maximum time to wait to have the server
# go down. Not that reliable as we are only
# testing that the agent has gone down and
# not necessarily that the server has gone
# down. Also defined is the interval time
# between checks to see if the server is
# down.
#

MAX_SHUTDOWN_TIME=300
SHUTDOWN_INTERVAL=20


#
# Number of parameters before optional parameters start
PARA=2


#
# Maximum amount of time we will wait to have
# the server comeback up once we have detected
# that it has gone down. Also defined is the
# interval time between checks to see if the
# server is back up.
#
MAX_REBOOT_TIME=300
REBOOT_INTERVAL=20

#
# First two parameters are to bootorder of the server and the current order that is supposed to be rebooted
#
#
if [ $1 -eq $2 ]
then
echo "Boot order match, checking if Agent is up..."

if [ $# -gt $PARA ]
then
echo "Accepting boot Arguments for Solaris"
COUNT=0
BOOT_ARGS=""
for ARG in $*
do
COUNT=`expr $COUNT + 1`
if [ $COUNT -gt $PARA ]
then
BOOT_ARGS=$BOOT_ARGS" "$ARG
fi
done

echo "Boot Args: $BOOT_ARGS"
fi

OS=`uname -s`
HOSTNAME=$NSH_RUNCMD_HOST
# The NSH_RUNCMD_HOST envar retuns the FQDN which is what we want
if [ "$OS" = "WindowsNT" ]
then
DEVNULL=NUL
else
DEVNULL=/dev/null
fi

# DEBUG=echo
DEBUG=false

if test -z "$HOSTNAME"
then
echo Usage $0 hostname
exit 1
fi

pwd | egrep -q ^//

if [[ $? -ne 0 ]]
then
print "ERROR: You must run this script using the \"runscript\" option." 1>&2
exit 1
fi

# Have to be local so the uname -D command works properly
cd //@/
disconnect

agent_up ()
{
# uname -D //$1/ > $DEVNULL 2> $DEVNULL
echo uname -D //$1/
uname -D //$1/
return $?
}

if agent_up $HOSTNAME
then
# XXX
$DEBUG "testing sleep (`which sleep`) interval - should be 10 second delay"
$DEBUG `date`
$DEBUG `sleep 10`
$DEBUG `date`

echo Rebooting server $HOSTNAME ...

case "$OS" in
SunOS)
if [ -z $BOOT_ARGS ]
then
nexec $HOSTNAME shutdown -i6 -y -g 0
else
nexec $HOSTNAME reboot -- $BOOT_ARGS
fi
;;

Linux)
nexec $HOSTNAME shutdown -r now
;;

WindowsNT)
nexec $HOSTNAME reboot
;;

*)
echo "Unknown platform \"$OS\""
exit 1
;;
esac

if test $? -ne 0
then
echo '***** Warning - Possible error in sending reboot request'
fi

#
# Give the server a certain amount of time to kill the
# agent and reboot
#
count=$SHUTDOWN_INTERVAL
sleep $SHUTDOWN_INTERVAL

while agent_up $HOSTNAME
do
echo `date` Agent still running ...
count=`expr $count + $SHUTDOWN_INTERVAL`

if test $count -gt $MAX_SHUTDOWN_TIME
then
echo "Reboot command sent but server not coming down"
exit 1
fi

sleep $SHUTDOWN_INTERVAL
done

#
# Now we know the agent is down and we are waiting for the
# system to reboot. Give a bunch of time to come back up.
#
count=$REBOOT_INTERVAL
sleep $REBOOT_INTERVAL

while ! agent_up $HOSTNAME
do
echo `date` Agent still not up ...
count=`expr $count + $REBOOT_INTERVAL`
sleep $REBOOT_INTERVAL

if test $count -gt $MAX_REBOOT_TIME
then
echo "Reboot has not yet come up after more than $count seconds ..."
exit 1
fi
done

echo Server $HOSTNAME back up and running
else
echo Agent currently not running
exit 1
fi
else
echo "No match in boot order, nothing to do here..."
fi

exit 0


The NSH script can then be used by a job containing the current boot order to boot and will only boot the servers with the correct order:

worddavc895a6aecb94b1ee58ef6d136622b24b.png

worddav7e282b5ec135f5d5763a2d24b915be4c.png

The NSH scripts are then grouped in a batch job which is configured to use its own target list instead of the individual job ones. It will then run against all of the target servers and each boot order will check if a target server has the same order number and reboot or ignore that target for that step.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*