Rebooting servers in a predefined order

This topic was edited by a BMC Contributor and has not been approved.  More information.

About

The information on this page was initially contributed by Steffen Nowka.

Requirements

This document is meant to show how to use BMC Bladelogic Server Automation (BSA) to reboot a set of servers in a predefined order and to handle individual stop/starts of services on each server. Each step of the process is supposed to handle the services stop/start and reboot of a defined subset of the overall group of targets and then move on to next subset, e.g. first to reboot the webservers and then database servers. Each subset might contain 1..n targets.

Example:

All servers to reboot:
srv1
srv2

Process of reboot:

Solution

BSA integrated capabilities to add properties to data classes can be used to add additional properties for servers containing information about the servers boot orders. As many of the boot processes will restart servers which are part of applications this example uses additional properties to hold information about:

  1. AppGroup – the name of the application a server runs for
  2. BootOrder – the step within the reboot process at which the server is supposed to be rebooted
  3. SrvRole – the function a server provides for the application (e.g. database, webserver,..)

The properties can then be used to group the servers in server smart groups

To perform a server reboot we can leverage BSA's packaging capabilities by creating a BLPackage that contains only external command as a dummy which is set to reboot "After item deployment"

The stop/start of services is handled by regular BLPackages set to stop or start of the needed services.

The packages can then be used for jobs pointing to the server smart groups made up by the boot order property for rebooting and server role for the service handling

Those jobs are then grouped in a batch job which is defined to use the target servers of the individual jobs it contains.

Alternative

As the previous way requires single reboot jobs for each server smart group but runs without the use of any sort of script the number of reboot jobs might get high. An alternative to it would be the use of an NSH script that simply checks weather or not the boot order of a target server equals to current step in the process. For this an updated version of the "Reboot server in a controlled manner" NSH-script can be used (double-click to select):

#
#  BladeLogic Multi-Platform Reboot And Monitoring Script
#    (1.0) Originally contributed to the BladeLogic Knowledge Base by Thomas Krause
#    (1.1) Updated by Sean Berry to include the sleep statement inside the monitoring loop,
#    timeouts set to 300s (5m), intervals set to 20s.  Some debugging statements added.
#    (1.2) Updated by Bill Robinson to take reboot arguments for Solaris, Changed /dev/null for
#    WinNT
#    (1.3) Updated by Steffen Nowka to include parameterized reboot flow by checking a reboot
#    order of the target against the current order to be rebooted
#
# Maximum time to wait to have the server
# go down. Not that reliable as we are only
# testing that the agent has gone down and
# not necessarily that the server has gone
# down. Also defined is the interval time
# between checks to see if the server is
# down.
#

MAX_SHUTDOWN_TIME=300
SHUTDOWN_INTERVAL=20


#
# Number of parameters before optional parameters start
PARA=2


#
# Maximum amount of time we will wait to have
# the server comeback up once we have detected
# that it has gone down.  Also defined is the
# interval time between checks to see if the
# server is back up.
#
MAX_REBOOT_TIME=300
REBOOT_INTERVAL=20

#
# First two parameters are to bootorder of the server and the current order that is supposed to be rebooted
#
#
if [ $1 -eq $2 ]
   then
   	echo "Boot order match, checking if Agent is up..."

	if [ $# -gt $PARA ]
    	    then
        	echo "Accepting boot Arguments for Solaris"
        	COUNT=0
        	BOOT_ARGS=""
    	    for ARG in $*
			do
			    COUNT=`expr $COUNT + 1`
			    if [ $COUNT -gt $PARA ]
   			    then
   	  				BOOT_ARGS=$BOOT_ARGS" "$ARG
   				fi
   			done

   		    echo "Boot Args: $BOOT_ARGS"
	fi

	OS=`uname -s`
	HOSTNAME=$NSH_RUNCMD_HOST
	# The NSH_RUNCMD_HOST envar retuns the FQDN which is what we want
 	if [ "$OS" = "WindowsNT" ]
	then
    	DEVNULL=NUL
	else
    	DEVNULL=/dev/null
	fi

	# DEBUG=echo
	DEBUG=false

	if test -z "$HOSTNAME"
	then
    	echo Usage $0 hostname
    	exit 1
	fi

	pwd | egrep -q ^//

	if [[ $? -ne 0 ]]
	then
    	print "ERROR: You must run this script using the \"runscript\" option." 1>&2
    	exit 1
	fi

	# Have to be local so the uname -D command works properly
	cd //@/
    disconnect

	agent_up ()
	{
		#    uname -D //$1/ > $DEVNULL 2> $DEVNULL
    	echo uname -D //$1/
    	uname -D //$1/
    	return $?
	}

	if agent_up $HOSTNAME
	then
    	# XXX
   		$DEBUG "testing sleep (`which sleep`) interval - should be 10 second delay"
   		$DEBUG `date`
   		$DEBUG `sleep 10`
   		$DEBUG `date`

    	echo Rebooting server $HOSTNAME ...

    	case "$OS" in
        	SunOS)
        	if [ -z $BOOT_ARGS ]
            then
            	nexec $HOSTNAME shutdown -i6 -y -g 0
        	else
            	nexec $HOSTNAME reboot -- $BOOT_ARGS
       	 	fi
            	;;

	        Linux)
     	       nexec $HOSTNAME shutdown -r now
            ;;

        	WindowsNT)
            	nexec $HOSTNAME reboot
            ;;

        	*)
            	echo "Unknown platform \"$OS\""
            	exit 1
            ;;
    	esac

    	if test $? -ne 0
    	then
        	echo '***** Warning - Possible error in sending reboot request'
	    fi

    	#
    	# Give the server a certain amount of time to kill the
    	# agent and reboot
    	#
    	count=$SHUTDOWN_INTERVAL
    	sleep $SHUTDOWN_INTERVAL

    	while agent_up $HOSTNAME
    	do
        	echo `date` Agent still running ...
        	count=`expr $count + $SHUTDOWN_INTERVAL`

        	if test $count -gt $MAX_SHUTDOWN_TIME
        	then
            	echo "Reboot command sent but server not coming down"
            	exit 1
        	fi

        	sleep $SHUTDOWN_INTERVAL
    	done

    	#
    	# Now we know the agent is down and we are waiting for the
   		# system to reboot. Give a bunch of time to come back up.
   		#
    	count=$REBOOT_INTERVAL
    	sleep $REBOOT_INTERVAL

    	while ! agent_up $HOSTNAME
    	do
        	echo `date` Agent still not up ...
        	count=`expr $count + $REBOOT_INTERVAL`
        	sleep $REBOOT_INTERVAL

        	if test $count -gt $MAX_REBOOT_TIME
        	then
            	echo "Reboot has not yet come up after more than $count seconds ..."
            	exit 1
        	fi
    	done

    	echo Server $HOSTNAME back up and running
	else
    	echo Agent currently not running
    	exit 1
	fi
else
	echo "No match in boot order, nothing to do here..."
fi

exit 0



The NSH script can then be used by a job containing the current boot order to boot and will only boot the servers with the correct order:

The NSH scripts are then grouped in a batch job which is configured to use its own target list instead of the individual job ones. It will then run against all of the target servers and each boot order will check if a target server has the same order number and reboot or ignore that target for that step.

Was this page helpful? Yes No Submitting... Thank you

Comments

  1. Stephane Forand

    CAVEATS : Command :: WindowsNT) ""nexec $HOSTNAME reboot"" doesn't have any possible arguments(-wait time in second before doing reboot & release nexec command with a valid return code to central bl apps server thread), where this create unwanted behavior.

    a) If command reboot, react so fast on target that central BL apps servers thread can't capture return code. Job will fall into a state on the BL Apps server, still wait for command to return something .... and then 60 minutes later, will fail with error :

          SSL_read (Error on socket: 'Connection reset by peer')

          SSL_read

          nexec: ioctl error: 9

          nexec: ioctl error: 9

     

    in reality, command work fine, and did initiated a server reboot that completed successfully. But because the way job part time out react

    a) it take an full hour, before job finish

    b) job notification wake up Sys Admin for no reason, where finally server did not need a human interaction.

     

     

     

    Nov 30, 2015 04:41
    1. Yechezkel Schatz

      Thanks, Stephane. I spoke to a team member and these issues sound like a defect. I understand that you submitted a ticket for these issues. I'll stay tuned for developments as the defect is resolved, and will then check again whether I should make any changes in the documentation.

      Dec 01, 2015 07:30
  2. Abhijeet Janwalkar

    How can we use this along with Patching. Lets say I have a Application created and all the servers part of this application will be patched in single MW. How this can be achieved. I hope i was clear on my requirement.

    Jul 18, 2017 03:54