Workflow- ESX Host Down


Failed to execute the [excerpt-include] macro.

The ESX Host Down workflow is a Triage and Remediation workflow designed to monitor, diagnose, and remediate the state of a VMware ESX server and its connection with the VMware vCenter Server.

Configuration guidelines

To enable this workflow, you must configure the AutoPilot Credentials Store and the ESX_Host_Down configuration module on the BMC Atrium Orchestrator server. 

Ensure that the credentials of the following systems are added to the AutoPilot Credentials Store:

  • Three ping host systems and the ping router to enable the 360 degree ping verification to work
  • VMware ESX server host that is the subject of the triage and remediation

The ESX_Host_Down configuration module also must contain the definitions required to make the 360 degree verification process work: 

Item

Description

Ping_Host_1

Devices from which the ping command is triggered when an ESX host down event is received. If the ping command fails, the ping host from where the ping failed launches a traceroute and collects the traceroute information.

Check whether you can log on to the ping host system from the peer host system.

Ping_Host_2

Ping_Host_3

Ping_Router

Default router that launches the ping command when the traceroute does not return any router IPs in the path to the host identified in the host down event

Router_pingCommand

Ping command that is appropriate for the routers in your environment

In the TrueSight Infrastructure Management Server, the action definitions for the ESX Host Not Responding workflow are located in the installationDirectory\pw\server\etc* *cellName\kb\bin\ ao_actions.mrl file. 

An extract from the ao_actions.mrl file depicts the action definitions of this workflow for both the manual, on-demand launch and the automatic launch via a remote action policy.

action 'Atrium Orchestrator Actions'.'Triage and Remediate ESX Host Not Responding':
{
['Administrator', 'Full Access', 'Data Collection Administrator', 'Event Administrator', 'Event Operator', 'Data Collection Operator', 'Event Operator Supervisor', 'Data Collection Supervisor']
}
[
'Create Change Request':MC_TRUEFALSE($CREATECHANGERQUEST),
'Change Request Type':MC_CHANGEREQUESTTYPE($CHANGEREQUESTTYPE),
'Create/Update Incident' : MC_TRUEFALSE($CREATEINCIDENT),
'Remediate' : MC_TRUEFALSE($REMEDIATE)
]
:EVENT($EV) where [ $EV.status != 'CLOSED' AND $EV.status != 'BLACKOUT']
#If you want this action to be context sensitive to PATROL Events
#Comment above line and uncomment below line.
#:PATROL_EV($EV) where [ $EV.mc_parameter within ['Comm_Status'] ]
{
action_requestor($UID,$PWD);
opadd($EV, "Triage and Remediate ESX Host Not Responding manual", $UID);
admin_execute(BEMGW,$EV,"Atrium_Orchestrator_ESX_Host_Not_Responding_Workflow",[$CREATECHANGERQUEST, $CHANGEREQUESTTYPE, $CREATEINCIDENT, $REMEDIATE,$UID],YES);
}
END
END
....................................................................
....................................................................
action 'Atrium Orchestrator Actions-Automatic'.'Triage and Remediate ESX Host Not Responding':
{
['Administrator', 'Full Access', 'Data Collection Administrator', 'Event Administrator']
}
:EVENT($EV) where [ $EV.status != 'CLOSED' AND $EV.status != 'BLACKOUT']
{
opadd($EV, "Triage and Remediate ESX Host Not Responding Auto", "BMC Impact Manager");
admin_execute(BEMGW,$EV,"Atrium_Orchestrator_ESX_Host_Not_Responding_Workflow",["true","normal","true","true","BMC Impact Manager"],YES);
}
END

Launching the workflow

You launch this workflow as you do the other BMC Atrium Orchestrator workflows from the Infrastructure Management interface. On the Administration Console you can define a remote action policy that automatically launches the workflow when a trigger event is received. 

Define a remote action policy under the Event Management Policies tab, following the procedure in Creating-remote-actions-on-the-administrator-console. Complete the policy definition by choosing the appropriate automatic workflow action for ESX Host Not Responding.

Common framework: event processing

The ESX Host Not Responding workflow is launched when the Infrastructure Management adapter receives an event that indicates that the ESX host is not responding. 

After extracting the configuration data from the event, the common framework determines the logging level for the workflow and sets the level at normal or detailed. 

If the event is a PATROL event, the Get PATROL Annotation Data subworkflow is launched to gather annotation details about the event. The subworkflow updates the event notes with the event information and with any annotation details. It then proceeds to start the triage process. 

As part of the triage input, the ESX workflow takes the connection details of the ESX server, the connection details of the ping hosts, and the ping router from the AutoPilot Credentials Store. 

It then launches the first subworkflow: a 360 degree ping to verify that the ESX server is reachable so that the workflow can log on to the system to perform the triage and, if necessary, the remediation tasks.

The 360 degree ping subworkflow

The 360 degree ping subworkflow begins the triage process. Its purpose is to verify that the ESX server is reachable from the system on which the BMC Atrium Orchestrator server resides. 

The subworkflow connects to the three ping hosts and the ping router, all of which are specified in the AutoPilot Credentials Store and in the ESX_Host_Down configuration module. After connecting to the ping hosts and to the router, the subworkflow launches the 360 degree service validation against the ESX server host. 

The three hosts ping the ESX server host. If any one of the ping attempts fails, then the system from where the ping failed is used to launch a traceroute command. The traceroute command tries to trace the network route that the ping command took before reaching its destination. The subworkflow examines the output of the traceroute command to determine whether the ping failure is the result of intermediate connectivity issues or the result of the ESX server host being down. 

If the ESX server host is down, then the workflow cannot communicate with the host to launch its triage and remediation commands. It updates the event notes with the information and processes any open incident. The workflow is complete. 

If the 360 degree service validation is successful, indicating that the ESX server is reachable, then the workflow launches into the triage process. It initiates a sequence of checks to determine whether the following services are running on the ESX host system:

  • The vmware-hostd ESX Server Management service
  • The vmware-vpxa vCenter Agent service
  • The xinetd daemon service that launches servers

Triage processing

The triage commands are defined in the following XML extract. Do not edit the following commands.

<commands>
<triage Name="verify management service status" sequence-no="1">
<command>ps -ef | grep hostd</command>
</triage>
<triage Name="verify virtual center service status"
sequence-no="2">
<command>ps -ef | grep vpxa</command>
</triage>
<triage Name="verify xinetd service status" sequence-no="3">
<command>service xinetd status</command>
</triage>
<triage Name="check for resource starvation" sequence-no="4">
<command>top -n 1</command>
</triage>
</commands>


The first triage command, ps -ef|grep hostd, checks whether the vmware-hostd ESX Server Management service is running. The status of this triage command is recorded as 0 (success) or 1 (failure). The next triage command, ps -ef | grep vpxa, verifies whether the vmware-vpxa vCenter Agent service is running. Its status is also recorded. The final command, service xinetd status, examines the status of the xinetd daemon. Its status is recorded. 

If all the triage commands show successful results, then all three service statuses are concatenated. The recorded output is 000, indicating that the ESX services are running. The remediation process is not required. The event notes are updated, and no incident is created. 

Instead, the top -n 1 command is launched to check for resource starvation in an effort to determine why the event was initiated. (Resource starvation refers to excessive CPU, memory, and swap usage, all of which can impact system performance and operation of the ESX server.) 

The top -n 1 command generates output similar to that in the following example:

18:36:08 up 23 days, 3:45, 1 user, load average: 0.00, 0.00, 0.00
180 processes: 179 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: cpu    user    nice    system   irq    softirq    iowait       idle
   total  0.0%    0.0%    0.9%     0.0%     0.0%      0.0%        99.0%
   total  0.0%    0.0%    0.9%     0.0%     0.0%      0.0%        99.0%
Mem:  268572k av,    214192k used,   54380k free, 0k shrd, 18648k buff
     143964k actv,  36104k in_d,    2432k in_c
     143964k actv,  36104k in_d,    2432k in_c
Swap: 554200k av,    30468k used,    523732k free          103724k cached
Swap: 554200k av,    30468k used,    523732k free          103724k cached
PID   USERPRINISIZERSSSHARESTAT%CPU%MEMTIMECPUCOMMAND
26985root20012201220876R0.90.40:000top
1root150496484436S0.00.10:040init
2root150000SW0.00.00:000keventd
3root3419000SWN0.00.00:000 ksoftirqd/0
6root150000SW0.00.00:000bdflush
4root150000SW0.00.00:010kswapd
5root150000SW0.00.00:000kscand
7root150000SW0.00.00:000kupdated
19root150000SW0.00.00:020vmkmsgd
20root150000SW0.00.00:000vmnixhbd
24root250000SW0.00.00:000vmkdevd
25root250000SW0.00.00:000 scsi_eh_0
353root150000SW0.00.00:000scsi_eh_1
354root160000SW0.00.00:000scsi_eh_2
382root150000SW0.00.00:190kjournald
433root250000SW0.00.00:000khubd


When reviewing the output statistics, pay attention to load average and CPU Idle values, both of which indicate how busy the server is. A load average value of 2.00 suggests that the system is busy, and a load average over 4.00 indicates that the system is so busy that the activity is impacting performance. A high CPU idle percentage (Idle) indicates that the system is not busy. If the CPU idle percentage is high, look at which processes (user, nice, system, and so forth) are consuming the time. If the CPU time is being consumed in the user state, you can list the PIDs to determine which command is consuming the most resources. 

Also review the memory and swap statistics to determine how much memory is used and the amount of swapping that is occurring. The more RAM that is available, the less the amount of swapping that occurs. When the statistics show a low amount of RAM but a high amount of swapping, you must perform troubleshooting measures, such as disabling third-party services and increasing the amount of RAM. 

If one or more of the triage commands indicate that the ESX service is down, then the triage process updates the incident (if enabled) and the event notes. The remediation process is invoked.

Remediation processing

The remediation process takes as input the event ID, the ESX host connection details, and the results of the triage commands indicating which services are down. 

As part of the common framework processing for change management, the remediation workflow determines whether change management is enabled. When the change management process is enabled, the remediation workflow is triggered when BMC Atrium Orchestrator receives an alert from the BMC Remedy ITSM indicating that the change has been approved and the task status is changed to assigned. Then, the workflow creates the change and adds the change ID and  the related task ID to the event notes. 

If the change management process is not enabled, the workflow invokes remediation directly. 

During remediation, the workflow connects to the ESX server. It runs the remediation commands to restart the ESX services based on the triage output indicating which services are down. 

The remediation commands are listed in the following XML extract. Do not edit these commands.

<remediation-commands>
<command triage-status="001">service xinetd restart</command>
<command triage-status="010">service vmware-vpxa restart</command>
<command triage-status="011">service vmware-vpxa restart, service
xinetd restart</command>
<command triage-status="100">service mgmt-vmware restart</command>
<command triage-status="101">service mgmt-vmware restart,service
xinetd restart</command>
<command triage-status="110">service mgmt-vmware restart,service
vmware-vpxa restart</command>
<command triage-status="111">service mgmt-vmware restart,service
vmware-vpxa restart,service xinetd restart</command>
</remediation-commands>


The following table outlines the relation between the different status results and the corresponding service restart commands. (Remember that 1 indicates that the service is not active and must be restarted.)

Management Service status

vCenter Service Status

Xinetd Service status

Remediation command

0

0

1

service xinetd restart

0

1

0

service vmware-vpxa restart

0

1

1

service vmware-vpxa restart, service xinetd restart

1

0

0

service mgmt-vmware restart

1

0

1

service mgmt-vmware restart, service xinetd restart

1

1

0

service mgmt-vmware restart, service vmware-vpxa restart

1

1

1

service mgmt-vmware restart, service vmware-vpxa restart, service xinetd restart


 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*