Using the Triage and Remediation solution
The Triage and Remediation runbook solution automates the process for applying triage and remediation actions on high-volume events in the data center. In addition, the solution leverages BMC Atrium Orchestrator to link the BMC Remedy ITSM change and incident management applications with the Infrastructure Management's event processing capability.
This topic provides suggested uses for the solution, descriptions of the workflows, configuration tips, and guidelines for customizing the existing workflows.
Overview of workflows
The Triage and Remediation solution and BMC Event and Impact Management solution provide prepackaged workflows to help you manage the following system occurrences:
- Operating system disk space reaches or exceeds its capacity
- Monitored host system is down (triage only)
- VMware ESX server host fails to respond
- Utility workflows for starting or restarting servers and services (include a validation phase)
- Database reaches or exceeds a specified tablespace limit
- Scheduled backup for a specified IBM Tivoli Storage Manager (TSM) server has failed
- PATROL Agent goes down due to any errors
An event notification triggers each workflow. The source of this event can be a BMC PATROL Agent, a BMC Event and Impact Management event, or any event that complies with the slot mapping standards that BMC Atrium Orchestrator supports. The event, in turn, is triggered when a specified threshold attribute is breached for the particular monitored component.
How to launch workflows from Infrastructure Management
You can launch a manual, on-demand workflow from the Events Console of the operator console. You can also define a remote action policy to automate the workflow launch from the administrator console.
The parameters for each workflow are defined in the ao_actions.mrl file. The chief difference between the two is that you can specify your input parameters in manual workflows, but the automated workflows use default values. For example, an automated workflow automatically creates incidents and change requests. However, you do have the option in automated workflows of changing the default values in the installationDirectory\pw\server\etc\cellName\kb\bin\ ao_actions.mrl file. (To recompile the knowledge base and restart the server cell, see Configuring Infrastructure Management for the Triage and Remediation Solution.)
You have the option of making most of the workflows context sensitive to PATROL events by uncommenting and commenting code in the ao_actions.mrl file. (See ProactiveNet server configuration for examples.)
You also have the option of customizing the ao_actions.mrl file to make any defined action context sensitive. For more information, see Configuring the Management Server for the triage and remediation.
Defining a remote action policy to launch a workflow
When defining a policy, you first define the event selector and then define the policy from the administrator console.
Perform the following steps to define the event selection criteria:
- In the Tree view of the administrator console, open the By Selector folder and highlight the selector you added to the remote action policy to open the Selector panel.
- Highlight this selector in the selector list of the Selector panel.
- Click the Update Event Selector icon in the toolbar to enable the edit function.
- In the Event Selector Criteria list of the Selector panel, highlight the selector and click Edit to open the Edit Criteria dialog box.
- In the Edit Criteria dialog box, specify the slots and values for events that you want the selector to match.
For example, you can specify the matching criteria in the event message slot, such as
$EV.msg contains 'unreachable'.
- Click OK.
Perform the following steps to define the remote action policy:
- In the administrator console, click the Event Management Policies tab.
- In the tree view under My Production, open the server cell entry.
- Under the Policy Type folder, select Remote Action Policy.
- Click the Add Event Policy icon in the tool bar.
- In the Selector Chooser dialog box, choose the selector to which this policy and the designated workflow apply. Then, click OK.
- In the Remote Action Policy tab, enter the policy name (required) and a description (optional).
- Designate whether the timeframes are enabled.
If enabled, indicate whether policy activation timeframes are always active (default value), or select the option to define the schedule of your timeframes.
- In the Action Name list, select the automatic workflow action to apply to this policy.
List of automatic workflow actions
- Click OK.
The event selection criteria are applied to the remote action policy.
Launching a workflow on demand from the Events Console
By default, each workflow applies to all events. From the operator console, perform the following steps to select and launch a workflow from the Events Console:
- Identify the appropriate event for the workflow, and click the Tools icon to display the menu.
- Choose Remote Actions > Atrium Orchestrator Actions to display a list of workflows.
List of on-demand workflow actions
Select the workflow.
An Execute Action dialog box opens containing input parameters that are specific to your workflow selection. The following figure shows the Execute Action dialog box for the OS Disk Space Full workflow:
Execute Action dialog box: OS Disk Space Full workflow
For example, you can choose to create an incident ticket and a change request as part of the workflow.
The specific input parameters for each workflow are described in the workflow descriptions later in this topic.
- Make your selections in the Execute Action dialog box, and click Execute to launch the workflow.
An Action Results icon and a Related Events icon are displayed in the event row. An information event is returned indicating the action and the target host.
- Verify the event notes in the Details pane. The event notes describe the stages of the workflow's execution and indicate whether the workflow has been launched successfully. See the following figure:
Event notes in the Details pane
To check the results of the workflow action
Click the Action Results icon to display the Event Remote Action Results dialog box. You can view output, errors, and details associated with the workflow action. The exit code 0 indicates successful execution. Otherwise, the exit code defaults to -1.
To view related events
Click the Related Events icon.
Related Events option
The Event List window displays where you can view, filter, and perform actions on consequent events that are related to the workflow launch.
Event List window
Verifying that the Infrastructure Management Server and the Atrium Orchestrator server can communicate
On the Infrastructure Management Server side, when you run an Atrium Orchestrator workflow on an event, the system populates the mc_operations field with a message. If you do not see a message in the Details window of the Console, check the Infrastructure Management Server cell trace file under the installationDirectory\pw\server\log\cellName\trace directory. When the workflow runs, it updates the mc_notes field in the Details window with status information and displays an Action Result icon next to the event. An action exit code of 0 indicates a successful execution. A non zero action exit code indicates unsuccessful execution.
To obtain more debugging information about the Infrastructure Management Server cell, run the following command:
mgetinfo -n cellName connect
On the Atrium Orchestrator server side, check the grid.log file to verify whether the workflow action has been received in the grid.log under the C:\Program Files\BMC\AO\tomcat\logs directory path.