Workflow- Failed Backup and Recovery
This triage and remediation workflow is triggered by a BMC PATROL event indicating that a scheduled backup job specified on an IBM Tivoli Storage Manager (TSM) server has failed. The workflow logs on to the specified server, checks the error log for messages indicating backup failure, and then schedules a restart of the backup job.
Configuration guidelines
To enable this workflow, you must configure the AutoPilot Credentials Store and the Failed_Backup_Recovery configuration module under the Triage and Remediation runbook module on the BMC Atrium Orchestrator server side.
Ensure that the following credentials are added to the AutoPilot Credentials Store:
- Host system(s) where the Tivoli Storage Manager instance resides
- The credentials of the hosts to which you want to connect
In addition, ensure that the Lightweight Activity Peer (LAP) is installed on the Tivoli Storage Manager client. The Tivoli Storage Manager actor adapter resides on the Tivoli Storage Manager client node. The Tivoli Storage Manager actor adapter must have the same name as the node name of the client. The Tivoli Storage Manager actor adapter must be enabled on the LAP.
The Failed_Backup_Recovery configuration module contains the following definitions:
Group/Item | Description |
---|---|
AO_Host | BMC AO host where the Configuration Distribution Peer (CDP) server resides. The ping attempts are launched from this server. |
WF_Detailed_Logging_Flag | Enables detailed logging of this specific workflow in the Infrastructure Management Performance Manager Operations Console. The valid values are use default , true , and false . The use default value applies the true or false value specified in the Detailed Logging File item under the Runbook_Defaults configuration folder that applies to all workflow. You can override this value by specifying its opposite value in the WF_Detailed_Logging_Flag item. |
Extract_log | Contains the different OS forms of the tail command for extracting recent information from log files (Windows_2012) Note Install the WinTail utility on the system where the Tivoli Storage Manager client is installed. To extract the log details, perform the following tasks:
|
UNIX_Dsmerror_Logpath | file path to Tivoli Storage Manager's dsmerror.logfile on UNIX systems (the client error log file)opt/tivoli/tsm/client/ba/bin/dsmerror.log |
Windows_Dsmerror_Logpath | file path to Tivoli Storage Manager's dsmerror.logfile on Windows systems C:\Program Files\Tivoli\TSM\baclient\dsmerror.log |
Pause_Before_Reschedule_Min | Wait time in minutes before BMC AO reschedules the failed backup job with the Tivoli Storage Manager server This pause is required if you need to take a manual action, such as starting a service, on the target server before the scheduled backup job runs again. A scheduled job is designed to run at the exact scheduled moment. |
Windows_DsmScheduler_ ServiceName | Name of the Tivoli Storage Manager scheduler service that is started after the workflow reschedules the failed job The default value is TSM Scheduler. If you install the service under a different name, modify the configuration value accordingly. |
You can find the action definitions for the Failed Backup and Recovery workflow in the installationDirectory\pw\server\etc\cellName\kb\bin\ao_actions.mrl file.
An extract from the ao_actions.mrl file depicts the action definitions of this workflow for both the manual, on-demand launch and the automatic launch via a remote action policy.
# Backup and Recovery TSM Workflow
action 'Atrium Orchestrator Actions'.'Triage and Remediate Failed Backup and Recovery Workflow':
{
['Administrator', 'Full Access', 'Data Collection Administrator', 'Event Administrator', 'Event Operator', 'Data Collection Operator', 'Event Operator Supervisor', 'Data Collection Supervisor']
}
[
'Create Change Request':MC_TRUEFALSE($CREATECHANGERQUEST),
'Change Request Type':MC_CHANGEREQUESTTYPE($CHANGEREQUESTTYPE),
\'Create/Update Incident' : MC_TRUEFALSE($CREATEINCIDENT),
'Remediate' : MC_TRUEFALSE($REMEDIATE)
]
:EVENT($EV) where [ $EV.status != 'CLOSED' AND $EV.status != 'BLACKOUT']
#If you want this action to be context sensitive to PATROL Events monitoring Disk usage.
#Comment above line and uncomment below line.
#:PATROL_EV($EV)
{
action_requestor($UID,$PWD);
opadd($EV, "Triage and Remediate Failed Backup and Recovery manual", $UID);
admin_execute(BEMGW,$EV,"Atrium_Orchestrator_Failed_Backup_Recovery_Workflow",[$CREATECHANGERQUEST, $CHANGEREQUESTTYPE, $CREATEINCIDENT, $REMEDIATE,$UID],YES);
}
END
END
#Backup Recovery Workflow - Automatic.
action 'Atrium Orchestrator Actions-Automatic'.'Triage and Remediate Failed Backup and Recovery':
{
['Administrator', 'Full Access', 'Data Collection Administrator', 'Event Administrator']
}
:EVENT($EV) where [ $EV.status != 'CLOSED' AND $EV.status != 'BLACKOUT']
{
opadd($EV, "Triage and Remediate Failed Backup and Recovery Auto", "BMC Impact Manager");
admin_execute(BEMGW,$EV,"Atrium_Orchestrator_Failed_Backup_Recovery_Workflow",["true","normal","true","true","BMC Impact Manager"],YES);
}
END
END
Refine rules to validate Failed Backup and Recovery events
To help process the incoming events that alert users of a failed backup and recovery condition, a new refine rule has been added to the cell. This rule applies to PATROL events and is especially defined to accommodate the requirements of the PATROL event type.
refine split_tsm_patrol_event: PATROL_EV ($EV)
where [$EV.status != 'CLOSED' AND
$EV.mc_object_class == 'TSM_JOB']
{
strmatch($EV.msg, '%s(domain: <%s>; node: <%s>; schedule:
<%s>%s',[$JOBN,$DOMAIN,$NODE,$SCHEDULE,$OTHER]);
$EV.mc_object = $NODE;
$EV.mc_object_owner = $SCHEDULE;
$EV.mc_object_class = $DOMAIN;
strmatch($JOBN, '%s: %s @ %s %s %s',[$JOBN1,$JOBN2,$JOBN3,$JOBN4,$JOBN5]);
$EV.mc_parameter_value = $JOBN2 || ' @ ' || $JOBN3 || $JOBN4;
}
strmatch($JOBN, '%s: %s @ %s %s %s',[$JOBN1,$JOBN2,$JOBN3,$JOBN4,$JOBN5]);
$EV.mc_parameter_value = $JOBN2 || ' @ ' || $JOBN3 || $JOBN4;
}
Before these new refine rules can take effect, you must recompile the Knowledge Base (KB) of the cell through the mccomp
command and then restart the cell. See Configuring Infrastructure Management for the Triage and Remediation Solution.
For more information about refine rules, see Refine rules.
Launching the workflow
Launch the Failed Backup and Recovery workflow by selecting an appropriate PATROL event in the event view of the operator console indicating that a scheduled backup failure has occurred. You can also define a remote action policy for the scheduled backup failure to launch the Failed Backup and Recovery workflow automatically.
From the Events Console of the operator console, select the event, and choose the Tools > Remote Actions > Atrium Orchestrator Actions > Triage and Remediate Backup Recovery workflow entry. Then, fill in the Execute Actions dialog box. See the following table to determine which input value to select for the Backup Recovery workflow.
Input parameter | Description |
---|---|
Create Change Request | Boolean. True/false indicator that shows whether you want to create a change request in BMC Remedy Change Management System. If you choose false, the Change Request Type parameter is ignored. |
Change Request Type | String. Specifies the type of change request (normal/preapproved) |
Create/Update Incident | Boolean. True/false indicator that shows whether you want to create an incident in the incident management system. If an incident already exists, then the workflow updates the existing incident information. |
Remediate | Boolean. True/false indicator that shows whether you want to proceed with the remediation action |
From the administration console, define a remote action policy under the Event Management Policies tab. Complete the policy definition by choosing the appropriate automatic workflow action for backup recovery.
Common framework: event processing
The PATROL event that triggers the Failed Backup and Recovery workflow contains the following parameters:
- Target (or client) host system where the Tivoli Storage Manager client resides
- Job name
- Node name
- Schedule name
- Domain name
The following table describes the mapping of the main parameters to event definitions:
Event mapping: Failed Backup and Recovery
Parameter | Event slot |
---|---|
Target (or client) host | mc_host |
Job name | mc_parameter_value |
Node name | mc_object |
Schedule name | mc_object_owner |
Domain name | mc_object_class |
After extracting the configuration data from the event, the common framework determines the logging level for the workflow, sets the level at normal or detailed, and updates the event information in the Notes dialog box of the Operations Console accordingly.
Next, the AO server host tries to ping the target host system. If it receives a reply and because the event is a PATROL event, the Get PATROL Annotation Data subworkflow is launched to gather annotation details about the event. The subworkflow updates the event notes with the event information and with any annotation details. It then proceeds to start the triage process.
Triage processing
Triage processing begins when the BMC Atrium Orchestrator host attempts to ping the node where the Tivoli Storage Manager (TSM) client resides. If the ping is successful, the workflow proceeds to examines the latest status of the TSM backup job schedule for which the event was generated.
If the status of the scheduled job is still "failed," the workflow launches commands to extract the TSM error code that is associated with the failed TSM client job. It extracts the error code from the TSM client error log file.
The workflow is designed to process only a subset of the TSM client error codes. If the triage workflow extracts an error code that is among the subset, it launches the remediation action. If the workflow is unable to extract the error code or if the extracted error code lies outside the subset, the workflow exits by updating the corresponding messages in the event notes.
The Tivoli Storage Manager (TSM) error codes are described in the IBM Tivoli Storage Manager documentation under the section that addresses client messages. You can access this documentation through the IBM Tivoli Storage Manager information center at the IBM support site. Review the error code descriptions on the IBM support site to address the TSM errors on the TSM node.
If any manual intervention is required before the remediation begins, the workflow updates the event notes and pauses for the configured amount of time.
The following table summarizes the TSM error messages that the workflow supports. The workflow pauses for a specified time (in minutes) while some of the TSM error codes are addressed manually by the user.
Tivoli Storage Man ager error messages and workflow guidelines
Sr No. | Error code | System action | User actions | Workflow actions |
---|---|---|---|---|
1 | ANS1020E | Processing stopped |
|
|
2 | ANS1558W | The WebSphere backup fails over to an offline backup | This is a triage-only action (no remediation). Enrich the event with the relevant information by checking the error log to discover the nature of the lock error. The lock operation can fail due to any of the following reasons:
|
No remediation is performed. |
3 | ANS1939E ANS5197E ANS1941E ANS5209E | Processing stopped |
|
|
4 | ANS9062E ANS9065E ANS9089E ANS9091E ANS1003E | The backup fails | Attempt the backup again. |
|
5 | ANS1258E | Processing stopped |
|
|
Remediation processing
If the triage processing extracts a valid error code within the subset of error codes that the workflow handles, the remediation processing begins. The remediation consists of rescheduling the backup job to run as a newly scheduled job.
The workflow takes the event data and the error code for the job failure and then sends a schedule query to the TSM server. The workflow extracts details such as schedule description, options, and priority.
Using these details, the workflow creates one more schedule in the TSM that runs only once after a few seconds have elapsed. If this new one-time schedule creation is successful, the workflow associates the schedule with the appropriate node.
The workflow tries to restart the TSM client scheduler so that the TSM server can obtain the newly created schedule. You can configure this TSM client scheduler name in the workflow's module configuration. The workflow updates the schedule and the job logs in the event notes.
In the BMC Remedy ITSM scenario, the event that triggered the triage and remediation workflow is updated with the corresponding change and task IDs. The triggered event, which originated from a Tivoli Storage Manager client node, remains open in Infrastructure Management because the Tivoli Storage Manager knowledge module has no mechanism for closing the event after the successful execution of the schedule. Consequently, the change and incident remain open in BMC Remedy ITSM.
If the scheduled backup fails again, the Tivoli Storage Manager knowledge module generates a PATROL event and sends it to Infrastructure Management. This event informs you that the remediation process did not work.
Comments
Log in or register to comment.