Workflow- Failed Backup and Recovery

This triage and remediation workflow is triggered by a BMC PATROL event indicating that a scheduled backup job specified on an IBM Tivoli Storage Manager (TSM) server has failed. The workflow logs on to the specified server, checks the error log for messages indicating backup failure, and then schedules a restart of the backup job.

Configuration guidelines

To enable this workflow, you must configure the AutoPilot Credentials Store and the Failed_Backup_Recovery configuration module under the Triage and Remediation runbook module on the BMC Atrium Orchestrator server side.

Ensure that the following credentials are added to the AutoPilot Credentials Store:

Host system(s) where the Tivoli Storage Manager instance resides
The credentials of the hosts to which you want to connect

In addition, ensure that the Lightweight Activity Peer (LAP) is installed on the Tivoli Storage Manager client. The Tivoli Storage Manager actor adapter resides on the Tivoli Storage Manager client node. The Tivoli Storage Manager actor adapter must have the same name as the node name of the client. The Tivoli Storage Manager actor adapter must be enabled on the LAP.

The Failed_Backup_Recovery configuration module contains the following definitions:

Group/Item	Description
AO_Host	BMC AO host where the Configuration Distribution Peer (CDP) server resides. The ping attempts are launched from this server.
WF_Detailed_Logging_Flag	Enables detailed logging of this specific workflow in the Infrastructure Management Performance Manager Operations Console. The valid values are use default, true, and false. The use defaultvalue applies the true or false value specified in the Detailed Logging File item under the Runbook_Defaults configuration folder that applies to all workflow. You can override this value by specifying its opposite value in the WF_Detailed_Logging_Flag item.
Extract_log	Contains the different OS forms of the tail command for extracting recent information from log files (Windows_2003) C:\winTail.bat (Oracle OS) tail -10/opt/tivoli/tsm/client/ba/bin/dsmerror.log (HP_UX) tail -10/opt/tivoli/tsm/client/ba/bin/dsmerror.log (Windows_2008) C:\winTail.bat (Linux) tail -10/opt/tivoli/tsm/client/ba/bin/dsmerror.log (AIX) tail -10/opt/tivoli/tsm/client/ba/bin/dsmerror.log (Windows_2012) C:\winTail.bat Note Install the WinTail utility on the system where the Tivoli Storage Manager client is installed. To extract the log details, perform the following tasks: Install the WinTail utility on the target system and add it to the system path Write your own batch file that extracts the last n number of lines of the required file similar to what the WinTail utility does
UNIX_Dsmerror_Logpath	file path to Tivoli Storage Manager's dsmerror.logfile on UNIX systems (the client error log file)opt/tivoli/tsm/client/ba/bin/dsmerror.log
Windows_Dsmerror_Logpath	file path to Tivoli Storage Manager's dsmerror.logfile on Windows systems C:\Program Files\Tivoli\TSM\baclient\dsmerror.log
Pause_Before_Reschedule_Min	Wait time in minutes before BMC AO reschedules the failed backup job with the Tivoli Storage Manager server This pause is required if you need to take a manual action, such as starting a service, on the target server before the scheduled backup job runs again. A scheduled job is designed to run at the exact scheduled moment.
Windows_DsmScheduler_ ServiceName	Name of the Tivoli Storage Manager scheduler service that is started after the workflow reschedules the failed job The default value is TSM Scheduler. If you install the service under a different name, modify the configuration value accordingly.

You can find the action definitions for the Failed Backup and Recovery workflow in the installationDirectory\pw\server\etc\cellName\kb\bin\ao_actions.mrl file.

An extract from the ao_actions.mrl file depicts the action definitions of this workflow for both the manual, on-demand launch and the automatic launch via a remote action policy.

# Backup and Recovery TSM Workflow
action 'Atrium Orchestrator Actions'.'Triage and Remediate Failed Backup and Recovery Workflow':
{
['Administrator', 'Full Access', 'Data Collection Administrator', 'Event Administrator', 'Event Operator', 'Data Collection Operator', 'Event Operator Supervisor', 'Data Collection Supervisor']
}
[
'Create Change Request':MC_TRUEFALSE($CREATECHANGERQUEST),
'Change Request Type':MC_CHANGEREQUESTTYPE($CHANGEREQUESTTYPE),
\'Create/Update Incident' : MC_TRUEFALSE($CREATEINCIDENT),
'Remediate' : MC_TRUEFALSE($REMEDIATE)
]
:EVENT($EV) where [ $EV.status != 'CLOSED' AND $EV.status != 'BLACKOUT']
#If you want this action to be context sensitive to PATROL Events monitoring Disk usage.
#Comment above line and uncomment below line.
#:PATROL_EV($EV)
{
action_requestor($UID,$PWD);
opadd($EV, "Triage and Remediate Failed Backup and Recovery manual", $UID);
admin_execute(BEMGW,$EV,"Atrium_Orchestrator_Failed_Backup_Recovery_Workflow",[$CREATECHANGERQUEST, $CHANGEREQUESTTYPE, $CREATEINCIDENT, $REMEDIATE,$UID],YES);
}
END
END
#Backup Recovery Workflow - Automatic.
action 'Atrium Orchestrator Actions-Automatic'.'Triage and Remediate Failed Backup and Recovery':
{
['Administrator', 'Full Access', 'Data Collection Administrator', 'Event Administrator']
}
:EVENT($EV) where [ $EV.status != 'CLOSED' AND $EV.status != 'BLACKOUT']
{
opadd($EV, "Triage and Remediate Failed Backup and Recovery Auto", "BMC Impact Manager");
admin_execute(BEMGW,$EV,"Atrium_Orchestrator_Failed_Backup_Recovery_Workflow",["true","normal","true","true","BMC Impact Manager"],YES);
}
END
END

Refine rules to validate Failed Backup and Recovery events

To help process the incoming events that alert users of a failed backup and recovery condition, a new refine rule has been added to the cell. This rule applies to PATROL events and is especially defined to accommodate the requirements of the PATROL event type.

refine split_tsm_patrol_event: PATROL_EV ($EV)
where [$EV.status != 'CLOSED' AND
$EV.mc_object_class == 'TSM_JOB']
{
strmatch($EV.msg, '%s(domain: <%s>; node: <%s>; schedule:
<%s>%s',[$JOBN,$DOMAIN,$NODE,$SCHEDULE,$OTHER]);
$EV.mc_object = $NODE;
$EV.mc_object_owner = $SCHEDULE;
$EV.mc_object_class = $DOMAIN;
strmatch($JOBN, '%s: %s @ %s %s %s',[$JOBN1,$JOBN2,$JOBN3,$JOBN4,$JOBN5]);
$EV.mc_parameter_value = $JOBN2 || ' @ ' || $JOBN3 || $JOBN4;
}
strmatch($JOBN, '%s: %s @ %s %s %s',[$JOBN1,$JOBN2,$JOBN3,$JOBN4,$JOBN5]);
$EV.mc_parameter_value = $JOBN2 || ' @ ' || $JOBN3 || $JOBN4;
}

Before these new refine rules can take effect, you must recompile the Knowledge Base (KB) of the cell through the mccomp command and then restart the cell. See Configuring Infrastructure Management for the Triage and Remediation Solution.

For more information about refine rules, see Refine-rules.

Launching the workflow

Launch the Failed Backup and Recovery workflow by selecting an appropriate PATROL event in the event view of the operator console indicating that a scheduled backup failure has occurred. You can also define a remote action policy for the scheduled backup failure to launch the Failed Backup and Recovery workflow automatically.

From the Events Console of the operator console, select the event, and choose the Tools > Remote Actions > Atrium Orchestrator Actions > Triage and Remediate Backup Recovery workflow entry. Then, fill in the Execute Actions dialog box. See the following table to determine which input value to select for the Backup Recovery workflow.

Input parameter	Description
Create Change Request	Boolean. True/false indicator that shows whether you want to create a change request in BMC Remedy Change Management System. If you choose false, the Change Request Type parameter is ignored.
Change Request Type	String. Specifies the type of change request (normal/preapproved)
Create/Update Incident	Boolean. True/false indicator that shows whether you want to create an incident in the incident management system. If an incident already exists, then the workflow updates the existing incident information.
Remediate	Boolean. True/false indicator that shows whether you want to proceed with the remediation action

From the administration console, define a remote action policy under the Event Management Policies tab. Complete the policy definition by choosing the appropriate automatic workflow action for backup recovery.

Common framework: event processing

The PATROL event that triggers the Failed Backup and Recovery workflow contains the following parameters:

Target (or client) host system where the Tivoli Storage Manager client resides
Job name
Node name
Schedule name
Domain name

The following table describes the mapping of the main parameters to event definitions:

Event mapping: Failed Backup and Recovery

Parameter	Event slot
Target (or client) host	mc_host
Job name	mc_parameter_value
Node name	mc_object
Schedule name	mc_object_owner
Domain name	mc_object_class

After extracting the configuration data from the event, the common framework determines the logging level for the workflow, sets the level at normal or detailed, and updates the event information in the Notes dialog box of the Operations Console accordingly.

Next, the AO server host tries to ping the target host system. If it receives a reply and because the event is a PATROL event, the Get PATROL Annotation Data subworkflow is launched to gather annotation details about the event. The subworkflow updates the event notes with the event information and with any annotation details. It then proceeds to start the triage process.

Triage processing

Triage processing begins when the BMC Atrium Orchestrator host attempts to ping the node where the Tivoli Storage Manager (TSM) client resides. If the ping is successful, the workflow proceeds to examines the latest status of the TSM backup job schedule for which the event was generated.

If the status of the scheduled job is still "failed," the workflow launches commands to extract the TSM error code that is associated with the failed TSM client job. It extracts the error code from the TSM client error log file.

The workflow is designed to process only a subset of the TSM client error codes. If the triage workflow extracts an error code that is among the subset, it launches the remediation action. If the workflow is unable to extract the error code or if the extracted error code lies outside the subset, the workflow exits by updating the corresponding messages in the event notes.

The Tivoli Storage Manager (TSM) error codes are described in the IBM Tivoli Storage Manager documentation under the section that addresses client messages. You can access this documentation through the IBM Tivoli Storage Manager information center at the IBM support site. Review the error code descriptions on the IBM support site to address the TSM errors on the TSM node.

If any manual intervention is required before the remediation begins, the workflow updates the event notes and pauses for the configured amount of time.

The following table summarizes the TSM error messages that the workflow supports. The workflow pauses for a specified time (in minutes) while some of the TSM error codes are addressed manually by the user.

Tivoli Storage Man ager error messages and workflow guidelines

Sr No.	Error code	System action	User actions	Workflow actions
1	ANS1020E	Processing stopped	Check the error log. Restart the Windows service associated with the system object indicated in the error log. Retry the backup operation.	Checks the error log, and extracts the error code Enriches the event with last n lines of the error log Pauses for the configured amount of time for the user to restart the service Remediates by rescheduling the job
2	ANS1558W	The WebSphere backup fails over to an offline backup	This is a triage-only action (no remediation). Enrich the event with the relevant information by checking the error log to discover the nature of the lock error. The lock operation can fail due to any of the following reasons: The WAS server is not running. The repository is already locked. Security is turned on and there is no WAS user/password file. Security is turned on and the information in the WAS user/password file is bad.	Checks the error log, and extracts the error code Enriches the event with last n lines of the error log No remediation is performed.
3	ANS1939E ANS5197E ANS1941E ANS5209E	Processing stopped	Examine the Windows File Replication Service Event log in thesystemRoot\Debugfolder. Ensure that File Replication Service is working properly. If determined to be satisfactory, restart the service, and retry the backup operation.	Checks the error log, and extracts the error code Enriches the event with last n lines of the error log Pauses for the configured amount of time for the user to examine the Windows File Replication Service Event log Remediates by rescheduling the job
4	ANS9062E ANS9065E ANS9089E ANS9091E ANS1003E	The backup fails	Attempt the backup again.	Checks the error log, and extracts the error code Enriches the event with last n lines of the error log Remediates by rescheduling the job
5	ANS1258E	Processing stopped	Check the error log for any messages that might indicate a reason for the failure. Enrich the incident. After the problem is corrected, retry the job.	Checks the error log, and extracts the error code Enriches the event with last n lines of the error log Pauses for the configured amount of time for the user to take corrective action Remediates by rescheduling the job

Remediation processing

If the triage processing extracts a valid error code within the subset of error codes that the workflow handles, the remediation processing begins. The remediation consists of rescheduling the backup job to run as a newly scheduled job.

The workflow takes the event data and the error code for the job failure and then sends a schedule query to the TSM server. The workflow extracts details such as schedule description, options, and priority.

Using these details, the workflow creates one more schedule in the TSM that runs only once after a few seconds have elapsed. If this new one-time schedule creation is successful, the workflow associates the schedule with the appropriate node.

The workflow tries to restart the TSM client scheduler so that the TSM server can obtain the newly created schedule. You can configure this TSM client scheduler name in the workflow's module configuration. The workflow updates the schedule and the job logs in the event notes.

In the BMC Remedy ITSM scenario, the event that triggered the triage and remediation workflow is updated with the corresponding change and task IDs. The triggered event, which originated from a Tivoli Storage Manager client node, remains open in Infrastructure Management because the Tivoli Storage Manager knowledge module has no mechanism for closing the event after the successful execution of the schedule. Consequently, the change and incident remain open in BMC Remedy ITSM.

If the scheduled backup fails again, the Tivoli Storage Manager knowledge module generates a PATROL event and sends it to Infrastructure Management. This event informs you that the remediation process did not work.