Automating remediation for recurring problems


Automated execution of complex routine tasks and remediation of recurring problems can be implemented systematically, thereby improving the MTTR, process efficiency, and the quality of work.​ Implementing task and problem resolution automation workflows allows the team to focus on other high-value tasks such as innovation or other complex, productive tasks.​

The following video (1:57) provides an overview of automated remediation:

icon_play.pngWatch the YouTube video about the overview of automated remediation.


Automated remediation

When an identified problem occurs, automated remediation is triggered based on the event without requiring manual intervention. This ensures that the issue is addressed promptly.

Examples

Use the following use case examples to understand the automation workflow and benefits:

  • Creation of a change request could fail when the BMC Atrium Impact Simulator plug-in times out and that causes a timeout event to occur in BMC Helix Operations Management. To ensure the successful creation of the change request, the BMC Atrium Impact Simulator plug-in needs to be restarted with an automation sequence.
  • The AR system is overpopulated with the accumulation of RE:Job_Runs reconciliation job history records that cause performance issues and affect the health of BMC Helix Configuration Management Database. To resolve this isssue proactively, clear the RE:Job_Runs job history table for records older than 24 hours with an automated sequence.

Remediation workflow

Automated remediation workflow sequence

Typical tasks in the workflow

Before you begin

Based on the nature of your organization and user roles, a tenant administrator, automation engineer, or operator (includes a network operations center (NOC) operator, major incident management (MIM) operator, or site reliability engineer (SRE)) can perform the following operations:

  1. A tenant administrator defines event monitoring policies based on the identified routine tasks or recurring problems. For more information, see Defining monitor policies and Enriching the events.

  2. (Optional) A tenant administrator can configure the event policies to work with BMC Helix ITSM for creating incident tickets against the events in BMC Helix Operations Management.

    Recommended BMC Helix ITSM version
    Use BMC Helix ITSM version 20.02 or later.

  3. An automation engineer creates an automation policy in BMC Helix Intelligent Automation containing remediation actions for the identified problem with the execution mode set to Automatic to trigger remediation actions automatically. For more information, see Creating-automation-policies.
  4. A tenant administrator or editor creates a dashboard to track and monitor the progress of all automated actions. For more information, see Setting up dashboards.

  5. The operator monitors and tracks the progression of the remediation status.

Workflow sequence

  1. An identified problem occurs.
  2. An event is generated in BMC Helix Operations Management and the automation policy is triggered.
  3. (Optional) An incident is created in BMC Helix ITSM.
  4. The event metrics and incident status are sent to BMC Helix Dashboards.
  5. The automation policy runs the remediation action.
  6. The event in BMC Helix Operations Management and incident in BMC Helix ITSM are closed, and the status is sent to BMC Helix Dashboards.
    The following diagram elaborates on the workflow sequence:
    SRE_automation_remediation_workflow.png

Use cases

The following table describes the typical example use cases, remediation workflow, and benefits:


Problem cause

Workflow sequence

Benefit

1

The RBE:Messages form in BMC Helix ITSM processes email records sent to the AR System Email Engine. 

Occasionally, these email records get stuck due to performance or customization issues.

This causes a delay in response time.

Watch the following video (2:56) to learn how to remediate the stuck records on the RBE:Messages form.

icon_play.pngWatch the YouTube video about automated remediation to remediate the stuck records on the RBE:Messages form.

Prevents customers from experiencing delays in email responses.

2

The NTE:SYS-NT Process Control Form in BMC Helix ITSM processes records during an escalation run and sends notifications to the users.  

Occasionally, these records get stuck due to performance or customization issues.

This causes a delay in sending notifications to the users.

Watch the following video (3:11) to learn how to remediate the Stuck records in the NTE:SYS-NT Process Control form:

icon_play.pngWatch the YouTube video about automated remediation to remediate the stuck records in the NTE:SYS-NT Process Control form.

Releases the stuck records and sends out the notifications on time.

3

The SMTP servers hosted on an active-active configuration share a Ceph storage mounted on a file system mount point (/mnt/gv0). 

Occasionally, the file system gets unmounted causing the SMTP server connection to the Ceph storage disconnected.

As a result, the Ceph storage becomes unavailable for read and write actions causing disruption.


Watch the following video (3:16) to learn how to remediate the Ceph Storage Disconnection:

icon_play.pngWatch the YouTube video about automated remediation to remediate the Ceph Storage Disconnection.

The Ceph storage is mounted back, and the SMTP services are restored in a few minutes.

4

The BMC Atrium Impact Simulator plug-in is a tool used to assess the impact of changes to a configuration item (CI) on other CIs critical to business services.

When you create a change request, the plug-in might time out and go into an unresponsive state, resulting in significant delays in creating the request.

This timeout generates an event in BMC Helix Operations Management.

  1. The BMC Atrium Impact Simulator times out and is unresponsive.
  2. BMC Helix Operations Management event is created and enriched into an incident-type event.
  3. The incident triggers automation through BMC Helix Intelligent Automation policy runs.
  4. The automation job (for example, a Python script) restarts the BMC Atrium Impact Simulator.
  5. The job fetches the pod on which the plug-in is running and restarts the process with the help of BMC Helix ITSM APIs.

Prevents customers from experiencing delays in creating change requests by releasing the BMC Atrium Impact Simulator from an unresponsive state.

5

The AR System is overpopulated with an accumulation of RE:Job_Runs reconciliation job history records that cause performance issues and affect the health of BMC Helix Configuration Management Database. 

This impact generates an event in BMC Helix Operations Management.

  1. The AR System performance goes down and affects the BMC Helix Configuration Management Database.
  2. A BMC Helix Operations Management event is created and enriched into an incident-type event.
  3. The incident triggers automation through BMC Helix Intelligent Automation policy runs.
  4. The automation job does the following tasks:
    1. Navigates to the RE:Job_Runs form.
      For example,  https://globalCI.apex.com/arsys/forms/apex-s/RE:Job_Runs
    2. Selects Started from the Run Status option and searches for records older than the last 24 hours.
    3. Deletes the RE:Job_Runs jobs older than 24 hours.

Keeps BMC Helix Configuration Management Database in a healthy state.

6

The Postfix server routes and sends emails with an SMTP configuration.

Occasionally, due to performance issues, it may become unresponsive, which generates an event in BMC Helix Operations Management.

  1. The Postfix server is unresponsive.
  2. The PATROL Agent monitoring the Postfix catches this failure.
  3. BMC Helix Operations Management event is created and enriched into an incident-type event.
  4. The incident triggers automation through BMC Helix Intelligent Automation policy runs.
  5. The automation job (for example, a Python script) restarts the Postfix service.

The Postfix SMTP services are restored in a few minutes, and the emails are routed as usual.

Best practices

Use these best practice guidelines according to the implementation in your environment. All of them may not apply to every type of automation workflow.

  • Identify repetitive manual tasks in the environment and automate them for reliable and timely responses.
  • Build dashboard visualizations in BMC Helix Dashboards to monitor automation performance metrics. This process helps you track how often automation is triggered and measure success and failure rates.
  • Enrich BMC Helix Operations Management events to incidents before triggering automation to improve efficiency.
  • Whenever automation is triggered or completed, include a work log in the incident. This information helps different support teams (NOC, MIM, or SRE) in your organization to comprehend the problem and resolution history.
  • Make sure the automation pipeline fails immediately if a critical step fails.
  • If an automation process fails, notify stakeholders promptly to help speed up issue assessment and correction.

Results

Implementing remediation through automation offers several benefits, including:

  • Faster incident resolution that leads to meeting customer SLAs effectively.
  • Consistent and reliable performance of critical business services, applications, and infrastructure entities.
  • 24/7 availability of applications and services without human interference and error-prone manual processes.
  • Early identification and prevention of recurring problems.
  • Proactive measures automate routine tasks and keep the team available for high-value tasks.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*