Remediating recurring problems by using push button automation


Automated execution of complex routine tasks and remediation of recurring problems can be implemented systematically, thereby improving the MTTR, process efficiency, and the quality of work.​ Implementing task and problem resolution automation workflows allows the team to focus on other high-value tasks such as innovation or other complex, productive tasks.​

Push button automation

As an operator with required permissions, you can manually trigger an automation policy to initiate the remediation as required. This process is appropriate for resolving issues that would take several hours to resolve manually or that are prone to human errors.

Examples

Use the following use case examples to understand the automation workflow and benefits:

  • The system performance is impacted due to long-running queries in your environment. Do a quick analysis of the long-running queries and take appropriate remediation measures. 
  • The Mid Tier pod performance is impacted due to a cache issue. Implement a remediation sequence to run a hard cache flush that optimizes the system's reliability and performance.


Remediation workflow

pushbutton-remediation_workflow.png

Typical tasks in the workflow

Before you begin

Based on the nature of your organization and user roles, a tenant administrator, automation engineer, or operator (including a network operations center (NOC) operator, major incident management (MIM) operator, or site reliability engineer (SRE) can perform the following operations:

  1. The operator identifies the issue and initiates a service request in BMC Helix Digital Workplace Catalog.

    Recommended version
    You can use version 19.02 or later.

  2. (Optional) The operator creates an incident ticket in BMC Helix ITSM with the requested details.

    Recommended BMC Helix ITSM version
    Use BMC Helix ITSM version 20.02 or later.

  3. An automation engineer creates an automation policy in BMC Helix Intelligent Automation containing remediation actions for the identified problem, with the execution mode set to Manual to trigger remediation actions on-demand manually. For more information, see Creating-automation-policies.
  4. A tenant administrator or editor creates a dashboard to track and monitor the progress of all automations. For more information, see Setting up dashboards.

  5. The operator triggers the remediation action in BMC Helix Intelligent Automation and monitors and tracks the progression of remediation status BMC Helix Dashboards.

Workflow sequence

  1. An identified problem occurs.
  2. A service request is initiated in the BMC Helix Digital Workplace Catalog to resolve the problem.
  3. (Optional) An incident is created in BMC Helix ITSM and the incident status is sent to BMC Helix Dashboards.
  4. The operator manually triggers the automation policy to initiate the remediation action in BMC Helix Intelligent Automation.
  5. The problem is remediated, and incident in BMC Helix ITSM is closed, and the status is sent to BMC Helix Dashboards.
    The following diagram elaborates on the workflow sequence:
    SRE_pushbutton_remediation_workflow.png

Examples

The following table describes the typical example use cases, remediation workflow, and benefits:


Problem cause

Workflow sequence

Benefit

1

System performance is impacted due to long-running queries in your environment. 

  1. The long-running queries impact system performance.
  2. A service request is raised by the operator in BMC Helix Digital Workplace Catalog.
  3. The operator performs the following tasks:
    1. Sets the following parameters when initiating the service request. The operators can set the parameters when initiating the request either manually or through an API call in  BMC Helix Digital Workplace Catalog
      1. Namespace (Customer name)
      2. From (Start time)
      3. To (End time)
      4. Email_ID (To send an email with the report attached)
      5. Jira_Ticket (To attach the report to JIRA)
    2. Initiates the automation pipeline from the BMC Helix Digital Workplace Catalog to collect the statistics of long-running queries.
    3. Converts the JSON output to Excel format for better readability and analysis.
    4. Sends an email to Email_ID with the Excel report attached.
    5. Attaches the Excel report file to the JIRA_Ticket.

Ensures reliable system operation and optimal performance, eliminating the risk of human error associated with manual resolution.

2

The Mid Tier pod performance is impacted due to a cache issue that necessitated a hard cache flush activity. 

  1. The Mid Tier pod performance is impacted.
  2. The operator logs a service request in  BMC Helix Digital Workplace Catalog.
  3. The operator triggers the automation workflow that performs the following steps:
    1. Runs a Kubernetes Exec command on each Mid Tier pod.
    2. Shuts down Tomcat securely and performs a validation check by using the PS -ef|grep tomcat command to ensure it is successfully shut down.
    3. Deletes the Cache folder from the Mid Tier, initiating the hard cache flush process.
    4. Starts Tomcat to resume normal operations after the cache flush is completed.

Ensures reliable system operation and optimal performance, eliminating the risk of human error associated with manual resolution.

Best practices

Use these best practice guidelines according to the implementation in your environment. All of them may not apply to every type of automation workflow.

  • Identify repetitive manual tasks in the environment and automate them for reliable and timely responses.
  • Build dashboard visualizations in BMC Helix Dashboards to monitor automation performance metrics. This process helps you track how often automation is triggered and measure success and failure rates.
  • Whenever automation is triggered or completed, include a work log in the incident. This information helps different support teams (NOC, MIM, or SRE) in your organization to comprehend the problem and resolution history.
  • Make sure the automation pipeline fails immediately if a critical step fails.
  • If an automation process fails, notify stakeholders promptly to help speed up issue assessment and correction.

Result

Implementing remediation through automation offers several benefits, including:

  • Faster incident resolution that leads to meeting customer SLAs effectively.
  • Consistent and reliable performance of critical business services, applications, and infrastructure entities.
  • 24/7 availability of applications and services without human interference and error-prone manual processes.
  • Early identification and prevention of recurring problems.
  • Proactive measures automate routine tasks and keep the team available for high-value tasks.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*