Remediating services automatically by using automation policies
By connecting with BMC Helix Intelligent Automation, automation teams can create remediation policies that show up for events that match the trigger conditions. Operators can manually trigger these remediation actions or they can be designed to run automatically, thus significantly saving time and manual efforts of the NOC teams.
By automating remediation, IT infrastructure management teams can achieve the following benefits:
Customer success: An enterprise software and IT consulting company automates remediation for impacted services
The IT infrastructure team at an enterprise software and IT consulting company implemented the automated remediation workflow and achieved the following results:
- Automated remediation of frequently occurring issues, which saved the need to manually investigate the event, create incidents, and restart the processes that stopped running.
- Capability to request automation if automation actions are not available yet.
- Increased system reliability and improved MTTR from 30-40 minutes to less than five minutes.
- Reports for analyzing results driven by the automated remediation actions.
Workflow
The following diagram illustrates the high-level workflow of automated remediation for events:
Task | Product | Role | Action | Reference |
---|---|---|---|---|
1. | BMC Helix AIOps | Tenant Administrator | Enable the Intelligent Automations feature from the Configurations menu. | |
2. | BMC Helix AIOps | Operator or SRE | (Optional) Request automation for an event from the Services or Situations menu. | |
3. | BMC Helix Intelligent Automation | Automation Engineer | Create an automation policy that contains remediation actions based on an incoming request or for frequently occurring issues. Automation engineers can set the execution mode to Automatic to trigger remediation actions automatically. | |
4. | BMC Helix AIOps | Operator or SRE | View events and run the automation actions available against the event. |
How does the IT team automate remediation?
When a service or process is down, typically, an IT operator or a site reliability engineer (SRE) spends hours investigating the event, creating an incident, and if needed, restarting the service or process. When a business-critical process is down, it causes a service outage that can last for a significant amount of time until the problem is investigated and remediated.
The IT team at the enterprise software and IT consulting company uses the advanced Intelligent Automations feature provided by BMC Helix AIOps to automatically remediate the process down events by restarting the processes. Automation engineers create automation policies in BMC Helix Intelligent Automation that appear as automation actions against the events in BMC Helix AIOps.
In the following example, a situation in BMC Helix AIOps indicates to the operator or SRE that an important process is down and shows the automations available against each event included in the situation.
The IT team uses automation to restart a process without any manual intervention. After the automation is run, the status and the incident ID is displayed for the event. An operator or SRE can view the details of the automation in BMC Helix Intelligent Automation by using the cross-launch link (appropriate permissions needed).