Harnessing the power of intelligent event correlation to reduce event noise and improve MTTR
An IT company reduces event noise and improves MTTR by using the intelligent event correlation capability
The event correlation feature in BMC Helix AIOps groups hundreds of events into a single situation, leading to significantly reducing event noise. Event correlation helped a reputed IT company to achieve the following goals:
- Perform a quick causal analysis to understand a problem context at the level of the impacted CO
- Improve the mean time to resolve (MTTR) of issues
Scenario
During major network or data center outages, the operations team grapples with a flood of events, making it extremely challenging to determine a problem's root cause. Identifying the root cause amid the noise is difficult and frustrating. To tackle this challenge, the IT company uses the intelligent event correlation feature. This feature performs real-time analysis to detect false positives, reduces redundant or insignificant alerts, and correlates multiple causal events into a single situation resulting in reduced event noise. Situations contain valuable problem context and relevant details to enable further analysis, which significantly enhances the MTTR.
The company created two event correlation policies by using BMC Helix Operations Management and then uses BMC Helix AIOps to monitor and investigate the correlated situations and individual events. By using the situations, the company avoids the laborious task of sifting through and analyzing hundreds of events to identify the problem context.
One policy, the Pune Event Flood policy, consolidates the event storm that arises due to network or hardware issues from the Pune-based entities. However, after the events are correlated into a situation, any new open events that need to be correlated as part of this situation could become noise. Therefore, the second policy called Noise_Reduction_For_Pune_Flood lowers the severities of all the additional open events to Warning, which reduces the event noise.
In BMC Helix AIOps, the intelligently correlated events form a single situation, which contains the following information:
- A brief meaningful message that is linked to the problem context
- The total number of events correlated to form the situation
- Its severity type, assigned priority, current status (open or closed) indicator, and an associated an incident ticket
- An indicator of whether an automation policy was run successfully (requires BMC Helix Intelligent Automation to be enabled)
In this scenario, the rules defined in the correlation policies automatically identified 66 events and grouped them into a single situation. It resulted in reducing the volume of event noise (by 1:66 ratio) and provided an all-in-one view of similar issues or causes. The operators or site reliability engineers (SRE) were able to investigate this single situation and quickly move to resolving the problem without wasting additional time probing. They could view individual situation details to investigate the impacting or causal events that formed the situation, and could also drill down to view the individual event details for further analysis.
In this example, the brief situation summary clearly shows the required information, such as critical severity, priority P1, closed status, an associated incident ID, and last modified time of the situation. Additionally, the top 3 probable causal nodes each contributed one event that critically impacted the problematic entity. The summary displays the total number of events correlated from multiple sources to form the situation. The intelligent correlation algorithm has sourced and grouped 66 probable events that originated due to a single cause even though the events were triggered from multiple devices or hosts.
Workflow for using the event correlation capability
Task | Product | Role | Action | References |
---|---|---|---|---|
1. | BMC Helix Operations Management | Tenant Administrator | Create and enable event correlation policies | |
2. | BMC Helix AIOps | Operator or SRE | Monitor situations | |
3. | BMC Helix AIOps | Operator or SRE | Investigate situation details and analyze individual events |
Results
By using the event correlation feature, the IT company achieved the following goals:
- Significant event noise reduction and ability to expand and view individual situations
- Effective probable root cause identification with the top 3 causal nodes
- Shorter resolution time and improved MTTR scores
- In-depth insights into individual situation details and drill-down views at the event level