Harnessing the power of intelligent event correlation to reduce event noise and improve MTTR


This use case describes how an IT company used the intelligent event correlation feature in BMC Helix AIOps, which groups hundreds of events into a single situation to reduce event noise. This feature helped the IT company quickly perform causal analysis to understand a problem context at the impacted CI level. Additionally, using this feature has resulted in company hugely improving the mean time to resolve (MTTR) issues.

Reduced event noise and improved MTTR by using the intelligent event correlation capability

During major network or data center outages, the operations team grapples with a flood of events, making it extremely challenging to determine a problem's root cause. Identifying the root cause amid the noise is difficult and frustrating. To tackle this challenge, the IT company uses the intelligent event correlation feature. This feature performs real-time analysis to detect false positives, reduces redundant or insignificant alerts, and correlates multiple causal events into a single situation resulting in reduced event noise. Situations contain valuable problem context and relevant details to enable further analysis, which significantly enhances the MTTR.

The company created two event correlation policies by using BMC Helix Operations Management and then uses BMC Helix AIOps to monitor and investigate the correlated situations and individual events. By using the situations, the company avoids the laborious task of sifting through and analyzing hundreds of events to identify the problem context.

One policy, the Pune Event Flood policy, consolidates the event storm that arises due to network or hardware issues from the Pune-based entities. However, after the events are correlated into a situation, any new open events that need to be correlated as part of this situation could become noise. Therefore, the second policy called Noise_Reduction_For_Pune_Flood lowers the severities of all the additional open events to Warning, which reduces the event noise.

Event_Correlation_Policies.png

Adv_enrichment_without_info_icon1.png

In BMC Helix AIOps, the intelligently correlated events form a single situation, which contains the following information:

  • A brief meaningful message that is linked to the problem context
  • The total number of events correlated to form the situation
  • Its severity type, assigned priority, current status (open or closed) indicator, and an associated an incident ticket
  • An indicator of whether an automation policy was run successfully (requires BMC Helix Intelligent Automation to be enabled)

In this scenario, the rules defined in the correlation policies automatically identified 66 events and grouped them into a single situation. It resulted in reducing the volume of event noise (by 1:66 ratio) and provided an all-in-one view of similar issues or causes. The operators or site reliability engineers (SRE) were able to investigate this single situation and quickly move to resolving the problem without wasting additional time probing. They could view individual situation details to investigate the impacting or causal events that formed the situation, and could also drill down to view the individual event details for further analysis.

Monitor_policy_situations.png

In this example, the brief situation summary clearly shows the required information, such as critical severity, priority P1, closed status, an associated incident ID, and last modified time of the situation. Additionally, the top 3 probable causal nodes each contributed one event that critically impacted the problematic entity. The summary displays the total number of events correlated from multiple sources to form the situation. The intelligent correlation algorithm has sourced and grouped 66 probable events that originated due to a single cause even though the events were triggered from multiple devices or hosts.

Investigate_policy_situations.png

Event_details_pane.png


Workflow

policy_based_situations_workflow.png


Task

Product

Role

Action

References

1.

BMC Helix Operations Management

Tenant Administrator

Create and enable event correlation policies

2.

BMC Helix AIOps

Operator or SRE

Monitor situations

3.

BMC Helix AIOps

Operator or SRE

Investigate situation details and analyze individual events


Results

By using the event correlation feature, the IT company achieved the following benefits:

  • Significant event noise reduction and ability to expand and view individual situations
  • Effective probable root cause identification with the top 3 causal nodes
  • Shorter resolution time and improved MTTR scores
  • In-depth insights into individual situation details and drill-down views at the event level

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*