This documentation supports an earlier version of BMC Helix Operations Management.To view the documentation for the latest version, select 23.3 from the Product version picker.

Harnessing the power of intelligent event correlation to reduce event noise and improve MTTR


This use case describes how BMC IT used the intelligent event correlation feature in Unknown macro: confluence_includeplus. Click on this message for details.
, which groups hundreds of events into a single situation to reduce event noise. This feature helped BMC IT quickly perform causal analysis to understand a problem context at the impacted CI level. Additionally, using this feature has resulted in BMC IT hugely improving the mean time to resolve (MTTR) issues.

BMC IT reduces event noise and improves MTTR by using the intelligent event correlation capability

During major network or data center outages, the BMC IT operations team grapples with a flood of events, making it extremely challenging to determine a problem's root cause. Identifying the root cause amid the noise is difficult and frustrating. To tackle this challenge, BMC IT uses the intelligent event correlation feature. This feature performs real-time analysis to detect false positives, reduces redundant or insignificant alerts, and correlates multiple causal events into a single situation resulting in reduced event noise. Situations contain valuable problem context and relevant details to enable further analysis, which significantly enhances the MTTR.

BMC IT created two event correlation policies by using Unknown macro: confluence_includeplus. Click on this message for details.
and then uses Unknown macro: confluence_includeplus. Click on this message for details.
to monitor and investigate the correlated situations and individual events. By using the situations, BMC IT avoids the laborious task of sifting through and analyzing hundreds of events to identify the problem context.

One policy, the Pune Event Flood policy, consolidates the event storm that arises due to network or hardware issues from the Pune-based entities. However, after the events are correlated into a situation, any new open events that need to be correlated as part of this situation could become noise. Therefore, the second policy called Noise_Reduction_For_Pune_Flood lowers the severities of all the additional open events to Warning, which reduces the event noise.

Event_Correlation_Policies.png

Event_selection_criteria2.png

In Unknown macro: confluence_includeplus. Click on this message for details.
, the intelligently correlated events form a single situation, which contains the following information:

  • A brief meaningful message that is linked to the problem context
  • The total number of events correlated to form the situation
  • Its severity type, assigned priority, current status (open or closed) indicator, and an associated an incident ticket
  • An indicator of whether an automation policy was run successfully (requires Unknown macro: confluence_includeplus. Click on this message for details.
    to be enabled)

In this scenario, the rules defined in the correlation policies automatically identified 66 events and grouped them into a single situation. It resulted in reducing the volume of event noise (by 1:66 ratio) and provided an all-in-one view of similar issues or causes. The operators or site reliability engineers (SRE) were able to investigate this single situation and quickly move to resolving the problem without wasting additional time probing. They could view individual situation details to investigate the impacting or causal events that formed the situation, and could also drill down to view the individual event details for further analysis.

Monitor_policy_situations.png

In this example, the brief situation summary clearly shows the required information, such as critical severity, priority P1, closed status, an associated incident ID, and last modified time of the situation. Additionally, the top 3 probable causal nodes each contributed one event that critically impacted the problematic entity. The summary displays the total number of events correlated from multiple sources to form the situation. The intelligent correlation algorithm has sourced and grouped 66 probable events that originated due to a single cause even though the events were triggered from multiple devices or hosts.

Investigate_policy_situations.png

Event_details_pane.png


Workflow

policy_based_situations_workflow.png


Task

Product

Role

Action

References

1.

Unknown macro: confluence_includeplus. Click on this message for details.

Tenant Administrator

Create and enable event correlation policies

2.

Unknown macro: confluence_includeplus. Click on this message for details.

Operator or SRE

Monitor situations

3.

Unknown macro: confluence_includeplus. Click on this message for details.

Operator or SRE

Investigate situation details and analyze individual events


Results

By using the event correlation feature, BMC IT achieved the following benefits:

  • Significant event noise reduction and ability to expand and view individual situations
  • Effective probable root cause identification with the top 3 causal nodes
  • Shorter resolution time and improved MTTR scores
  • In-depth insights into individual situation details and drill-down views at the event level

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*