This documentation supports BMC Helix AIOps till November 2021. To view other versions of the documentation, select a version from the Product version menu.

ML-based root cause isolation

The root cause isolation is the ability to predict the most likely causes of an issue in an infrastructure environment by analyzing the ML-based situational events from the infrastructure nodes and services. The goal is to reduce the mean time to identify or determine (MTTI) and mean time to resolve (MTTR) for issues. To achieve this goal, BMC Helix AIOps does the following:

  • Delivers a ranked list of the most likely events that caused the impact.
  • Uses ML-based Situation analysis to determine the top root causal nodes.
  • Determines the top suspected causes and provides evidence in the form of situational events and change requests.
  • Computes the root cause score using pre-assigned weights to the nodes, edges, events, change requests, and metrics. 
  • Builds and displays a service health timeline for a selected time range to indicate the health score degradation.
  • Shows the service topology with the impact path flowing from various nodes to the service.
  • Displays the metrics data for the causal nodes.


Essential elements of root cause isolation 

The following elements are used to derive root cause isolation for impacted services. 

Knowledge graph 

The technology experts build a pre-defined knowledge graph for each underlying technology, such as Kubernetes or Victoria Metrics. Nodes that are part of these technologies have a set of metrics for data collection. The interacting nodes of a service have a relationship known as edges. All edges have pre-defined weights assigned. These weights can have a fixed or range of values. AIOps Users do not have an option to change these values.

Service models

Service Designers create suitable business service models to derive the service topology mapping of all nodes.

Nodes with AIOps Situations

The AI/ML algorithm uses the expert-defined knowledge graph, topology map derived from BMC Helix Discovery, and root cause computation methods to create AIOps Situations for each causal node. 

Root cause computation

The AI/ML algorithm analyzes events from the nodes, automatically detects the problem, uses the pre-assigned weights and the knowledge base, applies probabilistic density technique to build a situation graph, and arrives at the root cause score of each causal event. 

Causal events

The causal events of a root cause node are only child events and not primary events.

In addition, a node can have multiple ML situations. A node can be an independent or shared node of a business service. One or more causal events from a shared node can also be part of multiple situations. These additional factors impact the root cause computation and ranking of causal nodes and causal events.


ML-based root cause isolation

For viewing the root cause isolation information and performing the analysis in BMC Helix AIOps as shown in this illustration, you must do the following tasks:

  • Create one or more Service Models from BMC Helix Discovery. For creating service models, see Modeling business services
  • Enable AIOps Root Cause Isolation and AIOps Situations in BMC Helix AIOps. For the AIOps features, see Enabling the AIOps features.
  • Navigate to Entities and select a listed service to view the service details and root cause isolation

  1. Tabs to view Root Cause Isolation, Topology, and Metric. You can view the following details: 

    • Root Cause Isolation view displays the ranked list of causal nodes (limited to top 3), causal events from top situations within each node, and changes from those nodes. By default only top three events and changes are displayed.
    • Topology view shows the relationship between a service and all its nodes.
    • Metric view displays the time-series data collected from key attributes of the causal events.

  2. Root Causal Nodes displays up to top three causal nodes that impact the service health.

    Top 3 causal nodes

    This is not customizable. If there are more than 3 causal nodes, only the top 3 nodes with the highest impact are displayed.


    In summary, the top root causal nodes are the nodes with more Situations in a given time window. If there are multiple nodes with an equal number of Situations, a node with the latest Situation is the top casual node. For each node, a maximum of the top three causal events is listed. If there are multiple events on a node, the events with the top three root cause scores (RCA score) are shown. When there are more than three events in a node, users can click Show all events

    However, a minimum of two events must exist to form an ML-based Situation. Those two events must have occurred within the correlation and saturation time limits configured on the Manage Situations page. When there are no situations in the selected time window, the root-casual nodes are not displayed.

    Example scenarios

    Scenario#1:
    Node#1 has three Situations at 10:00 A.M.. All other nodes have less than three Situations at 10:00 A.M.
    Result:
    Node#1 is the top Root Causal Node.

    Scenario#2
    Node#1 and Node#2 have two Situations each at 11:00 A.M. One Situation S1 in Node#1 has a recent timestamp.
    Result:
    Node#1 is ranked higher in the order.

  3. Buttons to view top or all causal Events and top or all causal Changes.
    By default, the Events button is selected. You can select Changes button to view the change requests that impacted the service.



  4. Options to Show all events (display all events) or Show top causal events (displays only the top three causal events).


  5. Root cause Score: The root cause score of an event is based on the contribution made by the causal events to the situation. 

    The score helps in detecting the top causal events in ML-based Situations. A node having an event with the highest score within ML-based Situations is identified as a causal node. A node participating in more number of situations becomes the top casual node.


  6. Link to view event details and change details. 

    • Click on an event link to view the event details or change link to view the change details. 
       

    • You can click on the Situation link from the event details window to view the details of the situation.




    • Performance View tab displays the time-series data collected from key attributes of the alarm class events.


Was this page helpful? Yes No Submitting... Thank you

Comments