ML-based root cause isolation
Essential elements of root cause isolation
Knowledge graph
The technology experts build a pre-defined knowledge graph for each underlying technology, such as Kubernetes or Victoria Metrics. Nodes that are part of these technologies have a set of metrics for data collection. The interacting nodes of a service have a relationship known as edges. All edges have pre-defined weights assigned. These weights can have a fixed or range of values. AIOps Users do not have an option to change these values.
Service models
Service Designers create suitable business service models to derive the service topology mapping of all nodes.
Nodes with AIOps Situations
The AI/ML algorithm uses the expert-defined knowledge graph, topology map derived from BMC Helix Discovery, and root cause computation methods to create AIOps Situations for each causal node.
Root cause computation
The AI/ML algorithm analyzes events from the nodes, automatically detects the problem, uses the pre-assigned weights and the knowledge base, applies probabilistic density technique to build a situation graph, and arrives at the root cause score of each causal event.
In addition, a node can have multiple ML situations. A node can be an independent or shared node of a business service. One or more causal events from a shared node can also be part of multiple situations. These additional factors impact the root cause computation and ranking of causal nodes and causal events.
ML-based root cause isolation in BMC Helix Service Monitoring
For viewing the root cause isolation information and performing the analysis in BMC Helix Service Monitoring as shown in this illustration, you must do the following tasks:
- Create one or more Service Models from BMC Helix Discovery. For creating service models, see Modeling-business-services.
- Enable AIOps Root Cause Isolation and AIOps Situations in BMC Helix Service Monitoring. For the AIOps features, see Enabling-the-AIOps-features.
- Navigate to Entities and select a listed service to view the service details and root cause isolation
- Tabs to view Root Cause Isolation, Topology, and Metric. You can view the following details:
- Root Cause Isolation view displays the ranked list of causal nodes (limited to top 3), causal events from top situations within each node, and changes from those nodes. By default only top three events and changes are displayed.
- Topology view shows the relationship between a service and all its nodes.
- Metric view displays the time-series data collected from key attributes of the causal events.
Root Causal Nodes
displays up to top three causal nodes that impact the service health.
In summary, the top root causal nodes are the nodes with more Situations in a given time window. If there are multiple nodes with an equal number of Situations, a node with the latest Situation is the top casual node. For each node, a maximum of the top three causal events is listed. If there are multiple events on a node, the events with the top three root cause scores (RCA score) are shown. When there are more than three events in a node, users can click Show all events
However, a minimum of two events must exist to form an ML-based Situation. Those two events must have occurred within the correlation and saturation time limits configured on the Manage Situations page. When there are no situations in the selected time window, the root-casual nodes are not displayed.
Example scenarios
Scenario#1:
Node#1 has three Situations at 10:00 A.M.. All other nodes have less than three Situations at 10:00 A.M.
Result:
Node#1 is the top Root Causal Node.Scenario#2
Node#1 and Node#2 have two Situations each at 11:00 A.M. One Situation S1 in Node#1 has a recent timestamp.
Result:
Node#1 is ranked higher in the order.- Buttons to view top or all causal Events and top or all causal Changes.
By default, the Events button is selected. You can select Changes button to view the change requests that impacted the service. - Options to Show all events (display all events) or Show top causal events (displays only the top three causal events).
Root cause Score: The root cause score of an event is based on the contribution made by the causal events to the situation.
The score helps in detecting the top causal events in ML-based Situations. A node having an event with the highest score within ML-based Situations is identified as a causal node. A node participating in more number of situations becomes the top casual node.
- Link to view event details and change details.
- Click on an event link to view the event details or change link to view the change details.
- Click on an event link to view the event details or change link to view the change details.
Performance View tab displays the time-series data collected from key attributes of the alarm class events.