Probable cause analysis (PCA)
Probable cause analysis (PCA) is the ability to determine the most likely causes of any issue in an infrastructure environment by correlating millions of monitoring data points and analyzing the relationship between infrastructure nodes and services. The goal is to reduce the mean time to identify or determine (MTTI) and mean time to resolve (MTTR) for issues. To achieve this goal, BMC Helix Service Monitoring does the following:
- Delivers a ranked list of the most likely causes of impact.
- Determines the top suspected causes and provides evidence in the form of events and changes.
- Computes the impact or PCA score using pre-assigned weightages to the nodes, events, changes, and metrics and inbuilt functions.
- Builds and displays a service health timeline for a selected time range to indicate the health score degradation.
- Displays the service topology with the impact path flowing from various nodes to the service.
- Displays the metrics collected collected for the causal events.
PCA in BMC Helix Service Monitoring
For viewing the PCA information and performing the analysis in BMC Helix Service Monitoring, you must have defined the Service Models from BMC Helix Discovery. For more information, see Modeling business services.
- Tabs to view Probable Cause, Topology, and Metric. You can view the following details:
- Probable Cause view displays the ranked list of contributing nodes (limited to top 3), events, and changes from those nodes. By default only top events and changes are displayed.
- Topology view shows the relationship between a service and all its nodes.
- Metric view displays the time-series data collected from key attributes of the causal events.
Causal Nodes (% Probability displays up to top 3 causal nodes that contribute to the probable cause based on the score calculation.
- The (% Probability) value is drawn from the score of each event or change. The score of the most impactful event or change is taken as the highest node score. For example, Event#1 has 68% and if Event#2 has 65%, the highest ranked event is 68%.
- The causal nodes are ranked based on the score calculation from the events or changes. A node with a top most event score at 72% is ranked higher than a node with a top most event score at 71%.
- Click Show all events to view all events or click Show top causal events to return to the top causal events list.
Score calculation: Hover over the score number to view the details. It is sum of the weightages assigned to the following factors:
Event Severity, KPI Metric, Multiple Services, Node Depth, Node Kind, and Time Proximity.The following table shows the pre-defined weightage details:
Factor Type Weightage Event Severity Event Warning (4), Minor(6), Major (8), and Critical (10) KPI Metric Event 0 or 10 - An event with a KPI metric gets 10 and without any metric gets 0 Multiple Services Event 0 or 10 - An event associated with multiple services gets 10 and only one service gets 0 Node Depth Event or Change 1 to 20 - 1 for the nearest node to the service and it can go up to 20 for the farthest node to the service. Node Kind Event or Change 20 (Fixed value) Time Proximity Event or Change Up to 40 for an event and up to 60 for a change. - Click Events to view the top causal events or click Changes to view the top causal changes.
By default, the Events button is selected. You can select Changes button to view the change requests that impacted the service. - You can view the event details and change details.
- Click on an event to view the event details or change to view the change details.
- You can view additional details about the top causal events:
Related Events tab displays correlation event details, such as event message, the impacted host, occurrence, severity, priority, and status.
Note
You can only view this tab if the event is a primary event.
Performance View tab displays the time-series data collected from key attributes of the causal events.
Note
You can only view this tab if the event slot value for Class is
Alarm
.
- Click on an event to view the event details or change to view the change details.
On-demand PCA score recalibration
The health timeline and the time slots adjust based on the selected time range. BMC Helix Service Monitoring has the ability to recalibrate the PCA score on demand for any given time slot within the selected time range. You can click on any time slot in the health timeline to identify the impacted nodes, the events, and changes for those node. If you click on any time slot within the time range, makes it as the current time slot for this recalibration and all the rules of the PCA scoring method are employed to re-rank the impacted nodes.
Example
Default PCA score computation
- The selected time range is last 24 hours.
- The range is between 12:00 hours (previous day) and 12:00 hours (current time).
The topmost causal entity displayed is ulx3od with the severity score of 45%.
On-demand PCA score computation
- The health score for the selected time slot (16:30 of the previous day) is displayed and an on-demand PCA computation is triggered.
- The topmost causal entity displayed is ulx3pq with a severity score of 73%.
Comments
Log in or register to comment.