Performing causal analysis of impacted services
The AI/ML-based RCI algorithm correlates data from your services to identify and determine the problem cause, provides a context to the problem identified, and a means to resolve and remediate the situation. BMC Helix AIOps uses this data to help operators and SREs by automatically performing the following functions:
- Identifies and determines the top impacted nodes of a service based on the impacting events, incidents, and changes.
- Computes the impact score of these events, incidents, and changes
- Ranks the impacting events based on the impact score, which in turn ranks the impacting causal nodes of the service.
- Correlates the events from the causal nodes into Situations to reduce the event noise and MTTR.
- Provides drill-down views to analyze and resolve the problem cause.
To view the impacted nodes for a service
- On the Services page, click the service name for which you want to analyze the root cause of an impact.
- Scroll down and expand Analyze Root Cause to view the top three root causes (if available), and their causal events, incidents, and changes.
The topmost causal node is selected. By default, details are displayed for the last three hours. You can select a different duration to view event details for the selected period.
- In the Causal Events table, view the following details:
- Message
- Occurred: Date and time when the event occurred.
- Impact Score: Score assigned to the impacted node.
- Severity, Status, and Priority assigned to the event.
Class: Class of the event.
For more information about event classes and types, see Event classification and formatting.- Automations: If the Intelligent Automations feature is enabled, automation available for the event is displayed.
- Incident ID: Incidents from BMC Helix IT Service Management are treated as Incident_Info events in BMC Helix Operations Management. The top-impacting incident is displayed as a top-impacting event with an incident ID.
- Actions: Perform event operations.
For instructions, see To investigate the impacting events and perform event actions.
- If there are more than three impacting events, click Show all events to view the list of all events for the causal node.
- (Optional) Click the column selector
and clear the columns that you do not want to be displayed.
- Only selected columns are displayed. You can also drag and drop the columns to rearrange them.
(Optional;
BMC Helix IT Service Management
integration) Select Changes
to view the top three changes that impact the service.
If there are more than three impacting changes, click Show all changes to view the list of all changes for the causal node.- (Optional) View the change details and perform change management tasks.
For instructions, see To investigate the change requests.
To investigate the impacting events and perform event actions
- Select a causal node to view the list of events that caused the impact and to perform event operations for resolving issues.
By default, the top three events along with their impact score, severity, and other relevant details are displayed. - Analyze the impact as required.
Click an event message to view the event details such as the event score, severity, priority, and status of the event.
You can also view a summary and any logs and notes associated with the event.Click the action menu
to perform the supported event actions, which are described in the following table:
All logs and notes for the event are displayed at the bottom of the panel.Action
Description
Acknowledge Event
Recognizes the existence of an open event. This operation changes the event status from Open to Acknowledged.
Assign Event
Assigns ownership of an open, acknowledged, or assigned event to yourself or another person in the same account. This operation changes the event status from Open or Acknowledged to Assigned, and the event owner is updated with the selected user. If the event status is Assigned, only the ownership changes to the selected user.
Close Event
Disables any further event operations on the event. Closed events are not considered for calculating the status of a device.
You can close events with statuses Open, Assigned, and Acknowledged only. You cannot close a change request event.
Decline Ownership
Removes ownership of an event in the assigned state. This operation changes the event status to Acknowledged.
Set Event Priority
Assigns a priority level to the event.
Take Ownership
Assigns ownership of Open or Acknowledged event to yourself.
Unacknowledge Event
Changes a previously Acknowledged event back to the Open state.
Add Notes
Displays the Add Notes dialog box.
Create Incident
Creates an incident in BMC Helix IT Service Management – SmartIT.
The incident ID appears against the impacted nodes.
You can click the link to to view the incident details in BMC Helix IT Service Management – SmartIT (Must have permissions to view incidents inBMC Helix IT Service Management).Create Automation
Launches the Create Automation Policy page in BMC Helix Intelligent Automation to enable tenant administrators to create an automation policy. Available also for closed events.
Requires the Intelligent Automations feature to be enabled from the Manage Product Features page under Configurations.
Request Automation
Displays the Request Automation dialog box. Available also for closed events.
Requires the Intelligent Automations feature to be enabled from the Manage Product Features page under Configurations.
Trigger Automation
Displays the Run Automation dialog box that you can use to run automations for remediating the event. Available also for closed events.
Requires the Intelligent Automations feature to be enabled from the Manage Product Features page under Configurations.
For more information about the impact of the actions on the event, see Performing event operations in the BMC Helix Operations Management online documentation.
- Click Performance View to see a graphical representation of the monitored metric.
The graph shows the health of the metric, baseline, and any event generated for the metric. You can view the baseline data for the previous 4 hours since the occurrence of any event.
- Add any additional notes related to the event by entering a note in the text box and clicking Add Note.
Any note added for the event is reflected for the event in BMC Helix Operations Management. (Optional) Click More Details to analyze the complete cause and context of the event.
The BMC Helix Operations Management Event Details page opens. To understand the various options that you see there, see Viewing event details.- (Optional) For all other events, repeat steps 4 to 6.
If an incident is created, click the link to view the incident details in BMC Helix IT Service Management – SmartIT in the Incident ID column.
- (Requires the Intelligent Automations option to be enabled) In the Automations column, automations that match the event are displayed.
For running existing automation, see Running-an-existing-automation. - (Requires the Intelligent Automations option to be enabled) Click Action and perform any of the available actions for the open events:
- Create Automation: Launches the BMC Helix Intelligent Automation > Create Automation Policy page.
Tenant administrators can create an automation policy; see Creating-automation-policies. - Request Automation: Displays the Request Automation dialog box.
For instructions about how to raise a request, see Requesting-for-a-new-automation. - Trigger Automation: Displays the Run Automation dialog box that you can use to run automations for remediating the event.
For instructions about how to run an automation, see Running-an-existing-automation.
- Create Automation: Launches the BMC Helix Intelligent Automation > Create Automation Policy page.
To investigate the change requests
BMC Helix AIOps supports connection with BMC Helix IT Service Management and ServiceNow change management applications. For more information about connecting with an IT service management application, see Setting-up-and-going-live.
To view change requests from BMC Helix IT Service Management:
- Select a causal node to view the list of probable changes that caused the impact and resolve the issues.
- Select Changes to view the top three changes along with their impact score, severity, and other relevant details. Analyze the impact as required.
- (Optional) Click the Show all changes link to view the list of all events.
Click a change ID link to view the change details summary in BMC Helix IT Service Management to manage the changes.
For more information on managing changes from BMC Helix IT Service Management, see Managing change.
To view change requests from ServiceNow:
By default, you can view change requests for the last five days in the Open, Implement, Review, and Closed state.
- Select a causal node to view the list of probable changes that caused the impact.
- Select Changes to view the top three changes along with their change ID, summary, when the change occurred, impact score, status, priority, and impact.
The Occured column shows the time when the change request gets created in ServiceNow. - (Optional) Click the Show all changes link to view the list of all events.
Analyzing the metrics for an impacted service
Metrics are performance and health indicators collected from various nodes of your services. Each metric is a quantifiable measure used to gauge and compare the performance of impacted causal nodes in a given time period. Each metric is analyzed over a time period and the behavioral trend is captured as a baseline. If you have configured an alarm policy for such metrics, an event or alert is generated if any of the vital metrics goes above or below the baseline. In BMC Helix AIOps, you can view the metrics associated with the top three causal events that are associated with the top three impacted nodes for a service. After analyzing the metrics trend graph, you can take follow-up actions to resolve the issue as required.
For example, if your service contains a Linux host, you might want to monitor the following metrics to ensure that the Linux server is performing and available at all times:
- CPU usage and load
- CPU Utilization: Indicates the percentage of CPU utilization. CPU utilization is calculated by adding user time and system time. Utilization is more useful when looked at in combination with the Load parameter.
- Load: Displays the average number of processes in the kernel's run queue during an interval (1 minute in this case). It is more useful when looked at in combination with CPU Utilization.
- Disk and disk usage
- Average requests in queue: Displays the average number of disk I/O requests in the queue and is measured only when the queue is occupied. A high number indicates that system throughput is probably slowing down because of the number of I/O requests for this disk.
- Block transferred per second: Displays the number of blocks read from, or written to, the device per second, and indicates the workload for the device.
- File system
- Available space: Displays the amount of available space for this file system instance. This parameter is critical for the root volume.
- Number of free I-nodes: Displays the number of I-nodes available. An I-node maintains information about each file. Measuring I-nodes is critical because once all of your I-nodes have been used, the file system will not accept any more files regardless of how much disk space is available.
- Memory
- Used memory: Displays the amount of memory used by the partition.
- Free memory: Displays the number of 1 KB pages of memory available.
- Process
- Process CPU Usage: Displays the percentage of CPU used by the selected process. This percentage is calculated on the total number of active CPUs in the system.
- Process Memory Usage: Displays the amount of real memory that the process is using (in MB).
To analyze the metrics for an impacted service
- On the Services page, click the service name for which you want to analyze the metrics.
- Expand Analyze Root Cause to analyze and understand the trend of the impacted metrics for the top causal node.
- If there are more than three impacted metrics, the charts are displayed only for the top three metrics per causal host.
Where to go from here
Based on the health of and impact on a service, you can perform any of the following tasks:
- View CI topology for impacted services, see Identifying-the-impacted-CI-nodes-from-CI-topology-view.
- Investigate impacting events, incidents, and changes for nodes in service hierarchy, see Investigating-the-service-nodes-from-service-hierarchy-view.
- View and analyze situations for an impacted service, see Analyzing-situations-for-a-service.
- View health indicators for an impacted service, see Monitoring-service-health-indicators.
- Get an insight into the service behavior and its severity pattern over a pre-defined period, see Monitoring-service-insights.