Performing root cause isolation of impacted services in your environment

This use case describes how you can use BMC Helix AIOps to analyze service health and determine root cause isolation information for the impacted services.

Scenario

Susan is an operator for the APEX Global IT Train Ticketing System. The train ticketing system provides a portal for booking and managing train reservations. The train ticketing system is based on a microservices architecture.

Susan plans to monitor a large number of services of the train ticketing system by using a single console. Susan faces the following challenges in her IT operations environment:

Monitoring the health of a large number of services from different sources is time-consuming, tedious, and complex task.
Viewing large number of events from multiple sources resulting in event noise.
Meeting SLAs in a complex environment requires quick analysis of issues.
Correlating data from a disparate set of solutions is difficult.

Therefore, Susan needs an effective solution to monitor services from all the sources by using a single console to quickly identify the impacted services, determine the probable causes of the impact, and diagnose and resolve the issues in a short time.

Solution

Susan can find answers to most of her challenges by using BMC Helix AIOps. Susan can determine the root cause isolation information for an impacted service by:

Using a ranked list of the most likely events that caused the impact.
Visualizing the relationship between a discovered business service and nodes of the service
Analyzing the events and change requests that are causing the impact
Viewing and understanding the anomalies or abnormalities from the metrics data associated with the causal events

Therefore, Susan decides to use BMC Helix AIOps to achieve her business goals.

e2e use case.png

Implementation workflow

Prerequisites

Susan requests the tenant administrator Tim to complete the following activities so that she can start monitoring the train ticketing application:

Register and activate a BMC Helix AIOps account.
Create an operator role for Susan to log in to BMC Helix AIOps.
Enable the AIOps root cause isolation feature. For more information, see Enabling-the-AIOps-features.

Roles and permissions

Susan and her co-workers have these roles and permissions.

User	Role	Responsibilities	Permissions
Susan	Operator	View services and situations View the details of a service or a situation	aiops.pca.view aiops.services.view aiops.situations.view aiops.situations.manage
Sam	Service Designer	View services Create and modify services	aiops.services.view aiops.services.manage
Tim	Tenant Administrator	Setting up roles and permissions Configure third-party integrations	All

Step 1: To view the health summary

Susan needs to do the following steps to view the health summary:

From the Overview tab, view the KPI metrics summary, such as total events, incidents, anomalies, MTTR, noise reduction trends, and gauge the overall system availability status by looking at the top services and top situations as shown in the following example image:

Insights for Susan

The TrainsApp train ticketing system service is listed as an impacted service in the top-services list with a poor health score.
In the last 24 hours, the event trend has been increasing at the rate of 14%, which needs immediate attention and can be a major contributor to the service impact.
There is an urgent need to address the increasing event trend and poor health score by viewing the service details and looking at the probable causes.

Step 2: To view the health score, impact score, and health timeline

Susan needs to do the following steps to view the health parameters of an impacted service (TrainsApp):

Open the service details page by doing one of the following actions:
1. From the Overview tab > Services widget, click the impacted service (TrainsApp) that needs to be analyzed.
2. From the Entities tab > Services page, click the impacted service (TrainsApp) tile that needs to be analyzed.
On the service details page, check the health and impact score, and the service health timeline as shown in the following example image:

Hover over different time slots on the health timeline to view the service health score, events, incidents, and changes.

Insights for Susan

From the Events pie chart, it is evident that the majority of the total events associated with the service are in critical status.
The health timeline for the last 3 hours indicates that the service impact is major and the problem has been persisting for a few hours.
From the health score and events history, analyze the sequence of activities that might have impacted the service.

Step 3: To view the root cause isolation information, and metrics

Susan needs to do the following steps to view the root cause isolation information and the associated metrics for the impacted service:

On the service details page, click Root Cause Isolation. The following example image displays the root cause isolation information.
To view the causal event details by causal nodes or situations, click View By and select one of the following options:
- Situations: Displays the top 3 situations impacting the service. Click a situation to view the associated events. Click an event to view its details.
  Launch the situation details page on the Situations tab
  Optionally, you can click theicon to launch the situation details page on the Situations tab.
- Causal Nodes: Displays the top 3 causal nodes impacting the service. Click a causal node and perform the following actions to view the event and change request details:
  - To view the events and event details:
    1. Click Events to view top causal events.
    2. Hover over the score to view the score calculation details for the event.
    3. Click an event to view event details.
  - To view the change requests and details:
    1. Click Changes to view top change requests.
    2. Hover over the score to view the score calculation details for the change.
    3. Click a change to view change details.
  - To view all events or all changes:
    - Click Show all events or Show all changes link to view all events or all changes for a particular causal node.
    - Switch back to view only the top events or top changes, by clicking the Show top causal events or Show top causal changes link.
Click Metric to view the metric details associated with the causal events as shown in the following example image:
Insights for Susan
- The list of causal nodes indicates the Kubernetes database host, booking software instance, and web server node as the most probable causes that impacted the service.
- The top 3 events associated with the Kubernetes database host indicate that there is a CPU utilization issue.
- The CPU utilization metric is showing aberrations that need to be addressed.

Conclusion

Susan is happy about her decision to use BMC Helix AIOps to achieve her business goals. With BMC Helix AIOps, Susan effectively used the following information to solve her problem:

The health summary of the IT operations environment
Impacted services and their details
Root cause isolation information impacting the service and associated metric details