Performing probable cause analysis of impacted services in your environment

This use case describes how you can use BMC Helix Service Monitoring to analyze service health and determine probable causes for the impacted services.

Scenario

Susan is an operator for the APEX Global IT Train Ticketing System. The train ticketing system provides a portal for booking and managing train reservations. The train ticketing system is based on a microservices architecture.

Susan plans to monitor a large number of services of the train ticketing system by using a single console. Susan faces the following challenges in her IT operations environment:

Monitoring the health of a large number of services from different sources is time-consuming, tedious, and complex task.
Viewing large number of events from multiple sources resulting in event noise.
Meeting SLAs in a complex environment requires quick analysis of issues.
Correlating data from a disparate set of solutions is difficult.

Therefore, Susan needs an effective solution to monitor services from all the sources by using a single console to quickly identify the impacted services, determine the probable causes of the impact, and diagnose and resolve the issues in a short time.

Solution

Susan can find answers to most of her challenges by using BMC Helix Service Monitoring. With BMC Helix Service Monitoring, Susan can determine the most probable causes of an impact by:

Using correlated data points from integrated event sources
Visualizing the relationship between a discovered business service and nodes of the service
Analyzing the events and change requests that are causing the impact
Viewing and understanding the anomalies or abnormalities from the metrics data associated with the causal events

Therefore, Susan decides to use BMC Helix Service Monitoring to achieve her business goals.

Implementation workflow

Prerequisites

Susan requests the tenant administrator Tim to complete the following activities so that she can start monitoring the train ticketing application by using BMC Helix Service Monitoring:

Register and activate a BMC Helix Service Monitoring account.
Create an operator role for Susan to log in to BMC Helix Service Monitoring.

Roles and permissions

Susan and her co-workers have these roles and permissions.

User	Role	Responsibilities	Permissions
Susan	Operator	View services and situations View the details of a service or a situation	aiops.pca.view aiops.services.view aiops.situations.view aiops.situations.manage
Sam	Service Designer	View services Create and modify services	aiops.services.view aiops.services.manage
Tim	Tenant Administrator	Setting up roles and permissions Configure third-party integrations View all data in the BMC Helix Service Monitoringconsole	All

Step 1: To view the health summary

Susan needs to do the following steps to view the health summary:

Log in to BMC Helix Service Monitoring.
From the Overview tab, view the KPI metrics summary, such as total events, incidents, anomalies, MTTR, noise reduction trends, and gauge the overall system availability status by looking at the top services and top situations as shown in the following example image:
Warning
Insights for Susan
- The TrainsApp train ticketing system service is listed as an impacted service in the top-services list with a poor health score.
- In the last 24 hours, the event trend has been increasing at the rate of 14%, which needs immediate attention and can be a major contributor to the service impact.
- There is an urgent need to address the increasing event trend and poor health score by viewing the service details and looking at the probable causes.

Step 2: To view the health score, impact score, and health timeline

Susan needs to do the following steps to view the health parameters of an impacted service (TrainsApp):

Open the service details page by doing one of the following actions:
1. From the Overview tab > Services widget, click the impacted service (TrainsApp) that needs to be analyzed.
2. From the Entities tab > Services page, click the impacted service (TrainsApp) tile that needs to be analyzed.
On the service details page, check the health and impact score, and the service health timeline as shown in the following example image:

Hover over different time slots on the health timeline to view the service health score, events, incidents, and changes.

Insights for Susan

From the Events pie chart, it is evident that the majority of the total events associated with the service are in critical status.
The health timeline for the last 3 hours indicates that the service impact is major and the problem has been persisting for a few hours.
From the health score and events history, analyze the sequence of activities that might have impacted the service.

Step 3: To view the probable causes, causal events, and metrics

Susan needs to do the following steps to view the probable causes and the associated metrics for the impacted service:

On the service details page, click Probable Cause. The following example image displays probable causes.
From Causal Nodes (% Probability), select a causal node to view the top causal events and changes for that node.
Do one of the following actions to view the event or change request details:
- To view the events and event details:
  1. Click Events to view top causal events.
  2. Hover over the score to view the score calculation details for the event.
  3. Click an event to view event details.
- To view the change requests and details:
  1. Click Changes to view top change requests.
  2. Hover over the score to view the score calculation details for the change.
  3. Click a change to view change details.
- To view all events or all changes:
  - Click Show all events or Show all changes link to view all events or all changes for a particular causal node.
  - Switch back to view only the top events or top changes, by clicking the Show top causal events or Show top causal changes link.
Click Metric to view the metric details associated with the causal events as shown in the following example image:
Warning
Insights for Susan
- The list of causal nodes indicates the Kubernetes database host, booking software instance, and web server node as the most probable causes that impacted the service.
- The top 3 events associated with the Kubernetes database host indicate that there is a CPU utilization issue.
- The CPU utilization metric is showing aberrations that need to be addressed.

Step 4: To view the service topology

Susan needs to do the following steps to view the service topology of the impacted service:

On the service details page, click Topology. The following example image displays the service topology of the train ticketing system and all the impacted nodes are highlighted in red color:
Click a node to view the details.