Performing probable cause analysis of impacted services in your environment
This use case describes how you can analyze service health and determine probable causes for the impacted services.
Scenario
Susan is an operator for the APEX Global IT Train Ticketing System. The train ticketing system provides a portal for booking and managing train reservations. The train ticketing system is based on a microservices architecture.
Susan plans to monitor a large number of services of the train ticketing system by using a single console. Susan faces the following challenges in her IT operations environment:
- Monitoring the health of a large number of services from different sources is time-consuming, tedious, and complex task.
- Viewing large number of events from multiple sources resulting in event noise.
- Meeting SLAs in a complex environment requires quick analysis of issues.
- Correlating data from a disparate set of solutions is difficult.
Therefore, Susan needs an effective solution to monitor services from all the sources by using a single console to quickly identify the impacted services, determine the probable causes of the impact, and diagnose and resolve the issues in a short time.
Solution
Susan can find answers to most of her challenges by using the probable cause analysis capability. Susan can determine the most probable causes of an impact by:
- Using correlated data points from integrated event sources
- Visualizing the relationship between a discovered business service and nodes of the service
- Analyzing the events and change requests that are causing the impact
- Viewing and understanding the anomalies or abnormalities from the metrics data associated with the causal events
Therefore, Susan decides to use this capability to achieve her business goals.
Implementation workflow
Prerequisites
Susan requests the tenant administrator Tim to complete the following activities so that she can start monitoring the train ticketing application:
- Register and activate an account.
- Create an operator role for Susan to log on to the BMC Helix AIOps console .
Roles and permissions
Susan and her co-workers have these roles and permissions.
User | Role | Responsibilities | Permissions |
---|---|---|---|
Susan | Operator |
|
|
Sam | Service Designer |
|
|
Tim | Tenant Administrator |
| All |
Step 1: To view the health summary
Susan needs to do the following steps to view the health summary:
- Log on to BMC Helix AIOps console .
- From the Overview tab, view the KPI metrics summary, such as total events, incidents, anomalies, MTTR, noise reduction trends, and gauge the overall system availability status by looking at the top services and top situations as shown in the following example image:
Step 2: To view the health score, impact score, and health timeline
Susan needs to do the following steps to view the health parameters of an impacted service (TrainsApp):
Open the service details page by doing one of the following actions:
- From the Overview tab > Services widget, click the impacted service (TrainsApp) that needs to be analyzed.
- From the Entities tab > Services page, click the impacted service (TrainsApp) tile that needs to be analyzed.
- On the service details page, check the health and impact score, and the service health timeline as shown in the following example image:
Hover over different time slots on the health timeline to view the service health score, events, incidents, and changes.
Step 3: To view the probable causes, causal events, and metrics
Susan needs to do the following steps to view the probable causes and the associated metrics for the impacted service:
- On the service details page, click Probable Cause. The following example image displays probable causes.
- From Causal Nodes (% Probability), select a causal node to view the top causal events and changes for that node.
- Do one of the following actions to view the event or change request details:
- To view the events and event details:
- Click Events to view top causal events.
- Hover over the score to view the score calculation details for the event.
- Click an event to view event details.
- To view the change requests and details:
- Click Changes to view top change requests.
- Hover over the score to view the score calculation details for the change.
- Click a change to view change details.
- To view all events or all changes:
- Click Show all events or Show all changes link to view all events or all changes for a particular causal node.
- Switch back to view only the top events or top changes, by clicking the Show top causal events or Show top causal changes link.
- To view the events and event details:
Click Metric to view the metric details associated with the causal events as shown in the following example image:
Step 4: To view the service topology
Susan needs to do the following steps to view the service topology of the impacted service:
- On the service details page, click Topology. The following example image displays the service topology of the train ticketing system and all the impacted nodes are highlighted in red color:
Click a node to view the details.
Conclusion
Susan is happy about her decision to use service monitoring capabilities to achieve her business goals. Susan effectively used the following information to solve her problem:
- The health summary of the IT operations environment
- Impacted services and their details
- Probable causes impacting the service and associated metric details
Comments
Log in or register to comment.