Performing root cause isolation of impacted services in your environment
Scenario
Susan is an operator for the APEX Global IT Train Ticketing System. The train ticketing system provides a portal for booking and managing train reservations. The train ticketing system is based on a microservices architecture.
Susan plans to monitor a large number of services of the train ticketing system by using a single console. Susan faces the following challenges in her IT operations environment:
- Monitoring the health of a large number of services from different sources is time-consuming, tedious, and complex task.
- Viewing large number of events from multiple sources resulting in event noise.
- Meeting SLAs in a complex environment requires quick analysis of issues.
- Correlating data from a disparate set of solutions is difficult.
Therefore, Susan needs an effective solution to monitor services from all the sources by using a single console to quickly identify the impacted services, determine the probable causes of the impact, and diagnose and resolve the issues in a short time.
Solution
Susan can find answers to most of her challenges by using BMC Helix AIOps. Susan can determine the root cause isolation information for an impacted service by:
- Using a ranked list of the most likely events that caused the impact.
- Visualizing the relationship between a discovered business service and nodes of the service
- Analyzing the events and change requests that are causing the impact
- Viewing and understanding the anomalies or abnormalities from the metrics data associated with the causal events
Therefore, Susan decides to use BMC Helix AIOps to achieve her business goals.
Implementation workflow
Prerequisites
Susan requests the tenant administrator Tim to complete the following activities so that she can start monitoring the train ticketing application:
- Register and activate a BMC Helix AIOps account.
- Create an operator role for Susan to log in to BMC Helix AIOps.
- Enable the AIOps root cause isolation feature. For more information, see Enabling-the-AIOps-features.
Roles and permissions
Susan and her co-workers have these roles and permissions.
User | Role | Responsibilities | Permissions |
---|---|---|---|
Susan | Operator |
|
|
Sam | Service Designer |
|
|
Tim | Tenant Administrator |
| All |
Step 1: To view the health summary
Susan needs to do the following steps to view the health summary:
From the Overview tab, view the KPI metrics summary, such as total events, incidents, anomalies, MTTR, noise reduction trends, and gauge the overall system availability status by looking at the top services and top situations as shown in the following example image:
Step 2: To view the health score, impact score, and health timeline
Susan needs to do the following steps to view the health parameters of an impacted service (TrainsApp):
- Open the service details page by doing one of the following actions:
- From the Overview tab > Services widget, click the impacted service (TrainsApp) that needs to be analyzed.
- From the Entities tab > Services page, click the impacted service (TrainsApp) tile that needs to be analyzed.
- On the service details page, check the health and impact score, and the service health timeline as shown in the following example image:
- Hover over different time slots on the health timeline to view the service health score, events, incidents, and changes.
Step 3: To view the root cause isolation information, and metrics
Susan needs to do the following steps to view the root cause isolation information and the associated metrics for the impacted service:
- On the service details page, click Root Cause Isolation. The following example image displays the root cause isolation information.
- To view the causal event details by causal nodes or situations, click View By and select one of the following options:
Situations: Displays the top 3 situations impacting the service. Click a situation to view the associated events. Click an event to view its details.
- Causal Nodes: Displays the top 3 causal nodes impacting the service. Click a causal node and perform the following actions to view the event and change request details:
- To view the events and event details:
- Click Events to view top causal events.
- Hover over the score to view the score calculation details for the event.
- Click an event to view event details.
- To view the change requests and details:
- Click Changes to view top change requests.
- Hover over the score to view the score calculation details for the change.
- Click a change to view change details.
- To view all events or all changes:
- Click Show all events or Show all changes link to view all events or all changes for a particular causal node.
- Switch back to view only the top events or top changes, by clicking the Show top causal events or Show top causal changes link.
- To view the events and event details:
- Click Metric to view the metric details associated with the causal events as shown in the following example image:
Conclusion
Susan is happy about her decision to use BMC Helix AIOps to achieve her business goals. With BMC Helix AIOps, Susan effectively used the following information to solve her problem:
- The health summary of the IT operations environment
- Impacted services and their details
- Root cause isolation information impacting the service and associated metric details