Identifying bottlenecks by using OpenTelemetry
In microservices-based architectures, tracking the flow of requests between services is challenging and crucial for identifying bottlenecks or issues swiftly to minimize system downtime. BMC Helix AIOps integrates with OpenTelemetry to gather traces from applications, aiding site reliability engineers (SRE) in quickly resolving errors and reducing mean time to resolution (MTTR).
OpenTelemetry, an open-source observability framework, enables the collection and analysis of telemetry data like traces, which detail a request's path through a system, helping identify sources of errors and latency. These traces also produce RED metrics (Request, Errors, and Duration), which can be used to create events in BMC Helix Operations Management based on specific criteria.
Watch the following video (2:36) to get a quick peek at how the combined power of BMC Helix AIOps and OpenTelemetry helps you to identify application issues quickly:
BMC Helix AIOps connects with OpenTelemetry through the BMC Helix OpenTelemetry service. To enable this service, contact BMC Support.
Scenario
Susan, an SRE for APEX Global, is responsible for maintaining the system health across applications. One day, she receives a complaint from the end users that it's taking them hours to submit an incident request from the ITSM Service application. As a result, they are not able to perform the change management tasks. She manages to identify the issue and fix it manually, which takes her days.
During this exercise, she faces some of the following challenges in her IT operations management:
- Identify the erroneous operations or the operations that are taking much longer than expected
- Track the complete execution path of the requests that flow from one service to another that constitute the application
In the future, Susan wants to identify the source of an issue and restore the application to normal as quickly as possible. She needs an effective solution to identify all the operations that lead to the slowdown of the applications. She decides to use BMC Helix AIOps for its service monitoring and OpenTelemetry for its tracing capabilities.
Workflow
Susan requests that Jim, the tenant administrator and Scotty, the service designer enable BMC Helix applications to collect service traces from instrumented applications. Susan can then monitor the service health and analyze traces.
Task | Product | Role | Action | Reference |
---|---|---|---|---|
1
| Enable BMC Helix applications to collect service traces from instrumented applications via OpenTelemetry. | |||
OpenTelemetry | Tenant administrator | Enable the OpenTelemetry Collector to export traces data to the BMC Helix applications. | ||
BMC Helix Operations Management | Tenant administrator | Create an alarm policy in BMC Helix Operations Management that generates events if the value of a trace metric (for example, duration or request rate) is greater or less than a specified value. | ||
BMC Helix Discovery | Service designer | Make sure that the topology for the application that is instrumented by OpenTelemetry is ingested into BMC Helix Discovery. | ||
BMC Helix AIOps | Service designer | Create a service model that represents the topology of that application that has been instrumented by OpenTelemetry. You can create a service model by using a service blueprint or static content. To create a service model by using a blueprint, you can either use an out-of-the-box blueprint (version 25.1.01 or later) or create your own blueprint. | ||
2
| Monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue. | |||
BMC Helix AIOps BMC Helix Dashboards | SRE/ Operator |
|
After Jim and Scotty enable BMC Helix applications to collect service traces from the ITSM Service application, Susan can start monitoring the service health and analyzing traces. She can monitor the health of the application through the business service model created in BMC Helix AIOps. The health score of a business service indicates the application health.
If Susan receives an issue in the future, she can identify the impacted application service quickly from the business service. She can use dashboards to track and analyze traces. The analysis helps her understand the request flows and their interactions across multiple application services. She can use this information to troubleshoot issues in the system.
To monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue
- Log on to BMC Helix AIOps.
- On the Services page, click the business service (ITSM Service) representing the ITSM Service application topology.
The following example shows that the service health score of an ITSM Service business service is 70, which indicates that impact of the issue is minor. - Click a causal node (for example, getRandomAds) responsible for the poor health of the business service.
The Node Details pane shows the list of impacting events on the Events tab. For information about monitoring the health of a service, see Identifying-the-impacted-CI-nodes-from-CI-topology-view. - On the Events tab, click the event to show the Event Details page.
The event description indicates that the value of the Duration_P95 parameter has been more than 3 milliseconds for 1 minute. To identify the erroneous operation, close the Event Details pane, and depending on the BMC Helix AIOps version you are using, perform one of the following tasks.
(Version 25.1.01 and later) Perform the following tasks:
Click the View Service Details link.
The OTel service and the summary of the traces appear in another pane in the BMC Helix AIOps console.In the Traces for <serviceName> section, sort the Status column and click the ERROR link for any trace.
The trace details appear in the pane.
(Version 25.1.00 and earlier) Perform the following tasks:
Click the View OTel Dashboard link.
The OTel Service Overview dashboard opens in the BMC Helix Dashboards console.In the Traces for <serviceName> section, sort the Status column and click the ERROR link for any trace.
The OTel Trace Details dashboard opens in the BMC Helix Dashboards console.
- In the Details for TraceID section, expand Service & Operation.
Because the event was generated for a duration metric, Duration_p95, look for the operations and suboperations that are taking a long time.
After Susan finds the operation, she fixes the issue in the application service. As a result, the ITSM Service business service and application service health is restored to normal.
Results
By using BMC Helix AIOps and OpenTelemetry, Susan achieved the following results:
- Identified the root cause of the issue by analyzing traces in BMC Helix Dashboards.
- Increased application reliability and improved MTTR.
- Optimized application performance due to reduced downtime.