Identifying bottlenecks by using OpenTelemetry
In microservices-based architectures, tracking the flow of requests between services is challenging and crucial for identifying bottlenecks or issues swiftly to minimize system downtime. BMC Helix AIOps integrates with OpenTelemetry to gather traces from applications, aiding site reliability engineers (SRE) in quickly resolving errors and reducing mean time to resolution (MTTR).
OpenTelemetry, an open-source observability framework, enables the collection and analysis of telemetry data like traces, which detail a request's path through a system, helping identify sources of errors and latency. These traces also produce RED metrics (Request, Errors, and Duration), which can be used to create events in BMC Helix Operations Management based on specific criteria.
Watch the following video (2:36) to get a quick peek at how the combined power of BMC Helix AIOps and OpenTelemetry helps you to identify application issues quickly:
BMC Helix AIOps connects with OpenTelemetry through the BMC Helix OpenTelemetry service. To enable this service, contact BMC Support.
Scenario
Susan, an SRE for APEX Global, is responsible for maintaining the system health across applications. One day, she receives a complaint from the end users that it's taking them hours to submit an incident request from the ITSM Service application. As a result, they are not able to perform the change management tasks. She manages to identify the issue and fix it manually, which takes her days.
During this exercise, she faces some of the following challenges in her IT operations management:
- Identify the erroneous operations or the operations that are taking much longer than expected
- Track the complete execution path of the requests that flow from one service to another that constitute the application
In the future, Susan wants to identify the source of an issue and restore the application to normal as quickly as possible. She needs an effective solution to identify all the operations that lead to the slowdown of the applications. She decides to use BMC Helix AIOps for its service monitoring and OpenTelemetry for its tracing capabilities.
Workflow
Susan requests Jim, the tenant administrator and Scotty, the service designer to enable BMC Helix applications to collect service traces from instrumented applications. Susan can then monitor the service health and analyze traces.
Task | Product | Role | Action | Reference |
---|---|---|---|---|
1 | Enable BMC Helix applications to collect service traces from instrumented applications via OpenTelemetry. | |||
OpenTelemetry | Tenant administrator | Enable the OpenTelemetry Collector to export traces data to the BMC Helix applications. | ||
BMC Helix Operations Management | Tenant administrator | Create an alarm policy in BMC Helix Operations Management that generates events if the value of a trace metric (for example, duration or request rate) is greater or less than a specified value. | ||
BMC Helix Discovery | Service designer | Make sure that the topology for the application that is instrumented by OpenTelemetry is ingested into BMC Helix Discovery. | ||
BMC Helix AIOps | Service designer |
| ||
2 | Monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue. | |||
BMC Helix AIOps BMC Helix Dashboards | SRE/ Operator |
|
To monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue
After Jim and Scotty enable BMC Helix applications to collect service traces from the ITSM Service application, Susan can start monitoring the service health and analyzing traces. She can monitor the health of the application through the business service model created in BMC Helix AIOps. The health score of a business service indicates the application health.
If Susan receives an issue in the future, she can identify the impacted application service quickly from the business service. She can use the out-of-the-box dashboards available in BMC Helix Dashboards to track and analyze traces. The analysis helps her understand the request flows and their interactions across multiple application services. She can use this information to troubleshoot issues in the system.
She performs the following steps to identify the issue:
- Log on to BMC Helix AIOps.
- On the Services page, click the business service (ITSM Service) representing the ITSM Service application topology.
The following example shows that the service health score of the ITSM Service business service is 70, which indicates that impact of the issue is minor. - Click an impacted node responsible for the poor health of the business service.
The Node Details pane shows the list of impacting events on the Events tab. For information about monitoring the health of a service, see Identifying-the-impacted-CI-nodes-from-CI-topology-view. - On the Events tab, click the event to show the Event Details page.
The event description indicates that the value of the Duration_P95 parameter has been more than 3 milliseconds for 1 minute. From the Node Details page, click View OTel Dashboard.
The OTel Trace Details dashboard is displayed. For information about the dashboard, see OTel Trace Details dashboard.- In the Details for TraceID section, expand Service & Operation.
Because the event was generated for a duration metric, Duration_p95, look for the operations and sub-operations that are taking a long time.
After Susan finds the operation, she fixes the issue in the application service. As a result, the ITSM Service business service and application service health is restored to normal.
Results
By using BMC Helix AIOps and OpenTelemetry, Susan achieved the following results:
- Identified the root cause of the issue by analyzing traces in BMC Helix Dashboards.
- Increased application reliability and improved MTTR.
- Optimized application performance due to reduced downtime.