Identifying bottlenecks with OpenTelemetry
In a microservices-based architecture, applications invoke multiple microservices or services and requests flow from one microservice or service to another frequently. In such environments, one of the challenges the organizations face is to monitor the flow of requests amongst services. When a bottleneck or issue occurs, finding its root cause is very difficult and time-consuming, which leads to unnecessary system downtime.
BMC Helix AIOps connects with OpenTelemetry (also known as OTel) to collect traces that are generated by instrumented applications. These traces help site reliability engineers (SREs) solve application errors and reduce MTTR. OpenTelemetry is a vendor-neutral, open-source observability framework to instrument, generate, collect, and export telemetry data such as traces. A trace describes the journey of a request through a distributed system. By viewing a trace, you can track the complete execution path of a request and identify which part of the application is causing issues such as errors and latency concerns. Traces generate RED (Request, Errors, and Duration) metrics, which you can use to generate events in BMC Helix Operations Management by defining event generation criteria.
BMC Helix AIOps connects with OpenTelemetry through the BMC Helix OpenTelemetry service. To enable this service, contact BMC Support.
Scenario
Susan, an SRE for APEX Global, is responsible for maintaining the system health across applications. One day, she receives a complaint from the end users that it's taking them hours to submit an incident request from the ITSM Service application. As a result, they are not able to perform the change management tasks. She manages to identify the issue and fix it manually, which takes her days.
During this exercise, she faces some of the following challenges in her IT operations management:
- Identify the erroneous operations or the operations that are taking much longer than expected
- Track the complete execution path of the requests that flow from one service to another that constitute the application
In the future, Susan wants to identify the source of an issue and restore the application to normal as quickly as possible. She needs an effective solution to identify all the operations that lead to the slowdown of the applications. She decides to use BMC Helix AIOps for its service monitoring and OpenTelemetry for its tracing capabilities.
Workflow
Susan requests Jim, the tenant administrator and Scotty, the service designer to enable BMC Helix applications to collect service traces from instrumented applications. Susan can then monitor the service health and analyze traces.
Task | Product | Role | Action | Reference |
---|---|---|---|---|
1 | Enable BMC Helix applications to collect service traces from instrumented applications via OpenTelemetry. | |||
OpenTelemetry | Tenant administrator | Enable the OpenTelemetry Collector to export traces data to the BMC Helix applications. | ||
BMC Helix Operations Management | Tenant administrator | Create an alarm policy in BMC Helix Operations Management that generates events if the value of a trace metric (for example, duration or request rate) is greater or less than a specified value. | ||
BMC Helix Discovery | Service designer | Make sure that the topology for the application that is instrumented by OpenTelemetry is ingested into BMC Helix Discovery. | ||
BMC Helix AIOps | Service designer |
| ||
2 | Monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue. | |||
BMC Helix AIOps BMC Helix Dashboards | SRE/ Operator |
|
To monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue
After Jim and Scotty enable BMC Helix applications to collect service traces from the ITSM Service application, Susan can start monitoring the service health and analyzing traces. She can monitor the health of the application through the business service model created in BMC Helix AIOps. The health score of a business service indicates the application health.
If Susan receives an issue in the future, she can identify the impacted application service quickly from the business service. She can use the out-of-the-box dashboards available in BMC Helix Dashboards to track and analyze traces. The analysis helps her understand the request flows and their interactions across multiple application services. She can use this information to troubleshoot issues in the system.
She performs the following steps to identify the issue:
- Log on to BMC Helix AIOps.
- On the Services page, click the business service (ITSM Service) representing the ITSM Service application topology.
The following example shows that the service health score of the ITSM Service business service is 70, which indicates that impact of the issue is minor. - Click an impacted node responsible for the poor health of the business service.
The Node Details pane shows the list of impacting events on the Events tab. For information about monitoring the health of a service, see Identifying-the-impacted-CI-nodes-from-CI-topology-view. - On the Events tab, click the event to show the Event Details page.
The event description indicates that the value of the Duration_P95 parameter has been more than 3 milliseconds for 1 minute. From the Node Details page, click View OTel Dashboard.
The OTel Trace Details dashboard is displayed. For information about the dashboard, see OTel Trace Details dashboard.- In the Details for TraceID section, expand Service & Operation.
Because the event was generated for a duration metric, Duration_p95, look for the operations and sub-operations that are taking a long time.
After Susan finds the operation, she fixes the issue in the application service. As a result, the ITSM Service business service and application service health is restored to normal.
Results
By using BMC Helix AIOps and OpenTelemetry, Susan achieved the following results:
- Identified the root cause of the issue by analyzing traces in BMC Helix Dashboards.
- Increased application reliability and improved MTTR.
- Optimized application performance due to reduced downtime.