Identifying bottlenecks with OpenTelemetry

In a microservices-based architecture, applications invoke multiple microservices or services and requests flow from one microservice or service to another frequently. In such environments, one of the challenges the organizations face is to monitor the flow of requests amongst services. When a bottleneck or issue occurs, finding its root cause is very difficult and time-consuming, which leads to unnecessary system downtime.

Important

Telemetry data exported by OpenTelemetry is sent to BMC Helix applications through the BMC Helix OpenTelemetry service. To enable this service, contact BMC Helix Support.

BMC Helix AIOps connects with OpenTelemetry (also known as OTel) to collect traces that are generated by instrumented applications. These traces help site reliability engineers (SREs) solve application errors and reduce MTTR. OpenTelemetry is a vendor-neutral, open-source observability framework to instrument, generate, collect, and export telemetry data such as traces. A trace describes the journey of a request through a distributed system. By viewing a trace, you can track the complete execution path of a request and identify which part of the application is causing issues such as errors and latency concerns. Traces generate RED (Request, Errors, and Duration) metrics, which you can use to generate events in BMC Helix Operations Management by defining event generation criteria.

Warning

BMC Helix does not automatically detect or mask sensitive data in OpenTelemetry payloads. All telemetry data sent via OpenTelemetry are ingested and displayed as received. Before exporting any data, configure your OpenTelemetry Collector to filter, remove, or mask sensitive data in accordance with your organizational policies. For recommended approaches and examples on handling sensitive data, see the OpenTelemetry documentation.

BMC Helix AIOps connects with OpenTelemetry through the BMC Helix OpenTelemetry service. To enable this service, contact BMC Support.

Scenario

Susan, an SRE for APEX Global, is responsible for maintaining the system health across applications. One day, she receives a complaint from the end users that it's taking them hours to submit an incident request from the ITSM Service application. As a result, they are not able to perform the change management tasks. She manages to identify the issue and fix it manually, which takes her days.

During this exercise, she faces some of the following challenges in her IT operations management:

Identify the erroneous operations or the operations that are taking much longer than expected
Track the complete execution path of the requests that flow from one service to another that constitute the application

In the future, Susan wants to identify the source of an issue and restore the application to normal as quickly as possible. She needs an effective solution to identify all the operations that lead to the slowdown of the applications. She decides to use BMC Helix AIOps for its service monitoring and OpenTelemetry for its tracing capabilities.

Workflow

Susan requests Jim, the tenant administrator and Scotty, the service designer to enable BMC Helix applications to collect service traces from instrumented applications. Susan can then monitor the service health and analyze traces.

Task	Product	Role	Action	Reference
1	Enable BMC Helix applications to collect service traces from instrumented applications via OpenTelemetry.
	OpenTelemetry	Tenant administrator	Enable the OpenTelemetry Collector to export traces data to the BMC Helix applications.	Enabling-BMC-Helix-applications-to-collect-service-traces-from-OpenTelemetry
	BMC Helix Operations Management	Tenant administrator	Create an alarm policy in BMC Helix Operations Management that generates events if the value of a trace metric (for example, duration or request rate) is greater or less than a specified value.
	BMC Helix Discovery	Service designer	Make sure that the topology for the application that is instrumented by OpenTelemetry is ingested into BMC Helix Discovery.
	BMC Helix AIOps	Service designer	(Optional) Create a blueprint that contains the OpenTelemetry namespace name as the variable. Create a service model BMC Helix AIOps that represents the application topology instrumented by OpenTelemetry.
2	Monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue.
2	BMC Helix AIOps BMC Helix Dashboards	SRE/ Operator	Monitor the service health in BMC Helix AIOps. Analyze service traces in BMC Helix Dashboards to identify the issue.	Analyze service traces in BMC Helix Dashboards and identify the issue

To monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue

After Jim and Scotty enable BMC Helix applications to collect service traces from the ITSM Service application, Susan can start monitoring the service health and analyzing traces. She can monitor the health of the application through the business service model created in BMC Helix AIOps. The health score of a business service indicates the application health.

If Susan receives an issue in the future, she can identify the impacted application service quickly from the business service. She can use the out-of-the-box dashboards available in BMC Helix Dashboards to track and analyze traces. The analysis helps her understand the request flows and their interactions across multiple application services. She can use this information to troubleshoot issues in the system.

She performs the following steps to identify the issue:

Log on to BMC Helix AIOps.
On the Services page, click the business service (ITSM Service) representing the ITSM Service application topology.
The following example shows that the service health score of the ITSM Service business service is 70, which indicates that impact of the issue is minor.
Click an impacted node responsible for the poor health of the business service.
The Node Details pane shows the list of impacting events on the Events tab. For information about monitoring the health of a service, see Identifying-the-impacted-CI-nodes-from-CI-topology-view.
On the Events tab, click the event to show the Event Details page.
The event description indicates that the value of the Duration_P95 parameter has been more than 3 milliseconds for 1 minute.
From the Node Details page, click View OTel Dashboard.
The OTel Trace Details dashboard is displayed. For information about the dashboard, see OTel Trace Details dashboard.
In the Details for TraceID section, expand Service & Operation.
Because the event was generated for a duration metric, Duration_p95, look for the operations and sub-operations that are taking a long time.

After Susan finds the operation, she fixes the issue in the application service. As a result, the ITSM Service business service and application service health is restored to normal.

Results

By using BMC Helix AIOps and OpenTelemetry, Susan achieved the following results:

Identified the root cause of the issue by analyzing traces in BMC Helix Dashboards.
Increased application reliability and improved MTTR.
Optimized application performance due to reduced downtime.

Identifying bottlenecks with OpenTelemetry

Scenario

Workflow

To monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue

Results

BMC Helix AIOps 24.3

On this page