Identifying bottlenecks in the system by leveraging OpenTelemetry tracing


In a microservices-based architecture, applications invoke multiple microservices or services and requests flow from one microservice or service to another frequently. In such environments, one of the challenges the organizations face is to monitor the flow of requests amongst services. When a bottleneck occurs, finding the root cause of an issue is very difficult and time-consuming, which leads to unnecessary system downtime.

BMC Helix AIOps connects with OpenTelemetry (also known as OTel) to collect traces that are generated by instrumented applications. These traces help site reliability engineers (SREs) solve application errors and reduce MTTR. OpenTelemetry is a vendor-neutral, open-source observability framework to instrument, generate, collect, and export telemetry data such as traces. A trace describes the journey of a request through a distributed system. By viewing a trace, you can track the complete execution path of a request and identify which part of the application is causing issues such as errors and latency concerns. Traces generate RED (Request, Errors, and Duration) metrics, which you use to generate events in BMC Helix Operations Management by defining event generation criteria.

BMC Helix AIOps connects with OpenTelemetry through the BMC Helix OpenTelemetry service. To enable this service, contact BMC Support.

OTel_DataFlow.jpg

Supported languages

BMC Helix AIOps supports monitoring of all the applications that are developed in the languages supported by OpenTelemetry. For the list of languages that are supported by OpenTelemetry, see the OpenTelemetry documentation.

Scenario

Susan, an SRE for APEX Global, is responsible for maintaining the system health across applications. One day, she receives a complaint from the end users that it's taking them hours to submit an incident request from the ITSM Service application. As a result, they are not able to perform the change management tasks. She manages to identify the issue and fix it manually, which takes her days.

During this exercise, she faces some of the following challenges in her IT operations management:

  • Track the complete execution path of the requests that flow from one service to another that constitute the application
  • Identify the erroneous operations or the operations that are taking much longer than expected

In the future, Susan wants to identify the source of an issue and restore the application to normal as quickly as possible. She needs an effective solution to identify all the operations that lead to the slowdown of the applications. She decides to use BMC Helix AIOps for its service monitoring and OpenTelemetry for its tracing capabilities.

Workflow

Perform the following actions to enable BMC Helix applications to collect service traces from OpenTelemetry and identify the issues in the system: 

Task

Product

Role

Action

Reference

1





Enable BMC Helix applications to collect service traces from instrumented applications via OpenTelemetry.

OpenTelemetry

Tenant Administrator

Enable the OpenTelemetry Collector to export traces data to the BMC Helix applications.

BMC Helix Operations Management

Tenant Administrator

Create an alarm policy in BMC Helix Operations Management that generates events if the value of a trace metric (for example, duration or request rate) is greater or less than a specified value.

BMC Helix Discovery

Service Designer

Make sure that the topology for the application that is instrumented by OpenTelemetry is ingested into BMC Helix Discovery.

BMC Helix AIOps

Service Designer

  • (Optional) Create a blueprint that contains the OpenTelemetry namespace name as the variable.
  • Create a service model BMC Helix AIOps that represents the application topology instrumented by OpenTelemetry

2


Monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue.

BMC Helix AIOps

BMC Helix Dashboards

SRE/ Operator

  • Monitor the service health in BMC Helix AIOps.
  • Analyze service traces in BMC Helix Dashboards to identify the issue.

To monitor the service health in BMC Helix AIOps and analyze service traces in BMC Helix Dashboards to identify the issue

Going forward, Susan can monitor the health of an application through the business service model created in BMC Helix AIOps. In BMC Helix AIOps, the health score of the business service indicates the application health.

If Susan receives an application issue in the future, she can identify the impacted application service quickly from the business service model. She can use the out-of-the-box dashboards available in BMC Helix Dashboards to track and analyze traces. The analysis will help her understand the request flows and their interactions across multiple application services. She can use this information to troubleshoot bottlenecks in the system.

She needs to perform the following steps:

  1. Log on to BMC Helix AIOps. 
  2. On the Services page, click the business service representing the application topology. 
    The following example shows that the service health score of the ITSM Service business service is 90, which indicates that impact of the issue is minor.  
    OTel_ServiceTopology_24102.png
  3. Click the node for the impacted application service (for example, prod-midtier)  that is responsible for poor health of the business service.
    For information about monitoring the health of a service, see  Monitoring-service-health.
    OTel_NodeDetails_24102.png
  4. On the Events tab, click the event to show the Event Details page. 
    The event description indicates that the value of the Duration_P95 parameter has been more than 1 millisecond for 1 minute. 
    OTel_DurationEventDetails_24102.png

  5. From the Event Details page, click View OTel Dashboard.
    The OTel Service Overview dashboard is displayed in BMC Helix Dashboards. For information about the dashboard, see OTel Service Overview dashboard.
    OTelServiceOverview_Dashboard_24102.png 

  6. Click the link in the Status column for each trace in the Traces for <serviceName> section.
    The OTel Trace Details dashboard is displayed. For information about the dashboard, see OTel Trace Details dashboard.

  7. In the Details for TraceID section, expand Service & Operations.
  8. Because the event was generated for a duration metric, Duration_p95, look for the operations and sub-operations that are taking a long time.
    OTel_Trace_Details_24102.png

After Susan finds the operation, she fixes the issue in the application service. As a result, the ITSM Service business service and application service health is restored to the normal. 

Results

By using BMC Helix AIOps and OpenTelemetry, Susan achieved the following results:

  • Identified the root cause of the issue by analyzing traces in BMC Helix Dashboards.
  • Increased application reliability and improved MTTR.
  • Optimized application performance due to reduced downtime.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*