Monitoring services

As an operator or a site reliability engineer (SRE), it's critical that you are able to observe the business services in your organization to monitor their overall health. When a service gets impacted by any factor, you need to view the events generated because of the impact, analyze the causes of the impact, and quickly remediate those events to restore the health of the impacted services. 

BMC Helix AIOps provides a set of comprehensive, service-centric monitoring capabilities. A service is a logical group of applications, middleware, security, storage, networks, and other child services that work together to achieve a business goal.

Additionally, BMC Helix AIOps provides advanced monitoring and analytical capabilities to: 

BMC Helix AIOps offers a single pane of view for all the business services used by your organization. The single pane of view helps the users to view all the relevant information at one place and to respond faster. Operators or SREs can view the following information:

  • Number of impacted services by severity - Critical, Major, Minor, or OK
  • Number of situations (correlated events), events, incidents, and CIs for each impacted service
  • Association between child services of each service
  • Details of each service related to its health, and impacting events, situations, incidents, or changes 

Scenario

APEX Global IT Train Ticketing System is a microservices-based architecture. The train ticketing system provides a portal for booking and managing train reservations. 

Susan is a site reliability engineer at APEX Global IT and is responsible for monitoring the overall health of all the services used for the train ticketing system by using BMC Helix AIOps. The Services page on the console shows whether the services are healthy and marked in green. Today, she observes that the TrainsApp service, typically used by the travelers to book tickets, is red. Investigating further, she observes that the service health score is 0, which means it is 100% impacted, and there are 73 events, 1 situation, and 1 incident against the service. She can also see that there's only 1 impacted CI (configuration item), which tells her that the problem is probably affecting only 1 node in the service. She clicks the Impacting Events link to view the events, situations, incidents, and changes for the impacted service.

Susan wants to find the root cause of the events, so she views the event details under the Analyze Root Cause section. If automation is available, such as restarting a service or clearing disk space, Susan can try to remediate the issue by invoking the available automation from the list of applicable items. In addition, Susan can delegate the event to another user for further actions or change the priority of the event, and so on. Finally, she can review the service behavior for the last 15 days, and look for a pattern or a trend to identify the service degradation. Susan can then take corrective measures to ensure that the service health improves.

All these capabilities enable Susan to achieve the following objectives with services in her organization:

  • Remain available and healthy at all times
  • Perform at an optimal level
  • Have low downtime and minimal impact on the business

Learn about the advanced monitoring and analytical capabilities by using the topics listed in the following table:  

ActionReference

Learn about the service health, health score, and metrics.

Service health score and health timeline

View the services in a heatmap view or tile view and monitor the overall health of each service on the Services page.

Getting started with service monitoring

On the service details page:

  • Monitor the health of a service. 

  • View CI topology for the impacted services on the CI Topology tab. 

  • Identify and investigate impacting events, incidents, and changes for nodes on the Service Hierarchy tab.

  • Monitor the health indicators configured for a service in the View Service Health Indicators section.

  • Perform causal analysis for impacted services in the Analyze Root Cause section the service details page.

  • Analyze situations for impacted services on the Analyze Situations tab. 
  • Get more insight into the service behavior and identify patterns for impacted services in the Analyze Service Insights section.



Was this page helpful? Yes No Submitting... Thank you

Comments