Monitoring and investigating services and situations


As an operator or a site reliability engineer (SRE), it's critical that you are able to observe the business services in your organization to monitor their overall health. When a service is impacted, you need to view the impacting events, analyze the causes, and quickly remediate the causes of those events to restore the health of the service. 

As an administrator, you can monitor situations and services by using BMC Helix Operations Management with the BMC Helix AIOps console.

A service is a logical group of applications, middleware, security, storage, networks, and other child services. The applications and child services work together to achieve a comprehensive, end-to-end business goal. For example, the HR service and the payroll service can be child services of a comprehensive business service.

A situation comprises events that are aggregated based on their occurrence, message, topology, or a combination of all these factors. These events are associated with a single host.

Example

Susan, an operator for the APEX Global IT Train Ticketing App Service, is responsible for monitoring their train ticketing app service. The train ticketing service provides a portal for booking and managing train reservations.

Susan plans to monitor a large number of dependent services of the train ticketing service by using a single console. Susan faces the following challenges in her IT operations environment:

  • Monitoring the health of a large number of services from different sources is a time-consuming, tedious, and complex task. 
  • Viewing large number of events from multiple sources results in event noise.
  • Meeting SLAs in a complex environment requires quick analysis of issues.
  • Correlating data from a disparate set of solutions is difficult.

Therefore, Susan needs an effective solution to monitor services from all the sources by using a single console to quickly identify the impacted services, determine the probable causes of the impact, and diagnose and resolve the issues in a short time.

Susan can find answers to most of her challenges by using BMC Helix Operations Management with BMC Helix AIOps. She can determine the root cause isolation information for an impacted service by:

  • Using a ranked list of the most likely events that caused the impact.
  • Visualizing the relationship between a discovered business service and the nodes of the service 
  • Analyzing the events and change requests that are causing the impact
  • Viewing and understanding the anomalies or abnormalities from the metrics data associated with the causal events

The following table identifies the tasks that help you monitor and investigate services and situations:

Action

Reference

null

Configure situation monitoring

Create event correlation policies to aggregate events, which are grouped into situations.

You can use the BMC Helix AIOps console to monitor and investigate situations after you configure correlation policies in BMC Helix Operations Management to aggregate events.

null

Access the BMC Helix AIOps console

Log on to BMC Helix Portal, and click BMC Helix AIOps to launch the BMC Helix AIOps console to monitor services and situations as explained in the following rows:

null

Monitor health summary

As an operator, view key performance indicators and entities from all the integrated products to get a quick-peek summary of the overall system health status through the following widgets:

null

Monitor services

As an operator, monitor services to assess the service health and perform probable cause analysis using the following options:

null

Monitor situations

As an operator, monitor and investigate policy-based situations. Monitoring situations provide the ability to:

  • Dynamically aggregate events based on event correlation policy to derive actionable insights.
  • Investigate the aggregated events.
  • Reduce the event noise.
  • Improve the mean-time-to-resolve (MTTR) issues based on the situation-driven workflow.
  • Lower the mean-time-to-detect or discover (MTTD) and the time required for investigating tickets.


 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*