Monitoring and investigating services and situations

As an operator or a site reliability engineer (SRE), it's critical that you are able to observe the business services in your organization to monitor their overall health. When a service is impacted, you need to view the impacting events, analyze the causes, and quickly remediate the causes of those events to restore the health of the service.

As an administrator, you can monitor situations and services by using BMC Helix Operations Management with the BMC Helix AIOps console.

A service is a logical group of applications, middleware, security, storage, networks, and other child services. The applications and child services work together to achieve a comprehensive, end-to-end business goal. For example, the HR service and the payroll service can be child services of a comprehensive business service.

A situation comprises events that are aggregated based on their occurrence, message, topology, or a combination of all these factors. These events are associated with a single host.

Related topic

BMC Helix AIOps documentation

Example

Susan, an operator for the APEX Global IT Train Ticketing App Service, is responsible for monitoring their train ticketing app service. The train ticketing service provides a portal for booking and managing train reservations.

Susan plans to monitor a large number of dependent services of the train ticketing service by using a single console. Susan faces the following challenges in her IT operations environment:

Monitoring the health of a large number of services from different sources is a time-consuming, tedious, and complex task.
Viewing large number of events from multiple sources results in event noise.
Meeting SLAs in a complex environment requires quick analysis of issues.
Correlating data from a disparate set of solutions is difficult.

Therefore, Susan needs an effective solution to monitor services from all the sources by using a single console to quickly identify the impacted services, determine the probable causes of the impact, and diagnose and resolve the issues in a short time.

Susan can find answers to most of her challenges by using BMC Helix Operations Management with BMC Helix AIOps. She can determine the root cause isolation information for an impacted service by:

Using a ranked list of the most likely events that caused the impact.
Visualizing the relationship between a discovered business service and the nodes of the service
Analyzing the events and change requests that are causing the impact
Viewing and understanding the anomalies or abnormalities from the metrics data associated with the causal events

The following table identifies the tasks that help you monitor and investigate services and situations:

Action		Reference
	Configure situation monitoring Create event correlation policies to aggregate events, which are grouped into situations. You can use the BMC Helix AIOps console to monitor and investigate situations after you configure correlation policies in BMC Helix Operations Management to aggregate events.	Event-correlation-for-aggregating-related-events
Access the BMC Helix AIOps console Log on to BMC Helix Portal, and click BMC Helix AIOps to launch the BMC Helix AIOps console to monitor services and situations as explained in the following rows:
	Monitor health summary As an operator, view key performance indicators and entities from all the integrated products to get a quick-peek summary of the overall system health status through the following widgets: Total events, anomalies, and incidents from the integrated monitoring systems Average mean time to resolve (MTTR) the incidents Overall event noise reduction score Impact severity and availability of top services Event count and status of top situations	Monitoring-key-performance-indicators-and-entities
	Monitor services As an operator, monitor services to assess the service health and perform probable cause analysis using the following options: Comprehensive health timeline for predefined time ranges. Probable cause analysis impact of causal entities Impactful events and change requests. Topology maps showing the impacted nodes. Service hierarchy view showing the upstream and downstream impact. Health indictors showing the impacted metrics.	Monitoring service health
	Monitor situations As an operator, monitor and investigate policy-based situations. Monitoring situations provide the ability to: Dynamically aggregate events based on event correlation policy to derive actionable insights. Investigate the aggregated events. Reduce the event noise. Improve the mean-time-to-resolve (MTTR) issues based on the situation-driven workflow. Lower the mean-time-to-detect or discover (MTTD) and the time required for investigating tickets.	Monitoring-and-investigating-policy-based-situations

Monitoring and investigating services and situations

On this page