Understanding service health score

Understanding the current and historical health of a service based on the indicators, abnormalities, and events of the components within the service model helps operators or site reliability engineers (SREs) identify the root cause of the service performance degradation and lower mean-time-to-resolve (MTTR) to avoid service disruptions.

As a service designer, you need to understand how service health is computed in BMC Helix AIOps.

Health score

The health score provides insights into the health of a service in the last 24 hours. The impact severity is assigned based on the health score. By default, a zero health score indicates the highest impact and a 100 health score indicates the best or OK health. Service designers can customize the health score for a service and determine the range for each severity level.

The health score of a service is displayed on the service details page in the BMC Helix AIOps console. Depending on the health score, a color-coded severity level is assigned to the service.

For example, the health score of the Train Ticket Reservation Systems service is displayed as 48 in the following image. The color code of the health score indicates that it is major. The tooltip shows the health score range for each severity level.

Health score computation

Health score for a service is computed by using causal events that are associated with each of the service entities and the significance derived from the service topology. The AI/ML algorithm in BMC Helix AIOps assigns weights to the nodes of a service and their relationships. These weights are numbers that signify the importance of a node or relationship when the impact occurs and are used for computing the health score.

By default, all events are considered for computing the health score of a service and the health score is propagated from the child services to a parent service. However, you can customize how the health score should be computed and whether the impact should be propagated by using the following configuration options:

Health indicators
Event rules
Balancing profiles
Health severity score and status
Health impact propagation

Watch the following video to get an overview of the advanced service health score configuration options:

Watch the YouTube video about the advanced service health score configuration in BMC Helix AIOps.

Health indicators

You can define one or more metrics associated with a service as health indicators that represent the overall health of the service. For example, if you are using synthetic transactions to measure the availability and response time of a web application, those availability and response time metrics are good candidates to be health indicators. When you define health indicators, you associate thresholds with them. When these thresholds are breached, the service health score reflects that the service is no longer completely healthy. For more information, see Adding-health-indicators.

The thresholds associated with service health indicators are also used for service predictions. For more information, see Predicting-and-proactively-resolving-service-outages.

If no health indicators are defined for a service, all the metrics associated with the service for which alarm thresholds are defined have the potential to impact the health score. Any alarm generated for any CI that is part of the service will affect the score. In this scenario, it is not necessary to define health indicators. However, not all metrics are of equal importance. Some metrics, such as those represent performance and availability, are better indicators of service health than others. If you have metrics like these for a service, consider defining them as health indicators.

Event rules

By default, the health score for an impacted service is computed based on the events generated on all the CIs that are part of the service. However, as a service designer, you can define event rules to consider only specific events based on the impacted CIs, event severities, or messages. For example, you can define a rule to consider only the events with Critical severity for computing the health score.

If you have added both health indicators and event rules for a service, events that are generated due to threshold breach of these metrics and that match the criteria defined in the event rules are considered for computing the health score. For more information, see Adding-event-rules.

Combining health indicators and event rules

The following table describes how the health score is computed when either health indicators or event rules or both are defined for a service:

Metrics defined as health indicators?	Event rules defined?	Events considered for health score computation	Example
Yes	No	Events generated only for the metrics that are defined as health indicators are considered.	If you have defined the CPU Utilization and Memory metrics as the health indicators, events generated only for these metrics are considered.
Yes	Yes	The following types of events are considered: Events that are generated on the health indicators due to associated policies Events generated on the metrics other than health indicators that satisfy the defined rules	If you have defined the CPU Utilization and Memory metrics as the health indicators and defined an event rule such that events with the only the critical severity are considered, events with only the critical severity for these metrics are considered.
No	Yes	Only the events that satisfy the event rules are considered.	If you have defined an event rule such that events with only the warning severity are considered, all events with the warning severity are considered.
No	No	All events are considered.	If you have not defined health indicators or event rules, events with all severities for all the CIs that are part of a service are considered.

Balancing profiles

As a service designer, you can use a balancing profile to specify a threshold by selecting a certain number or percentage of CIs to ensure that the service remains healthy as long as these CIs are healthy. The health score is computed based on the events generated from the selected CIs in the balancing profile. If no balancing profiles are defined, all events for all CIs are considered while computing the health score. For more information, see Adding-balancing-profiles.

Health severity score and status customization

The health score for a service is computed based on the health configuration defined in BMC Helix AIOps. Service designers can customize the values assigned to the health severity score and status based on an organization's requirements. The maximum health score for any service is 100. For more information, see Customizing-health-score-and-health-status.

Service impact propagation

By default, the impact on the child services is propagated to the parent service. As a service designer, you can stop propagation of the impact to a parent service based on your organization's needs. For more information, see Customizing-health-score-and-health-status.