Understanding service health score
Service health
The health score provides insights into the health of a service at that exact point in time. The impact severity is assigned to the service based on the health score. By default, a zero health score indicates the highest impact and a 100 health score indicates the best health. Service designers can customize the health score for a service and determine the range for each severity level.
In the BMC Helix AIOps console, the health score of a service is displayed on the service details page. Depending on the health score, a color-coded severity level is assigned to the service. For example, the health score of the Retail-Outlet service is displayed as 68 in the following image. The color code of the health score indicates that the impact is minor. The tooltip shows the health score range for each severity level.
Health score computation
The health score of a service depends on the health score of its nodes.
Node health score computation
The maximum health score for any node is 100. The health score of a node depends on the impacted events associated with the node. By default, each event severity is assigned a score as listed in the following table:
Event Severity | Score |
---|---|
Critical | 10 |
Major | 8 |
Minor | 6 |
Warning | 4 |
The following examples illustrate how the health score of a node is computed:
- If the node has one critical event, its health score = 100 - 10 = 90
- If a node has one major event, its health score = 100 - 8 = 92
- If a node has one major and one minor event, its health score = 100 - 8 - 6 = 86
Service health score computation
The health score for a service is computed by using causal events that are associated with each of the service nodes and the significance derived from the service topology. The AI/ML algorithm in BMC Helix AIOps assigns weights to the nodes of a service and their relationships. These weights are numbers that signify the importance of a node or relationship when the impact occurs and are used for computing the health score.
A service model can contain multiple nodes of the same or different types or a single node, and one or more nodes can be impacted by events.
Computation when a service contains multiple nodes and multiple nodes are impacted
If a service model contains multiple nodes of different types, such as database and host, and multiple nodes are impacted, by default, the node score is computed based on the weightage assigned to the device type, as shown in the following table:
Node kind | Weightage value |
---|---|
Database | 25% |
Host | 35% |
Virtual machine | 35% |
Other node kinds | 45% |
The following examples illustrate how the health score is computed based on the node weightage if multiple nodes are impacted:
Computation when a service contains multiple nodes and only one node is impacted
If a service model contains multiple nodes and only one node is impacted, the health score of the service is the node score of the impacted node. The node score depends on the severity of the event. For example, if a critical event has impacted the node, the node score and therefore, the service health score is 90 (100 - 10).
The following example illustrates how the health score is computed if only a single node is impacted:
Computation when a service contains only one node and the node is impacted
If a service model contains only one node and the node is impacted, the service health score depends on the severity of the event that has impacted the node. For example, if a major event has impacted the node, the node score and therefore, the service health score is 92 (100 - 8).
The following example illustrates how the health score is computed if a service model contains only one node and it is impacted:
Advanced options affecting service health score computation
By default, all events are considered for computing the health score of a service, and the health score is propagated from the child services to a parent service. However, you can customize how the health score should be computed and whether the impact should be propagated by using the following configuration options:
- Health indicators
- Event rules
- Balancing profiles
- Health severity score and status
- Health impact propagation
Watch the following video to get an overview of the advanced service health score configuration options:
Watch the YouTube video about the advanced service health score configuration in BMC Helix AIOps.
Health indicators
You can define one or more metrics associated with a service as health indicators that represent the overall health of the service. For example, if you are using synthetic transactions to measure the availability and response time of a web application, those availability and response time metrics are good candidates to be health indicators. When you define health indicators, you associate thresholds with them. When these thresholds are breached, the service health score reflects that the service is no longer completely healthy. For more information, see Adding-health-indicators.
The thresholds associated with service health indicators are also used for service predictions. For more information, see Predicting-and-proactively-resolving-service-outages.
Event rules
By default, the health score for an impacted service is computed based on the events generated on all the CIs that are part of the service. However, as a service designer, you can define event rules to consider only specific events based on the impacted CIs, event severity, or message. For example, you can define a rule to consider only the events with Critical severity for computing the health score.
If you have added both health indicators and event rules for a service, events that are generated due to a threshold breach of these metrics and that match the criteria defined in the event rules are considered for computing the health score. For more information, see Adding-event-rules.
Combining health indicators and event rules
The following table describes how the health score is computed when either health indicators or event rules or both are defined for a service:
Metrics defined as health indicators? | Event rules defined? | Events considered for health score computation | Example |
---|---|---|---|
Yes | No | The following types of events are considered:
By default, the health score reduced due to an event generated for a health indicator is double the value of the score reduced due to an event generated for a CI. For more information, see Customizing health score and health status. | A service is associated with three metrics: Disk Space Used, CPU Utilization, and Memory Utilization, and you have defined Disk Space Used and CPU Utilization as health indicators. If a critical event is generated for CPU Utilization, the node health score is reduced by 20. If a critical event is generated for any CI, the node health score is reduced by 10. |
Yes | Yes | The following types of events are considered:
| If you have defined the CPU Utilization and Memory metrics as the health indicators and defined an event rule so that events with only the critical severity are considered, events with only the critical severity for these metrics are considered. |
No | Yes | Only the events that satisfy the event rules are considered. | If you have defined an event rule that considers only events with the warning severity, then all events with the warning severity are considered. |
No | No | All events are considered. | If you have not defined health indicators or event rules, events with all severities for all the CIs that are part of a service are considered. |
Balancing profiles
As a service designer, you can use a balancing profile to specify a threshold by selecting a certain number or percentage of CIs to ensure that the service remains healthy as long as these CIs are healthy. The health score is computed based on the events generated from the selected CIs in the balancing profile. If no balancing profiles are defined, all events for all CIs are considered while computing the health score.
You can define balancing profiles for an individual service from the Service details page or for multiple services from the Manage Service Health page. For more information, see Adding-balancing-profiles and Configuring global settings for service health.
Health severity score and status customization
The health score for a service is computed based on the health configuration defined in BMC Helix AIOps . Service designers can customize the values assigned to the health severity score and status based on an organization's requirements. The maximum health score for any service is 100. For more information, see Customizing-health-score-and-health-status.
Service impact propagation
By default, the impact on the child services is propagated to the parent service, whose health score is determined by the health score of its child services. For example, if there are three impacted child services with health scores as 30, 40, and 50, the health score of the parent service is the lowest health score from across the child services. Therefore, the health score of the parent service is 30.
As a service designer, you can stop the propagation of the impact to a parent service based on your organization's needs. For more information, see Customizing-health-score-and-health-status.