Health score computation by node kind
When BMC Helix AIOps computes the health score of a service by node kind, it considers the node kind (database, host, virtual machine, and others), node weightage, and indexing to compute the health score of a service.
Service health score computation process
The following process is used to compute the health score of a service:
- Node health score is computed based on the following factors:
- Events impacted the node
- Event rules
- Health indicators
- Service health score is computed:
- Node health scores are sorted in ascending order.
- Node kind health score is computed based on the node weightage and indexing.
- Service health score is the lowest node score among all the node kinds.
Because the health score of a service depends on the health score of its nodes, therefore first let's look at how the health score of a node is calculated.
Node health score computation
The health score for a node is computed by using causal events that impact the node.
Node health score computation without any event rules or health indicator events
By default, the health score of a node is 100. Each event severity is assigned a score as listed in the following table:
Event Severity | Score | Reduction in health score |
---|---|---|
Critical | 10 | If the node is impacted by one critical event, its health score is reduced by 10. |
Major | 8 | If the node is impacted by one major event, its health score is reduced by 8. |
Minor | 6 | If the node is impacted by one minor event, its health score is reduced by 6. |
Warning | 4 | If the node is impacted by one warning event, its health score is reduced by 4. |
The following examples illustrate how the health score of a node is computed when it is impacted by events:
- If the node is impacted by one critical and one major event, its health score = 100 - 10 - 8 = 82
- If the node is impacted by two warning events, its health score = 100 - 8 = 92
Service designers can customize the values assigned to the severity score based on their organization's requirements. For more information, see Customizing-health-score-and-health-status.
Watch the following video to get an overview of the advanced service health score configuration options:
Watch the YouTube video about the advanced service health score configuration in BMC Helix AIOps.
Node health score computation with event rules
By default, the health score for an impacted service is computed based on the events generated on all the nodes (CIs) that are part of the service. However, as a service designer, you can define event rules to consider only specific events based on the impacted CIs, event severity, or message. For example, if you have defined an event rule that considers only events with the major severity, all events with the major severity are considered. The event rule you define for a service applies to all the nodes that are part of the service.
The following example illustrates how the health score of a node is computed when an event rule is defined to consider only the events with the Major severity. If a node is impacted by Major and Minor events, only the Major events are considered for the health score computation.
- If a node is impacted by three major events and two critical events, its health score = 100 - (3*8) = 76
- If a node is impacted by two minor events, its health score = 100 - 0 = 100
Node health score computation with health indicators
You can define one or more metrics associated with a service as health indicators that represent the overall health of the service. For example, if you are using synthetic transactions to measure the availability and response time of a web application, those availability and response time metrics are good candidates to be health indicators. When you define health indicators, you associate thresholds with them. When these thresholds are breached, the service health score reflects that the service is no longer completely healthy. For more information, see Adding-health-indicators.
The thresholds associated with service health indicators are also used for service predictions. For more information, see Predicting-and-proactively-resolving-service-outages.
By default, an event which is generated due to a breach in the health indicator threshold (also called health indicator event) is assigned a score as listed in the following table:
Health indicator event severity | Score | Reduction in health score |
---|---|---|
Critical | 20 | If the node is impacted by one critical event, its health score is reduced by 20. |
Major | 16 | If the node is impacted by one major event, its health score is reduced by 16. |
Minor | 12 | If the node is impacted by one minor event, its health score is reduced by 12. |
Warning | 8 | If the node is impacted by one warning event, its health score is reduced by 8. |
Service designers can customize the values assigned to the severity score based on an organization's requirements. For more information, see Customizing-health-score-and-health-status.
The following examples illustrate how the health score of a node is computed when a health indicator, for example, Disk Space Used is defined for a service. The node is impacted by health indicator events due to breach in the Disk Space Used threshold.
- If the node is impacted by two Major severity health indicator events, its health score = 100 - (2*16) = 68
- If the node is impacted by three Critical severity health indicator events, its health score = 100 - (3*20) = 40
Node health score computation with both health indicators and event rules
If you have defined both health indicators and event rules for a service, events that are generated due to a threshold breach of these metrics and that match the criteria defined in the event rules are considered for computing the health score of its nodes. The following table describes how the health score of a node is computed when either health indicators or event rules, or both are defined for a service:
Metrics defined as health indicators? | Event rules defined? | Events considered for health score computation | Example |
---|---|---|---|
Yes | No | The following types of events are considered:
By default, the health score reduced due to an event generated for a health indicator is double the value of the score reduced due to an event generated for a CI. For more information, see Customizing health score and health status. | A service is associated with three metrics: Disk Space Used, CPU Utilization, and Memory Utilization, and you have defined Disk Space Used and CPU Utilization as health indicators. If a critical event is generated for CPU Utilization, the node health score is reduced by 20. If a critical event is generated for any CI, the node health score is reduced by 10. |
Yes | Yes | The following types of events are considered:
| If you have defined the CPU Utilization and Memory metrics as the health indicators and defined an event rule so that events with only the critical severity are considered, events with only the critical severity for these metrics are considered. |
No | Yes | Only the events that satisfy the event rules are considered. | If you have defined an event rule that considers only events with the warning severity, then all events with the warning severity are considered. |
No | No | All events are considered. | If you have not defined health indicators or event rules, events with all severities for all the CIs that are part of a service are considered. |
The following examples illustrate how the health score of a node is computed when an event rule is defined to consider only the events with Critical severity and a health indicator is defined for the service.
- If the node is impacted by two Major severity health indicator events and two other Critical events, its health score = 100 - (2*16) - (2*10) = 58
- If the node is impacted by one Major severity health indicator event and two other Minor events, its health score = 100 - (1*16) = 84
Service health score computation
To compute the health score of a service, the AI/ML algorithm in BMC Helix AIOps assigns weight to the nodes of a service and their relationships. These weights are numbers that signify the importance of a node or relationship when the impact occurs and are used for computing the health score.
A service can contain multiple nodes of the same or different types or a single node, and one or more nodes can be impacted by events. The following examples illustrate how the health score of a service is computed in different scenarios.
Computation when a service contains multiple nodes and multiple nodes are impacted
If a service contains multiple nodes of different types, such as database and host, and multiple nodes are impacted, by default, the service health score is computed based on the weightage assigned to the device type, as shown in the following table:
Node kind | Weightage value |
---|---|
Database | 25% |
Host | 35% |
Virtual machine | 35% |
Other node kinds | 45% |
The following examples illustrate how the service health score is computed based on the node weightage if multiple nodes are impacted:
Computation when a service contains multiple nodes and only one node is impacted
If a service contains multiple nodes and only one node is impacted, the health score of the service is the node score of the impacted node. The node score depends on the severity of the event. For example, if a critical event has impacted the node, the node score and therefore, the service health score is 90 (100 - 10).
Computation when a service contains only one node and the node is impacted
If a service contains only one node and the node is impacted, the service health score depends on the severity of the event that has impacted the node. For example, if a major event has impacted the node, the node score and therefore, the service health score is 92 (100 - 8).
Impact propagation and service health score
By default, the impact on the child services is propagated to the parent service and health score of the parent service is determined by the health score of the child services. However, as a service designer, you can stop the impact propagation based on your organization's needs. For more information, see Customizing-health-score-and-health-status.
The following examples illustrate how the health score of a parent service is computed if the impact on the child services is propagated to the parent service.
Balancing profiles and service health score
As a service designer, you can use a balancing profile to specify a threshold by selecting a certain number or percentage of CIs to make sure that the service remains healthy as long as these CIs are healthy. The health score is computed based on the events generated from the selected CIs in the balancing profile. If no balancing profiles are defined, all events for all CIs are considered while computing the health score. For more information, see Adding-balancing-profiles.
Example: Service health score computation without event rules or health indicators
This example illustrates how the ApexInsurance.live service health score (which is 10), a child service of the apexbanking.live service, is computed.
The following figure shows the service model containing the apexbanking.live service and its child services. The highlighted service (ApexInsurance.live) indicates that it is 90% Impacted, which means that its health score is 10.
Assumptions
The following assumptions are used to compute the health score of the ApexInsurance.live service:
- Due to customizations in the health score settings, for every Critical event on a node in the ApexInsurance.live service, the health score of the node is reduced by 25.
For information about customizations, see Customizing-health-score-and-health-status. - For every Major event on a node, the health score of the node is reduced by 8.
- For every Critical health indicator event on a node, the health score of the node is reduced by 20.
Service topology
The following figure shows the topology of the ApexInsurance.live service:
The nodes have been grouped by their kinds. The icon indicates that various nodes in that node kind group have been impacted by events.
The topology contains the following node kind groups:
- Host: Consists of ten nodes, two of which belong to the child service of the ApexInsurance.live service. Although the group contains ten nodes, only eight of them are considered for health score computation because the remaining two belong to the child service of the ApexInsurance.live service, not to the service itself.
- Cluster: Consists of one node, which is impacted by events and is considered for health score computation.
- Network Device: Consists of four nodes. The nodes in this group are impacted by events. However, they are not considered for health score computation because they belong to the child service of the ApexInsurance.live service, not to the service itself.
- Network Interface: Consists of two nodes. These nodes are not considered for health score computation because they are not impacted by events.
Health score computation for nodes
Let’s first look at how the node score is calculated.
Health score computation for Host nodes
The following figure shows that the Host node kind contains ten nodes and Node 1 is impacted by two Critical events.
The following table lists the health scores for the nodes in the Host node kind:
Node kind | Node 1 | Node 2 | Node 3 | Node 4 | Node 5 | Node 6 | Node 7 | Node 8 |
---|---|---|---|---|---|---|---|---|
Host | 100 – (2*25) = 50 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
For each critical event on Node 1, its score is reduced by 25. Hence, the health score is 100 – (2*25) = 50. Other nodes have not been impacted by any events. Hence, their score is 100 (default score).
Health score computation for Cluster nodes
The following figure shows that the Cluster node kind contains one node (Node 1), which is impacted by two Critical and five Major events.
The following lists the health score for the node in the Cluster node kind:
Node kind | Node 1 |
---|---|
Cluster | 100 – (2*25) - (5*8) = 10 |
For each Critical event on Node 1, its score is reduced by 25 and for each Major event, the score is reduced by 8. Hence, the health score is 100 – 50 – 40 = 10.
Health score computation for the service
The following process is used to compute the ApexInsurance.live service health score (10):
- Node scores are sorted in ascending order by health score.
Node kind Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Host 50 100 100 100 100 100 100 100 Cluster 10 Index 0 Index 1 Index 2 Index 3 Index 4 Index 5 Index 6 Index 7 - Node kind score is calculated based on the node index. The index for the node is calculated based on the following formula:
Index = Weightage value percentage of the total number of nodes- The number of Host nodes is 8, and the weightage associated with a host node is 35. So, the index is 35% of 8 = 2.8, which is converted to a whole number 2.
The index points to the third element in the Host row. So, the score for the Host node kind is 100. - The number of Cluster nodes is 1, and the weightage associated with a cluster node is 45. So, the index is 45% of 1 = 0.45, which is converted to 0.
The index points to the first element in the Cluster row. So, the score for the Cluster node kind is 10.
- The number of Host nodes is 8, and the weightage associated with a host node is 35. So, the index is 35% of 8 = 2.8, which is converted to a whole number 2.
- The health score for the service is the lowest node score among all node kinds. The lowest node score is 10. Therefore, the service health score is 10.
Example: Service health score computation with event rules
Assume that you have defined an event rule. The rule states that events with only Critical severity can impact the service health, as shown in the following figure. In such a case, events only with Critical severity are considered for health score computation. Events with other severity types are ignored.
If the event rule is applied to the Cluster node kind, the health score for Node 1 is calculated as follows:
Node kind | Node 1 |
---|---|
Cluster | 100 – (2*25) = 50 |
In this case, although Node 1 is impacted by both Critical and Major events, only Critical events are considered for computation.
The overall service health score is calculated as follows:
- Node scores are sorted in ascending order by health score.
Node kind Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Host 50 100 100 100 100 100 100 100 Cluster 50 Index 0 Index 1 Index 2 Index 3 Index 4 Index 5 Index 6 Index 7 - Node kind score is calculated based on the node index. The index for the node is calculated based on the following formula:
Index = Weightage value percentage of the total number of nodes- The number of Host nodes is 8, and the weightage associated with a host node is 35. So, the index is 35% of 8 = 2.8, which is converted to a whole number 2.
The index points to the third element in the Host row. So, the node kind score is 100. - The number of Cluster nodes is 1, and the weightage associated with a cluster node is 45. So, the index is 45% of 1 = 0.45, which is converted to 0.
The index points to the first element in the Cluster row. So, the node kind score is 50.
- The number of Host nodes is 8, and the weightage associated with a host node is 35. So, the index is 35% of 8 = 2.8, which is converted to a whole number 2.
- The health score for the service is the lowest score among all the node kinds. The lowest node kind score is 50. Therefore, the service health score is 50.
The following figure shows that the health score of the ApexInsurance.live service is updated to 50 when an event rule is defined.
Example: Service health score computation with health indicators
Assume that you have defined the following health indicators for the ApexInsurance.live service (shown in the following figure):
ActualUsed, Free, KernelSlabMemory, UsedPercent, InErrorsInPercent
When there is a breach in the values of these health indicators, events are generated. For example, the following figure shows events generated for a host node in the Apexbanking.live service:
The following table shows the health score of the nodes:
Node kind | Node 1 | Node 2 | Node 3 | Node 4 | Node 5 | Node 6 | Node 7 | Node 8 | Node 9 |
---|---|---|---|---|---|---|---|---|---|
Host | 100 – (2*25) = 50 | 100 – (4 *20) = 20 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
Cluster | 100 – (2*25) – (5*8) = 10 | ||||||||
Index 0 | Index 1 | Index 2 | Index 3 | Index 4 | Index 5 | Index 6 | Index 7 | Index 8 |
The following process is used to compute the service health score:
- Node scores are listed in ascending order by health score.
Node kind Node 1 Node 2
Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Host 20 50 100 100 100 100 100 100 100 Cluster 10 Index 0 Index 1 Index 2 Index 3 Index 4 Index 5 Index 6 Index 7 Index 8 - Node kind score is calculated based on the node index. The index for the node is calculated based on the following formula:
Index = Weightage value percentage of the total number of nodes- The number of Host nodes is 9, and the weightage associated with a host node is 35. So, the index is 35% of 9 = 3.1, which is converted to a whole number 3.
The index points to the third element in the Host row. So, the node kind score is 100. - The number of Cluster nodes is 1, and the weightage associated with a cluster node is 45. So, the index is 45% of 1 = 0.45, which is converted to 0.
The index points to the first element in the Cluster row. So, the node kind score is 10.
- The number of Host nodes is 9, and the weightage associated with a host node is 35. So, the index is 35% of 9 = 3.1, which is converted to a whole number 3.
- The health score for the service is the lowest node score among all the node kinds. The lowest node score is 10. Therefore, the service health score is 10.
Example: Service health score computation with impact propagation
This example illustrates how the apexbanking.live service health score (displayed as 10), a parent service of the multiple child services is computed.
The parent service, apexbanking.live is impacted by multiple Critical severity events. Due to which, the health score of parent service is 20.
The health score of the impacted child services is 70, 50, 10, and 70.
The health score of the parent service is the lowest health score amongst its own score and from across the child services. Therefore, the health score of the parent service is 10.