Health score computation by node kind


When BMC Helix AIOps computes the health score of a service by node kind, it considers the node kind (database, host, virtual machine, and others), node weightage, and indexing to compute the health score of a service.

Service health score computation process

The following process is used to compute the health score of a service:

  1. Node health score is computed based on the following factors:
    • Events impacted the node
    • Event rules 
    • Health indicators 
  2. Service health score is computed:
    1. Node health scores are sorted in ascending order.
    2. Node kind health score is computed based on the node weightage and indexing.
    3. Service health score is the lowest node score among all the node kinds.

Because the health score of a service depends on the health score of its nodes, therefore first let's look at how the health score of a node is calculated. 

Node health score computation

The health score for a node is computed by using causal events that impact the node.

Node health score computation without any event rules or health indicator events

By default, the health score of a node is 100. Each event severity is assigned a score as listed in the following table: 

Event Severity

Score

Reduction in health score 

Critical

10

If the node is impacted by one critical event, its health score is reduced by 10.

Major

8

If the node is impacted by one major event, its health score is reduced by 8.

Minor

6

If the node is impacted by one minor event, its health score is reduced by 6.

Warning

4

If the node is impacted by one warning event, its health score is reduced by 4.

The following examples illustrate how the health score of a node is computed when it is impacted by events:

  • If the node is impacted by one critical and one major event, its health score = 100 - 10 - 8 = 82
  • If the node is impacted by two warning events, its health score = 100 - 8 = 92

Service designers can customize the values assigned to the severity score based on their organization's requirements. For more information, see Customizing-health-score-and-health-status.

Watch the following video to get an overview of the advanced service health score configuration options:

icon_play.png Watch the YouTube video about the advanced service health score configuration in BMC Helix AIOps.

Node health score computation with event rules

By default, the health score for an impacted service is computed based on the events generated on all the nodes (CIs) that are part of the service. However, as a service designer, you can define event rules to consider only specific events based on the impacted CIs, event severity, or message. For example, if you have defined an event rule that considers only events with the major severity, all events with the major severity are considered. The event rule you define for a service applies to all the nodes that are part of the service.

The following example illustrates how the health score of a node is computed when an event rule is defined to consider only the events with the Major severity. If a node is impacted by Major and Minor events, only the Major events are considered for the health score computation. 

  • If a node is impacted by three major events and two critical events, its health score = 100 - (3*8) = 76 
  • If a node is impacted by two minor events, its health score = 100 - 0 = 100

Node health score computation with health indicators

You can define one or more metrics associated with a service as health indicators that represent the overall health of the service. For example, if you are using synthetic transactions to measure the availability and response time of a web application, those availability and response time metrics are good candidates to be health indicators. When you define health indicators, you associate thresholds with them. When these thresholds are breached, the service health score reflects that the service is no longer completely healthy. For more information, see Adding-health-indicators.

The thresholds associated with service health indicators are also used for service predictions. For more information, see Predicting-and-proactively-resolving-service-outages.

Important

If no health indicators are defined for a service, all the metrics associated with the service for which alarm thresholds are defined have the potential to impact the health score. Any alarm generated for any CI that is part of the service affects the score. In this scenario, it is not necessary to define health indicators. However, not all metrics are of equal importance. Some metrics, such as those that represent performance and availability, are better indicators of service health than others. If you have metrics like these for a service, consider defining them as health indicators.

By default, an event which is generated due to a breach in the health indicator threshold (also called health indicator event) is assigned a score as listed in the following table:

Health indicator event severity

Score

Reduction in health score

Critical

20

If the node is impacted by one critical event, its health score is reduced by 20.

Major

16

If the node is impacted by one major event, its health score is reduced by 16.

Minor

12

If the node is impacted by one minor event, its health score is reduced by 12.

Warning

8

If the node is impacted by one warning event, its health score is reduced by 8.

Service designers can customize the values assigned to the severity score based on an organization's requirements. For more information, see Customizing-health-score-and-health-status.

The following examples illustrate how the health score of a node is computed when a health indicator, for example, Disk Space Used is defined for a service. The node is impacted by health indicator events due to breach in the Disk Space Used threshold. 

  • If the node is impacted by two Major severity health indicator events, its health score = 100 - (2*16) = 68 
  • If the node is impacted by three Critical severity health indicator events, its health score = 100 - (3*20) = 40

Node health score computation with both health indicators and event rules

If you have defined both health indicators and event rules for a service, events that are generated due to a threshold breach of these metrics and that match the criteria defined in the event rules are considered for computing the health score of its nodes. The following table describes how the health score of a node is computed when either health indicators or event rules, or both are defined for a service:

Metrics defined as health indicators?

Event rules defined?

Events considered for health score computation

Example

Yes

No

The following types of events are considered:

  • Events generated for the metrics that are defined as health indicators
  • All other events generated for any CI that is associated with the service 

By default, the health score reduced due to an event generated for a health indicator is double the value of the score reduced due to an event generated for a CI. For more information, see Customizing health score and health status.

A service is associated with three metrics: Disk Space Used, CPU Utilization, and Memory Utilization, and you have defined Disk Space Used and CPU Utilization as health indicators. If a critical event is generated for CPU Utilization, the node health score is reduced by 20. If a critical event is generated for any CI, the node health score is reduced by 10.

Yes

Yes

The following types of events are considered:

  • Events that are generated on the health indicators due to associated policies
  • Events generated on metrics other than health indicators that satisfy the defined rules

If you have defined the CPU Utilization and Memory metrics as the health indicators and defined an event rule so that events with only the critical severity are considered, events with only the critical severity for these metrics are considered.

No

Yes

Only the events that satisfy the event rules are considered.

If you have defined an event rule that considers only events with the warning severity, then all events with the warning severity are considered.

No

No

All events are considered.

If you have not defined health indicators or event rules, events with all severities for all the CIs that are part of a service are considered.

The following examples illustrate how the health score of a node is computed when an event rule is defined to consider only the events with Critical severity and a health indicator is defined for the service.  

  • If the node is impacted by two Major severity health indicator events and two other Critical events, its health score = 100 - (2*16) - (2*10) = 58 
  • If the node is impacted by one Major severity health indicator event and two other Minor events, its health score = 100 - (1*16) = 84 

Service health score computation

To compute the health score of a service, the AI/ML algorithm in BMC Helix AIOps assigns weight to the nodes of a service and their relationships. These weights are numbers that signify the importance of a node or relationship when the impact occurs and are used for computing the health score.

A service can contain multiple nodes of the same or different types or a single node, and one or more nodes can be impacted by events. The following examples illustrate how the health score of a service is computed in different scenarios.  

Computation when a service contains multiple nodes and multiple nodes are impacted

If a service contains multiple nodes of different types, such as database and host, and multiple nodes are impacted, by default, the service health score is computed based on the weightage assigned to the device type, as shown in the following table:

Node kind

Weightage value

Database

25%

Host

35%

Virtual machine

35%

Other node kinds

45%

The following examples illustrate how the service health score is computed based on the node weightage if multiple nodes are impacted:

A service model with ten nodes (virtual machines) and multiple other nodes impacted.

The following table shows the health score of the nodes. For information about how the health score of a node is computed, see Node health score computation.  

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

Node 8

Node 9

Node 10

100

80

74

80

70

92

100

88

84

100

The following process is used to compute the service health score:

  1. Node scores are sorted in ascending order by health score.

    Node 5

    Node 3

    Node 2

    Node 4

    Node 9

    Node 8

    Node 6

    Node 1

    Node 7

    Node 10

    70

    74

    80

    80

    84

    88

    92

    100

    100

    100

    Index 0

    Index 1

    Index 2

    Index 3

    Index 4

    Index 5

    Index 6

    Index 7

    Index 8

    Index 9

  2. The index is calculated based on the following formula:
    Index = Weightage value percentage of the total number of nodes
    If the index value is a decimal number, the fractional part of the number is not considered in the calculation. For example, if the index is 5.67, only the whole number part, that is, 5, is considered for calculation. The fractional part, that is, 67, is not considered for the calculation. 
    In this example, the number of virtual machine nodes is 10, and the weightage associated with a virtual machine node is 35. So, the index is 35% of 10 = 3.5, which is converted to a whole number 3.

  3. The health score of the service is the score of the node that corresponds to the index. The index position starts at the far left with 0 and moves to the right.
    In this example, the index points to Node 3, which has a health score of 80, Therefore, the service health score is 80.

A service with five database nodes, three host nodes, ten virtual machine nodes, and eight other device type nodes, and multiple nodes impacted

The following table shows the health score of these nodes. For information about how the health score of a node is computed, see Node health score computation

The following process is used to compute the service health score:

  1. Node scores are sorted in ascending order by health score.

    Node kind

    Node score sorted in ascending order

    Database

    80

    84

    90

    100

    100

     

     

     

     

     

    Host

    74

    88

    90

     

     

     

     

     

     

     

    Virtual machine

    58

    66

    66

    66

    72

    78

    88

    90

    100

    100

    Other node kinds

    40

    50

    54

    58

    66

    72

    88

    100

     

     

     

    Index 0

    Index  1

    Index 2

    Index 3

    Index 4

    Index 5

    Index 6

    Index 7

    Index 8

    Index 9

  2. Node kind score is calculated based on the node index. The index for the node is calculated based on the following formula:
    Index = Weightage value percentage of the total number of nodes
    In this example:

    • The number of database nodes is 5, and the weightage associated with a database node is 25. So, the index is 25% of 5=1.25, which is converted to a whole number 1.
      The index points to the second element in the Database row. So, the node kind score is 84. 

    • The number of host nodes is 3, and the weightage associated with a host node is 35. So the index is 35% of 3=1.05, which is converted to a whole number 1.
      The index points to the second element in the Host row. So, the node kind score is 88.

    • The number of virtual machine nodes is 10, and the weightage associated with a virtual machine node is 35. So, the index is 35% of 10=3.5, which is converted to a whole number 3.
      The index points to the fourth element in the Virtual machine row. So, the node kind score is 66.

    • The number of other type nodes is 8, and the weightage associated with an other node type is 45. So, the index is 45% of 8=3.6, which is converted to a whole number 3.
      The index points to the fourth element in the Other node kinds row. So, the node kind score is 58.

  3. The health score for the service is the lowest node score among all the node kinds. The lowest node score is 58. Therefore, the service health score is 58.

Computation when a service contains multiple nodes and only one node is impacted

If a service contains multiple nodes and only one node is impacted, the health score of the service is the node score of the impacted node. The node score depends on the severity of the event. For example, if a critical event has impacted the node, the node score and therefore, the service health score is 90 (100 - 10).

Computation when a service contains only one node and the node is impacted

If a service contains only one node and the node is impacted, the service health score depends on the severity of the event that has impacted the node. For example, if a major event has impacted the node, the node score and therefore, the service health score is 92 (100 - 8).

Impact propagation and service health score

By default, the impact on the child services is propagated to the parent service and health score of the parent service is determined by the health score of the child services. However, as a service designer, you can stop the impact propagation based on your organization's needs. For more information, see Customizing-health-score-and-health-status.

The following examples illustrate how the health score of a parent service is computed if the impact on the child services is propagated to the parent service.

The parent service not impacted by any event

The health score of the impacted child services is 30, 40, and 50.

The health score of the parent service is the lowest health score from across the child services. Therefore, the health score of the parent service is 30.

The parent service impacted by events

The parent service has been impacted by four Critical severity events. Due to which, the health score of the parent service is 20. The health scores of the impacted child services are 30, 40, and 50.

The health score of the parent service is the lowest health score amongst its own score and from across the child services. Therefore, the health score of the parent service is 20.

Important

Consider a service model with 3 nodes, where a circular relationship is created, as shown in the example.
cyclic service model.png

When an event is generated for node A, the impact is propagated to the remaining two nodes because of the service model's circular nature.

Here, when Node A receives an event for the first time, the impacting child services count for Node A is 0, Node B has an impacting child services count of 1, and Node C has an impacting child service count of 2.

The count varies for each node for the first time. This happens because when the event occurs at Node A; Node B, and Node C are not directly impacted. Hence, the impacting child services count for Node A is 0. 

Similarly, when the impacted child services count for Node B is calculated, by that time, Node C is not directly impacted. So, the impacting child services count for Node B is 1. 

When the impacted child services count for Node C is calculated, by that time, Nodes A and B are directly impacted. Hence, the impacting child services count for Node C is 2.

When an event occurs for the second time, the impacting child count for all three nodes will be two because all three nodes were already impacted during the first event cycle.

Balancing profiles and service health score

As a service designer, you can use a balancing profile to specify a threshold by selecting a certain number or percentage of CIs to make sure that the service remains healthy as long as these CIs are healthy. The health score is computed based on the events generated from the selected CIs in the balancing profile. If no balancing profiles are defined, all events for all CIs are considered while computing the health score. For more information, see Adding-balancing-profiles.

Example: Service health score computation without event rules or health indicators

This example illustrates how the ApexInsurance.live service health score (which is 10), a child service of the apexbanking.live service, is computed.

The following figure shows the service model containing the apexbanking.live service and its child services. The highlighted service (ApexInsurance.live) indicates that it is 90% Impacted, which means that its health score is 10.

Hierarchy_25102.png

Assumptions

The following assumptions are used to compute the health score of the ApexInsurance.live service:

  • Due to customizations in the health score settings, for every Critical event on a node in the ApexInsurance.live service, the health score of the node is reduced by 25.
    For information about customizations, see Customizing-health-score-and-health-status
  • For every Major event on a node, the health score of the node is reduced by 8.
  • For every Critical health indicator event on a node, the health score of the node is reduced by 20. 

Service topology

The following figure shows the topology of the ApexInsurance.live service:

ApexInsurance_Topology.png

The nodes have been grouped by their kinds. The ImpactIndicator.png icon indicates that various nodes in that node kind group have been impacted by events. 

The topology contains the following node kind groups:  

  • Host: Consists of ten nodes, two of which belong to the child service of the ApexInsurance.live service. Although the group contains ten nodes, only eight of them are considered for health score computation because the remaining two belong to the child service of the ApexInsurance.live service, not to the service itself.
  • Cluster: Consists of one node, which is impacted by events and is considered for health score computation.  
  • Network Device: Consists of four nodes. The nodes in this group are impacted by events. However, they are not considered for health score computation because they belong to the child service of the ApexInsurance.live service, not to the service itself. 
  • Network Interface: Consists of two nodes. These nodes are not considered for health score computation because they are not impacted by events.

Health score computation for nodes

Let’s first look at how the node score is calculated.

Health score computation for Host nodes

The following figure shows that the Host node kind contains ten nodes and Node 1 is impacted by two Critical events.

HostCriticalEvents_25102.png

The following table lists the health scores for the nodes in the Host node kind:

Node kindNode 1 Node 2 Node 3 Node 4 Node 5 Node 6Node 7Node 8
Host 100 – (2*25) = 50 100 100 100 100 100 100 100

For each critical event on Node 1, its score is reduced by 25. Hence, the health score is 100 – (2*25) = 50. Other nodes have not been impacted by any events. Hence, their score is 100 (default score)

Health score computation for Cluster nodes

The following figure shows that the Cluster node kind contains one node (Node 1), which is impacted by two Critical and five Major events.

ApexService_MajorCriticalEvents.png

The following lists the health score for the node in the Cluster node kind:

Node kindNode 1 
Cluster100 – (2*25) - (5*8) = 10 

For each Critical event on Node 1, its score is reduced by 25 and for each Major event, the score is reduced by 8. Hence, the health score is 100 – 50 – 40 = 10. 

Health score computation for the service

The following process is used to compute the ApexInsurance.live service health score (10): 

  1. Node scores are sorted in ascending order by health score.
    Node kindNode 1 Node 2 Node 3 Node 4 Node 5 Node 6Node 7Node 8
    Host 50 100 100 100 100 100 100 100
    Cluster10       
     Index 0Index 1Index 2Index 3Index 4Index 5Index 6Index 7
  2. Node kind score is calculated based on the node index. The index for the node is calculated based on the following formula: 
    Index = Weightage value percentage of the total number of nodes
    • The number of Host nodes is 8, and the weightage associated with a host node is 35. So, the index is 35% of 8 = 2.8, which is converted to a whole number 2. 
      The index points to the third element in the Host row. So, the score for the Host node kind is 100
    • The number of Cluster nodes is 1, and the weightage associated with a cluster node is 45. So, the index is 45% of 1 = 0.45, which is converted to 0. 
      The index points to the first element in the Cluster row. So, the score for the Cluster node kind is 10
  3. The health score for the service is the lowest node score among all node kinds. The lowest node score is 10. Therefore, the service health score is 10.

Example: Service health score computation with event rules

Assume that you have defined an event rule. The rule states that events with only Critical severity can impact the service health, as shown in the following figure. In such a case, events only with Critical severity are considered for health score computation. Events with other severity types are ignored.

EventsRule.png

If the event rule is applied to the Cluster node kind, the health score for Node 1 is calculated as follows:

Node kindNode 1 
Cluster100 – (2*25) = 50 

In this case, although Node 1 is impacted by both Critical and Major events, only Critical events are considered for computation.

The overall service health score is calculated as follows: 

  1. Node scores are sorted in ascending order by health score. 
    Node kindNode 1 Node 2 Node 3 Node 4 Node 5 Node 6Node 7Node 8
    Host 50 100 100 100 100 100 100 100
    Cluster50       
     Index 0Index 1Index 2Index 3Index 4Index 5Index 6Index 7
  2. Node kind score is calculated based on the node index. The index for the node is calculated based on the following formula: 
    Index = Weightage value percentage of the total number of nodes 
    • The number of Host nodes is 8, and the weightage associated with a host node is 35. So, the index is 35% of 8 = 2.8, which is converted to a whole number 2. 
      The index points to the third element in the Host row. So, the node kind score is 100
    • The number of Cluster nodes is 1, and the weightage associated with a cluster node is 45. So, the index is 45% of 1 = 0.45, which is converted to 0. 
      The index points to the first element in the Cluster row. So, the node kind score is 50
  3. The health score for the service is the lowest score among all the node kinds. The lowest node kind score is 50. Therefore, the service health score is 50

The following figure shows that the health score of the ApexInsurance.live service is updated to 50 when an event rule is defined.

ServiceHealthScore_EventRules.png

Example: Service health score computation with health indicators

Assume that you have defined the following health indicators for the ApexInsurance.live service (shown in the following figure):

ActualUsed, Free, KernelSlabMemory, UsedPercent, InErrorsInPercent

HealthIndicators_252.png

When there is a breach in the values of these health indicators, events are generated. For example, the following figure shows events generated for a host node in the Apexbanking.live service:

HealthIndicatorEvents_ApexBankingLive_252.png

The following table shows the health score of the nodes:

Node kindNode 1

Node 2

Node 3Node 4Node 5Node 6Node 7Node 8Node  9
Host

100 – (2*25) = 50

100 – (4 *20) = 20

100100100100100100100
Cluster100 – (2*25) – (5*8) = 10        
 Index 0Index 1Index 2Index 3Index 4Index 5Index 6Index 7Index 8

The following process is used to compute the service health score:

  1. Node scores are listed in ascending order by health score.
    Node kindNode 1

    Node 2

    Node 3Node 4Node 5Node 6Node 7Node 8Node  9
    Host2050100100100100100100100
    Cluster10        
     Index 0Index 1Index 2Index 3Index 4Index 5Index 6Index 7Index 8
  2. Node kind score is calculated based on the node index. The index for the node is calculated based on the following formula:
    Index = Weightage value percentage of the total number of nodes
    • The number of Host nodes is 9, and the weightage associated with a host node is 35. So, the index is 35% of 9 = 3.1, which is converted to a whole number 3.
      The index points to the third element in the Host row. So, the node kind score is 100.
    • The number of Cluster nodes is 1, and the weightage associated with a cluster node is 45. So, the index is 45% of 1 = 0.45, which is converted to 0.
      The index points to the first element in the Cluster row. So, the node kind score is 10.
  3. The health score for the service is the lowest node score among all the node kinds. The lowest node score is 10. Therefore, the service health score is 10.

HealthIndicatorEvents_ApexBankingLive_252.png

Example: Service health score computation with impact propagation

This example illustrates how the apexbanking.live service health score (displayed as 10), a parent service of the multiple child services is computed.

Hierarchy_banking_score.png

The parent service, apexbanking.live is impacted by multiple Critical severity events. Due to which, the health score of parent service is 20.  

The health score of the impacted child services is 70, 50, 10, and 70

The health score of the parent service is the lowest health score amongst its own score and from across the child services. Therefore, the health score of the parent service is 10

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*