Understanding service health score


Understanding the current and historical health of a service helps operators or site reliability engineers (SREs) identify the root cause of the service performance degradation and lower the mean time to resolve (MTTR).

As a service designer, you need to understand what a service health score is and how it is computed in BMC Helix AIOps .

Service health

The health score provides insights into the health of a service at that exact point in time. The impact severity is assigned to the service based on the health score. By default, a zero health score indicates the highest impact and a 100 health score indicates the best health. Service designers can customize the health score for a service and determine the range for each severity level.

In the BMC Helix AIOps console, the health score of a service is displayed on the service details page. Depending on the health score, a color-coded severity level is assigned to the service. For example, the health score of the Retail-Outlet service is displayed as 68 in the following image. The color code of the health score indicates that the impact is minor. The tooltip shows the health score range for each severity level.

HealhScoreMetric_243.png

Health score computation

The health score of a service depends on the health score of its nodes.

Node health score computation

The maximum health score for any node is 100. The health score of a node depends on the impacted events associated with the node. By default, each event severity is assigned a score as listed in the following table:

Event Severity

Score

Critical

10

Major

8

Minor

6

Warning

4

The following examples illustrate how the health score of a node is computed:

  • If the node has one critical event, its health score = 100 - 10 = 90
  • If a node has one major event, its health score = 100 - 8 = 92
  • If a node has one major and one minor event, its health score = 100 - 8 - 6 = 86

Service health score computation

The health score for a service is computed by using causal events that are associated with each of the service nodes and the significance derived from the service topology. The AI/ML algorithm in BMC Helix AIOps assigns weights to the nodes of a service and their relationships. These weights are numbers that signify the importance of a node or relationship when the impact occurs and are used for computing the health score.

A service model can contain multiple nodes of the same or different types or a single node, and one or more nodes can be impacted by events.

Computation when a service contains multiple nodes and multiple nodes are impacted

If a service model contains multiple nodes of different types, such as database and host, and multiple nodes are impacted, by default, the node score is computed based on the weightage assigned to the device type, as shown in the following table:

Node kind

Weightage value

Database

25%

Host

35%

Virtual machine

35%

Other node kinds

45%

The following examples illustrate how the health score is computed based on the node weightage if multiple nodes are impacted:

Example 1

A service model contains ten nodes (virtual machines) and multiple nodes are impacted. The following table shows the health score of these nodes: 

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

Node 8

Node 9

Node 10

100

80

74

80

70

92

100

88

84

100

The following process is used to compute the service health score:

  1. Node scores are sorted in ascending order by health score.

    Node 5

    Node 3

    Node 2

    Node 4

    Node 9

    Node 8

    Node 6

    Node 1

    Node 7

    Node 10

    70

    74

    80

    80

    84

    88

    92

    100

    100

    100

    Index 0

    Index 1

    Index 2

    Index 3

    Index 4

    Index 5

    Index 6

    Index 7

    Index 8

    Index 9

  2. The index is calculated based on the following formula:
    Index = Weightage value percentage of the total number of nodes
    If the index value is a decimal number, the fractional part of the number is not considered in the calculation. For example, if the index is 5.67, only the whole number part, that is, 5, is considered for calculation. The fractional part, that is, 67, is not considered for the calculation. 
    In this example, the number of virtual machine nodes is 10, and the weightage associated with a virtual machine node is 35. So, the index is 35% of 10 = 3.5, which is converted to a whole number 3.
  3. The health score of the service is the score of the node that corresponds to the index. The index position starts at the far left with 0 and moves to the right.
    In this example, the index points to Node 3, which has a health score of 80, Therefore, the service health score is 80.
Example 2

A service model contains five database nodes, three host nodes, ten virtual machine nodes, and eight other device type nodes, and multiple nodes are impacted.

The following process is used to compute the service health score:

  1. Node scores are sorted in ascending order by health score.

    Node kind

    Node score sorted in ascending order

    Database

    80

    84

    90

    100

    100

     

     

     

     

     

    Host

    74

    88

    90

     

     

     

     

     

     

     

    Virtual machine

    58

    66

    66

    66

    72

    78

    88

    90

    100

    100

    Other node kinds

    40

    50

    54

    58

    66

    72

    88

    100

     

     

     

    Index 0

    Index  1

    Index 2

    Index 3

    Index 4

    Index 5

    Index 6

    Index 7

    Index 8

    Index 9

  2. The index for the node is calculated based on the following formula:
    Index = Weightage value percentage of the total number of nodes
    In this example:
    • The number of database nodes is 5, and the weightage associated with a database node is 25. So, the index is 25% of 5=1.25, which is converted to a whole number 1.
      The index points to the second element in the Database row. So, the node score is 84. 
    • The number of host nodes is 3, and the weightage associated with a host node is 35. So the index is 35% of 3=1.05, which is converted to a whole number 1.
      The index points to the second element in the Host row. So, the node score is 88.
    • The number of virtual machine nodes is 10, and the weightage associated with a virtual machine node is 35. So, the index is 35% of 10=3.5, which is converted to a whole number 3.
      The index points to the fourth element in the Virtual machine row. So, the node score is 66.
    • The number of other type nodes is 8, and the weightage associated with an other node type is 45. So, the index is 45% of 8=3.6, which is converted to a whole number 3.
      The index points to the fourth element in the Other node kinds row. So, the node score is 58.
  3. The health score for the service is the lowest node score among all the node kinds. The lowest node score is 58. Therefore, the service health score is 58.

Computation when a service contains multiple nodes and only one node is impacted

If a service model contains multiple nodes and only one node is impacted, the health score of the service is the node score of the impacted node. The node score depends on the severity of the event. For example, if a critical event has impacted the node, the node score and therefore, the service health score is 90 (100 - 10).

The following example illustrates how the health score is computed if only a single node is impacted:

Example 3

A service model contains ten nodes (virtual machines) and only one node is impacted by a critical event. The following table shows the health score of these nodes: 

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

Node 8

Node 9

Node 10

100

90

100

100

100

100

100

100

100

100

Because only one node is impacted, the service health score is 90.

Computation when a service contains only one node and the node is impacted

If a service model contains only one node and the node is impacted, the service health score depends on the severity of the event that has impacted the node. For example, if a major event has impacted the node, the node score and therefore, the service health score is 92 (100 - 8).

The following example illustrates how the health score is computed if a service model contains only one node and it is impacted:

Example 4

A service model contains a single node and it is impacted by a major event. The following table shows the health score of this node:

Node 1

92

Because only one node is present and it is impacted, the service health score is 92.

Advanced options affecting service health score computation

By default, all events are considered for computing the health score of a service, and the health score is propagated from the child services to a parent service. However, you can customize how the health score should be computed and whether the impact should be propagated by using the following configuration options:

  • Health indicators
  • Event rules
  • Balancing profiles
  • Health severity score and status
  • Health impact propagation

Watch the following video to get an overview of the advanced service health score configuration options:

icon_play.png Watch the YouTube video about the advanced service health score configuration in BMC Helix AIOps.

 

Health indicators

You can define one or more metrics associated with a service as health indicators that represent the overall health of the service. For example, if you are using synthetic transactions to measure the availability and response time of a web application, those availability and response time metrics are good candidates to be health indicators. When you define health indicators, you associate thresholds with them. When these thresholds are breached, the service health score reflects that the service is no longer completely healthy. For more information, see Adding-health-indicators.

The thresholds associated with service health indicators are also used for service predictions. For more information, see Predicting-and-proactively-resolving-service-outages.

Important

If no health indicators are defined for a service, all the metrics associated with the service for which alarm thresholds are defined have the potential to impact the health score. Any alarm generated for any CI that is part of the service will affect the score. In this scenario, it is not necessary to define health indicators. However, not all metrics are of equal importance. Some metrics, such as those that represent performance and availability, are better indicators of service health than others. If you have metrics like these for a service, consider defining them as health indicators.

Event rules

By default, the health score for an impacted service is computed based on the events generated on all the CIs that are part of the service. However, as a service designer, you can define event rules to consider only specific events based on the impacted CIs, event severity, or message. For example, you can define a rule to consider only the events with Critical severity for computing the health score.

If you have added both health indicators and event rules for a service, events that are generated due to a threshold breach of these metrics and that match the criteria defined in the event rules are considered for computing the health score. For more information, see Adding-event-rules

Combining health indicators and event rules

The following table describes how the health score is computed when either health indicators or event rules or both are defined for a service:

Metrics defined as health indicators?

Event rules defined?

Events considered for health score computation

Example

Yes

No

The following types of events are considered:

  • Events generated for the metrics that are defined as health indicators
  • All other events generated for any CI that is associated with the service 

By default, the health score reduced due to an event generated for a health indicator is double the value of the score reduced due to an event generated for a CI. For more information, see Customizing health score and health status.

A service is associated with three metrics: Disk Space Used, CPU Utilization, and Memory Utilization, and you have defined Disk Space Used and CPU Utilization as health indicators. If a critical event is generated for CPU Utilization, the node health score is reduced by 20. If a critical event is generated for any CI, the node health score is reduced by 10.

Yes

Yes

The following types of events are considered:

  • Events that are generated on the health indicators due to associated policies
  • Events generated on metrics other than health indicators that satisfy the defined rules

If you have defined the CPU Utilization and Memory metrics as the health indicators and defined an event rule so that events with only the critical severity are considered, events with only the critical severity for these metrics are considered.

No

Yes

Only the events that satisfy the event rules are considered.

If you have defined an event rule that considers only events with the warning severity, then all events with the warning severity are considered.

No

No

All events are considered.

If you have not defined health indicators or event rules, events with all severities for all the CIs that are part of a service are considered.

Balancing profiles

As a service designer, you can use a balancing profile to specify a threshold by selecting a certain number or percentage of CIs to ensure that the service remains healthy as long as these CIs are healthy. The health score is computed based on the events generated from the selected CIs in the balancing profile. If no balancing profiles are defined, all events for all CIs are considered while computing the health score.

You can define balancing profiles for an individual service from the Service details page or for multiple services from the Manage Service Health page. For more information, see Adding-balancing-profiles and Configuring global settings for service health.

Health severity score and status customization

The health score for a service is computed based on the health configuration defined in BMC Helix AIOps . Service designers can customize the values assigned to the health severity score and status based on an organization's requirements. The maximum health score for any service is 100. For more information, see Customizing-health-score-and-health-status.

Service impact propagation

By default, the impact on the child services is propagated to the parent service, whose health score is determined by the health score of its child services. For example, if there are three impacted child services with health scores as 30, 40, and 50, the health score of the parent service is the lowest health score from across the child services. Therefore, the health score of the parent service is 30.

As a service designer, you can stop the propagation of the impact to a parent service based on your organization's needs. For more information, see Customizing-health-score-and-health-status.

Note

Consider a service model with 3 nodes, where a circular relationship is created, as shown in the example.
cyclic service model.png

When an event is generated for node A, the impact is propagated to the remaining two nodes because of the service model's circular nature.

Here, when Node A receives an event for the first time, the impacting child services count for Node A is 0, Node B has an impacting child services count of 1, and Node C has an impacting child service count of 2.

The count varies for each node for the first time. This happens because when the event occurs at Node A; Node B, and Node C are not directly impacted. Hence, the impacting child services count for Node A is 0. 

Similarly, when the impacted child services count for Node B is calculated, by that time, Node C is not directly impacted. So, the impacting child services count for Node B is 1. 

When the impacted child services count for Node C is calculated, by that time, Nodes A and B are directly impacted. Hence, the impacting child services count for Node C is 2.

When an event occurs for the second time, the impacting child count for all three nodes will be two because all three nodes were already impacted during the first event cycle.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*