Service health score, impact score, and metrics
Service health score
The service health score is used to assess the health of a service. The service health score is computed for the selected time range using the impacted events associated with each of the service entities and the significance derived from service topology. The higher the health score, the healthier the service. The service health score ranges from 0 to 100.
In the BMC Helix AIOps console, the service health score is displayed on the service details page as shown in the following image:
The service health is represented using the color-coded severity values as shown in the following image:
If the service severity is Ok, the service is healthy. Any other severity value such as Critical, Major, Minor indicates that the service is impacted.
Impacted entities and Impacted CIs
Impacted Entities on the service details page displays the top 3 impacted business services that are part of the main impacted service. In the following example, BMC Banking Application and Banking Core Servers are the top impacted business services that are part of the BMC Financial Service business service. Impacting Events shows the number of open events impacting the services.
Impacted CIs on the Services page displays the total count of impacted CIs associated with a service. In the following example, there are 5 impacted CIs that are associated with the cattle_namespace and 2 impacted CIs with manual_bs_service services.
Service health timeline
The service health score for the selected time range is represented by using the health timeline on the service details page. Here is an annotated screenshot of the service health timeline:
Time range selector. Click the arrow to change the time range. You can select a relative time range such as 3,6,12, and 24 hours. By default, Last3 hours time range is selected. Depending on the time range selected, the timeline is divided into equal-length time slots as shown in the following table:
Time range
Length of each time slot
3 hours
5 minutes
6 hours
5 minutes
12 hours
15 minutes
24 hours
20 minutes
Service health score for a specific time slot on the health timeline. Hover over a time slot to view the health score.
Legends to indicate incidents, events, and change requests on the health timeline. Hover over a legend on the health timeline to view event, incident, or change request details. For more information, see:
- Event-noise-reduction-indicator-for-prioritized-triage-and-remediation
The health timeline does not display the INFO and OK events. - Total-incident-count-and-mean-time-to-resolve-MTTR-indicators-for-a-reliable-incidence-response-process
- Event-noise-reduction-indicator-for-prioritized-triage-and-remediation
Service impact score
The service impact score indicates how the service is impacted because of its entities. The service impact score is inversely proportional to its health score. The higher the service impact score, the lower is its health.
Impact score = 100 - service health score
The service impact score is displayed on the service details page as shown in the following image:
The service health score is 90 in the example and hence the impact score is 10.
Health Indicator
The health indicator is a measure that provides relevant insight into the behavior of one or more essential CIs that are part of the Service. For example, Hourly average capacity of the File System, Kubernetes configuration status at different time intervals, and so on.
As a service administrator, you can define a health indicator as part of a service model definition. Each health indicator is a single or collection of metrics from CIs. Each metric contains an alarm trigger condition to generate events when a threshold value is violated. For more information on setting alarm policies, see Configuring-alarm-policies, on adding health indicator to a service model, see Modeling-services, and for viewing the health indicators as part of service details, see Performing-probable-cause-analysis. You can view the heath indicator metrics chart associated with the impacted service as shown in the following image:
Metrics
Metric is an important performance indicator in your environment. For example, if you have a Linux monitoring solution and CPU monitor type, the following section lists a few example metrics that can be monitored:
- Utilization
- Load
- Idle time
- Context Switches
You can view the metrics associated with the top three events associated with the top 3 impacted nodes for a service. An example metrics chart is shown in the following image:
Service Insights
View and analyze the summary of service health behavior and its severity pattern for a pre-defined period of 15 days, derive insights, and take corrective measures to ensure service continuity. The following section lists use case examples for understanding the health behavior and severity patterns:
- Service health score trend and behavior: For a service, the text summary shows the trend of a service over a period, and the highest percentage degradation. The graph represents the daily average health score trend, for the predefined period. For example, see the following behavior of a service for a period of four consecutive days.
Let's consider an example of a service, for which the daily average health score and their percentage changes are described in the following table.
Date | Daily Avg. Health Score | % change in Daily Avg. Health Score, compared to previous day Formula: [(H2 - H1)/(H1)] x 100 where, H1 = Average Health Score on previous Date
| |
---|---|---|---|
11/13/2022 | 30 | - | - |
11/14/2022 | 0 | [(0 - 30)/30] x 100 | 100% |
The summary text shows only the recent percentage degradation of average service health (for example, 100%) and comparison dates. In the corresponding graph, the same is represented by the highlighted zone.
- Service health score pattern for severities (Major/Critical): For a service, the summary text shows the daily occurrence time of Critical or Major severity, and the corresponding graph shows the severity occurrence pattern highlighted for the predefined period. For example, see following severity pattern with two repetitive durations on three consecutive days.
Consider the table below, showing the daily occurrences of Critical severities of a service. Let's try to derive a pattern considering the periodical repetition of severities. We can see only the Critical severity is occurring daily between 01:30 hour and 05:30 hours.
Date range | Severity | Duration |
---|---|---|
11/13/2022 to 11/27/2022 | Critical | 01:30 hr to 05:30 hr |
BMC Helix AIOps displays the pattern for the occurrences of Critical severity during the period. From the graph, viewing the highlighted sections you can identify the pattern.
Insights for Major/Critical Events: The summary text shows the trend of Major or Critical events over a period, with recent increase in events and the comparison dates. Let's consider an example of a service, for which the daily occurrences of Critical events are described in the following table:
Date
Critical events
Average events (Critical)
(Sum of Critical events) / (Number of days)
11/12/2022
10
(10+2+4+4+5+5+6+15+16+15+15+10+15+18+18) / 15
= 10.53
11/13/2022
2
11/14/2022
4
11/15/2022
4
11/16/2022
5
11/17/2022
5
11/18/2022
6
11/19/2022
15
11/20/2022
16
11/21/2022
15
11/22/2022
15
11/23/2022
10
11/24/2022
15
11/25/2022
18
Daily average = 11 events (Critical)
11/26/2022
18
The summary text shows only the recent increase in Critical events (from 15 to 18) and the comparison dates. In the corresponding graph, the same is represented by the highlighted zone.
Insights for Incidents: The summary text shows the trend of incidents over a period, with recent increase in incidents and the comparison dates. Let's consider an example of a service, for which the daily occurrences of incidents are described in the following table:
Date
Incidents
Average incidents
(Sum of incidents) / (Number of days)
11/12/2022
1
(1+2+3+4+5+7+6+7+9+9+10+11+12+13+15) / 15
= 7.6
11/13/2022
2
11/14/2022
3
11/15/2022
4
11/16/2022
5
11/17/2022
7
11/18/2022
6
11/19/2022
7
11/20/2022
9
11/21/2022
9
11/22/2022
10
11/23/2022
11
11/24/2022
12
11/25/2022
13
Daily average = 8 incidents
11/26/2022
15
The summary text shows only the recent change in incidents counts (from 13 to 15) and comparison dates. In the corresponding graph, the same is represented by the highlighted zone.
Where to go from here