Service health score, impact score, and metrics


The health of a service depends on the health of its entities. For example, a JIRA management service contains entities such as Incident Management, Change Management, Web server, and a database. If the Change Management system is down or its performance is highly impacted, the JIRA management service is also impacted. In a service model, each service entity is referred to as a Configuration Item (CI). 

Note

Service entities and Configuration items are interchangeably used in BMC Helix AIOps.

The service health score and service impact score are the two most important indicators of service health. The health and impact scores provide a quick insight into service health and enable you to take timely action. Also, navigate to the Insights tab to view and analyze the summary of service health behavior and its severity pattern for a pre-defined period.

Service health score

The service health score is computed for the selected time range using the impacted events associated with each of the service entities and the significance derived from service topology. The higher the health score, the healthier the service. The service health score ranges from 0 to 100.

The service health score is displayed on the service details page as shown in the following image:

health.png

The service health is represented using the color-coded severity values as shown in the following image:

node_scores.png

If the service severity is Ok, the service is healthy. Any other severity value such as Critical, Major, Minor indicates that the service is impacted. 

Impacted entities and Impacted CIs

Impacted Entities on the service details page displays the top 3 impacted business services that are part of the main impacted service. In the following example, BMC Banking Application and Banking Core Servers are the top impacted business services that are part of the BMC Financial Service business service. Impacting Events shows the number of open events impacting the services. 

Impacted_entities.png

Impacted CIs on the Services page displays the total count of impacted CIs associated with a service. In the following example, there are 5 impacted CIs that are associated with the cattle_namespace and 2 impacted CIs with manual_bs_service services.

impacted_CIs.png

Service health timeline

The service health score for the selected time range is represented by using the health timeline on the service details page. Here is an annotated screenshot of the service health timeline:

health_score_example.png

  1. Time range selector. Click the arrow to change the time range. You can select a relative time range such as 3,6,12, and 24 hours. By default, Last3 hours time range is selected. Depending on the time range selected, the timeline is divided into equal-length time slots as shown in the following table:

    Time range

    Length of each time slot

    3 hours

    5 minutes

    6 hours

    5 minutes

    12 hours

    15 minutes

    24 hours

    20 minutes

  2. Service health score for a specific time slot on the health timeline. Hover over a time slot to view the health score. 

    What happens when there are different health scores in the same time slot?

    If there are different health scores in the same time slot, the latest health score will be displayed when you hover over the time slot. For example, consider a time slot between 8 PM and 8.05 PM (five minutes), and if there are 3 health scores such as 20, 60, and 50 in this time slot, the latest health score 50 is displayed when you hover over this time slot.

    The duration or length of each time slot is fixed, but the time window changes dynamically for each slot based on the current time. For example, consider the current system time is 8 PM, and you selected the Last 3 Hours option from the Health Time. In this case, the slots are plotted between 5 PM and 8 PM. Each slot is split into 5 minutes intervals starting from 5:00 to 5:05 PM and ending at 7:55 to 8:00 PM. If the same Health Time line is viewed at 8:01 PM for the last 3 hours, the range starts from 5:01 to 5:06 PM and ends at 7:56 to 8:01 PM.

    Therefore, the slot length or duration is maintained constantly (see Time range) based on the duration selected, but the time window for the slot changes dynamically based on the current selection and system time.

  3. Legends to indicate incidents, events, and change requests on the health timeline. Hover over a legend on the health timeline to view event, incident, or change request details. For more information, see:

    How do I derive insights using the event, incident, and change request legends?

    When you hover over a legend on the health timeline, the corresponding details are displayed along with the timestamp. If you find unusual occurrences of events, incidents, or change requests, hover over the legends, collate the details, and analyze them to resolve the issues.

    Example

    derive_insights.png

    In the preceding example:
    The event count is 1 on 15/02/2021 at 12:37, and the event count is 10 on 15/02/2021 at 13:32. You can check the details of the events that occurred during this time slot and analyze the associated information to determine and resolve the error condition. Click the time slot to view the event details and to perform root cause isolation. For more information, see Performing-ML-based-root-cause-isolation-of-an-impacted-service

Service impact score

The service impact score indicates how the service is impacted because of its entities. The service impact score is inversely proportional to its health score. The higher the service impact score, the lower is its health.
Impact score = 100 - service health score

In BMC Helix AIOps, the service impact score is displayed on the service details page as shown in the following image:

health.png

The service health score is 64in the example and hence the impact score is 36.

Health Indicator

The health indicator is a measure that provides relevant insight into the behavior of one or more essential CIs that are part of the Service. For example, Hourly average capacity of the File System, Kubernetes configuration status at different time intervals, and so on.

As a service administrator, you can define a health indicator as part of a service model definition. Each health indicator is a single or collection of metrics from CIs. Each metric contains an alarm trigger condition to generate events when a threshold value is violated. For more information on setting alarm policies, see Alarm Policies, on adding health indicator to a service model, see Modeling-services, and for viewing the health indicators as part of service details, see Performing-ML-based-root-cause-isolation-of-an-impacted-service. You can view the heath indicator metrics chart associated with the impacted service as shown in the following image:

Metrics

Metric is an important performance indicator in your environment. For example, if you have a Linux monitoring solution and CPU monitor type, the following section lists a few example metrics that can be monitored:

  • Utilization
  • Load
  • Idle time
  • Context Switches

You can view the metrics associated with the top three events associated with the top 3 impacted nodes for a service. An example metrics chart is shown in the following image:

e2e_metrics.png

Service health insights

The following video (1:53) shows a high-level overview of the Service Insights feature in BMC Helix AIOps:

icon_play.pnghttps://youtu.be/NAy2pEMzjF8

View and analyze the summary of service health behavior and its severity pattern for a pre-defined period of 15 days, derive insights, and take corrective measures to ensure service continuity. The following section lists use case examples for understanding the health behavior and severity patterns:

  • Service health behavior: For a service, the text summary shows the highest percentage degradation and the graph represents the daily average health score trend, for the predefined period. For example, see the following behavior pattern for consecutive four days period.

Let's consider an example of a Financial Service, for which the daily average health score and their percentage changes are described in the following table. 

Date

Daily Avg. Health Score

% change in Daily Avg. Health Score, compared to previous day

Formula: [(H2 - H1)/(H1)] x 100

where,

H1 = Average Health Score of Previous Date

H2 = Average Health Score of Current Date

 

06/13/2022

62.50

-

-

06/14/2022

61.25

[(61.25 - 62.5)/62.5] x 100 

- 2 %

06/15/2022

60

[(60-61.25)/61.25] x 100

- 2.04 %

06/16/2022

60.79

[(60.79-60)/60] x 100

+ 1.31 %

06/17/2022

59.86

[(59.86-60.79)/60.79] x 100

- 1.53 %

06/18/2022

60

[(60-59.86)/59.86] x 100

+ 0.23 %

BMC Helix AIOps displays only the highest percentage degradation of average service health (e.g., 2.04%) in the summary text with the respective comparison dates. From the corresponding daily average health score trend, you can identify the zone of highest percentage degradation.

Insight_Behavior.png

  • Service severity pattern: For a service, the summary text shows the daily occurrence time of Critical or Major severity, and the corresponding graph shows the severity occurrence pattern highlighted for the predefined period. For example, see following severity pattern with two repetitive durations on three consecutive days.  

Consider the table below, showing the daily occurrences of Major and Critical severities of a Financial Service. Let's try to derive a pattern considering the periodical repetition of severities. We can see only the Major severity is occurring daily between 21:30 and 05:30 hours. However, based on the occurrences of Critical severity we can't derive any pattern.

Date

Severity

Duration

06/14/2022

Major

21:30 to 5:30 hrs

Critical

07:00 to 10:00 hrs

06/15/2022

Major

21:30 to 5:30 hrs

Critical

09:00 to 11:00 hrs

06/16/2022

Major

21:30 to 5:30 hrs

Critical

07:00 to 10:00 hrs

BMC Helix AIOps displays the pattern for the occurrences of Major severity during the period. From the graph, viewing the highlighted sections you can identify the pattern. As per the example, we see the graph only for the Major severity, since it repeats regularly at a fixed time daily. However, there will be no graph for the Critical severity as the regularity pattern is broken on 06/15/2022.

Insights_Pattern.png

Where to go from here

Monitoring-services

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*