Service health score and health timeline


Observability is key to ensuring that the business services are healthy and performing at optimum levels. Understanding the current and historical health of a service based on indicators, abnormalities, and events of the components within the service model helps operators or site reliability engineers (SREs) to identify the root causes of service performance degradation or health impact, lower mean-time-to-resolve (MTTR), and avoid service disruptions.

BMC Helix AIOps achieves the above business goals by using AI/ML-powered algorithms so organizations can understand the health of their services and identify the causes of service problems faster.

Scenario

APEX Global IT monitors a JIRA management service that contains entities, such as the Incident Management application, Change Management application, Web server, and a database. Each of these entities may further have child nodes, such as network interface cards, hosts, namespaces, software instances, and so on. 

Let's assume that the Change Management application is down due to a network error for three hours on a day when the ITOps team has scheduled a system upgrade push. The application's performance is highly impacted due to this problem, which would cause unnecessary downtime with the JIRA management service if not attended to quickly. During the three hours when the upgrade was scheduled, many impactful events were generated, which resulted in the JIRA management service going into an impacted state with a major severity. Because the upgrade involved a time-critical fix to the JIRA service, all the JIRA requests were blacked out, which resulted in a major impact to availability. 

During the past three hours, when the impact was high, events and alarms were flooding the monitoring system. These events are plotted on the health timeline slots. The Change Management system and the network nodes are listed as the top impacted nodes. Events from these nodes are listed as the top causal events.

Health score

The health score and health timeline provides insight into the health of a service in the last 24 hours. The impact severity is assigned based on the health score. By default, a zero health score indicates the highest impact and a 100 indicates the best or OK health. Service designers can customize the health score for any service and determine the range for each severity levels. For more information, see Understanding-service-health-score

Important

When a service model is updated or new CIs are added to an existing model, health score computation for the newly added CIs begins 15 minutes after the service is updated. 

The health score of a service is displayed on the service details page in the BMC Helix AIOps console. Depending on the health score, a color-coded severity level is assigned to the service. 

For example, the health score of the Train Ticket Reservation System service is displayed as 48 in the following image. The color code of the health score indicates that it is major. The tooltip shows the health score range for each severity level.

HealhScoreMetric_23201.png


Health timeline

The health of each service is monitored on a 24-hour timeline. This health timeline is made up of time slices or slots. The events, incidents, and changes are plotted against the time slots for the last 24 hours. The length of each slot could be 5, 15, or 20 minutes based on whether the health timeline is plotted for the last 3-6 hours, 12 hours, or 24 hours as shown in the following image:

HealthScore_Example_23102.png

Item

Description

Time range selector

List to select the time range such as Last 3 (default), 6, 12, and 24 hours. Based on the time range, the timeslot is sliced as follows:

  • Last 3 hours or 6 hours - 5 minutes per slot
  • Last 12 hours - 15 minutes per slot
  • Last 24 hours - 20 minutes per slot

Health score

Time slot shows the health score at that time. Hover over the time slot to view.

Incident, Event, Change

Legends to indicate incidents, events, and change requests on the health timeline.

Hover over a legend on the health timeline to view event, incident, or change request volumes for that time slot. On the timeline, change is shown based on the Start Date of the change request. The health timeline does not display the INFO, OK, or BLACKOUT events.

You can select a different time range to view the events, incidents, and changes for that duration.

What happens when there are different health scores in the same time slot?

If there are different health scores in the same time slot, the latest health score will be displayed when you hover over the time slot. For example, consider a time slot between 8:00 P.M. and 8.05 P.M. (five minutes), and if there are 3 health scores such as 20, 60, and 50 in this time slot, the latest health score 50 is displayed when you hover over this time slot.

The duration or length of each time slot is fixed, but the time window changes dynamically for each slot based on the current time. For example, consider the current system time is 8:00 P.M., and you selected the Last 3 Hours option from the Health Timeline. In this case, the slots are plotted between 5:00 P.M. and 8:00 P.M. Each slot is split into 5 minutes intervals starting from 5:00 to 5:05 P.M. and ending at 7:55 to 8:00 P.M. If the same Health Timeline is viewed at 8:01 P.M. for the last 3 hours, the range starts from 5:01 to 5:06 P.M. and ends at 7:56 to 8:01 P.M.

Therefore, the slot length or duration is maintained constantly (see Time range) based on the duration selected, but the time window for the slot changes dynamically based on the current selection and system time.

How do I derive insights using the event, incident, and change request legends?

When you hover over a legend on the health timeline, the corresponding details are displayed along with the timestamp. If you find unusual occurrences of events, incidents, or change requests, hover over the legends, and analyze them to resolve the issues.

Example

HealthScore_Example_23102.png

In the example:
The event count is 1 at 17:17, and the event count is 3 at 18:17. You can check the details of the events that occurred during this time slot and analyze the associated information to determine and resolve the error condition. Click the time slot to view the event details and to perform root cause isolation. For more information, see Performing ML-based root cause isolation of an impacted service

Where to go from here

To view the service health, impact, and health timeline, see Monitoring-service-health.