Service health score and health timeline
Observability is key to ensuring that the business services are healthy and performing at optimum levels. Understanding the current and historical health of a service based on indicators, abnormalities, and events of the components within the service model helps operators or site reliability engineers (SREs) to identify the root causes of service performance degradation or health impact, lower mean-time-to-resolve (MTTR), and avoid service disruptions.
BMC Helix AIOps achieves the above business goals by using AI/ML-powered algorithms so organizations can understand the health of their services and identify the causes of service problems faster.
Scenario
APEX Global IT monitors a JIRA management service that contains entities, such as the Incident Management application, Change Management application, Web server, and a database. Each of these entities may further have child nodes, such as network interface cards, hosts, namespaces, software instances, and so on.
Let's assume that the Change Management application is down due to a network error for three hours on a day when the ITOps team has scheduled a system upgrade push. The application's performance is highly impacted due to this problem, which would cause unnecessary downtime with the JIRA management service if not attended to quickly. During the three hours when the upgrade was scheduled, many impactful events were generated, which resulted in the JIRA management service going into an impacted state with a major severity. Because the upgrade involved a time-critical fix to the JIRA service, all the JIRA requests were blacked out, which resulted in a major impact to availability.
During the past three hours, when the impact was high, events and alarms were flooding the monitoring system. These events are plotted on the health timeline slots. The Change Management system and the network nodes are listed as the top impacted nodes. Events from these nodes are listed as the top causal events.
Health score
The health score and health timeline provides insight into the health of a service in the last 24 hours. The impact severity is assigned based on the health score. By default, a zero health score indicates the highest impact and a 100 indicates the best or OK health. Service designers can customize the health score for any service and determine the range for each severity levels. For more information, see Understanding-service-health-score.
The health score of a service is displayed on the service details page in the BMC Helix AIOps console. Depending on the health score, a color-coded severity level is assigned to the service.
For example, the health score of the Train Ticket Reservation System service is displayed as 48 in the following image. The color code of the health score indicates that it is major. The tooltip shows the health score range for each severity level.
Health timeline
The health of each service is monitored on a 24-hour timeline. This health timeline is made up of time slices or slots. The events, incidents, and changes are plotted against the time slots for the last 24 hours. The length of each slot could be 5, 15, or 20 minutes based on whether the health timeline is plotted for the last 3-6 hours, 12 hours, or 24 hours as shown in the following image:
Item | Description |
---|---|
Time range selector | List to select the time range such as Last 3 (default), 6, 12, and 24 hours. Based on the time range, the timeslot is sliced as follows:
|
Health score | Time slot shows the health score at that time. Hover over the time slot to view. |
Incident, Event, Change | Legends to indicate incidents, events, and change requests on the health timeline. Hover over a legend on the health timeline to view event, incident, or change request volumes for that time slot. On the timeline, change is shown based on the Start Date of the change request. The health timeline does not display the INFO, OK, or BLACKOUT events. You can select a different time range to view the events, incidents, and changes for that duration. |
Where to go from here
To view the service health, impact, and health timeline, see Monitoring-service-health.