Monitoring service health indicators

Health indicators are important attributes and key metrics that determine the health of all the configuration items (CIs) defined in a service model. The algorithm used by this product automatically discovers the changes in behavior of these metrics and generates alarm events based on the set threshold limits to indicate whether the service is healthy and available or the performance is degrading. This enables you to be automatically alerted to an issue so you can quickly identify the root cause and resolve the problem.

In BMC Helix AIOps , the health indicator metrics provide relevant insights into the behavior of one or more key CI elements that are part of a service. For example, Hourly average capacity of the File System, Kubernetes node status at different time intervals, and so on. In these examples, the Health average capacity and Status at different time intervals are health indicator metrics and File System and Kubernetes node are the CI elements.

Monitoring service health

As a service designer, you define health indicator metrics and configure alarm conditions for those indicator metrics as part of a service model definition. Each health indicator is a single metric or a collection of metrics from CIs. Each metric contains an alarm trigger condition to generate events when a threshold value is violated.

The operators or site reliability engineering teams (SREs) view and understand the behavior of those selected metrics as health indicators from the Health Indicators view. The health indicator view graphically displays the following information to help operators make informed decisions and take appropriate actions:

Baseline graph for the selected service metrics showing normal or expected behavior. Baselines are derived and plotted for key performance indicators (KPIs) such as CPU utilization, network traffic, and Mean Time to Resolve (MTTR) issues based on historical data. BMC Helix AIOps uses the following trends to calculate the baselines:
- Hourly data—Calculates the baseline from the data captured at the same hour each day over 24 hours.
- Weekly data—Calculates the baseline from the data accumulated from the hourly data over six days.
For more information, see Baselines.

Alarm breach pattern for all the selected metrics based on the alarm condition at various specified time intervals. The pattern provides insights into the service availability for the selected time interval.
For example, in a Namespace service with a Kubernetes cluster that has three pods and a deployment node in it, the service designer chooses to configure the following metrics for observing the availability of the node, pod, and container in the Kubernetes cluster:
- Kubernetes Node -
  - Status (Ready, Not Ready),
  - Monitoring Status - 0: online, 1: offline
- Kubernetes POD -
  - Ready (Ready or not)
  - Status (Running, Succeeded, Pending, Failed)
- Kubernetes Container -
  - Ready (Ready or not)
  - State (Running, Waiting, Terminated)
Event indicators on the baseline graphs, each representing a specific event.
For example, high memory utilization or system updates. Event indicators include a tooltip to provide more context about the event.

Before you begin

Ensure that the service model is created using BMC Helix AIOps console and configured with health indicators and alarm conditions.

The service designers are automatically redirected to the BMC Helix Operations Management console to configure the alarm policies for individual metrics. For more information about setting alarm policies, see Alarm policies and for adding health indicators to a service model, see Adding or editing health indicators for a service.

To view and analyze the behavior of health indicators of a service

On the Service page, click a service to view the health and the impact on the service.
Expand View Health Indicators to view and analyze the breach pattern of the chosen health indicators configured for the service.
Understanding the health metric graphic
- The blue bubble indicates the metric name, which is followed by the CI Name (Hostname) and the Entity or object name.
- The y-axis indicates the timeline. In the above image, the timeline selected is Last 3 Hours.
- The graph indicates the behavioral trend of the metric for the last three hours as configured in the alarm policy.
- The gray band indicates the baseline, which is the reference point representing the accepted deviation.
  You can view the data for the last 24 hours in this section.
- The markers seen on the graph indicate a specific event.
The corresponding metric selection when defining the service model is as shown in the following image:
The charts are displayed for the Last 24 Hours (default). You can change the timeline to Last 3, 6, 12, or 24 Hours.
For more information, see Adding or editing health indicators for a service.

Why don't I see health indicators for some services?

When you are creating a service model in BMC Helix AIOps , defining metrics and configuring alarm policies are optional. If metrics are not configured, the health indicators are not displayed.
If the service models are created in BMC Discovery , the View Health Indicators section is not displayed when you view the service details in BMC Helix AIOps .

Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks:

View CI topology for impacted services. For more information, see Identifying the impacted CI nodes from CI topology view.
Investigate impacting events, incidents, and changes for nodes in service hierarchy. For more information, see Investigating the service nodes from service hierarchy view.
View the causal analysis for the impact, see Performing causal analysis of impacted services.

Monitoring service health indicators

Before you begin

To view and analyze the behavior of health indicators of a service

Where to go from here

Comments