Monitoring service health indicators


Health indicators are important attributes and key metrics that determine the health of all the configuration items (CIs) defined in a service model. The AI/ML algorithm in BMC Helix AIOps automatically discovers the changes in behavior of these metrics and generates alarm events based on the set threshold limits to indicate whether the service is healthy and available or the performance is degrading. This enables you to be automatically alerted to an issue so you can quickly identify the root cause and resolve the problem.

In BMC Helix AIOps, the health indicator metrics provide relevant insights into the behavior of one or more key CI elements that are part of a service. For example, Hourly average capacity of the File System, Kubernetes node status at different time intervals, and so on. In these examples, the Health average capacity and Status at different time intervals are health indicator metrics and File System and Kubernetes node are the CI elements.


As a service designer, you define health indicator metrics and configure alarm conditions for those indicator metrics as part of a service model definition. Each health indicator is a single metric or a collection of metrics from CIs. Each metric contains an alarm trigger condition to generate events when a threshold value is violated.

The operators or site reliability engineering teams (SREs) view and understand the behavior of those selected metrics as health indicators from the Health Indicators view. The health indicator view graphically displays the alarm breach pattern for all the selected metrics based on the alarm condition at various specified time intervals.

For example, in a Namespace service with a Kubernetes cluster that has three pods and a deployment node in it, the service designer chooses to configure the following metrics for observing the availability of the node, pod, and container in the Kubernetes cluster:

  • Kubernetes Node -
    • Status (Ready, Not Ready),
    • Monitoring Status - 0: online, 1: offline
  • Kubernetes POD -
    • Ready (Ready or not)
    • Status (Running, Succeeded, Pending, Failed)
  • Kubernetes Container -
    • Ready (Ready or not)
    • State (Running, Waiting, Terminated)

Before you begin

Ensure that the service model is created using BMC Helix AIOps and configured with health indicators and alarm conditions.

The service designers are automatically redirected to the BMC Helix Operations Management console to configure the alarm policies for individual metrics. For more information about setting alarm policies, see Alarm Policies and for adding health indicators to a service model, see Adding or editing health indicators for a service

To view and analyze the behavior of health indicators of a service

  1. On the Service page, click a service to view the health and the impact on the service.
  2. Expand View Health Indicators to view and analyze the breach pattern of the chosen health indicators configured for the service.

    Understanding health indicators

    HealthIndicators_232.png

    • The blue bubble indicates the metric name, which is followed by the CI Name (Hostname) and the Entity or object name.
    • The y-axis indicates the timeline. In the above image, the timeline selected is Last 3 Hours.
    • The graph indicates the behavioral trend of the metric for the last three hours as configured in the alarm policy.

    The corresponding metric selection when defining the service model is as shown in the following image:

    health_indicator_in_model.png

    The charts are displayed for the Last 24 Hours (default). You can change the timeline to Last 3, 6, 12, or 24 Hours.
    For more information, see Health Indicators and Adding or editing health indicators for a service.

Why don't I see health indicators for some services?

  • When you are creating a service model in BMC Helix AIOps, defining metrics and configuring alarm policies are optional. If metrics are not configured, the health indicators are not displayed. 
  • If the service models are created in BMC Helix Discovery, the View Health Indicators section is not displayed when you view the service details in BMC Helix AIOps.

Health indicators and service predictions

The AI-based, service-centric prediction in BMC Helix AIOps consumes health indicators and metrics available for a service. It is capable of predicting service outages and helps in identifying and fixing issues before they happen. As a result, it helps reduce an organization's overall mean time to resolve (MTTR). For more information, see Predicting-and-proactively-resolving-service-outages.


Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks: