Monitoring service health indicators


Health indicators are important metrics that indicate the health of all the configuration items (CIs) defined in a service model. The AI/ML algorithm in BMC Helix AIOps automatically discovers the changes in behavior of these metrics and generates alarm events based on the set threshold limits to indicate whether the service is healthy and available or the performance is degrading. This enables you to be automatically alerted to an issue so you can quickly identify the root cause and resolve the problem.

In BMC Helix AIOps, the health indicators provide relevant insights into the behavior of one or more key CI elements that are part of a service. For example, Hourly average capacity of the File System, Kubernetes node status at different time intervals, and so on. In these examples, the Health average capacity and Status at different time intervals are health indicator metrics and File System and Kubernetes node are the CI elements.

As a service designer, you define health indicators and configure alarm conditions for them as part of a service model definition. Each health indicator is a single metric or a collection of metrics from CIs and contains an alarm trigger condition to generate events when a threshold value is violated.

The operators or site reliability engineering teams (SREs) view and understand the behavior of those selected metrics as health indicators from the Health Indicators view. The health indicator view displays the following information graphically to help operators make informed decisions and take appropriate actions:

  • ​Baseline graph for the selected service metrics showing normal or expected behavior. Baselines are derived and plotted for key performance indicators (KPIs) such as CPU utilization, network traffic, and Mean Time to Resolve (MTTR) issues based on historical data. BMC Helix AIOps uses the following trends to calculate the baselines:

    • Hourly - This baseline is calculated by using data from the previous day. For example, to determine the baseline for 10:00 today, the calculation projected yesterday from 10:00 to 11:00 is used.
    • Daily - This baseline is calculated by using data from the previous day. For example, to determine the baseline for today, the calculation projected for yesterday is used. The daily baseline calculation is separate for weekdays and weekends and is calculated internally. 
    • Weekly - This baseline is calculated by using data from the same day in the previous week. For example, to determine the baseline for Monday, the calculation projected for the previous Monday is used.
    • Monthly - This baseline is calculated by using data from the same date in the previous month. For example, to determine the baseline for February 1, the calculation projected for January 1 is used. 

    ​For more information about baselines, see Baselines in Alarm policies and Autoanomalies.

  • Alarm breach pattern for all the selected metrics based on the alarm condition at various specified time intervals. The pattern provides insights into the service availability for the selected time interval.
    ​​​​​​For example, in a Namespace service with a Kubernetes cluster that has three pods and a deployment node in it, the service designer chooses to configure the following metrics for observing the availability of the node, pod, and container in the Kubernetes cluster:
    • Kubernetes Node -
      • Status (Ready or not)
      • Monitoring Status (0: Online, 1: Offline)
    • Kubernetes POD -
      • Ready (Ready or not)
      • Status (Running, Succeeded, Pending, Failed)
    • Kubernetes Container -
      • Ready (Ready or not)
      • State (Running, Waiting, Terminated)
  • Event indicators on the baseline graphs, each representing a specific event.
    For example, high memory utilization or system updates. Event indicators include a tooltip to provide more context about the event.

Before you begin

  • Make sure the service model is created using BMC Helix AIOps and configured with health indicators.
    For information about adding health indicators to a service model, see Adding health indicators.
  • Make sure that the alarm policies are configured for the health indiators.
    When adding health indicators, the service designers are automatically redirected to the BMC Helix Operations Management console to configure the alarm policies for the health indicators. For more information about creating or editing an alarm policy, see Configuring alarm policies.

To view and analyze the behavior of health indicators of a service

  1. On the Service page, click a service to view the health and the impact on the service.
  2. Expand View Health Indicators to view and analyze the baseline and breach pattern of the chosen health indicators configured for the service.

    understanding_health_indicators_25_2.png

    • The graph indicates the behavioral trend of the metric for the last 24 hours as configured in the alarm policy.
    • The preceding graph shows hourly, daily, and weekly baselines in different colors with corresponding high and low baseline values.
    • The blue bubble indicates the metric name, which is followed by the CI Name (Hostname) and the entity or object name.
    • The Y-axis indicates the timeline. In the preceding graph, the timeline selected is the Last 24 Hours.
    • T​​​he markers ( define_health_indicator_252_markers.png ) seen on the graph indicate an anomaly event. The yellow marker indicates an open anomaly event, and the green marker indicates a closed anomaly event. 

    The corresponding metric selection for two metrics when defining the service model is as shown in the following image:
    define_health_indicator_252_image1.png

    define_health_indicator_252_image2.png

    define_health_indicator_252_image3.png

    define_health_indicator_252_image4.png

The charts are displayed for the Last 3 Hours (default). You can change the timeline to Last 3, 6, 12, 24 hours, or 7 days.

FAQ

Why don't I see health indicators for my service?
  • When creating a service model in BMC Helix AIOps, adding health indicators and configuring alarm policies are optional. If health indicators are not defined, they are not displayed in the View Health Indicators section.
  • If the service models are created in BMC Helix Discovery, the View Health Indicators section is not displayed when you view the service details in BMC Helix AIOps.

Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks: 

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*