Monitoring service health indicators
As a service designer, you define health indicator metrics and configure alarm conditions for those indicator metrics as part of a service model definition. Each health indicator is a single metric or a collection of metrics from CIs. Each metric contains an alarm trigger condition to generate events when a threshold value is violated.
The operators or site reliability engineering teams (SREs) view and understand the behavior of those selected metrics as health indicators from the Health Indicators view. The health indicator view graphically displays the alarm breach pattern for all the selected metrics based on the alarm condition at various specified time intervals.
For example, in a Namespace service with a Kubernetes cluster that has three pods and a deployment node in it, the service designer chooses to configure the following metrics for observing the availability of the node, pod, and container in the Kubernetes cluster:
- Kubernetes Node -
- Status (Ready, Not Ready),
- Monitoring Status - 0: online, 1: offline
- Kubernetes POD -
- Ready (Ready or not)
- Status (Running, Succeeded, Pending, Failed)
- Kubernetes Container -
- Ready (Ready or not)
- State (Running, Waiting, Terminated)
Before you begin
Ensure that the service model is created using BMC Helix AIOps and configured with health indicators and alarm conditions.
The service designers are automatically redirected to the BMC Helix Operations Management console to configure the alarm policies for individual metrics. For more information about setting alarm policies, see Alarm Policies and for adding health indicators to a service model, see Adding or editing health indicators for a service.
To view and analyze the behavior of health indicators of a service
- On the Service page, click a service to view the health and the impact on the service.
Expand View Health Indicators to view and analyze the breach pattern of the chosen health indicators configured for the service.
Understanding health indicators
- The blue bubble indicates the metric name, which is followed by the CI Name (Hostname) and the Entity or object name.
- The y-axis indicates the timeline. In the above image, the timeline selected is Last 3 Hours.
- The graph indicates the behavioral trend of the metric for the last three hours as configured in the alarm policy.
The corresponding metric selection when defining the service model is as shown in the following image:
The charts are displayed for the Last 24 Hours (default). You can change the timeline to Last 3, 6, 12, or 24 Hours.
For more information, see Health Indicators and Adding or editing health indicators for a service.
Why don't I see health indicators for some services?
- When you are creating a service model in BMC Helix AIOps, defining metrics and configuring alarm policies are optional. If metrics are not configured, the health indicators are not displayed.
- If the service models are created in BMC Helix Discovery, the View Health Indicators section is not displayed when you view the service details in BMC Helix AIOps.
Health indicators and service predictions
The AI-based, service-centric prediction in BMC Helix AIOps consumes health indicators and metrics available for a service. It is capable of predicting service outages and helps in identifying and fixing issues before they happen. As a result, it helps reduce an organization's overall mean time to resolve (MTTR). For more information, see Predicting-and-proactively-resolving-service-outages.
Where to go from here
Based on the health of and impact on a service, you can perform any of the following tasks:
- View CI topology for impacted services. For more information, see Identifying-the-impacted-CI-nodes-from-CI-topology-view.
- Investigate impacting events, incidents, and changes for nodes in service hierarchy. For more information, see Investigating-the-service-nodes-from-service-hierarchy-view.
- View the causal analysis for the impact, see Performing-causal-analysis-of-impacted-services.
- Get an insight into the service behavior and its severity pattern over a pre-defined period. For more information, see Monitoring-service-insights.
- View service predictions. For more information, see Predicting-and-proactively-resolving-service-outages.