Monitoring service health indicators
As a service designer, you define health indicators and configure alarm conditions for them as part of a service model definition. Each health indicator is a single metric or a collection of metrics from CIs and contains an alarm trigger condition to generate events when a threshold value is violated.
The operators or site reliability engineering teams (SREs) view and understand the behavior of those selected metrics as health indicators from the Health Indicators view. The health indicator view displays the following information graphically to help operators make informed decisions and take appropriate actions:
Baseline graph for the selected service metrics showing normal or expected behavior. Baselines are derived and plotted for key performance indicators (KPIs) such as CPU utilization, network traffic, and Mean Time to Resolve (MTTR) issues based on historical data. BMC Helix AIOps uses the following trends to calculate the baselines:
- Hourly - This baseline is calculated by using data from the previous day. For example, to determine the baseline for 10:00 today, the calculation projected yesterday from 10:00 to 11:00 is used.
- Daily - This baseline is calculated by using data from the previous day. For example, to determine the baseline for today, the calculation projected for yesterday is used. The daily baseline calculation is separate for weekdays and weekends and is calculated internally.
- Weekly - This baseline is calculated by using data from the same day in the previous week. For example, to determine the baseline for Monday, the calculation projected for the previous Monday is used.
- Monthly - This baseline is calculated by using data from the same date in the previous month. For example, to determine the baseline for February 1, the calculation projected for January 1 is used.
For more information about baselines, see Baselines in Alarm policies and Autoanomalies.
- Alarm breach pattern for all the selected metrics based on the alarm condition at various specified time intervals. The pattern provides insights into the service availability for the selected time interval.
For example, in a Namespace service with a Kubernetes cluster that has three pods and a deployment node in it, the service designer chooses to configure the following metrics for observing the availability of the node, pod, and container in the Kubernetes cluster:- Kubernetes Node -
- Status (Ready or not)
- Monitoring Status (0: Online, 1: Offline)
- Kubernetes POD -
- Ready (Ready or not)
- Status (Running, Succeeded, Pending, Failed)
- Kubernetes Container -
- Ready (Ready or not)
- State (Running, Waiting, Terminated)
- Kubernetes Node -
- Event indicators on the baseline graphs, each representing a specific event.
For example, high memory utilization or system updates. Event indicators include a tooltip to provide more context about the event.
Before you begin
- Make sure the service model is created using BMC Helix AIOps and configured with health indicators.
For information about adding health indicators to a service model, see Adding health indicators. - Make sure that the alarm policies are configured for the health indiators.
When adding health indicators, the service designers are automatically redirected to the BMC Helix Operations Management console to configure the alarm policies for the health indicators. For more information about creating or editing an alarm policy, see Configuring alarm policies.
To view and analyze the behavior of health indicators of a service
- On the Service page, click a service to view the health and the impact on the service.
Expand View Health Indicators to view and analyze the baseline and breach pattern of the chosen health indicators configured for the service.
- The graph indicates the behavioral trend of the metric for the last 24 hours as configured in the alarm policy.
- The preceding graph shows hourly, daily, and weekly baselines in different colors with corresponding high and low baseline values.
- The blue bubble indicates the metric name, which is followed by the CI Name (Hostname) and the entity or object name.
- The Y-axis indicates the timeline. In the preceding graph, the timeline selected is the Last 24 Hours.
The markers (
) seen on the graph indicate an anomaly event. The yellow marker indicates an open anomaly event, and the green marker indicates a closed anomaly event.
The corresponding metric selection for two metrics when defining the service model is as shown in the following image:
The charts are displayed for the Last 3 Hours (default). You can change the timeline to Last 3, 6, 12, 24 hours, or 7 days.
FAQ
Where to go from here
Based on the health of and impact on a service, you can perform any of the following tasks:
- View the CI topology for impacted services. For more information, see Identifying-the-impacted-CI-nodes-from-CI-topology-view .
- Investigate impacting events, incidents, and changes for nodes in service hierarchy. For more information, see Investigating-the-service-nodes-from-service-hierarchy-view.
- View the causal analysis for the impact, see Performing-causal-analysis-of-impacted-services .
- Get an insight into the service behavior and its severity pattern over a pre-defined period. For more information, see Monitoring-service-insights.
- View service predictions. For more information, see Predicting-and-proactively-resolving-service-outages.