Default language.

Composite alarm policy


Use composite alarm policies to detect performance issues by correlating multiple metrics and triggering an alarm only when all defined conditions are met for a specified duration. A composite alarm policy evaluates a time‑series expression that combines multiple metrics into a single logical condition, applicable to a single instance or across multiple instances. This correlation reduces alert noise and generates a single, contextual alarm that reflects real performance impact.

Each composite alarm policy supports one expression per severity level, with up to 3 severity levels per policy. For multimetric time‑series expressions, all metrics must share system‑identified common labels.

Use composite alarm policies when a single metric does not accurately indicate a real issue, and multiple conditions must occur together to represent performance impact. Composite alarm policies are useful for correlating metrics across different instances and reducing alert noise caused by isolated or short‑lived metric breaches.

Difference between a standard alarm policy and a composite alarm policy

Standard alarm policies evaluate a single metric against a threshold. Composite alarm policies evaluate multiple metrics together, enabling correlation across related components or instances. The following table compares standard and composite alarm policies.

Standard alarm policyComposite alarm policy
Uses a single metricUses multiple metrics
Threshold‑based alarm conditions using a single metric valueTime‑series expression‑based alarm conditions using single or multiple metric values
Generates multiple separate alarmsGenerates one composite alarm
Causes higher alert noiseResults in reduced alert noise

Building composite alarm trigger conditions

Composite alarm conditions are built by combining metrics with logical operators such as and or or, and by correlating metrics across instances by using label joins. This enables administrators to generate an alarm only when thresholds are breached across single or multiple instances simultaneously, resulting in a single, meaningful alarm instead of multiple independent alerts. Each policy includes the following details:

  • Severity levels, with one expression and duration per severity:
    • Minor
    • Major
    • Critical,
  • A hostname label that identifies the affected entity.

The following image displays the configuration of a composite alarm policy:

Building_expression.png

While building time-series expressions, use the Show metrics list tab to access predefined metrics. The following video displays how to build a basic time-series expression:

Examples of time series expressions

Example 1: Single‑host, multi‑metric resource utilization

Purpose

The condition helps identify situations where high memory usage is driven by specific processes, indicating potential memory leaks or inefficient applications that can lead to system instability or performance degradation.

Expression

Used{entityTypeId="NUK_Memory"} > 80
and on (hostname)
(ProcessMemoryUsage{entityTypeId="NUK_Process"} >= 80)

This expression correlates host-level memory usage with process-level memory consumption. It generates an alarm when the following conditions are met:

  • The overall memory utilization on a host exceeds 80%
  • A process on that host consumes ≥ 80% of memory

The AND condition for (hostname) makes sure that both conditions are matched for the same machine.

Example 2: Database connection utilization correlated with idle transaction correlation

Purpose:
The condition helps detect potential connection leaks or improper transaction handling, where idle transactions consume connections, reducing available capacity and potentially impacting database performance.

Expression

(DBMaxConnPct{entityTypeId="PGR_DB_METRICS"} > 80)
and on (hostname)
(DBIdleInTxnSessions{entityTypeId="PGR_DB_SESSION_METRICS"} > 10)

This expression correlates database connection utilization with idle-in-transaction sessions. It generates an alarm when the following conditions are met:

  • The database is using more than 80% of its maximum connections
  • The database has more than 10 sessions in an idle-in-transaction state on the same host

The AND condition for (hostname) makes sure that both conditions are evaluated for the same database instance.

Example 3: Scenario 3: Database performance degradation correlated with high CPU utilization

Purpose

The condition helps identify situations where high CPU load is likely contributing to database slowness, indicating potential resource contention or inefficient query processing.

Expression

(ResponseTime{entityTypeId="PGR_CUSTOM_SQL"} > 10)
and on (hostname)
(Utilization{entityTypeId="NUK_CPU"} > 90)

This expression correlates database performance with system resource usage. It triggers when the following conditions are met:

  • The SQL query response time exceeds 10 seconds
  • CPU utilization on the same host exceeds 90%

The AND condition for (hostname) makes sure that both conditions are matched for the same machine, avoiding false positives from unrelated hosts.

Example 4: Auto Scaling Group capacity exhaustion risk

Purpose

The condition identifies that the Auto Scaling Group is operating very close to its maximum capacity, leaving little or no room to scale further. This can be an early warning that the system may run out of scaling headroom, potentially leading to performance issues during traffic spikes.

Expression

(GroupMaxSize{entityTypeId="AWS_AUTO_SCALING_GROUP"}[15m]
-
GroupInServiceInstances{entityTypeId="AWS_AUTO_SCALING_GROUP"}[15m]) <= 1

This expression calculates the difference between the maximum capacity of an Auto Scaling Group (ASG) and the number of instances currently in service in the last 15 minutes. An alert is triggered when the difference is less than or equal to 1.

Example 5: Kubernetes pod readiness correlated with lifecycle state mismatch

Pupose

The condition helps detect pods that are in an unhealthy or transitional state, such as Pending, Failed, Unknown, or other non-running states. These pods are not ready to serve traffic. This is useful for identifying pod startup failures, scheduling issues, or crash-related conditions.

Expression

kube_pod_status_ready{condition="false"} == 1
and on (namespace, pod)
kube_pod_status_phase{phase!="Running"}

This expression correlates pod readiness status with pod lifecycle phase. It generates an alarm when the following conditions are met:

  • A Kubernetes pod is not ready (condition=false)
  • The same pod is not in the running phase (pending, failed, succeeded, or unknown)

The AND condition on (namespace, pod) ensures that both conditions are evaluated for the same pod within the same namespace, filtering out transient readiness issues and highlighting pods that are genuinely unhealthy or stuck outside the running state.

 

Validating composite alarm trigger conditions

Before a composite alarm policy is saved, all configured expressions are validated to make sure that the time-series expression is syntactically correct and suitable for evaluation in large environments. During validation, the system performs the following actions:

  • Checks the expression syntax
  • Verifies required label joins for multi‑metric conditions
  • Estimates the number of matching time series
  • Identifies common labels to select a host name to generate alarms

If validation fails or raises warnings, you cannot save the policy until the issues are addressed.

Common validation issues

The following table provides some common validation issues with solutions:

IssueSolution

Missing label join in multi‑metric expressions

Validation fails if multiple metrics are combined without an explicit on(...) or ignoring(...) clause.

Add a label join to make sure that metrics are correlated for the same entity or instance.
Too many matching time series
Validation warns or fails when an expression matches an excessive number of series.
Refine the expression by narrowing labels, such as environment, region, application, or instance.
No matching series found
Validation fails if the expression returns no time series.
Verify metric names, label values, and data availability.
Hostname label not available
A host name is not detected.

Perform one of the following tasks:

  • If no common host name label is detected, use the Static option to define the host name
  • Revise the expression so that a consistent entity label can be selected to generate alarms.
Expression modified after validation
 
Any change to a validated expression requires revalidation before the policy can be saved.

Limitations in building composite alarm conditions

Composite alarm policies have the following limitations:

  • At least one data point must be available to create a composite alarm policy.
  • A maximum of 10 metrics can be added to a single time-series expression.
  • The number of matching series per expression must be less than two lakh.
  • The alarm closes automatically when the conditions are no longer true.
  • No baseline or custom closure options are supported.
  • The composite alarm policy cannot automatically identify conflicting thresholds defined in time-series expressions.

Where to go from  here

Configuring composite alarm policy

Related topics

Composite alarm event class

Composite alarm policy management endpoints in the REST API

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*

BMC Helix Operations Management 26.2