Detecting anomalies by using static and dynamic thresholds

Anomalies are observations that diverge from a well-structured data pattern, or an irregular spike in the time-series data, or unclassifiable data points within a specific data set. An anomaly could occur independently or due to a combination of factors. For example, the combination of slow response time and high memory utilization together may impact the expected system behavior.

As an administrator, create alarm, composite alarm policies, or variate policies to monitor and manage the health of your system and detect anomalies. These policies can also help you detect abnormal behavior in your monitoring data more accurately by reducing:

False positives: Scenarios where an alarm is raised even though the system exhibits normal behavior.
False negatives: Scenarios where the product failed to raise an alarm despite the occurrence of an abnormal metric condition.

Tip

You can also configure the automatic detection of anomalies without configuring alarm policies. For more information, see Configuring autoanomaly event generation.

Alarm policies use a combination of static and baseline thresholds for a single metric to generate alerts. Whereas composite alarm policies use the multi-metric, condition-based expressions, and variate policies use dynamic thresholds to generate alerts. In the monitoring world, a threshold is a defined value that determines whether a monitored metric, such as CPU utilization or memory usage, is above, below, or within a normal range in your infrastructure environment.

Static thresholds represent an absolute value above or below which an event is generated. In general, a static threshold is specified for attributes that have commonly accepted values, beyond which performance is known to degrade. For example, if the total CPU utilization on a Windows system exceeds 80%, performance degradation can occur.

The time-series expressions are defined using Prometheus Query Language (PromQL) and operate on time‑series monitoring data over a specified evaluation interval, enabling administrators to define logical relationships between metrics and generate an alarm only when the combined condition is met. For example, an administrator can define a condition where an alarm is generated only when CPU utilization exceeds 90% and memory utilization exceeds 90% for a specified duration.

Dynamic thresholds contain a set of low and high baselines, which are automatically generated. Administrators don’t need to manually set the threshold value. Dynamic thresholds can be used for attributes that degrade over time, such as performance metrics like response time.

Alarm policies

Alarm policies use the following threshold combinations to detect anomalies:

Static thresholds only
Baseline thresholds only
Static and Baseline thresholds

For more information, see Alarm policy concept and instructions for configuring an alarm policy.

Example scenarios for alarm policies

This section provides a few examples of alarm policies.

Scenario 1 - Static threshold

If the CPU Utilization crosses the threshold of 75% for 15 minutes, and if it violates the baseline determined for that event, an alarm event must be generated.

Scenario 2 - Static threshold

If the storage server's power supply is degraded for more than 15 minutes, the service will be impacted, and an event alert must be generated.

Scenario 3 - Static threshold

If an Apache Tomcat server's availability is offline or unavailable for more than 15 minutes, the service will be impacted and needs an alarm notification.

Scenario 4 - Baseline threshold

A Linux server's CPU utilization violates the baseline value, so an event alert of the Alarm class must be generated. This alert can have one of the following severity levels:

Information
Critical
Major
Minor
Warning

Composite alarm policies

As an administrator, create composite alarm policies to receive alarm events when multiple related metric conditions occur together. Use these policies when a single metric is not sufficient to determine an anomaly and when correlating metrics provides a more accurate indication of system impact. Composite alarm policies are useful for detecting complex anomaly scenarios where individual metrics might fluctuate independently but indicate a problem only when they occur together.

For more information, see <Link placeholder> and instructions for Configuring a composite alarm policy.

Example scenarios for composite alarm policies

This section provides a few examples of composite alarm policies.

Scenario 1

Generate an alarm when CPU utilization exceeds 80% and memory utilization exceeds 90% on a particular server for more than 5 minutes.

Scenario 2

Generate an alarm when CPU utilization exceeds 85% on multiple servers within the same environment for a specified duration.

Scenario 3

Generate an alarm when the average response time exceeds 500 milliseconds and the request rate exceeds 100 requests per second on the same application instance.

Scenario 4

Generate an alarm when container memory utilization is more that 80% across multiple pods and node memory availability drops below 40% for the evaluation interval.

Scenario 5

Generate an alarm when disk space utilization exceeds 85% or inode utilization exceeds 90% on multiple servers for 60 minutes.

Variate policies

As an administrator, create variate policies to receive event notifications. Variate policies use dynamic thresholds. Use these policies when you want to be alerted for metric anomalies where the threshold limits keep changing over time.

For more information, see variate policy concept and instructions for configuring a variate policy.

Example scenarios for variate policies

This section provides a few examples of variate policies.

Scenario 1

Under normal circumstances, a busy server might utilize 70% of its CPU without causing any concern. However, a relatively underutilized server that utilizes 50% of CPU capacity could indicate an issue.