Detecting anomalies by using static and dynamic thresholds

Anomalies are observations that diverge from a well-structured data pattern or an irregular spike in the time-series data or unclassifiable data points within a specific data set. An anomaly could occur independently or due to a combination of factors. For example, the combination of slow response time and high memory utilization together may impact the expected system behavior. 

As an administrator, create alarm and variate policies to help you monitor and manage the health of your system and detect anomalies. These policies can also help you detect abnormal behavior in your monitoring data more accurately by reducing:

  • False positives: Scenarios where an alarm is raised even though the system exhibits normal behavior. 
  • False negatives: Scenarios where the product failed to raise an alarm despite the occurrence of an abnormal metric condition.

Alarm policies use a combination of static and baseline thresholds, and variate policies use dynamic thresholds. In the monitoring world, a threshold is a defined value that determines whether a monitored metric, such as CPU utilization or memory usage, is above, below, or within a normal range in your infrastructure environment.


Alarm policies

Alarm policies use the following threshold combinations to detect anomalies:

  • Static thresholds only
  • Baseline thresholds only
  • Static and Baseline thresholds

The following table lists a few example scenarios for alarm policies:

Type of threshold used

Static thresholds: These are predefined values used to detect abnormalities in monitored data. If a monitored metric value is greater than or less than the predefined threshold value, an alarm event is generated. The threshold values are defined manually by users based on the past performance data and context.

Combination of static and baseline thresholds: IT operators are notified when a metric value breaches the combination of static and baseline threshold conditions. A baseline is the expected normal operating range for a metric or attribute of a monitor. Using the baseline option along with the static threshold values in the trigger condition can generate events based on learned behavior. 

Baseline threshold: IT operators are notified by event alerts that have the following severity levels when a metric value breaches the baseline threshold value:

  • Information
  • Critical
  • Major
  • Minor
  • Warning

Example usage scenarios

Scenario 1

If the CPU Utilization crosses the threshold of 75% for 15 minutes, and if it violates the baseline determined for that event, an alarm event must be generated.

Scenario 2

The storage server's power supply is degraded for more than 15 minutes, the service will be impacted and an event alert must be generated.

Scenario 3

An Apache tomcat server's availability is offline or unavailable for more than 15 minutes, the service will be impacted and needs an alarm notification.

Scenario 4

A Linux server's CPU utilization violates the baseline value, so an event alert of the Alarm class must be generated. This alert can have one of the following severity levels:

  • Information
  • Critical
  • Major
  • Minor
  • Warning
Reference

To learn more about alarm policies, see the alarm policy concept and instructions for configuring an alarm policy.


Variate policies

As an administrator, create variate policies to receive event notifications. Variate policies use dynamic thresholds. Use these policies when you want to be alerted for metric anomalies where the threshold limits keep changing over time. The following table provides a few example scenarios for variate policies.

Type of threshold used

Dynamic: Variate policies use dynamic thresholds, which are derived by using unsupervised learning methods, and are used to detect anomalies of various metrics that change dynamically with time. For example, page load time, active virtual pages, VM utilization, and CPU load are some of the metrics.

Example usage scenarios

Scenario 1

Under normal circumstances, a busy server might utilize 70% of its CPU without causing any concern. However, a relatively underutilized server that utilizes 50% of CPU capacity could indicate an issue.


Scenario 2

Use variate policies to analyze anomalies for a group of metrics together. All the metrics in a multivariate configuration are analyzed at the same time and contribute to a single anomaly score. A single anomaly event is generated based on that score.

Reference

To learn more about alarm policies, see variate policy concept and instructions for configuring a variate policy.

Was this page helpful? Yes No Submitting... Thank you

Comments