Threshold Check


Description

BMC Helix Network Management always collects all available statistics from all managed devices and applications (CPU utilization, hardware temperature, SQL deadlocks, and so forth).

Use a threshold check to actively monitor a single statistic and measure it against a set of high or low threshold values. This provides insight into imminent failures and the detection of actual failures.

A threshold check provides two modes of monitoring for a given statistic:

  • Static threshold evaluation
  • Dynamic threshold evaluation (anomaly detection)

Each mode can be used independently of the other, or both can be used simultaneously.

The states of threshold checks are displayed in various locations in BMC Helix Network Management, but the most common is in a Tactical Overview dashboard widget

Threshold checks can be configured to monitor for both high and low values, only high values, or only low values. A threshold check detecting unacceptable values for its monitored statistic displays as failed in BMC Helix Network Management dashboards and generates an alarm. Additionally, a threshold check can be configured to run commands on the device or application it is monitoring when the check fails (such as reboot or restart-service commands). This provides the possibility of resolving some issues automatically without the need to involve personnel.

By default, BMC Helix Network Management automatically adds several preconfigured threshold checks to each managed device to provide basic monitoring and alerting services. However, BMC Helix Network Management collects many other statistics to which you can add a threshold check to suit your specific monitoring needs.

Static Thresholds

The threshold check measures a statistic's current value against a set of static threshold values and determines if the value of the collected statistic has exceeded any of those thresholds. This functionality provides basic performance monitoring of a statistic. For more advanced monitoring, see Dynamic Thresholds (Anomaly Detection).

Dynamic Thresholds (Anomaly Detection)

Anomaly detection is a more advanced function of a threshold check. It uses adaptive, dynamic threshold values to determine if a statistic is experiencing values that are inconsistent with what is normally expected from that statistic at this time, given its history.

As an example, say CPU utilization for server X has been 85% at this time of day, for this day of the week, for the past eight weeks, but is now showing 55%. Given its history, that's not the expected utilization value for this time. This could indicate that an important process on that server has stopped running or that clients are unable to connect and utilize the server resources.

Since neither of the reported values is particularly high or low, static threshold monitoring would not indicate anything unusual happening. However, the dynamic threshold values of anomaly detection look at the collected values in the context of their history and develop an understanding of what is typical for that statistic at any given time.

When used together, static thresholds and anomaly detection are extremely powerful. However, you are free to use them independently of each other in any given threshold check. Each can be configured alone without requiring the other to be configured.

Details

Data Collection

BMC Helix Network Management always collects and stores the retrieved values for each statistic of a managed device and application to which it has access, maintaining a historical record of the performance of that statistic. A threshold check uses a specific subset of that data to evaluate the current state of the statistic. Note, however, that the specific subset of data used is different for static thresholds and anomaly detection, as explained in each relevant section below.

(Note: The performance history of each collected statistic is graphed on the Performance tab of the Device Dashboard for any given managed device and does not require a threshold check for it to be viewable.)

Statistic Pairs

In BMC Helix Network Management, many collected statistics are configured and monitored as pairs (for example, bandwidth utilization pairs inbound traffic with outbound traffic). Threshold checks are also configured in these same pairs, with the opposing statistics identified as VARIABLE ONE and VARIABLE TWO in the check (the actual statistic name is also identified alongside the variable label). The static threshold and anomaly detection settings for each variable are independent, but all other configuration settings for the check are shared between the two.

When configuring static threshold values, the units for the high and low fields will automatically be appropriate for the type of statistic selected (for example, CPU utilization would show percent, while latency would show seconds). A pull-down selector next to the value allows you to specify a multiplier prefix for the entered value. This allows you to configure the check using values with which you are comfortable and have BMC Helix Network Management do the math for you.

Errors Per Second

BMC Helix Network Management always measures errors per second values as milli-errors per second, which allows for err/sec measurements of less than one. See Understanding Errors per Second for more information on calculating err/sec values that can be used in threshold checks. When entering err/sec values, remember to select the milli (m) multiplier prefix.

Threshold Check States

Threshold checks display one of the following states when viewed in dashboards.

StateDescription
OK(Green) The value of the statistic is within the user-determined acceptable operating range.
WARNING(Yellow) The value of the statistic is higher than the configured high WARNING value or lower than the configured low WARNING value but has not yet exceeded the configured CRITICAL value for either.
CRITICAL(Red) The value of the statistic is higher than the configured high CRITICAL value or lower than the configured low CRITICAL value. Generates an alarm.
ACKNOWLEDGED(Blue) Indicates a threshold check in a CRITICAL state that a user has acknowledged. This state is technically an incident state, not a check state, but its display in the dashboards helps to distinguish between problems that are new and problems that are already being addressed.

Threshold checks always generate an alarm immediately upon reaching the CRITICAL state but normally do not generate an alarm when reaching the WARNING state. To generate an alarm for threshold checks reaching the WARNING state, the "Incidents on warning thresholds exceeded" feature must be activated on BMC Helix Network Management's Feature Toggle administration page (this feature is turned off by default).

A threshold check continues to retrieve and evaluate statistic values according to its configuration even after an alarm has been generated. If the value returns to normal at any time, the check immediately recovers to the OK state, clears its alarm, and signals any opened incident that it has recovered.

The Tactical Overview dashboard widget is useful for displaying threshold check status in dashboards.

Static Thresholds

To activate static threshold monitoring, enter values in the HIGH and/or LOW fields for warning (yellow) and critical (red) states. You can configure thresholds for either variable or both. Leaving a field empty disables monitoring for that condition. Leaving all fields empty turns off static threshold monitoring, which is useful when you want to rely only on anomaly detection.

When the latest value for a statistic is collected, the threshold check calculates an average from the most recent datapoints. The number of datapoints included in the average is defined during configuration. The averaged value is then compared with the configured high and low thresholds to determine whether any thresholds have been exceeded.

Below are the two methods to calculate the averaged value, depending on whether the deployment is On‑Premises or SaaS:

On‑Premises

On‑Premises deployments average the two most recent datapoints for threshold evaluation.

The On‑Premises platform uses RRD (Round Robin Database), which stores data in fixed time buckets. Retrieving only the latest datapoint often results in NaN values if the bucket has not yet been finalized. Averaging the last two datapoints reduces NaN occurrences and provides a more stable value.

SaaS

SaaS deployments use a dynamic lookback window enabled by the VictoriaMetrics storage platform:

  • The system looks back over a time window defined as timePeriod × 60 seconds. For example, a 5‑minute period results in a 300‑second lookback window.
  • All datapoints within that window are averaged.
  • If no datapoints are found, the system expands the lookback window by 600 seconds.
  • The system may look back up to 15 minutes to locate datapoints.

VictoriaMetrics supports flexible querying, allowing the system to expand the lookback window when datapoints are sparse. This ensures consistent and reliable threshold evaluations.

Averaging helps prevent unnecessary alarms caused by short‑duration spikes and provides a more stable representation of the statistic’s current behavior. NaN values are ignored until they age out of the averaging window.

Static thresholds allow independent values to trigger WARNING (yellow) and CRITICAL (red) states for both high- and low-value conditions:

  • If the averaged value exceeds a configured warning threshold, the check enters the WARNING state. This state appears in dashboards but does not trigger additional actions.
  • If the averaged value exceeds a configured critical threshold, the check enters the CRITICAL state. This state appears in dashboards and generates an alarm.

Anomaly Detection

An anomaly check compares the most recently polled value of a statistic against a set of dynamic threshold values computed from a data set that samples eight previously polled values.

To activate anomaly detection for a threshold check during configuration, select a value for the Boundary field (in the ANOMALY section) other than None.

The check can be configured to look for upper boundary and/or lower boundary anomalies (similar to high and low static threshold values). These upper and lower boundaries represent dynamic threshold values that are computed as deviations from the mean of the eight sampled values. As each execution of the check drops older samples from the data set and adds newer samples, these boundaries are continuously recomputed to establish what should be considered "normal" for the statistic at the time of polling.

The amount that the upper and lower boundary thresholds deviate from the mean is controlled by the anomaly sensitivity setting of each individual threshold check. A lower sensitivity causes the boundary values to be further from the computed mean of the samples, meaning the more abnormal a polled value must be to be considered an anomaly. A higher sensitivity causes the boundary values to be closer to the mean, meaning the less abnormal a polled value must be to be considered an anomaly.

The boundaries are calculated using the formula (standard deviation of sample set * 4) + (mean of sample set * sensitivity factor).

  • High sensitivity factor = 0.0
  • Medium sensitivity factor = 0.1
  • Low sensitivity factor = 0.5

The appropriate sensitivity for anomaly detection is highly subjective, depending on the circumstances. Trial and error will be required for each situation to achieve optimal performance. (A higher sensitivity can detect smaller deviations, while a lower sensitivity can only detect larger deviations.)

Anomaly detection includes independent sensitivity settings to trigger both WARNING and CRITICAL states.

If the current statistic value exceeds the computed warning upper or lower boundary values, it is considered a potential anomaly, and the check enters the WARNING state. This state is displayed in the dashboards, but BMC Helix Network Management does not take any other action.

If the current statistic value exceeds the computed critical upper or lower boundary values, it is considered an anomaly, and the check enters the CRITICAL state. This state is displayed in the dashboards and generates an alarm.

Limitations in Anomaly Detection

It is not recommended to activate anomaly detection for statistics that use percentages for their values, such as CPU utilization, interface bandwidth utilization, etc. Due to the way anomaly boundary thresholds are calculated, it is possible to create a situation in which anomalies are never detected when monitoring percentage-based statistics. To monitor percentage-based statistics for unacceptable performance values, it is recommended to use static thresholds. Or, at least, static thresholds in addition to anomaly detection.

Anomaly Detection Samples

Anomaly checks run every 30 minutes at 10 and 40 minutes past the hour. Each check calculates the mean for the previous 30 minute interval. For example, if the check runs at 1:10 p.m., it evaluates the mean average for 12:30–1:00 p.m. After this value is calculated, it is compared with the historical data points defined by the configured season.

The anomaly detection process does not use the eight most recent polled values. Instead, it samples eight historical data points separated by a fixed interval determined by the selected season. Each sample is taken from the same relative timestamp as the current poll.

For a season of Hour, if the statistic is polled at 8:05 p.m., the system samples values from:

  • 7:05 p.m.
  • 6:05 p.m.
  • 5:05 p.m.
  • 4:05 p.m.
  • 3:05 p.m.
  • 2:05 p.m.
  • 1:05 p.m.
  • 12:05 p.m.

When the next poll occurs at 8:10 p.m., the system samples values from:

  • 7:10 p.m.
  • 6:10 p.m.
  • 5:10 p.m.
  • 4:10 p.m.
  • 3:10 p.m.
  • 2:10 p.m.
  • 1:10 p.m.
  • 12:10 p.m.

Selecting a different season changes the interval between sampled values. Available options include:

  • Hour
  • Day
  • Week

An anomaly alarm remains active until one of the following occurs:

  • The offending data point ages out of the sample set.
  • Dynamic threshold values are recalculated so the data point no longer falls outside the sensitivity range.

When the Week season is selected, this clearing period can be significantly longer, so any resulting incidents should be acknowledged.

Previous Anomalous Samples

If any single previous value of the eight data values being sampled was itself an anomaly, that sample is excluded from the detection calculations. However, if more than one of the previous data values were anomalous, those values are used in the detection calculations. This is how the threshold check dynamically adapts to gradual changes in behavior that are, in fact, perfectly normal.

Minimum Sample Values

Each sampled data point must also have a minimum value to be checked for anomalous behavior (configured in the check using the Min Value field). The anomaly engine will never use values below this setting (they will simply be dropped from the data set). This is to prevent a sequence of extremely low values from producing false positives from only minor deviations. (Changing from an average of 1 to 0.001 is a 1000% deviation, but still not likely to be any kind of problem.)

However, if a polled value is below the minimum setting after an anomaly is detected and an alarm generated, the current alarm is automatically cleared, and the opened incident is notified. This is for housekeeping purposes to prevent a detected upper boundary anomaly from causing the check to be stuck in an alarm state due to never processing the current data point if the polled values remain below the minimum.

Unfortunately, this same logic also causes a lower boundary anomaly alarm to clear, negating the value of setting a lower boundary check. It is therefore recommended to exercise caution when using a minimum value with a lower boundary check, as polled values below this setting will both prevent a lower boundary anomaly from triggering an alarm and cause any existing alarm to be cleared.

(The minimum value setting is only for anomaly detection and does not interfere with static threshold check operations.)

Configuration and Management

Add a Threshold Check to a Device

Use Device Templates for Monitoring Checks

In general, it is not recommended to add monitoring checks directly to devices. Instead, use device templates to add these checks. Device templates allow you to automate the process of adding the appropriate monitoring checks to the right devices.

To add a threshold check directly to a managed device, follow the procedure below.

(Note: When a threshold check is added to a device, it cannot be removed. It can only be disabled.)

See also, Add a Threshold Check to a Device Template.

  1. Log in to BMC Helix Network Management as a user with the Admin access level or higher.
  2. Locate the device to which you would like to add a threshold check and select it to open its device dashboard.
    • Specific devices can be located in BMC Helix Network Management by either drilling into a Tactical Overview dashboard widget or searching for the device by name using the search feature at the top of the main menu.
  3. Select the gear icon in the top right of the dashboard to open the dashboard administrative view.
  4. Select the Instances tab.
  5. Locate the panel for the statistic type containing the statistic you would like to monitor and select it to open the panel.
    • If the statistic you would like to monitor is in the Network panel, use the pull-down menu at the top right and select Thresholds to display the network interfaces.
  6. Locate the specific statistic to which you would like to add a threshold check and select its add threshold icon (+) in the ACTIONS column.
  7. In the ACTION GROUP field, select the action group(s) to receive alert notifications  before escalation.
  8. In the ESCALATION GROUP field, select the action group(s) to receive alert notifications after escalation.
  9. In the RENOTIFICATION INTERVAL field, enter the number of minutes for BMC Helix Network Management to wait before sending another alert notification if the problem is not acknowledged by a user.
    • Alert notifications are sent to the action groups in the ACTION GROUP field.
    • The default value of 1440 minutes (24 hours) is recommended to minimize alert noise.
    • Setting a value of 0 (zero) will disable renotifications.
  10. In the ESCALATE AT field, enter the number of alert notifications after the first for BMC Helix Network Management to wait before sending alert notifications to the action groups in the ESCALATION GROUP field, as well as to the groups in the ACTION GROUP field.
    • The default value of 1 means that a total of 2 alerts must be sent before escalation groups start receiving them.
  11. In the STATISTICAL GROUP field, select the type that has the greatest relevance to the check. This field determines which statistical calculations this check contributes to for reports.
  12. (Optional) In the SUBSTRING field, enter a string or regular expression to include or exclude specific interfaces from this check using a match to the interface name/description. If you leave this field empty, BMC Helix Network Management applies the configured threshold check to every interface on the device. For more information about how the Substring regex rule works, see Regular expressions.
  13. If you would like to configure static threshold  monitoring (repeat these steps for each variable if two variables are present):
    1. (Optional) In the HIGH warning field (yellow), enter the exact value at which the check should enter the WARNING state for high values.
      • Next to the value type, select the multiplier prefix.
    2. (Optional) In the HIGH critical field (red), enter the exact value at which the check should enter the CRITICAL state for high values.
      • Next to the value type, select the multiplier prefix.
    3. (Optional) In the LOW warning field (yellow), enter the exact value at which the check should enter the WARNING state for low values.
      • Next to the value type, select the multiplier prefix.
    4. (Optional) In the LOW critical field (red), enter the exact value at which the check should enter the CRITICAL state for low values.
      • Next to the value type, select the multiplier prefix.
    5. In the TIME PERIOD field, select the time period over which data values will be sampled for the calculated average.
      • See Best Practices below for best practices regarding threshold check time periods.
  14. If you would like to configure anomaly detection(repeat these steps for each variable if two variables are present):
    1. In the Boundary field, select whether to check for upper boundary anomalies, lower boundary anomalies, or both.
    2. In the Sensitivity warning field (yellow), select the desired sensitivity. (This should always be at least one setting higher than the critical sensitivity field so that the warning state occurs first.)
    3. In the Sensitivity critical field (red), select the desired sensitivity. (This should always be at least one setting lower than the warning sensitivity field so that the warning state occurs first.)
    4. In the Season field, select the desired season for the data samples.
    5. (Optional) In the Min Value field, set the minimum value that a polled value must be to qualify for anomaly detection.
      • The value entered in this field should be specified in the same base unit displayed in the static threshold configuration without the prefix (for example, bytes, not megabytes; seconds, not milliseconds). Note: For bandwidth monitoring (only), the value must be specified in bits per second and not as a percentage.
  15. Select Create Threshold.

Disable a Threshold Check on a Single Device

Currently being revised.

Disabling a threshold check prevents that specific check from monitoring its statistic, but the statistic is still polled for values and those values are still recorded by BMC Helix Network Management.

Disable Threshold Checks on Multiple Devices

To disable multiple specific threshold checks on multiple specific managed devices, follow the procedure below.

Disabling a threshold check prevents that specific check from monitoring its statistic, but the statistic is still polled for values and those values are still recorded by BMC Helix Network Management.

  1. Log in to BMC Helix Network Management as a user with the Admin access level or higher.
  2. Go to the main menu and select Administration > Change Devices > Turn On/Off Thresholds to open the Deactivate Thresholds page.
  3. Select a functional group that contains the devices you would like to affect.
  4. Place a check next to the specific devices on which you would like to disable specific threshold checks.
  5. Select Select Device.
  6. Place a check next to the specific threshold checks that you would like to disable.
  7. Select Update Thresholds.

Best Practices

Device Templates

It is highly recommended that threshold checks be added to devices and managed through device templates and not directly on devices. Even in unique device-specific circumstances, threshold checks for that device can still be managed using a device template that includes the desired threshold checks and is assigned directly to the device.

The only circumstance under which a threshold check should ever be added to a device directly is when its device template functionality has been completely turned off.

Time Periods for Static Threshold Checks

BMC Helix Network Management polls and records each statistic every five minutes. When you configure a static threshold check with a TIME PERIOD of 5 minutes, a single poll exceeding the warning or critical threshold triggers a state change.

Selecting a 15‑minute period causes the system to evaluate the average of all samples collected during that interval, and a state change occurs only when that average exceeds the threshold. This field is an important adjustment for reducing false alarms.

Divide the TIME PERIOD value configured in the check by 5 to determine the number of recent samples to average before comparing them to the threshold values.

 

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*

BMC Helix Network Management