Alarm policies

Use alarm policies to monitor and manage the health of your system and safeguard against abnormalities.

Alarms are types of events that are generated when threshold values are violated. You can configure the threshold values. Thresholds define an acceptable value above or below which an alarm is generated. Thresholds can be applied to multiple monitor types on multiple PATROL Agents.

An alarm policy comprises the policy name, description, and alarm generation conditions. An alarm generation condition contains the following details:

Threshold values
Violation duration (in minutes) and polls (in number)
Metric to monitor
Option to choose between all or multiple instances
Post trigger actions

Example

As per the condition defined in the following image (Monitoring solution: Linux; Monitor type: CPU; Metric: Utilization), when the CPU utilization for Linux computers whose instance name starts with 'clm' crosses the threshold of 75% for a period of 15 minutes, a Major alarm is generated. After the CPU utilization returns to a normal state of below 75% and 15 minutes lapse, the generated alarm is automatically closed. You can specify the duration or the number of polls for which a metric violates the threshold.

Alarm generation conditions_poll_count.png

Threshold conditions

You can define the following types of threshold conditions. While adding conditions, you can specify a severity of Critical, Major, or Minor.

Based on an absolute value: If a metric's performance is known to degrade in the event of the value going above or below a common acceptable value, you can define a threshold condition containing this absolute value. For example, if the total CPU utilization of a Linux system is above 85% for 10 minutes, it may result in performance issues. In this case, you can define a threshold condition of 85% for a time period of 10 minutes. When this threshold condition is exceeded, an alarm is generated.
Based on a range of values: In scenarios where you want to monitor metric values in terms of a range, you can specify multiple conditions specifying the high and low values. For a single metric, you can add up to three conditions only. For example, if the total CPU utilization of a Linux system is above 75% for 10 minutes, you might want to be notified with a Major alarm. But if the CPU utilization exceeds 85% for 10 minutes, you might want to be notified with a Critical alarm. In this scenario, you can define two threshold conditions for the same metric with a different threshold value and different event severity. When the first condition is breached (for example, when the threshold crosses 85%), an alarm is generated with the specified severity (for example, Critical). When the next condition is breached (for example, when the threshold crosses 75%), the severity of the same alarm is updated, a new alarm is not generated.
Based on a baseline value: If you have enabled a baseline calculation for a metric, you can define a baseline violation condition to generate event alerts. For example, if the CPU utilization of the Linux servers violates the baseline, it might result in performance issues. In this case, you might want to get notified with event alerts so that you can detect anomalies in the system and take corrective actions. The alerts can have one of the following severity levels:
- Information
- Critical
- Major
- Minor
- Warning

Baselines

A baseline is the expected normal operating range for a metric or attribute of a monitor. Using the baseline in the trigger condition for alarm policies allows a threshold to generate events based on learned behavior. While defining threshold conditions in a policy, you can choose to select an additional condition, Violates Baseline, which determines whether after fulfilling the initial threshold conditions, the baseline calculation for a metric is violated, and only then the alarm is generated. You can alternatively select the option, metric value violates baseline, to generate event alerts when only the baseline value for a metric is violated. A baseline is available for alarm policies only.

For example, if the CPU Utilization of all Linux computers crosses the threshold of 75% for a period of 15 minutes, and it violates the baseline determined for that event, then depending on whether you have selected baseline option, an alarm is generated.

Baseline calculation is done in the following phases:

Bootstrap phase: At the very initial stage, hourly baseline calculation begins after six hours of aggregate data is available for a metric. Based on the initial data, hourly baseline for a metric starts appearing from the seventh hour. This baseline is set for the next 23 hours to get the first 24 hours of baseline data for that metric.
Baseline initialization phase: This phase starts after the first 24 hours from the start of baseline calculation have elapsed and goes on for the next 14 days. The previous day's hourly aggregate is merged with the hourly baselines for that same period and the current baseline is calculated. This is done at the start of the hour and the baseline is projected for the same hour, same day, next week.
Baseline ongoing phase: Starts from the day 15. Same hour, same day, previous week's hourly aggregate is merged with the hourly baseline for that same period (same hour, same day, previous week) to get the current baseline (for the current hour). This baseline is also projected for the same hour, same day, next week.

Since baselines are calculated hourly, to view baseline values, the selected interval must include the hourly data point. If you select a duration or define a period on the graph that has no data point, no baselines are displayed.

In some scenarios, baseline calculation stops if the PATROL agent stops or there is a communication gap between the microservices that enable baseline calculation. In such a scenario, a baseline gap appears on the graph. After the agent restarts, the baseline calculation for the metric starts again.

Instances

Conditions can be defined at an instance level. An instance represents the low level unit of a monitor type (or metric). You can define alarm conditions for all or multiple instances.

Defining conditions for all instances: Suppose you want to generate an alarm when the CPU utilization of any of the servers in your environment crosses 75%. If you have a combination of 4 CPU and 16 CPU servers, all the CPU instances will be monitored and an alarm will be generated if any of the instances crosses the maximum threshold of 75%.
Defining conditions for multiple instances: Suppose you want to generate an alarm when the CPU utilization crosses 75% for "CPU1" and "CPU11" in your environment . In this scenario, you can add a condition to generate a Critical alarm when the CPU utilization percentage crosses 75% for CPUs ending with "1". In this example, "CPU1" and "CPU11" represent multiple instances.

Post trigger actions

While defining the condition, you can also define the alarm severity: Critical, Major, Minor, Information, or Warning.

Additionally, you can specify one of the following post trigger actions for the alarm event:

Close immediately
Close after the metric reaches a normal state and the duration equal to the violation time period has lapsed
Close after the metric reaches a normal state and the number of consecutive polls that are at a normal level is equal to the violated poll counts
Do not close
Close event

Policy precedence

Policy precedence is the priority of a policy and ranges from 0 to 9999. A lower number indicates a higher priority. Policy precedence controls the alarm generation conditions when there are conflicting or overlapping configurations defined between two or more alarm policies.

If you configure two or more policies with the same metric for the same alarm severity, the conflict is resolved with the help of policy precedence.

Important

The following feature is under controlled availability for select customers.

The threshold is considered across the policies that contain the same severity.

Examples

Let us look at a few examples.

Example 1: Multiple policies have different precedence and the same metric configuration

Important

This example is for a feature that is under controlled availability for select customers.

Policy A	Policy B
Precedence: 200 Monitoring solution: Linux Monitor type: CPU Metric: Filesystem space utilization Condition 1: If metric value is above 40% for 1 polls And Ignore baseline Then Generate Minor Alarm and Close Alarm Immediately Condition 2: If metric value is above 50% for 1 polls And Ignore baseline Then Generate Major Alarm and Close Alarm Immediately Condition 3: If metric value is above 90% for 1 polls And Ignore baseline Then Generate Critical Alarm and Close Alarm Immediately	Precedence: 100 Monitoring solution: Linux Monitor type: CPU Metric: Filesystem space utilization Condition 1: If metric value is above 70% for 1 polls And Ignore baseline Then Generate Major Alarm and Close Alarm Immediately
Resolution: Only the major threshold in Policy B is applied because of the following reasons and a major alarm is generated: The threshold values of both policies are compared across severities. The precedence value of Policy B is lower than that of the Policy A. Therefore, the threshold of Policy B is applied and a major alarm is generated. The minor and critical alarms as specified in Policy A are not generated.

Policy A

Policy B

Precedence: 200
Monitoring solution: Linux
Monitor type: CPU
Metric: Filesystem space utilization
Condition 1:
If metric value is above 40% for 1 polls And Ignore baseline
Then Generate Minor Alarm and Close Alarm Immediately
Condition 2: If metric value is above 50% for 1 polls And Ignore baseline
Then Generate Major Alarm and Close Alarm Immediately
Condition 3: If metric value is above 90% for 1 polls And Ignore baseline
Then Generate Critical Alarm and Close Alarm Immediately

Precedence: 100
Monitoring solution: Linux
Monitor type: CPU
Metric: Filesystem space utilization
Condition 1: If metric value is above 70% for 1 polls And Ignore baseline
Then Generate Major Alarm and Close Alarm Immediately

Resolution: Only the major threshold in Policy B is applied because of the following reasons and a major alarm is generated:

The threshold values of both policies are compared across severities.
The precedence value of Policy B is lower than that of the Policy A. Therefore, the threshold of Policy B is applied and a major alarm is generated.
The minor and critical alarms as specified in Policy A are not generated.

Example 2: Multiple policies have the same precedence, different Agent tag names, and the same metric configurations

Policy A	Policy B
Precedence: 070 Metric: Utilization Condition: If metric value is Above 80% For time 20 mins And Ignore baseline Then Generate Critical Alarm And Close alarm After violation time lapses	Precedence: 070 Metric: Utilization Condition: If metric value is Above 80% For time 20 And Ignore baseline Then Generate Critical Alarm And Close alarm After violation time lapses
Resolution: No conflict Policy A is applied to agents whose tag name is Test1. Policy B is applied to agents whose tag name is Test2.

Example 3: Multiple policies have different precedence, different Agent tag names, and the same metric configurations

Policy A	Policy B
Precedence: 099 Metric: Utilization Condition: If metric value is Above 80% For time 15 mins And Ignore baseline Then Generate Critical Alarm And Close alarm After violation time lapses	Precedence: 070 Metric: Utilization Condition: If metric value is Above 80% For time 15 mins And Ignore baseline Then Generate Critical Alarm And Close alarm After violation time lapses
Resolution: Policy B is applied to the Agents whose tag name is Test1 because the precedence value of Policy B (070) is lower than that of Policy A (099).