Alarm policies

Use alarm policies to monitor and manage the health of your system and safeguard against abnormalities.

Alarms are types of events that are generated when threshold values are violated. You can configure the threshold values. Thresholds define an acceptable value above or below which an alarm is generated. Thresholds can be applied to multiple monitor types on multiple PATROL Agents.

An alarm policy comprises the policy name, description, and the alarm generation conditions. An alarm generation condition contains the following details:

Threshold values
Violation duration (in minutes)
Metric to monitor
Option to choose between all or multiple instances
Post-trigger actions

Example: As per the condition defined in the following image (Monitoring solution: Linux; Monitor type: CPU; Metric: Utilization), when the CPU utilization for Linux computers whose instance name starts with 'clm' crosses the threshold of 75% for a period of 15 minutes, a Major alarm is generated. After the CPU utilization returns to a normal state of below 75% and 15 minutes lapse, the generated alarm is automatically closed.

Alarm Generation annotated.png

Threshold conditions

You can define the following types of threshold conditions. While adding conditions, you can specify a severity of Critical, Major, or Minor.

Based on an absolute value: If a metric's performance is known to degrade in the event of the value going above or below a common acceptable value, you can define a threshold condition containing this absolute value. For example, if the total CPU utilization of a Linux system is above 85% for 10 minutes, it may result in performance issues. In this case, you can define a threshold condition of 85% for a time period of 10 minutes. When this threshold condition is exceeded, an alarm is generated.
Based on a range of values: In scenarios where you want to monitor metric values in terms of a range, you can specify multiple conditions specifying the high and low values. For a single metric, you can add up to three conditions only. For example, if the total CPU utilization of a Linux system is above 75% for 10 minutes, you might want to be notified with a Major alarm. But if the CPU utilization exceeds 85% for 10 minutes, you might want to be notified with a Critical alarm. In this scenario, you can define two threshold conditions for the same metric with a different threshold value and different event severity. When the first condition is breached (for example, when the threshold crosses 85%), an alarm is generated with the specified severity (for example, Critical). When the next condition is breached (for example, when the threshold crosses 75%), the severity of the same alarm is updated, a new alarm is not generated.

Baselines

A baseline is the expected normal operating range for a metric or attribute of a monitor. Using the baseline in the trigger condition for alarm policies allows a threshold to generate events based on learned behavior. While defining threshold conditions in a policy, you can choose to select an additional condition, Violates Baseline, which determines whether after fulfilling the initial threshold conditions, if the baseline calculation for a metric is violated, only then the alarm is generated. Baseline is available for alarm policies only.

For example, if the CPU Utilization of all Linux computers crosses the threshold of 75% for a period of 15 minutes, and it violates the baseline determined for that event, then depending on whether you have selected baseline option, an alarm is generated.

Baseline calculation is done in the following phases:

Bootstrap phase: At the very initial stage, hourly baseline calculation begins after six hours of aggregate data is available for a metric. Based on the initial data, hourly baseline for a metric starts appearing from the seventh hour. This baseline is set for the next 23 hours to get the first 24 hours of baseline data for that metric.
Baseline initialization phase: This phase starts after the first 24 hours from the start of baseline calculation have elapsed and goes on for the next 14 days. The previous day's hourly aggregate is merged with the hourly baselines for that same period and the current baseline is calculated. This is done at the start of the hour and the baseline is projected for the same hour, same day, next week.
Baseline ongoing phase: Starts from the day 15. Same hour, same day, previous week's hourly aggregate is merged with the hourly baseline for that same period (same hour, same day, previous week) to get the current baseline (for the current hour). This baseline is also projected for the same hour, same day, next week.

Since baselines are calculated hourly, to view baseline values, the selected interval must include the hourly data point. If you select a duration or define a period on the graph that has no data point, no baselines are displayed.

In some scenarios, baseline calculation stops if the PATROL agent stops or there is a communication gap between the microservices that enable baseline calculation. In such a scenario, a baseline gap appears on the graph. After the agent restarts, the baseline calculation for the metric starts again.

Instances

Conditions can be defined at an instance level. An instance represents the low level unit of a monitor type (or metric). You can define alarm conditions for all or multiple instances.

Defining conditions for all instances: Suppose you want to generate an alarm when the CPU utilization of any of the servers in your environment crosses 75%. If you have a combination of 4 CPU and 16 CPU servers, all the CPU instances will be monitored and an alarm will be generated if any of the instances crosses the maximum threshold of 75%.
Defining conditions for multiple instances: Suppose you want to generate an alarm when the CPU utilization crosses 75% for "CPU1" and "CPU11" in your environment . In this scenario, you can add a condition to generate a Critical alarm when the CPU utilization percentage crosses 75% for CPUs ending with "1". In this example, "CPU1" and "CPU11" represent multiple instances.

How do I identify an instance name?

On the Monitoring > Devices page, click the device to view its details.
From the Monitors tab, expand the monitor name and select the monitor type.
Copy the URL from the browser address bar to identify the instance name. In the following example, NT_MEMORY is the instance name.
Note that the instance name is case sensitive. Use the instance name as displayed in the browser address bar for filtering and searching.
#Syntax
https://<Host FQDN>/monitor/#/monitors/<Instance name>+<PATROL Agent GUID>:<Instance name>:<Instance name>/<Tenant ID>/perfOverviewTab

#Example
https://abc.bmc.com/monitor/#/monitors/NT_MEMORY+<PATROL Agent GUID>:NT_MEMORY:NT_MEMORY/<Tenant ID>/perfOverviewTab

Post-trigger actions

While defining the condition, you can also define the alarm severity: Critical, Major, or Minor.

Additionally, you can specify one of the following post-trigger actions for the alarm event:

Close immediately
Close after the metric reaches a normal state and a duration equal to the violation time period has lapsed
Do not close

Policy precedence

Policy precedence is the priority of a policy and ranges from 0 to 9999. A lower number indicates a higher priority. Policy precedence controls the alarm generation conditions when there are conflicting or overlapping configurations defined between two or more alarm policies.

If two or more policies attempt to configure the same metric for the same alarm severity, the conflict is resolved with the help of policy precedence. For example, Monitoring solution: Linux; Monitor type: CPU; Metric: Utilization; Alarm severity: Critical.

The following scenarios explain how the policy precedence resolves the conflict.

Scenario 1
Policy A: Precedence number: 099 Criteria to select multiple instances: Agent Tag = Test1 Alarm generation condition: If Utilization > 85% for 15 minutes, generate a Critical alarm	Policy B: Precedence number: 070 Criteria to select multiple instances: Agent Tag = Test2 Alarm generation condition: If Utilization > 80% for 20 minutes, generate a Critical alarm
Resolution: No conflict Policy A is applied to agents whose tag name is Test1. Policy B is applied to agents whose tag name is Test2.

Scenario 2
Policy A: Precedence number: 099 Criteria to select multiple instances: Agent Tag = Test1 Alarm generation condition: If Utilization > 85% for 15 minutes, generate a Critical alarm	Policy B: Precedence number: 070 Criteria to select multiple instances: Agent Tag = Test1 Alarm generation condition: If Utilization > 80% for 20 minutes, generate a Critical alarm
Resolution: Policy B is applied to the agents whose tag name is Test1 because the precedence value of Policy B (070) is lower than that of Policy A (099). Remember the rule: Lower the number, higher the precedence.

Tip: Assigning precedence numbers

Assign precedence numbers starting with the highest number in the range and then continue in the descending order. By doing so, you can leave the lower numbers in the range for specific use cases.

Out-of-the-box policies

A list of out-of-the-box policies with predefined threshold conditions are available to help you monitor the health and status of particular monitoring solutions quickly and easily. These policies will work only when data related to the monitoring solutions is already collected. Therefore, you need to ensure that the monitoring policies for these solutions are already configured. For more information, see Defining-monitor-policies.

These policies are set up for all instances, for a violation duration of one minute, and they are configured to raise an alarm with Major severity. Also, they are configured to automatically close the alarm after the metric reaches a normal state and a duration equal to the violation time period lapses. Thus, these policies are set up with the following common alarm generation conditions:

Instance type set to All instances
Violation duration set to 1 minute
Alarm severity set to Major
Close alarm after violation time lapses

The attached spreadsheet lists the various metrics covered by the out-of-the-box policies. By default, these policies are disabled. If you want to enable an out-of-the-box alarm policy, edit the policy and enable it. The out-of-the-box alarm policies can be updated and deleted as required. However, you cannot rename these policies.