Alarm policies

This topic explains the concept of alarm policies and how they can be useful. For instructions on creating, editing, and deleting alarm policies, see Configuring alarm policies.

Alarms are types of events that are generated when the user-specified threshold values are violated. Thresholds define an acceptable value above or below which an alarm is generated. Thresholds can be defined as part of an alarm policy that can be applied to multiple monitor types on multiple PATROL Agents. Thus, alarm policies can help you monitor and manage the health of your system and safeguard against abnormalities. The alarm policies can be viewed and created by tenant administrators only.

An alarm policy comprises the policy name, description, and the alarm generation conditions. An alarm generation condition contains the following details:

  • Threshold values
  • Violation duration (in minutes)
  • Metric to monitor
  • Option to choose between all or multiple instances
  • Post-trigger actions

Example: As per the condition defined in the following image (Monitoring solution: Linux; Monitor type: CPU; Metric: Utilization), when the CPU utilization for Linux computers whose instance name starts with 'CPU1' crosses the threshold of 75% for a period of 15 minutes, a Major alarm will be generated. After the CPU utilization returns to a normal state of below 75% and 15 minutes lapse, the generated alarm is automatically closed.

Threshold conditions

You can define the following types of threshold conditions. While adding conditions, you can specify a severity of Critical, Major, or Minor. 

  • Based on an absolute value: If a metric's performance is known to degrade in the event of the value going above or below a common acceptable value, you can define a threshold condition containing this absolute value. For example, if the total CPU utilization of a Linux system is above 85% for 10 minutes, it may result in performance issues. In this case, you can define a threshold condition of 85% for a time period of 10 minutes. When this threshold condition is exceeded, an alarm is generated.

  • Based on a range of values: In scenarios where you want to monitor metric values in terms of a range, you can specify multiple conditions specifying the high and low values. For a single metric, you can add up to three conditions only. For example, if the total CPU utilization of a Linux system is above 75% for 10 minutes, you might want to be notified with a Major alarm. But if the CPU utilization exceeds 85% for 10 minutes, you might want to be notified with a Critical alarm. In this scenario, you can define two threshold conditions for the same metric with a different threshold value and different event severity. When the first condition is breached (for example, when the threshold crosses 85%), an alarm is generated with the specified severity (for example, Critical). When the next condition is breached (for example, when the threshold crosses 75%), the severity of the same alarm is updated, a new alarm is not generated.

Instances

Conditions can be defined at an instance level. An instance represents the low level unit of a monitor type (or metric). You can define alarm conditions for all or multiple instances.

  • Defining conditions for all instances: Suppose you want to generate an alarm when the CPU utilization of any of the servers in your environment crosses 75%. If you have a combination of 4 CPU and 16 CPU servers, all the CPU instances will be monitored and an alarm will be generated if any of the instances crosses the maximum threshold of 75%.

  • Defining conditions for multiple instances: Suppose you want to generate an alarm when the CPU utilization crosses 75% for "CPU1" and "CPU11" in your environment . In this scenario, you can add a condition to generate a Critical alarm when the CPU utilization percentage crosses 75% for CPUs ending with "1". In this example, "CPU1" and "CPU11" represent multiple instances.

How do I identify an instance name?

  1. On the Monitoring > Devices page, click the device to view its details.
  2. From the Monitors tab, expand the monitor name and select the monitor type.
  3. Copy the URL from the browser address bar to identify the instance name. In the following example, NT_MEMORY is the instance name.

    Note that the instance name is case sensitive. Use the instance name as displayed in the browser address bar for filtering and searching.

    #Syntax
    https://<Host FQDN>/monitor/#/monitors/<Instance name>+<PATROL Agent GUID>:<Instance name>:<Instance name>/<Tenant ID>/perfOverviewTab
     
     
    #Example
    https://abc.bmc.com/monitor/#/monitors/NT_MEMORY+<PATROL Agent GUID>:NT_MEMORY:NT_MEMORY/<Tenant ID>/perfOverviewTab


Post-trigger actions

While defining the condition, you can also define the alarm severity: Critical, Major, or Minor.

Additionally, you can specify one of the following post-trigger actions for the alarm event:

  • Close immediately
  • Close after the metric reaches a normal state and a duration equal to the violation time period has lapsed
  • Do not close

Policy precedence

Policy precedence is the priority of a policy and ranges from 0 to 9999. A lower number indicates a higher priority. Policy precedence controls the alarm generation conditions when there are conflicting or overlapping configurations defined between two or more alarm policies.

If two or more policies attempt to configure the same metric for the same alarm severity, the conflict is resolved with the help of policy precedence. For example, Monitoring solution: Linux; Monitor type: CPU; Metric: Utilization; Alarm severity: Critical.

The following scenarios explain how the policy precedence resolves the conflict.

Scenario 1

Policy A:

Precedence number: 099

Criteria to select multiple instances: Agent Tag = Test1

Alarm generation condition:

If Utilization > 85% for 15 minutes, generate a Critical alarm


Policy B:

Precedence number: 070

Criteria to select multiple instances: Agent Tag = Test2

Alarm generation condition:

If Utilization > 80% for 20 minutes, generate a Critical alarm


Resolution: No conflict

Policy A is applied to agents whose tag name is Test1.

Policy B is applied to agents whose tag name is Test2.


Scenario 2

Policy A:

Precedence number: 099

Criteria to select multiple instances: Agent Tag = Test1

Alarm generation condition:

If Utilization > 85% for 15 minutes, generate a Critical alarm


Policy B:

Precedence number: 070

Criteria to select multiple instances: Agent Tag = Test1

Alarm generation condition:

If Utilization > 80% for 20 minutes, generate a Critical alarm


Resolution:

Policy B is applied to the agents whose tag name is Test1 because the precedence value of Policy B (070) is lower than that of Policy A (099).

Remember the rule: Lower the number, higher the precedence.


Tip: Assigning precedence numbers

Assign precedence numbers starting with the highest number in the range and then continue in the descending order. By doing so, you can leave the lower numbers in the range for specific use cases.


Out-of-the-box policies

A list of out-of-the-box policies with predefined threshold conditions are available to help you monitor the health and status of particular monitoring solutions quickly and easily. These policies will work only when data related to the monitoring solutions is already collected. Therefore, you need to ensure that the monitoring policies for these solutions are already configured. For more information, see Configuring monitor policies.

These policies are set up for all instances, for a violation duration of five minutes, and they are configured to raise an alarm with Critical severity. Also, they are configured to automatically close the alarm after the metric reaches a normal state and a duration equal to the violation time period lapses. Thus, these policies are set up with the following common alarm generation conditions:

  • Instance type set to All instances
  • Violation duration set to 5 minutes
  • Alarm severity set to Critical 
  • Close alarm after violation time lapses

The following table lists the various metrics covered by the out-of-the-box policies. By default, these policies are enabled. If you want to disable an out-of-the-box alarm policy, edit the policy and disable it. The out-of-the-box alarm policies can be updated and deleted as required. However, you cannot rename these policies.

Policy nameMonitoring solutionMonitor typeMetric

PATROL for Linux

LinuxDisksCollection Status
File SystemFileSystem Mount Status
File SystemsCollection Status
ProcessProcess Count Check
ProcessProcess Ownership Check
ProcessParent PID is 1
PATROL for Microsoft Windows ServersMicrosoft Windows ServersCPU Performance Processors online/offline status
Health At A GlanceWindows Management Instrumentation Availability
Hyper-V partition Partition state
Job ProcessProcess Memory limit exceed 
Job ProcessProcess User time limit exceed 
Remote MonitoringConfiguration status
Windows ProcessProcess Count Check
Windows ProcessIndicates if the process id has changed since the last collection cycle
PATROL for Light Weight ProtocolsLight Weight ProtocolsPingMonitor Status
Ping Device Availability
Ping Host Availability
HostStatus 
Host ProcessStatus 
Interface Admin status 
Interface Operational status
OpenVMS Interface Admin status 
OpenVMS Interface Operational status
OpenVMS process  Status 
OpenVMS serverStatus 
SNMPConfiguration status
SNMP device Status
SNMP device (DEMO)Status

Where to go from here

To create, edit, or delete an alarm policy, see Configuring alarm policies.

Was this page helpful? Yes No Submitting... Thank you

Comments