Configuring alarm policies
As a tenant administrator, use alarm policies to safeguard against abnormalities by configuring thresholds. Thresholds define an acceptable value above or below which an alarm is generated.
You can manually configure alarm policies or configure automatic anomaly event generation to detect anomalies. Alarms are types of events that are generated when threshold values are violated. You can configure the threshold values. Thresholds define an acceptable value above or below which an alarm is generated. Thresholds can be applied to multiple monitor types on multiple PATROL Agents.
Related topics
You can create, edit, view, and delete alarm policies. You can also view a list of alarm policy conflicts and an audit trail of all updates made to all alarm policies.
Scenario Sarah, an administrator at Apex Global, wants to create an alarm policy that will send her an alarm event if the CPU utilization on Linux computers reaches above 85%. Watch the following video (2:02) to see how Sarah creates the alarm policy. |
To create an alarm policy
If you are creating an alarm policy for dynamic monitor types, make sure that monitor instances are available for a device in the Devices > Device Details > Monitors tab.
- Navigate to the Configuration > Alarm Policies page and click Create.
- In the Policy Information area, enter a name and description for the policy.
- In the Precedence field, add a unique precedence number to the policy.
You can add a custom value in this field, or use the arrows to increase or decrease the value. For more information, see Alarm-policies. - Use the Alarm Genaration Conditions area to create the conditions based on which the alarm will be generated.
Refer to the following image:
Perform the following steps to create the conditions:- In the For field, click
to select a monitor type.
You can select one of the following monitor types:- BMC Monitor Types
- BMC Dynamic Monitor Types
You must first enable the dynamic monitor types for metric selection. For more information, see Dynamic-monitor-type-registration-management-endpoints-in-the-REST-API. - Third Party Monitor Types
- In the On field, select one of the following options:
- All Instances
- Multi Instances: When you select multiple instances, you can define multiple conditions using parameters such as agent tag name, hostname, instance name, port, etc. You can also use regular expressions while defining multiple instances by using the Matches operator.
You cannot create multiple alarm polices with the same metrics and conditions. You can create policies with the same metrics and conditions only if the instance type is different. The instance type can be All or Multiple.
If you create multiple alarm polices with the same metrics, same conditions, and different instance types, the policy with the Multiple instance type takes precedence.
- In the next field, add the instance details.
You can also copy the criteria by clicking Copy. The copied criteria can be reused in subsequent policies by pressing Ctrl+V in the selection criteria field.
Important:- The instance name is case sensitive. Ensure that you use the instance name as displayed in the Device Details page while defining the selection criteria.
- The instance name that you enter in the alarm policy is displayed in the object slot for an alarm event and not in the instancename slot on the Events page.
(Optional) Click
to view the devices that satisfy the selection criteria.
Important:
The Preview button is disabled if you specify the Device Host Name and Instance Name in the selection criteria.- Use the If and Then fields to add the threshold value, violation duration, and details about when the generated alarm must be closed eventually.
For more information, refer to the following table.Threshold Specify if the event must be closed immediately after the metric reaches a normal state and a duration after the violation time period has lapsed. You can also specify that the event must not be closed. Alarm events that are not closed remain open until they are closed manually, the policy is deleted, or the PATROL Agent associated with the alarm is deleted. To change any of the values, click them.
If you have more than 20 multi-instance thresholds for a metric, alarm generation might be delayed till the evaluation of the multi-instance thresholds is complete. To avoid this delay, we recommend that you configure at least one threshold for all instances in addition to the multi-instance thresholds.
Baseline Specifies whether after the static threshold is breached and the baseline calculation for a metric is violated, only then the alarm is generated.
For the If condition, the following options are available:
- metric value is: The alarm is triggered when the condition of the metric value is satisfied and the And condition is satisfied. The following options are available for the And condition:
- violates baseline: The alarm is triggered for the metric value when any baseline is violated.
- violates high baseline: The alarm is triggered for the metric value when the high baseline is violated.
- violates low baseline: The alarm is triggered for the metric value when the low baseline is violated.
- ignore baseline: The alarm is triggered for the metric value. The baseline is ignored.
- metric value violates baseline: The alarm is triggered when the metric value violates any baseline for the time period that you specify.
- metric value violates high baseline: The alarm is triggered when only the high baseline is violated for the time period that you specify.
- metric value violates low baseline: The alarm is triggered when only the low baseline is violated for the time period that you specify.
Example 1
As per the condition defined in the following image (Monitoring solution: Linux; Monitor type: CPU; Metric: Utilization), when the CPU Utilization of all Linux computers whose instance name starts with 'CPU1' crosses the threshold of 75% for a period of 15 minutes, a Major alarm will be generated. After the Utilization returns to a normal state of below 75% and a 15 minutes lapse, the generated alarm is automatically closed.
You can add multiple conditions per metric. To add additional conditions, click Add Condition. Click Duplicate Condition to create another condition with the same details that you can modify later. To add conditions for a different metric, click Add Instance Policy.
If you have multiple conditions with varying threshold values and severity levels, an alarm is generated when the first condition is breached. When the next condition is breached, a new alarm is not generated. Instead, the severity of the first alarm changes.
Suppose you added these conditions:
If CPU utilization crosses 75% for a period of 15 minutes, generate a Major alarm. If CPU utilization crosses 85% for a period of 15 minutes, generate a Critical alarm.
In this scenario, when the CPU utilization of a computer crosses 75% for a period of 15 minutes, a Major alarm is generated. When the CPU utilization of the same computer crosses 85%, the earlier alarm severity changes from Major to Critical. If the CPU utilization returns to 75%, the alarm severity changes from Critical to Major.
Example 2
As per the condition defined in the following image (Monitoring solution: Linux; Monitor type: CPU; Metric: Utilization), when the CPU Utilization of all Linux computers whose agent tag name is production and agent host name begins with clm crosses the threshold of 75% for a period of 15 minutes, a Major alarm will be generated. After the Utilization returns to a normal state of below 75% and a 15 minutes lapse, the generated alarm is automatically closed.
Example 3
The following example uses a regular expression while defining multiple instances. As per the condition defined in the following image (Monitoring solution: Linux; Monitor type: CPU; Metric: Utilization), when the CPU Utilization of all Linux computers whose agent host name matches the regular expression AM-.* and instance name matches the regular expression ^0.*, crosses the threshold of 75% for 15 minutes, a Major alarm will be generated. After the Utilization returns to a normal state of below 75% and a 15 minutes lapse, the generated alarm is automatically closed. The following example explains how the regular expressions defined in the following image helps you to filter the instances based on agent host names and instance names.
Agent host names
- hostAM-34TEST1
- devAM-PRODSETtest
- M-PSRTESTserver
As per the agent host name regular expression AM-.*, only 2 host names (hostAM-34TEST1 and devAM-PRODSETtest) are selected from the preceding list.
Instance names
- 04578
- 0prodtest
- qa0test
As per the instance name regular expression ^0.*, only 2 instance names (04578 and 0prodtest) are selected from the preceding list.
Example 4
As per the condition defined in the following image (Monitoring solution: Linux; Monitor type: CPU; Metric: Utilization), when the CPU Utilization of all Linux computers whose instance name starts with 'CPU1' crosses the threshold of 75% for a period of 15 minutes, and it violates the baseline determined for that event, a Major alarm will be generated. After the Utilization returns to a normal state of below 75% and a 15 minutes lapse, the generated alarm is automatically closed.
- metric value is: The alarm is triggered when the condition of the metric value is satisfied and the And condition is satisfied. The following options are available for the And condition:
- In the For field, click
- (Optional) Select Enable Policy.
You can enable or disable the policy any time from the Alarm Policies page. Save the policy.
To edit an alarm policy
- Navigate to the Configuration > Alarm Policies page and do one of the following:
- Select the policy and click Edit.
- From the Actions menu of a policy, select Edit.
- Change the configuration details provided while creating the policy and click Save.
When you edit an alarm policy, the system incrementally loads 25 threshold values at once on a page scroll to improve usability.
If you update the precedence of existing policies, the system automatically closes open alarms associated with the policy and generates new alarms for any subsequent metric threshold violations for that policy.
You cannot search thresholds that are not yet loaded on the page.
To view conflicting alarm policies
To view a list of the alarm policies that have the same severity defined for the same metric, go to Configuration > Alarm Policies and click Check policy conflicts. A PDF file with the list of conflicting alarm policies is downloaded.
To copy an alarm policy
- Navigate to the Configuration > Alarm Policies.
- Click the action menu of the policy that you want to copy and select Copy.
The Create Alarm Policy page is displayed with the configurations of the copied policy. - Modify the configurations according to your requirements to create a new policy quickly.
- (Optional) Select Enable Policy.
You can enable or disable the policy any time from the Alarm Policies page. - Save the policy.
To view the list of alarm policies
On the Configuration > Alarm Policies page, view the list of alarm policies.
By default, the policies are sorted by Name. To sort on a different column, click the column heading.
To enable or disable an alarm policy
On the Configuration > Alarm Policies page, do one of the following:
- Select the policy and click Enable or Disable.
- From the Actions menu of a policy, select Enable or Disable.
- Edit the policy and select or clear the Enable Policy check box.
To delete an alarm policy
On the Configuration > Alarm Policies page, do one of the following:
- Select one or more policies and click Delete.
- From the Actions menu of a policy, select Delete, and click Yes.
In the policy post-trigger actions, if you have specified that the alarm must not be closed, such alarm events are automatically closed if they are not updated for more than the period specified in the retention policy. All closed events are automatically deleted from the system as per the retention policy. However, if you delete such a policy or the PATROL Agent associated with such an alarm is deleted, the alarm is automatically closed.
For more information about the retention policy, seeBMC Helix Operations Management service.
To view the audit trail of alarm policies
As a tenant administrator, you can use the BMC Helix Audit dashboard in BMC Helix Dashboards, to view the trail of all changes that were made to alarm policies. The BMC Helix Audit Dashboard provides the audit trail of alarm policies.
Scenario
Tina is a tenant administrator and Sarah is a system administrator at Apex Global. Tina has left on a vacation and she won't be back at work for two more weeks. Sarah has taken up some of Tina's responsibilities during this time. Sarah is looking at some alarm policies in the system and she wants to know when Tina created them and when they were updated. Because Tina is on vacation, how can Sarah obtain this information?
Sarah can log in to BMC Helix Dashboards and use the BMC Helix Audit Dashboard to see a complete audit trail of all alarm policies.
For more information, see Auditing user activities in BMC Helix Dashboards.
The following image displays the audit trail of alarm policies in the BMC Helix Portal Audit dashboard. Note that the selected resource type is Alarm Policy.