Generating alerts from logs


As an administrator, use events to alert yourself about an issue before your users notice them. You can get the alerts (in the form of events), proactively identify the problem, resolve the problem before your users are adversely affected, and ensure continuous availability for the end users. 

An event is a notification that is used to notify you when a condition is satisfied in the logs. Based on the severity, events can help you discover health issues across your data center from a single console. Use alert policies to generate these events by configuring the condition you want to be notified. You can also use alert policies to detect anomaly occuring in the logs or a rare pattern that is reported. Alert policies notify you about these conditions in the logs with the help of events that are generated in BMC Helix Operations Management.

Log data also contains anomalies that represent potential system faults, which makes them critical to debugging application performance and errors. BMC Helix Log Analytics provides automated analysis with machine learning (ML)-based anomaly detection of abnormal log pattern (or anomalies) that indicate any deviation from the normal behavior. This analysis helps you find concerns proactively before they become a problem and help troubleshoot errors when they arise.

Types of alert policies

You can create the following types of alert policies:

  • Static Thresholds: When you are aware of the conditions for which you want to be alerted and you also know where these conditions will occur, use static thresholds. For example, while analyzing logs, you come across a status 401 (authentication failure) for which you want to be notified. Let's say you notice that the status is reported multiple times in a short time period. You want to be notified if it occurs again. So, you create alert policies that generate events when the conditions configured in the policies occur in the logs. Here are a few more examples:

    Examples

    • An exception in the applicationserver log
    • Error log level in the database log
    • Unexpected token in the application log
  • Anomaly Detection: When you want to be alerted if an anomalous log message is generated in a certain type of log like database logs. For example, you want to be alerted if an anomalous log message is generated in the Kubernetes microservice logs. Here are a few more examples:

    Examples

    • An anomaly in a specific service in a Kubernetes environment
    • An anomaly in a specific service of Amazon Web Services
    • An anomaly in Windows event for a particular host or VM

How anomalous logs are identified

When you enable and save an alert policy for anomaly detection, a machine learning (ML) model is generated. To generate the model, you need a large number of logs in your data store that match the policy selection criteria. If less logs are present in your data store, the model is not created. However, if enough logs are present, a representative sample of logs is used to generate the model. It takes around 10 minutes to generate the model.

To generate a model, we require around a minimum of 50000 logs matching the policy selection criteria.

After the model is generated, it analyzes the incoming logs. When it detects an anomaly, an event is generated. The model is created and trained by using the logs present in your data store. Therefore, it is able to identify when a rare log message is generated and it reports it as an anomaly. The model assigns a score to the anomalous log message that represents the anomaly strength. The value of the score lies between 0 and 1. If the score is higher, the anomaly strength of the log message is high. This score is reported in the Anomaly_Score field. The messages of type Info are not reported as anomalous. 

To keep the model updated as per the latest logs generated in your environment, the model is regenerated in every 6 hours. It is also regenerated in the following conditions:

  • You edit an alert policy that is already enabled.
  • You enable a disabled alert policy.

Here are a few examples of anomalous logs:

Example of anomalous Apache logs

Non-anomalous logs

[Fri Mar 24 04:47:44 2023] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
[Fri Mar 24 04:51:08 2023] [notice] jk2_init() Found child 6725 in scoreboard slot 10

Anomalous logs

[Fri Mar 24 01:04:31 2023] [error] [client 218.62.18.218] Directory index forbidden by rule: /var/www/html/
[Fri Mar 24 20:47:17 2023] [error] jk2_init() Can't find child 2087 in scoreboard
Example of anomalous Linux logs

Non-anomalous logs

Mar 24 06:06:21 combo kernel: usbcore: registered new driver hub
Mar 24 06:06:23 combo kernel: ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
Mar 24 06:06:28 combo kernel: PCI: Found IRQ 11 for device 0000:00:1f.2
Mar 24 06:06:29 combo kernel: uhci_hcd 0000:00:1f.2: new USB bus registered, assigned bus number 1

Anomalous Records

Mar 24 06:06:29 combo kernel: audit(1138278101.749:164014): avc:  denied  { ioctl } for  pid=594 exe=/usr/lib/vte/gnome-pty-helper path=/dev/pts/0 dev= ino=2 scontext=system_u:system_r:kernel_t tcontext=system_u:object_r:devpts_t tclass=chr_file
Mar 24 06:06:29 combo kernel: audit(1138278101.766:164029): avc:  denied  { setattr } for  pid=594 exe=/usr/lib/vte/gnome-pty-helper name=0 dev= ino=2 scontext=system_u:system_r:kernel_t tcontext=system_u:object_r:devpts_t tclass=chr_file
Example of anomalous Windows logs

Non-anomalous logs

2022-07-28 04:30:31, Info  CBS  SQM: Initializing online with Windows opt-in: False
2022-07-28 04:30:31, Info  CBS  SQM: Cleaning up report files older than 10 days.
2022-07-28 04:30:31, Info  CBS  SQM: Requesting upload of all unsent reports.
2022-07-29 00:00:46, Info  CBS  Startup processing thread terminated normally

Anomalous logs

2022-07-28 04:30:31, Warn CBS No startup processing required, TrustedInstaller service was not set as autostart, or else a reboot is still pending.
2022-07-29 00:00:47, Error CBS Failed to create backup log cab. [HRESULT = 0x80070001 - ERROR_INVALID_FUNCTION]

Alert policy details

An alert policy consists of the following details:

  • Name, description, and precedence.
  • Policy selection criteria or the conditions that generate an event. Configure the policy selection criteria based on the fields available in the logs. The operators that you can use are Equals, Not Equal to, and Contains. Combine these conditions with the AND and OR logical operators. Optionally, group these conditions on a particular field, such as when status Equals 401 for a particular host. In this case, you group the condition on the host field. Next, define the time period for these conditions to be true. As an example, generate an event if the status Equals 401 for 5 times (minimum) in the past 10 minutes.  


    Important

    While using the Contains operator in the policy selection criteria, ensure that you use the complete word present in the log string of a field. For example, if the value of the Country field is "United States of America", set the criteria as Country Contains United or Country Contains America. Do not set the criteria for partial words, such as Country Contains Unite or Country Contains Amer. 

  • Host name, which can be either a static value that you type or a field in the logs that you select. If you select a log field, ensure that you select the same log field in the Group by field. 
  • Additional Details are the values from the logs that are added to the fields of the generated event. These values can be either static values that you type or a field in the logs that you select. The additional details that you can add to the event are described as slots on this page: Log Alert event class. Fields of type Enum accept only preconfigured values. If you enter a value that is not preconfigured, the default value is added to the slot in the event. 
    To add custom fields to an event, see Event management endpoints in the REST API..


To create an alert policy

The following video (3:02) illustrates the steps to create an alert policy for static threholds.


icon-play@2x.pngWatch the YouTube video about the steps to create an alert policy for static threholds in BMC Helix Log Analytics.

  1. Click the Alerts menu and select Alert Policies.
  2. On the Alert Policies page, click Create.
  3. Enter a unique name such as Authentication Failure, and an optional description.
  4. In the Precedence field, set a precedence for the policy.
    The precedence number defines the priority for executing the policy. Note that a policy with a lower precedence number is executed first. 
  5. In the Policy Selection Criteria field, configure the condition for which the event will be generated.
    For example, enter status Equals 401 AND filename EQUALS BMC_Apache_SantaClara.log. When you click in the box, you are prompted to make a selection. Each time you make a selection, you are progressively prompted to make another selection. 
    The selection criteria consist of an opening parenthesis, followed by the slot name, the operator, the slot value (which can be a string based on the type of slot selected), and the closing parenthesis. You can optionally select the logical operator AND or OR to add additional conditions. Specifying the opening and closing parentheses is optional
    .

    Important

    The values that you enter for a field in the selection criteria are case-sensitive. For example, if the host name is WebServer.example.com, add the selection criteria as ( host_name Equals WebServer.example.com ). If you enter, ( host_name Equals webserver.example.com ), event is not generated.

  6. To group occurrences of a condition, perform one of the following actions:
    • In the Group by field, enter the values by which you want to group occurrences of a condition.
      For example, to group all occurrences of status 401 on a particular host name, enter the host name. You can enter a maximum of three values, but one must be the host name.
    • Click in the Group by field and select an appropriate option.
  7. Select if you are creating a static threshold or anomaly detection alert.
  8. For a static threshold, perform the following actions:
    1. In the Alert Condition field, decide how many times the condition must occur in a time period to generate the event
    2. Enter the status of the event.
    3. Enter and select the values in the Minutes, Minimum count is fields.
      For example, when status 401 is reported a minimum of 50 times within a 5-minute period, a critical event is generated.
  9. For anomaly detection, perform the following actions:
    1. From the Log Attribute list, select the field that contains the log message.
    2. Select the type of event that you want to create. 
      If it is not Message or Log, select Custom and in the Log Attribute Value field, enter the field that contains the log message. 
  10. To add host name to the event, in the Alert Parameters section, perform one of the following actions:
    This value helps you correlate events in BMC Helix AIOps.
    • In the Hostname field, enter a host name.
    • Click in the Hostanme field and select the appropriate option. 
  11. In the Message field, change the default message, if required.
    To use a log field value in the message, put double curly brackets around the field name such as {{ $.location }}.
  12. In Additional Details, configure additional event parameters like source identifier.
    These values are set for the generated event.
  13. Select Enable Policy.
  14. Save the policy
    View all your policies on the Alert Policies page.
    AlertPolicies.png
  15. To edit, enable, disable, or delete a policy, use the Actions menu.

To understand the number of events generated

For an alert policy to detect anomalies, one event is generated. For one minute, no change is made to the event whether or not more anomalies are reported for the alert policy. However, after a minute is over and an anomaly is identified for the same alert policy, the Repeated count value of the same event is updated. This value is updated only one time in a minute.

Let's consider the following examples to understand how many events are generated for an alert policy for static thresholds.

Configurations in an alert policy

Incoming logs

Number of events generated

Details

Policy selection criteria: status Equals 401

Group by: blank

Hostname: blank or static value

For last: 5 minutes; When minimum count is: 10

The condition is satisfied 22 times in the last 5 minutes.

1

The event is generated after the criteria is satisfied the first 10 times in the logs. When it is satisfied another 10 times, for the same event, the Repeated count field is updated as 1.

Policy selection criteria: status Equals 401

Group by: hostname

Hostname: $.hostname

For last: 5 minutes; When minimum count is: 10

The condition is satisfied 11 times for host 1 and 20 times for host 2 in the last 5 minutes.

2

  • One event is generated for host 1 because the criteria for it is satisfied 11 times.
  • One event is generated for host 2 after the criteria is satisfied the first 10 times in the logs. When it is satisfied another 10 times, for the same event, the Repeated count field is updated as 1.

Policy selection criteria: status Equals 401

Group by: blank

Hostname: host 1 (static value)

For last: 5 minutes; When minimum count is: 10

The condition is satisfied 11 times for host 1 in the last 5 minutes.

1

The event is generated for host 1 because the criteria for it is satisfied 11 times.

Policy selection criteria: status Equals 401

Group by: city

Hostname: host 1 (static value)

For last: 5 minutes; When minimum count is: 10

The condition is satisfied 11 times for host 1 and city 1 and 20 times for host 1 and city 2 in the last 5 minutes.

2

  • One event is generated for host 1 and city 1 because the criteria for it is satisfied 11 times.
  • One event is generated for host 1 and city 2 after the criteria is satisfied 10 times in the logs. When it is satisfied another 10 times, for the same event, the Repeated count field is updated as 1.

Policy selection criteria: status Equals 401

Group by: hostname, city, and country

Hostname: $.hostname

For last: 5 minutes; When minimum count is: 10

The condition is satisfied 11 times for host 1, city 1, and country 1 and 20 times for host 2, city 2, and country 2 in the last 5 minutes

2

  • One event is generated for host 1 because the criteria for host 1, city 1, and country 1 is satisfied 11 times.
  • One event is generated for host 2 after the criteria is satisfied 10 times for host 2, city 2, and country 2. When it is satisfied another 10 times, for the same event, the Repeated count field is updated as 1.

To view the generated events

  1. Click the Alerts menu.
  2. Select Events.
    The Events page in BMC Helix Operations Management is displayed. The class of these events if Log Event. Use it to filter events generated by using alert policies. For more information about events, see Monitoring and managing events
    To view these events in BMC Helix Dashboards, navigate to Dashboards > Manage Dashboards > Log Analytics, and click the Self Monitoring dashboard.

Using the events to analyze logs

When you configure alert policies and the condition configured in a policy is satisfied in the logs, events are generated inBMC Helix Operations Management. The class of these events is Log Event. To continuously track such events, use the Self monitoring dashboard available in the Log Analytics folder in BMC Helix Dashboards. 

In the Search Parameters field of the event under Others, there is the link to launch BMC Helix Log Analytics. When you click this link, it opens the Explorer in BMC Helix Log Analytics to show associated logs. These logs are filtered based on the criteria mentioned in Policy Selection Criteria and the fields selected in the Group by field of the alert policy. 

If the host name is present as a configuration item (CI) for a service in BMC Helix AIOpsyou can monitor the generated events in BMC Helix AIOps. For a CI of a service or the host name, these events are correlated in BMC Helix AIOps.

AIOpsEntities.png

Closing the events automatically

The generated events are not closed automatically. Use event policies in BMC Helix Operations Management to close the events automatically. For example, create an event policy with time-based configurations to close the events that have not been modified in the last two hours. For more information, see Closing events automatically.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*