Operators need to analyze, prioritize and triage a large number of events in order to resolve problems. They need a way of quickly managing event storms to detect problems even before business is impacted.
A correlation policy can help reduce the event storm by combining multiple matching events into a single aggregated event.
This policy correlates and then aggregates incoming events based on the:
- Event selection criteria: Acts as the first filter for selecting events.
- Correlation conditions: Conditions specified while creating the correlation policy determine which events must be matched.
The selected events are aggregated under a single aggregated event on the Monitoring > Events page.
Use cases for correlating and aggregating events
Use case 1: Suppose due to a host going down, you received numerous events related to various applications set up on the host that went down.
In this scenario, you can create a correlation policy to aggregate all the events with the same host name.
Use case 2: Suppose you received various events related to a failed login. A failed login can indicate unauthorized users trying to access the system.
In this scenario, you can create a correlation policy to aggregate all the events:
- Originating from the particular host or source address
- With the message containing the string "failed login."
You can generate a new aggregated event with critical severity and high priority to prevent a security breach.
Benefits of correlation and aggregation
Event correlation and aggregation can help reduce event noise. It also reduces the operator's mean-time-to-detect or discover (MTTD) and the time required for investigating tickets.
The following images compare how aggregation reduces event noise for a specific use case.
The following image shows a large number of events received from various event sources in a time window of 15 minutes on a certain day. An operator would need to analyze and prioritize each of these events.
Suppose, you suspect a problem pattern related to the source. You can create a correlation policy to aggregate events based on the host name.
The following image shows a reduced number of events from various event sources in a time window of 15 minutes after the correlation policy is created. The aggregated events reduce event noise and help the operator focus only on those events that really matter.
Incoming events are correlated based on the following conditions:
- Matching criteria: The process of building the condition for matching criteria is similar to building the event selection criteria. You can add a condition to find new incoming events that match existing events of interest based on slot values. For example, you can find incoming events with a host name that is same as the host name of one or more existing events. Thus, all incoming events with the matching host name can be correlated and aggregated into a single event.
While building the condition, you can specify slots prefixed with $NEW and $OLD. Slots prefixed with ‘$NEW’ refer to slots of incoming events and slots prefixed with ‘$OLD’ refer to slots of existing events.
- Correlation time (in minutes): Correlation happens for the time duration specified in minutes. The time calculation begins after the correlation policy is created. After the time window passes, correlation stops.
- Minimum event count: When the minimum count specified for matching events is met, correlation begins.
These conditions can be defined while configuring an event policy with the type, Correlation. For more information, see Configuring event policies.
How aggregated events are displayed
The new aggregated event (primary event) is generated with the event details specified while creating the correlation policy.
The following image displays an example of how an aggregated event appears on the Events page.
The individual matching events used for aggregation are no longer displayed on the Events page. Instead, these events are displayed as related events (secondary events) in the event details of the primary event.
You can access the related events by clicking the number displayed in the aggregated event message. The number represents the count of events that were combined to form the new aggregated event. This count can increase as the matching events increase, until the correlation time window lapses.
The aggregated events are also displayed as situations on the BMC Helix AIOps console. Also, the event status is synced into BMC Helix AIOps. For more information, see .