Learning about incident correlation
An incident is any event that is not part of the standard operation of a service and that causes an interruption to or a reduction in the quality of that service. The mission of the incident management process is to resolve incident requests as quickly as possible in a prioritized fashion.
Why incident correlation?
When a major incident happens, the Service Desk is flooded with tickets. It is important for Service Desk managers to correlate the tickets and help reduce the noise. In this way, it is possible to find the probable cause and restore the service as soon as possible.
Challenges
When there is a major incident, Service Desk managers need to:
- Rely on word of mouth or informal channels to detect emerging situations
- Manage the flood of incidents about the same underlying issue manually
- Identify major incidents as quickly as possible as there is no efficient way to manage flood of incidents about the same underlying issue. Also, multiple agents could potentially start working on similar incidents, resulting in duplication of efforts.
How incident correlation can help in faster service restoration
The real-time incident correlation functionality helps in quicker restoration of the service by providing the following benefits:
- Enabling Service Desk managers to respond faster to emerging situations.
- Reducing the effort to formally register incidents as duplicates
- Avoiding unnecessary work due to undetected duplicates
Incident correlation process
- Incidents in PWA screens are created or updated in Smart IT.
- These incidents are compared with existing incidents. The similarity is based on user-configurable grouping fields, text fields and sliding time window.
- Webhooks are registered in Action Request System. These webhooks are used to send real-time event notifications for incident correlation.
- As incidents flow in, the machine learning algorithm continuously updates the clusters of incidents. The algorithm uses the AI Foundation services in BMC Helix Portal for finding similar incidents and stores the results in AI Foundation Cluster Storage as shown in the following diagram:
- The topic modeling algorithm analyzes the most important and most frequently used words in the text of all incidents in each cluster.
- The topic modeling algorithm then uses these words to derive a name for each cluster, such as “VPN-password-reset".
- The cluster is closed when:
- The clusters do not have any further updates in a configured time interval (default value is 7 days)
- All incidents are resolved in a cluster
- The cluster has been open for 30 days. After 30 days of cluster creation, it is closed irrespective of its status and last update time.