Investigating ML-based situations that impact multiple services

Related topics

A multi-service situation is created by correlating events across multiple services that are topologically connected and have a cross-service impact. It includes events that occur from the defined service context, as well as from external nodes that are topologically related and contribute to the service impact. Multi-service situations help SREs and operators get a complete view of issues across multiple services, improve investigation accuracy, and reduce the time required to identify the root cause of complex issues.

As an operator or a site reliability engineer (SRE), use multi-service situations to:

Identify issues that spread across multiple services
Understand the service impact scope
Drill down into related events and nodes and pinpoint the exact CI causing the issue
Detect cross-site network issues where services are impacted due to failures in BGP-connected infrastructure
Accelerate root cause isolation by leveraging cross-service correlation and topology mapping
Reduce MTTR by avoiding the troubleshooting of individual services

Example 1

A retail company has several services, including order management service, inventory service, and payment service. All three services are dependent on a shared network router to connect to backend systems and external services.

Suddenly, the shared router experiences intermittent packet loss and high latency, causing network disruptions. As a result:

The order management service fails to reach the pricing engine, causing incorrect totals and missing discounts during checkout.
The inventory service is unable to update warehouse stock levels, leading to discrepancies in product availability across channels.
The payment service does not receive payment confirmations, leaving transactions incomplete and unverified.

Each of these services begin to generate events related to connectivity issues and degraded functionality.

Because the affected services are topologically connected through the shared router, BMC Helix AIOps automatically correlates all related events into a single multi-service situation.

The shared router is identified as the top-impacted CI, enabling the operator or SRE to quickly trace the problem across services to a single root cause, avoiding isolated investigations and reducing Mean Time to Resolution (MTTR).

Example 2

A BGP connection between data centers in New York and London becomes unstable due to a misconfiguration. Services in both locations, like customer profile lookup and order processing, start failing due to sync issues.

BMC Helix AIOpsdetects the BGP-connected routers as topologically linked and forms a single multi-service situation. The BGP issue is identified as the root cause, helping the SRE quickly resolve the cross-site network problem without investigating each service separately.

The BGP-based situation correlation feature is available under controlled availability. To use this capability, contact BMC Support.

Naming convention for multi-service situations

The name of a multi-service situation is derived from the root cause of the situation.

The situation name is displayed in the following format: <Multi-Service> - <root cause of situation>.

For example, Multi-Service - Memory utilization is > 80% for 5 mins.

To investigate a situation impacting multiple servicesEdit

On the BMC Helix AIOps console, click Situations.
All situations that occurred in the last 24 hours are displayed in a hierarchical view. Multi-service situations are indicated by the icon.

To learn more about multi-service situations, see Situations-overview.
Click the multi-service situation and view the following details on the situation details page:
- Situation name, severity, priority, type of situation, last modified date, and status.
- Incident ID: Click to open the incident in BMC Helix IT Service Management.
  If an incident is not created, a Create Incident link is displayed. Click the link to create an incident in BMC Helix IT Service Management (requires a subscription to BMC Helix IT Service Management).
  When Intelligent Automation Proactive Service Resolution is configured with on-premises ITSM, incidents are created in on-premises ITSM, and the incident ID is displayed in the situation details in BMC Helix AIOps. However, the incident ID cross-launch link from BMC Helix AIOps does not open the corresponding on-premises ITSM incident.
- Names of the impacted services: Click +1 to view all the impacted services. Click the service name to open the service details in a new tab.
- Changes: Change requests associated with the situation.
- Show Notes: Opens the Logs and Notes panel to display the notes added to a situation.
- Situation explanation. For more information, see To view the situation explanation.
- (Optional) Click the action menu to perform actions on a situation.
  For more information, see Performing situation actions.
- If BMC HelixGPT is enabled, the following information is displayed:
  - A human-readable AI-generated summary of the situation.
  - Best action recommendations, a list of suggested steps that can be used to remediate the situation. Additionally, a BMC HelixGPT-driven wizard offers sample code to accomplish individual steps in different languages such as Ansible, Python, and Bash.
  - Log insights collected from logs generated in BMC Helix Log Analytics, that help in getting an accurate root cause of the problem.
  - An integrated virtual agent, Ask HelixGPT, which leverages the BMC HelixGPT generative AI capabilities and helps you to ask questions to investigate and remediate the situation better. To learn more about BMC HelixGPT capabilities, see Situations-overview.
    Information
    To enable BMC HelixGPT, contact BMC Support.
Continue with To view best action recommendations.

To view best action recommendations

BMC Helix AIOps with BMC HelixGPT enables you to connect to the following ITSM data sources to generate best action recommendations:

BMC Helix IT Service Management
ServiceNow BMC Helix IT Service Management (Controlled availability customers only)
Jira (Controlled availability customers only)

Administrators must configure third-party data sources in BMC HelixGPT to generate recommendations. To configure third-party data sources, see Adding data sources in BMC HelixGPT in BMC HelixGPT documentation.

On the situation details page, review an AI-generated summary (short problem statement, brief summary, and detailed problem context).
Click Show remediation steps.
The recommended steps are displayed for a situation.
For example, for a High CPU Utilization issue, the following steps are suggested:
(If available) Click Code wizard.
The code that can be used to run the recommended step is displayed. For some manual steps, the code wizard might not be displayed.
1. Select your preferred language (Ansible, Python, Bash), and the code is displayed based on the selected language.
2. Click Copy to clipboard and use the code in your existing script to run the recommended remediation step.
3. Close the code wizard.
Continue with To view log insights.

To view log insights

BMC Helix AIOps with BMC HelixGPT enables you to connect to the following log data sources to generate log insights:

BMC Helix Log Analytics (no configuration required)
Splunk Enterprise
ElasticSearch

Administrators must configure third-party data sources in BMC HelixGPT Manager to generate insights. To configure third-party data sources, see Adding data sources in BMC HelixGPT in BMC HelixGPT documentation.

Click Ask HelixGPT and then click Log Insights. The first time that you view log insights for a situation, a progress bar is displayed to show the progress of the log summary generation. If you view log insights for the same situation again, the summary loads without delay. Depending on the log source configured in BMC HelixGPT, actionable insights from the logs related to the configuration item are displayed. which helps in identifying the root cause of the situation.
Use the cross-launch link to view the log details in BMC Helix Log Analytics.

Important

If a situation has multiple root causes, the BMC HelixGPT Log Insights retrieves logs only for the CI with the highest impact score. As a result, log data for other contributing CIs may not appear in the Log Insights.

Can I configure multiple data sources to generate log insights?

Yes. If you are using third-party log applications as data sources, you can select more than one data source, and BMC Helix AIOps displays insights from all relevant logs for an impacted configuration item.

Can I use the cross-launch link to view the log data even if I am not using BMC Helix Log Analytics?

Yes, the cross-launch link opens the logs page in the log application configured in BMC HelixGPT Manager.

What other log data sources are supported by BMC Helix AIOps?

For a list of supported third-party data sources, see Data-sources-for-BMC-Helix-AIOps.

To use the Ask HelixGPT virtual agent to get more information about the situation

The Ask HelixGPT virtual agent is available if BMC HelixGPT is enabled. To enable BMC HelixGPT, contact BMC Support.

Use the Ask HelixGPT virtual agent to ask questions within the context of an open or past situation. Using the BMC HelixGPT capabilities, operators can get information about diverse topics regarding infrastructure, service health, and near real-time predictions.

Click Ask HelixGPT.
The interactive virtual agent dialog box displays the following predefined questions:
- What is the impact of the issue?
- Which team can solve this issue?
- Has this situation happened in the past?
- Are there any change windows active during this situation?
Click any question to get additional information about the situation.
BMC HelixGPT generates the answer by evaluating information from incidents created for similar situations in BMC Helix IT Service Management, analyzing time stamps and patterns of similar situations that have occurred in the past, analyzing the service health score of the impacted service of the situation, and the change requests associated with the situation.
For example, if you click What is the impact of this issue?, the following answer is displayed.
(Optional) Click any other question to obtain more details about the situation.
Continue with To view situation explanation.

To view the situation explanation

In the Situation Explanation section, use the Root Cause View to analyze the root cause events associated with the situation.
- Root Cause View: Shows the impact flow of events in a situation in a graphical format. Based on the temporal and topological relationships between various causal events in the situation, the ML algorithm determines the root cause event and consequent events. Each event in the graph is aligned against the corresponding CI kind. The direction in the graph indicates the impact flow from the root cause event. You can see the impact score percentage displayed with the event. The total impact score from all the events adds up to 100 percent.
  - Show root cause candidates: Select this option to display all configuration items (CIs) that are identified as contributing to the impact of the situation. When enabled, multiple root causes are highlighted that contribute to the issue. If disabled, only the most probable root cause is shown. Use this option when dealing with complex issues where multiple components might be failing or affecting each other, leading to service issues. This option is visible only when more than one potential root cause is detected.
  - Hover over an event to view the impacted node details and the corresponding CI or CI kind highlighted in the CI topology and analysis section.
  - Click an event to view additional details on the Situation Details pane.
- Events: Displays all causal events and details such as the event messages, impacted host, occurrence time, impact score, severity, priority, status, and incident ID. Perform actions on a situation by clicking the action menu
  If Proactive Service Resolution is enabled in , the same incident ID is displayed against the situation and the events.
  Information
  Automated remediation action
  The Automations column displays the matching automation actions for the event. To run automation, see Running-an-existing-automation.
  (Optional) You can send a request to create automation or create automation for events yourself if you have the necessary permissions. For more information, see Requesting-for-a-new-automation and Creating-automation-policies.
- Changes: Displays the change requests that are most likely contributing to the situation, based on correlation with impacted service nodes. Change requests from BMC Helix ITSM are displayed in situations. Change requests from ServiceNow are not supported on the Situations page.
  - For each change, the change ID, summary, impacted host or CI, occurrence time (start date/time), status (e.g., Scheduled, In Progress, Completed), priority, and impact level are shown.
  - Click a change entry to view more details in BMC Helix ITSM, where you can review the entire change history, implementation notes, approvals, and associated incidents.
    While change events do not contribute to service health score calculations, displaying them in context helps operators investigate more effectively.
Click an event message to view the following details in the Event Details pane:
- Event name, event score, severity, priority, status, and the More Details link to view the additional event details in BMC Helix Operations Management
- Event assignee details
- Date when the event first occurred or was last modified
- Event summary showing the Class, Incident ID, Object Class, Object, and Host. Clicking the Incident ID link opens the incident in BMC Helix IT Service Management.
  For more information about event classes and objects, see EVENT base event class.
- Logs and notes history: All logs and notes for an event are displayed. Type a note in the text box and click Add Note to add any additional notes related to the event. Any note added for the event is reflected in the event in BMC Helix Operations Management.
- Performance view: If the slot value for the event class is Alarm, the time-series data collected from the key attributes of the causal events of ML-based situations is displayed.
Click the action menu to perform event actions.
Continue with To view CI topology and analysis.

To view the CI topology and analysis

In the CI topology and analysis map, view the topology map of the situation, the impacted CIs, and the probability of the impact on the connected CIs.
Use the following options to view the map based on your requirements:
1. Views: Switch between the Organic and Hierarchic view to view the impact flow.
  In the organic view, nodes are placed close to their adjacent nodes, thus saving space. While in the hierarchic view, the nodes are distributed into layers, which facilitates the identification of dependencies and relationships among the nodes.
  The topology view shows how the impacted services are connected through a shared router, helping you visualize the cross-service relationships and pinpoint the root cause.
2. Grouping: Click Enable Grouping by CI Kind to view the topology map grouped by the CI kind.
3. Search: If there are many CIs in the hierarchy, use the search box to locate a particular CI.
4. Legend: Click to view the legends used for the topological map.
5. Use the other tools to zoom in, zoom out, or view the map on a full screen.
6. The right-hand pane provides a summary of the configuration items (CIs) related to the situation. It includes the following fields:
  - Name: The unique name or identifier of the CI. This helps you recognize the specific component involved in the situation.
  - Kind: Indicates the type of CI, such as business service, host, software pod, or other infrastructure or application components. This classification helps you understand the CI’s function within the service topology.
    Location: Expand the CI name to see its physical or logical location, such as data center, region, or availability zone. This information helps identify location-based issues or regional dependencies.
Continue with To view the probable impact in the CI topology and analysis.

To view the probable impact in the CI topology and analysis

When a multi-service situation is formed, the topology view displays indirectly connected services that are not part of the situation, but are topologically linked to nodes in the situation, and have active events during the same timeframe.

Relationships with such services are marked in the topology by using the icon to indicate they may be impacted as a result of the ongoing issue, even though they are not directly contributing to or included in the situation.

For example, in the preceding topology map, a multi-service situation is detected and formed with events from the virtual machine (selufw99_bbdab) and host (selvm02_bbdab). During the same time window, service Apex-Cloud_Infra, which is indirectly connected to the router (isp_router) through shared infrastructure, generates a critical event.

Although Apex-Cloud_Infra is not included in the situation, the topology view highlights it as a probable impact, helping the operator assess whether the issue in the virtual machine or host is contributing to problems in dependent services like Apex-Cloud_Infra.

To prioritize the root cause on the same node

When multiple events occur on the same node within a situation, the first detected event might not always represent the true causal event. This can happen when the polling intervals vary across monitoring solutions, causing later events to better represent the actual cause.

To explicitly indicate which event should be considered the priority causal event, you can apply prioritization by using the refinement policy in BMC Helix Operations Management. For information about enrichment policies, see Advanced, time-based, and dynamic enrichment policies

Prioritization can be applied only to events on the same nod Underline Keyboard shortcut Ctrl+Ue within a service. This prioritization applies only to future situations, not to situations that have already been created.

To indicate priority causal events:

Create an event enrichment or refinement policy in BMC Helix Operations Management.
In the policy, set the tags slot of the event to include the priority-causal-root tag.

When an event is marked with this tag and occurs on the causal node, BMC Helix AIOps prioritizes it over other events on the same node.

Example

Consider a node where both application and network monitoring are enabled through different monitoring solutions.

The application monitoring detects a slowdown first and raises an event.
A few seconds later, the network monitoring raises an event indicating that a critical network port on the node is down, which actually represents the cause of the slowdown.

Without prioritization, the application event may be incorrectly treated as the root cause. By tagging the network event with the priority-causal-root tag through a refinement policy, the port issue detected by network monitoring is correctly prioritized as the causal event.

Health score behavior for topologically connected CIs in a multi-service contextEdit

If a CI is part of a service model, it is considered internal to that service, even if it has topological connections to other services. As a result, any events on that CI are excluded from the health score calculation of other services.

Example:

health_score_for_CIs_in_multiple_service_contexts_06.png

Service A and Service B are two defined service models in BMC Helix AIOps. Both are topologically connected through different CIs.

Host CI-01 is explicitly included only in Service A’s model, and the Software pod CI-01 is explicitly included only in Service B’s model.

When an event occurs on Host CI-01,

Since Host CI-01 is part of service A, the event contributes to the health score of service A.
Although Host CI-01 is topologically connected to service B, the event is not considered in service B’s health score because it is not directly a part of service B’s model and is considered part of service A.

This ensures that the health score of each service is calculated only from the CIs it owns, avoiding double counting or cross-service noise.

FAQs

How can I exclude specific events from a Situation?

To exclude specific events from being included in the ML-based situations, you can use event enrichment policies to tag them appropriately.

Create an event enrichment or refinement policy in BMC Helix Operations Management.
In the policy, set the tags slot of the event to include the ExcludeFromSituation tag.
When this tag is applied, BMC Helix AIOps automatically excludes the event from being correlated into any ML-based situation.
Verify the exclusion by checking the event's tags slot and confirming that it does not appear in any active situation.

You can refine event inclusion logic by using the advanced enrichment or refinement policies. For information about enrichment policies, see Advanced, time-based, and dynamic enrichment policies

For example, exclude events from known noisy CIs or event sources.

Why must a situation causal event have a single unique node ID?

When a situation is created and noise consolidation is enabled, the causal event must be associated with only one unique node ID. If multiple node IDs exist, it might not be possible to determine which node to associate with the incident, and as a result, the incident might not be created in BMC Helix IT Service Management.

Why is the Anomaly class not available in event policy selection criteria?

The Anomaly class is not available because single anomaly events cannot be defined through a policy and do not generate incidents on their own. Incidents are created when anomaly events are correlated into a situation, at which point a consolidated incident is generated, and the contributing anomaly events are updated with the corresponding incident ID.

Why is a situation not created when events come from a node that is out of service?

When a node is out of service, events from that node are not correlated to create a situation. In such cases, the event payload might show a list of services in the impacted_service_key field, but the service_key field remains empty. To create a situation, at least one additional event from a topologically connected node must occur within the correlation time window.

Where to go from hereEdit

To perform additional actions on a situation or on the events included in the situation, see Performing situation actions.