Investigating ML-based situations that impact multiple services

Related topics

A multi-service situation is created by correlating events across multiple services that are topologically connected and have a cross-service impact. It includes events that occur from the defined service context, as well as from external nodes that are topologically related and contribute to the service impact.

As an operator or a site reliability engineer (SRE), use multi-service situations to:

Identify issues that spread across multiple services, and understand the service impact scope
Pinpoint the exact configuration item (CI) causing the issue
Reduce MTTR by avoiding the troubleshooting of individual services

The name of a multi-service situation is derived from the root cause of the situation. The situation name is displayed in the following format: <Multi-Service> - <root cause of situation>. For example, Multi-Service - Memory utilization is > 80% for 5 mins.

Scenarios

As an operator or site reliability engineer (SRE), you may encounter complex issues that affect more than one service at a time. The following scenarios illustrate how multi-service situations help identify cross-service dependencies, isolate shared root causes, and streamline troubleshooting.

Cross-service impact from a shared router failure
Cross-location service impact due to BGP instability

Cross-service impact from a shared router failure

A retail company offers several services, including order management, inventory management, and payment services. All three services are dependent on a shared network router to connect to backend systems and external services.

Suddenly, the shared router experiences intermittent packet loss and high latency, causing network disruptions. As a result:

The order management service fails to communicate with the pricing engine, resulting in incorrect totals and missing discounts during the checkout process.
The inventory service is unable to update warehouse stock levels, leading to discrepancies in product availability across channels.
The payment service does not receive payment confirmations, resulting in incomplete and unverified transactions.

Each of these services begin to generate events related to connectivity issues and degraded functionality.

Because the affected services are topologically connected through the shared router, BMC Helix AIOps automatically correlates all related events into a single multi-service situation.

The shared router is identified as the top-impacted CI, enabling the operator or SRE to quickly trace the problem across services to a single root cause, avoiding isolated investigations and reducing Mean Time to Resolution (MTTR).

Cross-location service impact due to BGP instability

A Border Gateway Protocol (BGP) connection between data centers in New York and London becomes unstable due to a misconfiguration. Services in both locations, like customer profile lookup and order processing, start failing due to sync issues.

BMC Helix AIOps detects the BGP-connected routers as topologically linked and forms a single multi-service situation. The BGP issue is identified as the root cause, enabling the SRE to quickly resolve the cross-site network problem without needing to investigate each service separately.

Important

The BGP-based situation correlation feature is available under controlled availability. To use this capability, contact Helix BMC Support.

To investigate a situation impacting multiple servicesEdit

On the BMC Helix AIOps console, click Situations.
All situations that occurred in the last 24 hours are displayed in a hierarchical view. Multi-service situations are indicated by the icon.

To learn more about multi-service situations, see Situations overview.
Click the multi-service situation to view the situation details.

The following table describes key situation UI options that require additional explanation or have a conditional behavior. The UI options that are self-explanatory and visible directly in the UI are not described here:

UI option	Description
Severity and priority	BMC Helix AIOps evaluates all events, causal and child events, correlated into the situation, and assigns the highest severity found among those events to the situation. If multiple events have the highest severity, the situation’s priority is set to the highest priority among those events. For more information, see How situation severity and priority are determined. The original causal event might or might not determine the final severity and priority. The causal event determines the root cause assignment, but the severity and priority of the situation are determined by evaluating all events that form the situation. Situation severity and priority are determined from the highest-severity event only when the situation-severity parameter is set to priority-from-highest-severity value in the situations_configurations/search API, and the Multi-service Situations feature is enabled. For more information, see Configuring ML-based situations and Managing situations by using REST APIs.
Incident ID	Click to open the incident in BMC Helix IT Service Management. If an incident is not created, a Create Incident link is displayed. Click the link to create an incident in BMC Helix IT Service Management (requires a subscription to BMC Helix IT Service Management). When Intelligent Automation Proactive Service Resolution is configured with on-premises ITSM, incidents are created in on-premises ITSM, and the incident ID is displayed in the situation details in BMC Helix AIOps. However, the incident ID cross-launch link from BMC Helix AIOps does not open the corresponding on-premises ITSM incident. An incident created from a situation inherits the situation's severity and priority values.
Names of the impacted services	Click +1 to view all the impacted services. Click the service name to open the service details in a new tab.
Show Notes	Opens the Logs and Notes panel to display the notes added to a situation.

If BMC HelixGPT is enabled, the following information is displayed:
- A human-readable AI-generated summary of the situation.
- Best action recommendations, a list of suggested steps that can be used to remediate the situation. Additionally, a BMC HelixGPT-driven wizard offers sample code to accomplish individual steps in different languages such as Ansible, Python, and Bash.
- Log insights collected from logs generated in BMC Helix Log Analytics, that help in getting an accurate root cause of the problem.
- An integrated virtual agent, Ask HelixGPT, which leverages the BMC HelixGPT generative AI capabilities and helps you to ask questions to investigate and remediate the situation better.

To identify the root cause by using Deep RCA

Important

Deep Root Cause Analysis (RCA) is under controlled availability to select customers.

Deep RCA is the advanced capability for analyzing the root cause of the situation. If you have access to Deep RCA, the best action recommendations and log insights are not available for ML-based situations.

When you deploy the fine-tuned large language model provided for BMC Helix AIOps, agentic AI-based capabilities are available to automatically run agents in the background to identify the root cause by analyzing change requests and logs from supported data sources.

When you investigate a situation, the Log Tool and Change Tool in Deep RCA run in parallel, and the detailed output, hypothesis, and reasoning used by these tools are displayed.

For more information, see Identifying the root cause of the situation by using Deep RCA.

To view best action recommendations and log insights

BMC Helix AIOps connects with BMC HelixGPT to display a natural language summary of the situation with its root cause. It also provides a step-by-step action plan for remediating the situation and generates actionable insights by analyzing logs from BMC Helix Log Analytics or any supported external log sources.

For more information about using best action recommendations and log insights to investigate a situation, see Viewing best action recommendations and log insights.

BAR_IncreaseCPU_HelixGPT_243 (1).png

To use the Ask HelixGPT virtual agent to get more information about the situation

Use the Ask HelixGPT virtual agent to ask questions within the context of an open or past situation. Using BMC HelixGPT capabilities, operators can get information about diverse topics regarding infrastructure, service health, and near real-time predictions.

Click Ask HelixGPT.
The interactive virtual agent dialog box displays the following predefined questions:
- Reoccuring?
- Change Windows?
- Impact
- Team to solve?
Click any question to get additional information about the situation.
BMC HelixGPT generates the answer by evaluating information from incidents created for similar situations in BMC Helix IT Service Management, analyzing time stamps and patterns of similar situations that have occurred in the past, analyzing the service health score of the impacted service of the situation, and the change requests associated with the situation.
For example, if you click Impact?, the following answer is displayed.
(Optional) Click any other question to obtain more details about the situation.
The response time for a query varies with the number of events involved. At times, it may take longer than a few seconds to receive a response.

To view the situation explanation

The Situation Explanation section helps in understanding how BMC Helix AIOps identifies and analyzes the root cause of an ML-based situation. It provides a graphical root cause view, event impact analysis, correlated changes, and predictive insights to explain how events across services contribute to the situation and how the impact propagates through configuration items (CIs).

Root Cause View: Shows the impact flow of events in a situation in a graphical format. Based on the temporal and topological relationships between various causal events in the situation, the ML algorithm determines the root cause event and consequent events. Each event in the graph is aligned against the corresponding CI kind. The direction in the graph indicates the impact flow from the root cause event. You can see the impact score percentage displayed with the event. The total impact score from all the events adds up to 100 percent. multi_service_situation_explanation_root_cause_candidates_007.png

multi_service_situation_explanation_root_cause_candidates_007.png

Feature	Description
Show root cause candidates	Select this option to display all configuration items (CIs) that are identified as contributing to the impact of the situation. When enabled, multiple root causes are highlighted that contribute to the issue. If disabled, only the most probable root cause is shown. Use this option when dealing with complex issues where multiple components might be failing or affecting each other, leading to service issues. This option is visible only when more than one potential root cause is detected.
Impacted node details	Hover over an event to view the impacted node details and the corresponding CI or CI kind highlighted in the CI topology and analysis section.
Additional details	Click an event to view additional details on the Situation Details pane.

Events: Displays all causal events and details such as the event messages, impacted host, occurrence time, impact score, severity, priority, status, prediction ID, and incident ID. Perform actions on a situation by clicking the action menu.

Feature	Description
Prediction ID	(The ability to link prediction events with alarm events is under controlled availability to select customers. To use this capability, contact BMC Support.) If a prediction is linked to a real event (alarm or anomaly), the prediction ID field in the events tab displays the corresponding prediction details. Click the prediction ID to view more details in BMC Helix Operations Management. This tab also displays closed predictions that are linked to real alarms or anomalies. When an alarm event exists for a prediction, that prediction is automatically closed. These linked prediction details help you track how many predictions accurately matched real events, allowing you to assess the accuracy and effectiveness of the predictive insights.
Incident ID	If Proactive Service Resolution is enabled in , the same incident ID is displayed against the situation and the events.

Feature

Description

Prediction ID

(The ability to link prediction events with alarm events is under controlled availability to select customers. To use this capability, contact BMC Support.) If a prediction is linked to a real event (alarm or anomaly), the prediction ID field in the events tab displays the corresponding prediction details. Click the prediction ID to view more details in BMC Helix Operations Management. This tab also displays closed predictions that are linked to real alarms or anomalies. When an alarm event exists for a prediction, that prediction is automatically closed. These linked prediction details help you track how many predictions accurately matched real events, allowing you to assess the accuracy and effectiveness of the predictive insights.

Incident ID

If Proactive Service Resolution is enabled in , the same incident ID is displayed against the situation and the events.

Important

The Automations column displays the matching automation actions for the event. To run automation, see Running-an-existing-automation.
(Optional) You can send a request to create automation or create automation for events yourself if you have the necessary permissions. For more information, see Requesting-for-a-new-automation and Creating-automation-policies.

Changes: Displays the change requests that are most likely contributing to the situation, based on correlation with impacted service nodes. Change requests from BMC Helix ITSM are displayed in situations. Change requests from ServiceNow are not supported on the Situations page.

Feature	Description
Change details	For each change, the change ID, summary, impacted host or CI, occurrence time (start date/time), status (for example, Scheduled, In Progress, Completed), priority, and impact level are shown. Click a change entry to view more details in BMC Helix ITSM, where you can review the entire change history, implementation notes, approvals, and associated incidents. While change events do not contribute to service health score calculations, displaying them in context helps operators investigate more effectively.

Feature

Description

Change details

For each change, the change ID, summary, impacted host or CI, occurrence time (start date/time), status (for example, Scheduled, In Progress, Completed), priority, and impact level are shown. Click a change entry to view more details in BMC Helix ITSM, where you can review the entire change history, implementation notes, approvals, and associated incidents. While change events do not contribute to service health score calculations, displaying them in context helps operators investigate more effectively.

Predictions: Lists all open prediction events associated with the impacted services that are part of the situation. Predictions are displayed for events that are expected to occur in the next 24 hours, based on metric thresholds or baseline violation in case of anomalies. When multiple services are impacted, all the impacted service names are listed in this tab.

Feature	Description
Prediction details	(The ability to link prediction events with alarm events is under controlled availability to select customers. To use this capability, contact BMC Helix Support.) If an actual alarm is generated for the same metric, the related prediction event is automatically closed and removed from this tab. The alarm entry is updated with the corresponding prediction ID to maintain traceability and help users correlate predicted events with real alarms. These prediction events enable users to review forecasted trends and take proactive action before service impact occurs.
Prediction chart	On the prediction chart, the black vertical line represents the actual metric data, and the prediction icon indicates the forecasted event point.

Feature

Description

Prediction details

(The ability to link prediction events with alarm events is under controlled availability to select customers. To use this capability, contact BMC Helix Support.) If an actual alarm is generated for the same metric, the related prediction event is automatically closed and removed from this tab. The alarm entry is updated with the corresponding prediction ID to maintain traceability and help users correlate predicted events with real alarms. These prediction events enable users to review forecasted trends and take proactive action before service impact occurs.

Prediction chart

On the prediction chart, the black vertical line represents the actual metric data, and the prediction icon indicates the forecasted event point.

To view the event details

Click an event message to view the following details in the Event Details pane:

UI option	Description
Event details	Event name, event score, severity, priority, status, event assignee details, and the More Details link to view the additional event details in BMC Helix Operations Management.
Date	Date when the event first occurred or was last modified
Event summary	Event summary showing the Class, Incident ID, Object Class, Object, and Host. Clicking the Incident ID link opens the incident in BMC Helix IT Service Management. Event classification and formatting
Logs and notes history	All logs and notes for an event are displayed. Type a note in the text box and click Add Note to add any additional notes related to the event. Any note added for the event is reflected in the event in BMC Helix Operations Management.
Performance view	If the slot value for the event class is Alarm, the time-series data collected from the key attributes of the causal events of ML-based situations is displayed.

Click the action menu to perform event actions.

To view the CI topology and analysis

In the CI topology and analysis map, view the topology map of the situation, the impacted CIs, and the probability of the impact on the connected CIs.

Use the following options to view the topology map based on your requirements:

UI element	Description
Views	Switch between the Organic and Hierarchic view to view the impact flow. In the organic view, nodes are placed close to their adjacent nodes, thus saving space. While in the hierarchic view, the nodes are distributed into layers, which facilitates the identification of dependencies and relationships among the nodes. The topology view displays how the impacted services are connected through a shared router, enabling you to visualize cross-service relationships and pinpoint the root cause.
Grouping	Click Enable Grouping by CI Kind to view the topology map grouped by the CI kind.
Search	If there are many CIs in the hierarchy, use the search box to locate a particular CI.
Legend	Click to view the legends used for the topological map.
Other tools	Use the other tools to zoom in, zoom out, or view the map on a full screen.
Name (displayed in the right-hand pane)	The unique name or identifier of the CI. This helps you recognize the specific component involved in the situation. Location: Expand the CI name to see its physical or logical location, such as data center, region, or availability zone. This information helps identify location-based issues or regional dependencies.
Kind (displayed in the right-hand pane)	Indicates the type of CI, such as business service, host, software pod, or other infrastructure or application components. This classification helps you understand the CI’s function within the service topology.

multi_service_situation_CI_topology_root_cause_candidates_008.png

How is the probable impact indicated in the CI topology and analysis?

When a multi-service situation is formed, the topology view displays indirectly connected services that are not part of the situation but are topologically linked to nodes in the situation, and have active events during the same timeframe.

Relationships with such services are marked in the topology by using the icon to indicate that they may be impacted as a result of the ongoing issue, even though they are not directly contributing to or included in the situation.

Example

In the following topology map, a multi-service situation is detected and formed with events from the virtual machine (selufw99_bbdab) and host (selvm02_bbdab). During the same time window, service Apex-Cloud_Infra, which is indirectly connected to the router (isp_router) through shared infrastructure, generates a critical event.

Although Apex-Cloud_Infra is not included in the situation, the topology view highlights it as a probable impact, helping the operator assess whether the issue in the virtual machine or host is contributing to problems in dependent services, such as Apex-Cloud_Infra.

Prioritizing the root cause on the same node

When multiple events occur on the same node within a situation, the first detected event might not always represent the true causal event. This can occur when the polling intervals vary across monitoring solutions, causing later events to better represent the actual cause.

To explicitly indicate which event should be considered the priority causal event, you can apply prioritization by using the refinement policy in BMC Helix Operations Management. For information about enrichment policies, see Advanced time based and dynamic enrichment policies

Prioritization can be applied only to events on the same node within a service. This prioritization applies only to future situations, not to situations that have already been created.

To indicate priority causal events

Create an event enrichment or refinement policy in BMC Helix Operations Management.
In the policy, set the tags slot of the event to include the priority-causal-root tag.

When an event is marked with this tag and occurs on the causal node, BMC Helix AIOps prioritizes it over other events on the same node.

Example

Consider a node where both application and network monitoring are enabled through different monitoring solutions.

The application monitoring detects a slowdown first and raises an event.
A few seconds later, the network monitoring raises an event indicating that a critical network port on the node is down, which actually represents the cause of the slowdown.

Without prioritization, the application event may be incorrectly treated as the root cause. By tagging the network event with the priority-causal-root tag through a refinement policy, the port issue detected by network monitoring is correctly prioritized as the root cause of the event.

Important

The root cause of a situation is determined based on the availability of the participating routers (ISP and Meraki router) when a direct topological relationship is not defined between them:

When both the ISP router and the Meraki router are present, and no direct topological relationship exists between them, the ISP router is identified as the root cause.
When only the Meraki router is present, it is identified as the root cause.
When only the ISP router is present, it is identified as the root cause.

Health score behavior for topologically connected CIs in a multi-service contextEdit

If a CI is part of a service model, it is considered internal to that service, even if it has topological connections to other services. As a result, any events on that CI are excluded from the health score calculation of other services.

Example

health_score_for_CIs_in_multiple_service_contexts_06.png

Service A and Service B are two defined service models in BMC Helix AIOps. Both are topologically connected through different CIs.

Host CI-01 is explicitly included only in Service A’s model, and the Software pod CI-01 is explicitly included only in Service B’s model.

When an event occurs on Host CI-01,

Since Host CI-01 is part of service A, the event contributes to the health score of service A.
Although Host CI-01 is topologically connected to service B, the event is not considered in service B’s health score because it is not directly a part of service B’s model and is considered part of service A.

This ensures that the health score of each service is calculated only from the CIs it owns, avoiding double-counting or cross-service noise.

FAQs

How can I exclude specific events from a Situation?

To exclude specific events from being included in the ML-based situations, you can use event enrichment policies to tag them appropriately.

Create an event enrichment or refinement policy in BMC Helix Operations Management.
In the policy, set the tags slot of the event to include the ExcludeFromSituation tag.
When this tag is applied, BMC Helix AIOps automatically excludes the event from being correlated into any ML-based situation.
Verify the exclusion by checking the event's tags slot and confirming that it does not appear in any active situation.

You can refine event inclusion logic by using the advanced enrichment or refinement policies. For information about enrichment policies, see Advanced time based and dynamic enrichment policies

For example, exclude events from known noisy CIs or event sources.

Why must a situation causal event have a single unique node ID?

When a situation is created and noise consolidation is enabled, the causal event must be associated with only one unique node ID. If multiple node IDs exist, it might not be possible to determine which node to associate with the incident, and as a result, the incident might not be created in BMC Helix IT Service Management.

Why is the Anomaly class not available in event policy selection criteria?

The Anomaly class is not available because single anomaly events cannot be defined through a policy and do not generate incidents on their own. Incidents are created when anomaly events are correlated into a situation, at which point a consolidated incident is generated, and the contributing anomaly events are updated with the corresponding incident ID.

Why is a situation not created when events come from a node that is out of service?

When a node is out of service, events from that node are not correlated to create a situation. In such cases, the event payload might show a list of services in the impacted_service_key field, but the service_key field remains empty. To create a situation, at least one additional event from a topologically connected node must occur within the correlation time window.

Can I use the cross-launch link to view the log data even if I am not using BMC Helix Log Analytics?

Yes, the cross-launch link opens the logs page in the log application configured in BMC HelixGPT Manager.

Can I configure multiple data sources to generate log insights?

Yes. If you are using third-party log applications as data sources, you can select more than one data source, and BMC Helix AIOps displays insights from all relevant logs for an impacted configuration item.

What other log data sources are supported by BMC Helix AIOps?

For a list of supported third-party data sources, see Adding agents for BMC Helix AIOps.

Why are situation explanation and CI topology not visible for older situations?

The causal graph for a situation is retained for only 33 days. As a result, additional root-cause-related details, such as situation explanation and CI topology, are no longer available after this period.

Situation events, however, follow a different retention policy and are purged according to the default interval of 90 days.

Where to go from hereEdit

To perform additional actions on a situation or on the events included in the situation, see Performing situation actions.

Investigating ML-based situations that impact multiple services

Scenarios

Cross-service impact from a shared router failure

Cross-location service impact due to BGP instability

To investigate a situation impacting multiple servicesEdit

To identify the root cause by using Deep RCA

To view best action recommendations and log insights

To use the Ask HelixGPT virtual agent to get more information about the situation

To view the situation explanation

To view the event details

To view the CI topology and analysis

How is the probable impact indicated in the CI topology and analysis?

Prioritizing the root cause on the same node

To indicate priority causal events

Health score behavior for topologically connected CIs in a multi-service contextEdit

FAQs

Where to go from hereEdit

BMC Helix AIOps 26.1

On this page