Investigating ML-based situations that impact multiple services


A multi-service situation is created by correlating events across multiple services that are topologically connected and have a cross-service impact. It includes events that occur from the defined service context, as well as from external nodes that are topologically related and contribute to the service impact. 

As an operator or a site reliability engineer (SRE), use multi-service situations to:

  • Identify issues that spread across multiple services, and understand the service impact scope
  • Pinpoint the exact configuration item (CI) causing the issue
  • Reduce MTTR by avoiding the troubleshooting of individual services

The name of a multi-service situation is derived from the root cause of the situation. The situation name is displayed in the following format: <Multi-Service> - <root cause of situation>. For example, Multi-Service - Memory utilization is > 80% for 5 mins.

Scenarios

As an operator or site reliability engineer (SRE), you may encounter complex issues that affect more than one service at a time. The following scenarios illustrate how multi-service situations help identify cross-service dependencies, isolate shared root causes, and streamline troubleshooting.

Cross-service impact from a shared router failure

A retail company offers several services, including order management, inventory management, and payment services. All three services are dependent on a shared network router to connect to backend systems and external services.

Suddenly, the shared router experiences intermittent packet loss and high latency, causing network disruptions. As a result:

  • The order management service fails to communicate with the pricing engine, resulting in incorrect totals and missing discounts during the checkout process.

  • The inventory service is unable to update warehouse stock levels, leading to discrepancies in product availability across channels.

  • The payment service does not receive payment confirmations, resulting in incomplete and unverified transactions.

Each of these services begin to generate events related to connectivity issues and degraded functionality.

Because the affected services are topologically connected through the shared router, BMC Helix AIOps automatically correlates all related events into a single multi-service situation.

The shared router is identified as the top-impacted CI, enabling the operator or SRE to quickly trace the problem across services to a single root cause, avoiding isolated investigations and reducing Mean Time to Resolution (MTTR).

Cross-location service impact due to BGP instability

A Border Gateway Protocol (BGP) connection between data centers in New York and London becomes unstable due to a misconfiguration. Services in both locations, like customer profile lookup and order processing, start failing due to sync issues.

BMC Helix AIOps detects the BGP-connected routers as topologically linked and forms a single multi-service situation. The BGP issue is identified as the root cause, enabling the SRE to quickly resolve the cross-site network problem without needing to investigate each service separately.

Warning

Important

The BGP-based situation correlation feature is available under controlled availability. To use this capability, contact Helix BMC Support.

To investigate a situation impacting multiple servicesEdit

  1. On the BMC Helix AIOps console, click Situations
    All situations that occurred in the last 24 hours are displayed in a hierarchical view. Multi-service situations are indicated by the multiservice_situations_icon1.png icon.
    multi_service_situations_001.png

    To learn more about multi-service situations, see Situations overview.
     
  2. Click the multi-service situation to view the situation details.
    multi_service_situation__details_rcc_009.png
     
  3. The following table describes key situation UI options that require additional explanation or have a conditional behavior. The UI options that are self-explanatory and visible directly in the UI are not described here:
    UI optionDescription
    Severity and priority

    BMC Helix AIOps evaluates all events, causal and child events, correlated into the situation, and assigns the highest severity found among those events to the situation. If multiple events have the highest severity, the situation’s priority is set to the highest priority among those events. For more information, see How situation severity and priority are determined.

    The original causal event might or might not determine the final severity and priority. The causal event determines the root cause assignment, but the severity and priority of the situation are determined by evaluating all events that form the situation. Situation severity and priority are determined from the highest-severity event only when the situation-severity parameter is set to priority-from-highest-severity value in the situations_configurations/search API, and the Multi-service Situations feature is enabled.

    For more information, see Configuring ML-based situations and Managing situations by using REST APIs.

    Incident ID

    Click to open the incident in BMC Helix IT Service Management.

    If an incident is not created, a Create Incident link is displayed. Click the link to create an incident in BMC Helix IT Service Management (requires a subscription to BMC Helix IT Service Management).

    When Intelligent Automation Proactive Service Resolution is configured with on-premises ITSM, incidents are created in on-premises ITSM, and the incident ID is displayed in the situation details in BMC Helix AIOps. However, the incident ID cross-launch link from BMC Helix AIOps does not open the corresponding on-premises ITSM incident. An incident created from a situation inherits the situation's severity and priority values.

    Names of the impacted servicesClick +1 to view all the impacted services. Click the service name to open the service details in a new tab.
    Show NotesOpens the Logs and Notes panel to display the notes added to a situation.
  4. If BMC HelixGPT is enabled, the following information is displayed:
    • A human-readable AI-generated summary of the situation.
    • Best action recommendations, a list of suggested steps that can be used to remediate the situation. Additionally, a BMC HelixGPT-driven wizard offers sample code to accomplish individual steps in different languages such as Ansible, Python, and Bash.
    • Log insights collected from logs generated in BMC Helix Log Analytics, that help in getting an accurate root cause of the problem.
    • An integrated virtual agent, Ask HelixGPT, which leverages the BMC HelixGPT generative AI capabilities and helps you to ask questions to investigate and remediate the situation better.

To identify the root cause by using Deep RCA

Warning

Important

Deep Root Cause Analysis (RCA) is under controlled availability to select customers.

Deep RCA is the advanced capability for analyzing the root cause of the situation. If you have access to Deep RCA, the best action recommendations and log insights are not available for ML-based situations. 

When you deploy the fine-tuned large language model provided for BMC Helix AIOps, agentic AI-based capabilities are available to automatically run agents in the background to identify the root cause by analyzing change requests and logs from supported data sources. 

When you investigate a situation, the Log Tool and Change Tool in Deep RCA run in parallel, and the detailed output, hypothesis, and reasoning used by these tools are displayed.  

For more information, see Identifying the root cause of the situation by using Deep RCA

Situation_DeepRCAChangeRquest_Complete_261.jpg

To view best action recommendations and log insights

BMC Helix AIOps connects with BMC HelixGPT to display a natural language summary of the situation with its root cause. It also provides a step-by-step action plan for remediating the situation and generates actionable insights by analyzing logs from BMC Helix Log Analytics or any supported external log sources. 

For more information about using best action recommendations and log insights to investigate a situation, see Viewing best action recommendations and log insights

BAR_IncreaseCPU_HelixGPT_243 (1).png

To use the Ask HelixGPT virtual agent to get more information about the situation

Use the Ask HelixGPT virtual agent to ask questions within the context of an open or past situation. Using BMC HelixGPT capabilities, operators can get information about diverse topics regarding infrastructure, service health, and near real-time predictions.

  1. Click Ask HelixGPT.
    The interactive virtual agent dialog box displays the following predefined questions:
    • Reoccuring?
    • Change Windows?
    • Impact
    • Team to solve?
  2. Click any question to get additional information about the situation.

    BMC HelixGPT generates the answer by evaluating information from incidents created for similar situations in BMC Helix IT Service Management, analyzing time stamps and patterns of similar situations that have occurred in the past, analyzing the service health score of the impacted service of the situation, and the change requests associated with the situation.
    For example, if you click Impact?, the following answer is displayed. 
    AskHelixGPT Questions_Impact_261.jpg

  3. (Optional) Click any other question to obtain more details about the situation. 
    The response time for a query varies with the number of events involved. At times, it may take longer than a few seconds to receive a response. 

 To view the situation explanation

The Situation Explanation section helps in understanding how BMC Helix AIOps identifies and analyzes the root cause of an ML-based situation. It provides a graphical root cause view, event impact analysis, correlated changes, and predictive insights to explain how events across services contribute to the situation and how the impact propagates through configuration items (CIs).

  • Root Cause View: Shows the impact flow of events in a situation in a graphical format. Based on the temporal and topological relationships between various causal events in the situation, the ML algorithm determines the root cause event and consequent events. Each event in the graph is aligned against the corresponding CI kind. The direction in the graph indicates the impact flow from the root cause event. You can see the impact score percentage displayed with the event. The total impact score from all the events adds up to 100 percent.multi_service_situation_explanation_root_cause_candidates_007.png
    FeatureDescription
    Show root cause candidates

    Select this option to display all configuration items (CIs) that are identified as contributing to the impact of the situation. When enabled, multiple root causes are highlighted that contribute to the issue. If disabled, only the most probable root cause is shown. Use this option when dealing with complex issues where multiple components might be failing or affecting each other, leading to service issues. This option is visible only when more than one potential root cause is detected.

    Impacted node detailsHover over an event to view the impacted node details and the corresponding CI or CI kind highlighted in the CI topology and analysis section.
    Additional detailsClick an event to view additional details on the Situation Details pane.
  • Events: Displays all causal events and details such as the event messages, impacted host, occurrence time, impact score, severity, priority, status, prediction ID, and incident ID. Perform actions on a situation by clicking the action menu.
    FeatureDescription
    Prediction ID

    (The ability to link prediction events with alarm events is under controlled availability to select customers. To use this capability, contact BMC Support.) If a prediction is linked to a real event (alarm or anomaly), the prediction ID field in the events tab displays the corresponding prediction details. Click the prediction ID to view more details in BMC Helix Operations Management. This tab also displays closed predictions that are linked to real alarms or anomalies. When an alarm event exists for a prediction, that prediction is automatically closed. These linked prediction details help you track how many predictions accurately matched real events, allowing you to assess the accuracy and effectiveness of the predictive insights.

    Incident ID

    If ErrorExcerpt named Proactive Service Resolution was not found in document xwiki:IT-Operations-Management.Operations-Management.BMC-Helix-AIOps.aiops261._aiops_LinksLibrary.WebHome. is enabled in , the same incident ID is displayed against the situation and the events.

    Warning

    Important

  • Changes: Displays the change requests that are most likely contributing to the situation, based on correlation with impacted service nodes. Change requests from BMC Helix ITSM are displayed in situations. Change requests from ServiceNow are not supported on the Situations page.
    FeatureDescription
    Change detailsFor each change, the change ID, summary, impacted host or CI, occurrence time (start date/time), status (for example, Scheduled, In Progress, Completed), priority, and impact level are shown. Click a change entry to view more details in BMC Helix ITSM, where you can review the entire change history, implementation notes, approvals, and associated incidents. While change events do not contribute to service health score calculations, displaying them in context helps operators investigate more effectively. 

    change_event_details_253.png
     

  • Predictions: Lists all open prediction events associated with the impacted services that are part of the situation. Predictions are displayed for events that are expected to occur in the next 24 hours, based on metric thresholds or baseline violation in case of anomalies. When multiple services are impacted, all the impacted service names are listed in this tab.
    FeatureDescription
    Prediction details(The ability to link prediction events with alarm events is under controlled availability to select customers. To use this capability, contact BMC Helix Support.) If an actual alarm is generated for the same metric, the related prediction event is automatically closed and removed from this tab. The alarm entry is updated with the corresponding prediction ID to maintain traceability and help users correlate predicted events with real alarms. These prediction events enable users to review forecasted trends and take proactive action before service impact occurs.
    Prediction chartOn the prediction chart, the black vertical line represents the actual metric data, and the prediction icon indicates the forecasted event point.

    predictions_in_situations_254.png

To view the event details

  1. Click an event message to view the following details in the Event Details pane:
    UI optionDescription
    Event details

    Event name, event score, severity, priority, status, event assignee details, and the More Details link to view the additional event details in BMC Helix Operations Management.

    DateDate when the event first occurred or was last modified
    Event summary

    Event summary showing the Class, Incident ID, Object Class, Object, and Host. Clicking the Incident ID link opens the incident in BMC Helix IT Service Management. 

    Event classification and formatting

    Logs and notes history

    All logs and notes for an event are displayed. Type a note in the text box and click Add Note to add any additional notes related to the event. Any note added for the event is reflected in the event in BMC Helix Operations Management.

    Performance viewIf the slot value for the event class is Alarm, the time-series data collected from the key attributes of the causal events of ML-based situations is displayed.
  2. Click the action menu kebab_three_dots_menu_01.png to perform event actions.

    multi_service_situation_event_details_rcc_003.png

 To view the CI topology and analysis

In the CI topology and analysis map, view the topology map of the situation, the impacted CIs, and the probability of the impact on the connected CIs. 

Use the following options to view the topology map based on your requirements:

UI elementDescription
Views

Switch between the Organic and Hierarchic view to view the impact flow. 
In the organic view, nodes are placed close to their adjacent nodes, thus saving space. While in the hierarchic view, the nodes are distributed into layers, which facilitates the identification of dependencies and relationships among the nodes.

The topology view displays how the impacted services are connected through a shared router, enabling you to visualize cross-service relationships and pinpoint the root cause.

Grouping

Click Enable Grouping by CI Kind grouping_by_ci_kind_01.png to view the topology map grouped by the CI kind. 

SearchIf there are many CIs in the hierarchy, use the search box to locate a particular CI. 
LegendClick to view the legends used for the topological map.
Other toolsUse the other tools to zoom in, zoom out, or view the map on a full screen.
Name (displayed in the right-hand pane)

The unique name or identifier of the CI. This helps you recognize the specific component involved in the situation.

  • Location: Expand the CI name to see its physical or logical location, such as data center, region, or availability zone. This information helps identify location-based issues or regional dependencies.
Kind (displayed in the right-hand pane)Indicates the type of CI, such as business service, host, software pod, or other infrastructure or application components. This classification helps you understand the CI’s function within the service topology.

multi_service_situation_CI_topology_root_cause_candidates_008.png ​​​​​​

 How is the probable impact indicated in the CI topology and analysis?

When a multi-service situation is formed, the topology view displays indirectly connected services that are not part of the situation but are topologically linked to nodes in the situation, and have active events during the same timeframe.

Relationships with such services are marked in the topology by using the indirect_relationship_impact01.png icon to indicate that they may be impacted as a result of the ongoing issue, even though they are not directly contributing to or included in the situation.

Information
Example

In the following topology map, a multi-service situation is detected and formed with events from the virtual machine (selufw99_bbdab) and host (selvm02_bbdab). During the same time window, service Apex-Cloud_Infra, which is indirectly connected to the router (isp_router) through shared infrastructure, generates a critical event.

Although Apex-Cloud_Infra is not included in the situation, the topology view highlights it as a probable impact, helping the operator assess whether the issue in the virtual machine or host is contributing to problems in dependent services, such as Apex-Cloud_Infra.

impact_relationship_mss.png

In the following topology map, a multi-service situation is detected and formed with events from the virtual machine (selufw99_bbdab) and host (selvm02_bbdab). During the same time window, service Apex-Cloud_Infra, which is indirectly connected to the router (isp_router) through shared infrastructure, generates a critical event.

Although Apex-Cloud_Infra is not included in the situation, the topology view highlights it as a probable impact, helping the operator assess whether the issue in the virtual machine or host is contributing to problems in dependent services, such as Apex-Cloud_Infra.

impact_relationship_mss.png

Prioritizing the root cause on the same node

When multiple events occur on the same node within a situation, the first detected event might not always represent the true causal event. This can occur when the polling intervals vary across monitoring solutions, causing later events to better represent the actual cause.

To explicitly indicate which event should be considered the priority causal event, you can apply prioritization by using the refinement policy in BMC Helix Operations Management. For information about enrichment policies, see Advanced time based and dynamic enrichment policies

Prioritization can be applied only to events on the same node within a service. This prioritization applies only to future situations, not to situations that have already been created.

To indicate priority causal events

  1. Create an event enrichment or refinement policy in BMC Helix Operations Management.
  2. In the policy, set the tags slot of the event to include the priority-causal-root tag.

When an event is marked with this tag and occurs on the causal node, BMC Helix AIOps prioritizes it over other events on the same node.

Information
Example

Consider a node where both application and network monitoring are enabled through different monitoring solutions.

  • The application monitoring detects a slowdown first and raises an event.

  • A few seconds later, the network monitoring raises an event indicating that a critical network port on the node is down, which actually represents the cause of the slowdown.

Without prioritization, the application event may be incorrectly treated as the root cause. By tagging the network event with the priority-causal-root tag through a refinement policy, the port issue detected by network monitoring is correctly prioritized as the root cause of the event.

Warning
Important

The root cause of a situation is determined based on the availability of the participating routers (ISP and Meraki router) when a direct topological relationship is not defined between them:

  • When both the ISP router and the Meraki router are present, and no direct topological relationship exists between them, the ISP router is identified as the root cause.
  • When only the Meraki router is present, it is identified as the root cause.
  • When only the ISP router is present, it is identified as the root cause.

Health score behavior for topologically connected CIs in a multi-service contextEdit

If a CI is part of a service model, it is considered internal to that service, even if it has topological connections to other services. As a result, any events on that CI are excluded from the health score calculation of other services. 

Information
Example

health_score_for_CIs_in_multiple_service_contexts_06.png

Service A and Service B are two defined service models in BMC Helix AIOps. Both are topologically connected through different CIs.

Host CI-01 is explicitly included only in Service A’s model, and the Software pod CI-01 is explicitly included only in Service B’s model.

When an event occurs on Host CI-01,

  • Since Host CI-01 is part of service A, the event contributes to the health score of service A.

  • Although Host CI-01 is topologically connected to service B, the event is not considered in service B’s health score because it is not directly a part of service B’s model and is considered part of service A.

This ensures that the health score of each service is calculated only from the CIs it owns, avoiding double-counting or cross-service noise.

FAQs

How can I exclude specific events from a Situation?

To exclude specific events from being included in the ML-based situations, you can use event enrichment policies to tag them appropriately.

  1. Create an event enrichment or refinement policy in BMC Helix Operations Management.

  2. In the policy, set the tags slot of the event to include the ExcludeFromSituation tag.

  3. When this tag is applied, BMC Helix AIOps automatically excludes the event from being correlated into any ML-based situation.

  4. Verify the exclusion by checking the event's tags slot and confirming that it does not appear in any active situation.

You can refine event inclusion logic by using the advanced enrichment or refinement policies. For information about enrichment policies, see Advanced time based and dynamic enrichment policies

For example, exclude events from known noisy CIs or event sources.

Why must a situation causal event have a single unique node ID?

When a situation is created and noise consolidation is enabled, the causal event must be associated with only one unique node ID. If multiple node IDs exist, it might not be possible to determine which node to associate with the incident, and as a result, the incident might not be created in BMC Helix IT Service Management.

Why is the Anomaly class not available in event policy selection criteria?

The Anomaly class is not available because single anomaly events cannot be defined through a policy and do not generate incidents on their own. Incidents are created when anomaly events are correlated into a situation, at which point a consolidated incident is generated, and the contributing anomaly events are updated with the corresponding incident ID.

Why is a situation not created when events come from a node that is out of service?

When a node is out of service, events from that node are not correlated to create a situation. In such cases, the event payload might show a list of services in the impacted_service_key field, but the service_key field remains empty. To create a situation, at least one additional event from a topologically connected node must occur within the correlation time window.

Can I use the cross-launch link to view the log data even if I am not using BMC Helix Log Analytics?

Yes, the cross-launch link opens the logs page in the log application configured in BMC HelixGPT Manager. 

Can I configure multiple data sources to generate log insights?

Yes. If you are using third-party log applications as data sources, you can select more than one data source, and BMC Helix AIOps displays insights from all relevant logs for an impacted configuration item.

What other log data sources are supported by BMC Helix AIOps?

For a list of supported third-party data sources, see Adding agents for BMC Helix AIOps

Where to go from hereEdit

To perform additional actions on a situation or on the events included in the situation, see Performing situation actions.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*

BMC Helix AIOps 26.1