Investigating ML-based situations that impact multiple services
Related topics
Investigating ML-based primary situations
As an operator or a site reliability engineer (SRE), use multi-service situations to:
Identify issues that spread across multiple services
Understand the service impact scope
Drill down into related events and nodes and pinpoint the exact CI causing the issue
Detect cross-site network issues where services are impacted due to failures in BGP-connected infrastructure
Accelerate root cause isolation by leveraging cross-service correlation and topology mapping
Reduce MTTR by avoiding the troubleshooting of individual services
A multi-service situation is created by correlating events across multiple services that are topologically connected and have a cross-service impact. It includes events that occur from the defined service context, as well as from external nodes that are topologically related and contribute to the service impact. Multi-service situations help SREs and operators get a complete view of issues across multiple services, improve investigation accuracy, and reduce the time required to identify the root cause of complex issues.
Example 1
A retail company has several services, including order management service, inventory service, and payment service. All three services are dependent on a shared network router to connect to backend systems and external services.
Suddenly, the shared router experiences intermittent packet loss and high latency, causing network disruptions. As a result:
The order management service fails to reach the pricing engine, causing incorrect totals and missing discounts during checkout.
The inventory service is unable to update warehouse stock levels, leading to discrepancies in product availability across channels.
The payment service does not receive payment confirmations, leaving transactions incomplete and unverified.
Each of these services begin to generate events related to connectivity issues and degraded functionality.
Because the affected services are topologically connected through the shared router, BMC Helix AIOps automatically correlates all related events into a single multi-service situation.
The shared router is identified as the top-impacted CI, enabling the operator or SRE to quickly trace the problem across services to a single root cause, avoiding isolated investigations and reducing Mean Time to Resolution (MTTR).
Example 2
A BGP connection between data centers in New York and London becomes unstable due to a misconfiguration. Services in both locations, like customer profile lookup and order processing, start failing due to sync issues.
BMC Helix AIOpsdetects the BGP-connected routers as topologically linked and forms a single multi-service situation. The BGP issue is identified as the root cause, helping the SRE quickly resolve the cross-site network problem without investigating each service separately.
To investigate a situation impacting multiple servicesEdit
- On the BMC Helix AIOps console, click Situations.
All situations that occurred in the last 24 hours are displayed in a hierarchical view. Multi-service situations are indicated by theicon.
To learn more about multi-service situations, see Situations-overview.
- Click the multi-service situation and view the following details on the situation details page:
- Situation name, severity, priority, type of situation, last modified date, and status.
- Incident ID: Click to open the incident in BMC Helix IT Service Management.
If an incident is not created, a Create Incident link is displayed. Click the link to create an incident in BMC Helix IT Service Management (requires a subscription to BMC Helix IT Service Management). - Names of the impacted services: Click +1 to view all the impacted services. Click the service name to open the service details in a new tab.
- Changes: Change requests associated with the situation.
- Show Notes: Opens the Logs and Notes panel to display the notes added to a situation.
- Situation explanation. For more information, see To view the situation explanation.
- (Optional) Click the action menu to perform actions on a situation.
For more information, see Performing situation actions. - If BMC HelixGPT is enabled, the following information is displayed:
- A human-readable AI-generated summary of the situation.
- Best action recommendations, a list of suggested steps that can be used to remediate the situation. Additionally, a BMC HelixGPT-driven wizard offers sample code to accomplish individual steps in different languages such as Ansible, Python, and Bash.
- Log insights collected from logs generated in BMC Helix Log Analytics, that help in getting an accurate root cause of the problem.
An integrated virtual agent, Ask HelixGPT, which leverages the BMC HelixGPT generative AI capabilities and helps you to ask questions to investigate and remediate the situation better. To learn more about BMC HelixGPT capabilities, see Situations-overview.
Continue with To view best action recommendations.
To view best action recommendations
BMC Helix AIOps with BMC HelixGPT enables you to connect to the following ITSM data sources to generate best action recommendations:
- BMC Helix IT Service Management
- ServiceNow BMC Helix IT Service Management (Controlled availability customers only)
- Jira (Controlled availability customers only)
Administrators must configure third-party data sources in BMC HelixGPT to generate recommendations. To configure third-party data sources, see Adding data sources in BMC HelixGPT in BMC HelixGPT documentation.
On the situation details page, review an AI-generated summary (short problem statement, brief summary, and detailed problem context).
Click Show remediation steps.
The recommended steps are displayed for a situation.
For example, for a High CPU Utilization issue, the following steps are suggested:(If available) Click Code wizard.
The code that can be used to run the recommended step is displayed. For some manual steps, the code wizard might not be displayed.Select your preferred language (Ansible, Python, Bash), and the code is displayed based on the selected language.
Click Copy to clipboard and use the code in your existing script to run the recommended remediation step.
Close the code wizard.
Continue with To view log insights.
To view log insights
BMC Helix AIOps with BMC HelixGPT enables you to connect to the following log data sources to generate log insights:
- BMC Helix Log Analytics (no configuration required)
- Splunk Enterprise
- ElasticSearch
Administrators must configure third-party data sources in BMC HelixGPT Manager to generate insights. To configure third-party data sources, see Adding data sources in BMC HelixGPT in BMC HelixGPT documentation.
Click Ask HelixGPT and then click Log Insights. The first time that you view log insights for a situation, a progress bar is displayed to show the progress of the log summary generation. If you view log insights for the same situation again, the summary loads without delay. Depending on the log source configured in BMC HelixGPT, actionable insights from the logs related to the configuration item are displayed. which helps in identifying the root cause of the situation.
Use the cross-launch link to view the log details in BMC Helix Log Analytics.
To use the Ask HelixGPT virtual agent to get more information about the situation
The Ask HelixGPT virtual agent is available if BMC HelixGPT is enabled. To enable BMC HelixGPT, contact BMC Support.
Use the Ask HelixGPT virtual agent to ask questions within the context of an open or past situation. Using the BMC HelixGPT capabilities, operators can get information about diverse topics regarding infrastructure, service health, and near real-time predictions.
Click Ask HelixGPT.
The interactive virtual agent dialog box displays the following predefined questions:What is the impact of the issue?
Which team can solve this issue?
Has this situation happened in the past?
Are there any change windows active during this situation?
Click any question to get additional information about the situation.
BMC HelixGPT generates the answer by evaluating information from incidents created for similar situations in BMC Helix IT Service Management, analyzing time stamps and patterns of similar situations that have occurred in the past, analyzing the service health score of the impacted service of the situation, and the change requests associated with the situation.
For example, if you click What is the impact of this issue?, the following answer is displayed.(Optional) Click any other question to obtain more details about the situation.
Continue with To view situation explanation.
To view the situation explanation
In the Situation Explanation section, use the Root Cause View to analyze the root cause events associated with the situation.
Root Cause View: Shows the impact flow of events in a situation in a graphical format. Based on the temporal and topological relationships between various causal events in the situation, the ML algorithm determines the root cause event and consequent events. Each event in the graph is aligned against the corresponding CI kind. The direction in the graph indicates the impact flow from the root cause event. You can see the impact score percentage displayed with the event. The total impact score from all the events adds up to 100 percent.
Show root cause candidates: Select this option to display all configuration items (CIs) that are identified as contributing to the impact of the situation. When enabled, multiple root causes are highlighted that contribute to the issue. If disabled, only the most probable root cause is shown. Use this option when dealing with complex issues where multiple components might be failing or affecting each other, leading to service issues. This option is visible only when more than one potential root cause is detected.
Hover over an event to view the impacted node details and the corresponding CI or CI kind highlighted in the CI topology and analysis section.
Click an event to view additional details on the Situation Details pane.
Events: Displays all causal events and details such as the event messages, impacted host, occurrence time, impact score, severity, priority, status, and incident ID. Perform actions on a situation by clicking the action menu
.
If Proactive Service Resolution is enabled in
, the same incident ID is displayed against the situation and the events.
Changes: Displays the top three changes and details such as the change ID, summary, impacted host, occurrence time, status, priority, and impact.
Click an event message to view the following details in the Event Details pane:
Event name, event score, severity, priority, status, and the More Details link to view the additional event details in BMC Helix Operations Management
Event assignee details
Date when the event first occurred or was last modified
Event summary showing the Class, Incident ID, Object Class, Object, and Host. Clicking the Incident ID link opens the incident in BMC Helix IT Service Management.
For more information about event classes and objects, see EVENT base event class.Logs and notes history: All logs and notes for an event are displayed. Type a note in the text box and click Add Note to add any additional notes related to the event. Any note added for the event is reflected in the event in BMC Helix Operations Management.
Performance view: If the slot value for the event class is Alarm, the time-series data collected from the key attributes of the causal events of ML-based situations is displayed.
Click the action menu
to perform event actions.
Continue with To view CI topology and analysis.
To view the CI topology and analysis
In the CI topology and analysis map, view the topology map of the situation, the impacted CIs, and the probability of the impact on the connected CIs.
Use the following options to view the map based on your requirements:
Views: Switch between the Organic and Hierarchic view to view the impact flow.
In the organic view, nodes are placed close to their adjacent nodes, thus saving space. While in the hierarchic view, the nodes are distributed into layers, which facilitates the identification of dependencies and relationships among the nodes.The topology view shows how the impacted services are connected through a shared router, helping you visualize the cross-service relationships and pinpoint the root cause.
Grouping: Click Enable Grouping by CI Kind
to view the topology map grouped by the CI kind.
Search: If there are many CIs in the hierarchy, use the search box to locate a particular CI.
Legend: Click to view the legends used for the topological map.
Use the other tools to zoom in, zoom out, or view the map on a full screen.
The right-hand pane provides a summary of the configuration items (CIs) related to the situation. It includes the following fields:
Name: The unique name or identifier of the CI. This helps you recognize the specific component involved in the situation.
Kind: Indicates the type of CI, such as business service, host, software pod, or other infrastructure or application components. This classification helps you understand the CI’s function within the service topology.
Location: Expand the CI name to see its physical or logical location, such as data center, region, or availability zone. This information helps identify location-based issues or regional dependencies.
Continue with To view the probable impact in the CI topology and analysis.
To view the probable impact in the CI topology and analysis
When a multi-service situation is formed, the topology view displays indirectly connected services that are not part of the situation, but are topologically linked to nodes in the situation, and have active events during the same timeframe.
Relationships with such services are marked in the topology by using the icon to indicate they may be impacted as a result of the ongoing issue, even though they are not directly contributing to or included in the situation.
For example, in the preceding topology map, a multi-service situation is detected and formed with events from the virtual machine (selufw99_bbdab) and host (selvm02_bbdab). During the same time window, service Apex-Cloud_Infra, which is indirectly connected to the router (isp_router) through shared infrastructure, generates a critical event.
Although Apex-Cloud_Infra is not included in the situation, the topology view highlights it as a probable impact, helping the operator assess whether the issue in the virtual machine or host is contributing to problems in dependent services like Apex-Cloud_Infra.
Health score behavior for topologically connected CIs in a multi-service contextEdit
If a CI is part of a service model, it is considered internal to that service, even if it has topological connections to other services. As a result, any events on that CI are excluded from the health score calculation of other services.
Example:
Service A and Service B are two defined service models in BMC Helix AIOps. Both are topologically connected through different CIs.
Host CI-01 is explicitly included only in Service A’s model, and the Software pod CI-01 is explicitly included only in Service B’s model.
When an event occurs on Host CI-01,
Since Host CI-01 is part of service A, the event contributes to the health score of service A.
Although Host CI-01 is topologically connected to service B, the event is not considered in service B’s health score because it is not directly a part of service B’s model and is considered part of service A.
This ensures that the health score of each service is calculated only from the CIs it owns, avoiding double counting or cross-service noise.
Where to go from hereEdit
To perform additional actions on a situation or on the events included in the situation, see Performing situation actions.