Performing causal analysis of impacted services

While monitoring business services for health and performance, operators and site reliability engineering teams (SREs) need to quickly view performance issues and impact severity, health issues and impact severity, and their root causes. To ensure speed and accuracy, they can use BMC Helix AIOps, which is powered by an AI/ML-enabled root cause isolation (RCI) algorithm. It helps eliminate inaccuracies or speculation in locating problem areas, lowers the waiting time for building a large amount of observable data, automatically points to the source of a problem, and reduces the mean-time-to-resolve (MTTR) duration of incidents or issues in large and complex services.


The AI/ML-based RCI algorithm correlates data from your services to identify and determine the problem cause, provides a context to the problem identified, and a means to resolve and remediate the situation. BMC Helix AIOps uses this data to help operators and SREs by automatically performing the following functions:

  • Identifies and determines the top impacted nodes of a service based on the impacting events, incidents, and changes.
  • Computes the impact score of these events, incidents, and changes
  • Ranks the impacting events based on the impact score, which in turn ranks the impacting causal nodes of the service.
  • Correlates the events from the causal nodes into Situations to reduce the event noise and MTTR.
  • Provides drill-down views to analyze and resolve the problem cause.


To view the impacted nodes for a service

  1. On the Services page, click the service name for which you want to analyze the root cause of an impact.
  2. Scroll down and expand Analyze Root Cause to view the top three root causes (if available), and their causal events, incidents, and changes.
    The top-most causal node is selected by default.
    By default, details are displayed for the last three hours. You can select a different duration to view service details for the selected time period. 


  3. Select a different causal node from the list to view the impacting events for the node.
    Because incidents from BMC Helix IT Service Management are treated as Incident_Info events in BMC Helix Operations Management, a top impacting incident is displayed as a top impacting event with an incident ID.
    If there are more than three impacting events, click Show all events to view the list of all events for the causal node. 
  4. (Optional) View the event details and perform event operations.
    For instructions, see To investigate the impacting events and perform event actions.
  5. (Optional; BMC Helix IT Service Management integration) Select Changes to view the top three changes that impact the service.
    If there are more than three impacting changes, click Show all changes to view the list of all changes for the causal node.
  6. (Optional) View the change list of all change requests and their details.
    For instructions, see To investigate the impacting events and perform event actions.
  7. (Optional) Click the column selector and clear the columns that you do not want to be displayed. 
    Only selected columns are displayed. You can also drag and drop the columns to rearrange them.  


To investigate the impacting events and perform event actions

  1. Select a causal node to view the list of events that caused the impact and to perform event operations for resolving issues.
    By default, top three events along with their impact score, severity, and other relevant details are displayed.
  2. Analyze the impact as required. 
  3. Click an event message to view the event details.
  4. Click the action menu to perform the supported event actions, which are described in the following table: 
    All logs and notes for the event are displayed at the bottom of the panel.


    Action

    Description


    Acknowledge EventRecognizes the existence of an open event. This operation changes the event status from Open to Acknowledged.
    Assign EventAssigns ownership of an open, acknowledged, or assigned event to yourself or another person in the same account. This operation changes the event status from Open or Acknowledged to Assigned, and the event owner is updated with the selected user. If the event status is Assigned, only the ownership changes to the selected user.
    Close Event

    Disables any further event operations on the event. Closed events are not considered for calculating the status of a device.

    You can close events with statuses Open, Assigned, and Acknowledged only.

    Decline Ownership

    Removes ownership of an event in the assigned state. This operation changes the event status to Acknowledged.

    Set Event Priority

    Assigns a priority level to the event.

    Take OwnershipAssigns ownership of Open or Acknowledged event to yourself.
    Unacknowledge EventChanges a previously Acknowledged event back to the Open state.
    Add NotesDisplays the Add Notes dialog box.
    Create Incident

    Creates an incident in BMC Helix IT Service Management – SmartIT.
    The incident ID appears against the impacted nodes.
    You can click the link to
    to view the incident details in BMC Helix IT Service Management – SmartIT (Must have permissions to view incidents inBMC Helix IT Service Management).

    For more information about the impact of the actions on the event, see Performing event operations Open link  in the BMC Helix Operations Management online documentation. 

  5. Add any additional notes related to the event by entering a note in the text box and clicking Add Note.
    Any note added for the event is reflected for the event in BMC Helix Operations Management

  6. (Optional) Click More Details to analyze the complete cause and context of the event.
    The BMC Helix Operations Management Event Details page opens. To understand various options that you see there, see Viewing event details. Open link
  7. (Optional) For all other events, repeat steps 4 to 6.
  8. If an incident is created, click the link to view the incident details in BMC Helix IT Service Management – SmartIT in the Incident ID column.

    Important

    To launch the incident details page, you must have the permissions to view incidents in BMC Helix IT Service Management


  9. (Requires the Intelligent Automations option to be enabled) In the Automations column, automations that match the event are displayed.
    For running existing automations, see Running an existing automation.
  10. (Requires the Intelligent Automations option to be enabled) Click Action and perform any of the available actions for the open events: 
    • Create Automation: Launches the BMC Helix Intelligent Automation > Create Automation Policy page.
      Tenant administrators can create an automation policy; see
      Creating automation policies.
    • Request Automation: Displays the Request Automation dialog box.
      For instructions about how to raise a request, see Requesting for a new automation.
    • Trigger Automation: Displays the Run Automation dialog box that you can use to run automations for remediating the event. 
      For instructions about how to run an automation, see Running an existing automation


To investigate the change requests 

For the changes to be listed in BMC Helix AIOps, you must have BMC Helix IT Service Management enabled to work with BMC Helix AIOps. For more information on managing changes from BMC Helix IT Service Management, see  Managing change. Open link

  1. Select a causal node to view the list of probable changes that caused the impact and to resolve the issues.
  2. Select Changes to view the top three changes along with their impact score, severity, and other relevant details. Analyze the impact as required.
  3. Optionally, click the Show all changes link to view the list of all events.
  4. Click a change message to view the change details summary.


Analyzing the metrics for an impacted service

Metrics are performance and health indicators collected from various nodes of your services. Each metric is a quantifiable measure used to gauge and compare performance of impacted causal nodes in a given time period. If you have configured an alarm policy for such metrics, an event or alert is generated if any of the vital metrics goes above or below the expected range of normal behavior. Each metric is analyzed over a time period and the behavioral trend is captured. In BMC Helix AIOps, you can view the metrics associated with the top three causal events that are associated with the top three impacted nodes for a service. After analyzing the metrics trend graph, you can take follow up actions to resolve the issue as required.

For example, if your service contains a Linux host, you might want to monitor the following metrics to ensure that the Linux server is performing and available at all times:

  • CPU usage and load
    • CPU Utilization: Indicates the percentage of CPU utilization. CPU utilization is calculated by adding user time and system time. Utilization is more useful when looked at in combination with the Load parameter.
    • Load: Displays the average number of processes in the kernel's run queue during an interval (1 minute in this case). It is more useful when looked at in combination with CPU Utilization.
  • Disk and disk usage
    • Average requests in queue: Displays the average number of disk I/O requests in the queue and is measured only when the queue is occupied. A high number indicates that system throughput is probably slowing down because of the number of I/O requests for this disk.
    • Block transferred per second: Displays the number of blocks read from, or written to, the device per second, and indicates the work load for the device. 
  • File system
    • Available space: Displays the amount of available space for this file system instance. This parameter is critical on the root volume.
    • Number of free I-nodes: Displays the number of I-nodes available. An I-node maintains information about each file. Measuring I-nodes is critical because once all of your I-nodes have been used, the file system will not accept any more files regardless of how much disk space is available.
  • Memory
    • Used memory: Displays the amount of memory used by the partition. 
    • Free memory: Displays the number of 1 KB pages of memory available.
  • Process
    • Process CPU Usage: Displays the percentage of CPU used by the selected process. This percentage is calculated on the total number of active CPUs in the system. 
    • Process Memory Usage: Displays the amount of real memory that the process is using (in MB).


To analyze the metrics for an impacted service

  1. On the Services page, click the service name for which you want to analyze the metrics.
  2. Expand Analyze Root Cause to analyze and understand the impacted metrics trend for the top causal node. 
  3. If there are more than three impacted metrics, the charts are displayed only for the top three metrics per causal host.


Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks:

Was this page helpful? Yes No Submitting... Thank you

Comments