Monitoring service health


As an operator or a site reliability engineer (SRE), use BMC Helix AIOps to monitor the services in your organization for their health, performance, and availability. This ensures that users are able to access the system and perform their tasks quickly and without interruption.

Each service can contain one or more child services, and a combination of nodes, applications, and devices . 

Service health is determined by the events generated for a service. Typically, a service is considered healthy if there is no impact, such as open events for that service. However, depending on how services are designed in your organization, a lower impact that doesn't affect the service performance can also be considered healthy. For example, a single transaction failure might raise multiple false alarms that turn a service into red, or a key high-availability application might have some nodes down, but can continue to perform at an optimal health.  Therefore, it is important to have an algorithm that continuously learns how to determine the most impacted entities for a service. 

BMC Helix AIOps uses AI/ML algorithms to compute the health score and impact score and displays the impacted nodes and events for a service.

BMC Helix AIOps displays services from the following components:

  • Service models created and managed in BMC Helix AIOps
  • Groups published as services in BMC Helix Operations Management
  • All business services from BMC Helix Discovery
  • Service models or topologies ingested from third-party applications through BMC Helix Intelligent Integrations connectors


To get started with service monitoring

In the BMC Helix AIOps console, all services are displayed on the Services page. 

  1. Click Services to view the following information: 
    • All services color-coded by severity in a heatmap or tile view
    • Child services associated with the parent services
    • Search, basic filters, and advanced filters for services
    • Option to create new services (available for the Service Designer role only)

      About the heatmap box sizes

      The impact score and the number of services to be displayed on this page determine the size of a heatmap box, the higher the impact score the larger the box. T he box size is dynamic and relative to other boxes.

  2. Hover over a service to view a quick summary of the impact. 
    The impact score, situations, events, and incidents associated with a service are displayed.
    Services_impact_243.jpg
  1. (Optional) If there is a child service, click the service box to view the next level of child services associated with the service.
    If a child service is impacted, the impact and the health score are propagated to all the services that the impacted service depends upon. If the parent service is not impacted and the health score is propagated from a dependent service, a label (Propagated) is displayed next to the health score.
    ServicesPage_HeatMap_Propagated_243.jpg
  1. (Optional) Choose how to view services on the heatmap view:
    • Basic search: Enter a service name in the search box and click search Search button.png
    • Basic filters: Select or clear the severity filter checkboxes to view services based on the selected severity. If the parent service matches the selected filter, the child services are also displayed. You can also click Select all to view all services. Filter selection is retained even if you access other pages and navigate back to the Services page. 
    • Advanced filter: Click to view services by Service Kind (Business Service, Technical Service, Business Application) or services with specific label-value pairs. By default, business services and technical services are displayed. Search and filter options are retained even if you switch between the heatmap and tile views.
    • Number of services per page: Click to select the number of services to be displayed on the page.
      The service count displayed on the heatmap view shows only the parent services filtered by Severity and Service Kind.
    • Refresh page: Click Refresh Refresh.pngto refresh the page.
      By default, the Services page is automatically refreshed after every five minutes. 
      The automatic refresh interval is applicable for active browser sessions only. If you navigate to any other browser tab, the session no longer remains active and the UI pages do not get automatically refreshed. To change the refresh interval duration, see Configuring-your-personal-settings.
  1. (Optional) Click Tile View tile_view_icon.pngto view services in a tile view.
    Each tile represents a service and displays the service name, service impact score, and the count of situations, events, and incidents, and the total impacted CI count associated with the service. Search results or filters are retained across both the tile view and the heatmap view. If the Show Policy-based Situations option is disabled, the policy-based situations are hidden and only the ML-based situation count is displayed in both, the heatmap and tile view. It might take up to 15 minutes for the situation count to update after the Show Policy-based Situations option is enabled or disabled. 

    Important

    The service count displayed on the tile view shows all the services filtered by Severity and Service Kind.

  2. (Optional) Save a set of preferences (severity filters, advanced filters, and the page view (Heatmap view or Tile view)) in a preset. These preferences are saved across sessions:
    1. Click Add Preset.
    2. Enter a name for the preset.
    3. (Optional) Set the preset as the default preset.
    4. Click Create.
      The presets you create on the Services page are applied to the Situations page as well. f you select a preset on the Services page, the situations are displayed according to the services filtered by the preset. 
      Also, if you select a preset on the Situations page, the Services page displays services according to that preset.
  1. (Optional) Click Save Preferences to save your page preferences.
    Your selected severity filters, advanced filters, and the page view (Heatmap view or Tile view) are saved until you change your page preferences again. 
Why don't I see any services on the Services page?

Services start appearing on the Services page as soon as service models are created, or services are discovered by BMC Helix Discovery. For more information, see Creating-service-models.

To monitor service health

  1. Click Services and click a service name to view the following details:
    • Service name and severity level
    • Health score
      If a dependent service is impacted, the health score is propagated to all services that the impacted service depends upon.

      Click here to learn more about the propagated health score
      • If more than one child service or a parent service is impacted, the lowest health score is displayed.
      • The service health score is propagated to all the services on which the impacted service depends upon.
      • If the parent service is not impacted, the propagated health score is denoted by a label (Propagated) both on the tile view and heatmap view.
      • If the parent service is impacted, the Analyze Root Cause section shows the list of all impacted child services and CIs.
    • Incidents: Click to view incident details. 
      The incident message cross-launch link opens the incident details in BMC Helix ITSM (if you have permission to access the application).
      Incident tab_243.png
    • Total Events: Number of events generated for the service. This number includes events with all severity types (Critical, Major, Minor, Warning, Information, Ok, and Unknown).
    • Impacting Events: Number of events used to compute health score for a service.
      A maximum of 10,000 impacting events are listed on the node details page. However, the number of impacting events on the node can be more than 10,000.

      What is the difference between Total Events and Impacting Events?

      Total events include all the events generated for a service. Impacting events are a subset of total events and are used to calculate the health of a service. For example, Retail-outlet service has the following number of events:

      • Critical=3
      • Major=2
      • Minor=2
      • Warning=3
      • Information=2
      • Ok=2

      The total number of events is 14. If no event rule is defined, the impacting events count will be the same as total events; that is, 14. If an event rule is defined, for example, to consider Critical events to calculate the health score, the impacting events count will be updated to 3. If the same event rule is modified to consider both Critical and Major events to calculate the health score, the impacting events count will be updated to 5.

    • Impacting Child Services: Number of direct and propagated child services impacting the parent service.
      Click the Information icon Info_icon.pngto see the child services that are impacting the parent service in the following order:
      • Total Impacted Child Services
      • Propagated Child Services
      • Direct Impacting Child Services
      • Top 10 Direct Impacting Child Services
        A maximum of the top ten directly impacting child services are listed. 
        Impacting_Child_Services_243.png

        Click the Impacting Child Services link to view the details

        The detailed view displays the count of the total impacted child services, propagated child services, and directly impacted child services.

        Click the link to open the child service in a new tab. 
        Impacting_Child_Services_Details_243.png

    • Refresh page: Click Refresh Refresh.png to refresh the page. 

By default, the services details page is not automatically refreshed, and the Auto Refresh Interval option is set to  Off. To change the refresh interval duration, see Configuring-your-personal-settings.


    • Health timeline
    • CI topology
    • Service hierarchy
    • Health indicators (in the View Health Indicators section)
    • Situations (in the Analyze Situations section)
    • Root cause (in the Analyze Root Cause section)
    • Service insights (in the Analyze Service Insights section)
      Incident Count_243.png
  1. Click the Impacting Events link to view the impacting events, situations, incidents, and changes for a service, and perform the following optional steps:
      1. Click any event, situation, incident, or change to view details, related events (for situations only), logs, and notes, and perform additional operations. 
        You can also use the More Details cross-launch link to view the selected event, situation, incident, or change in BMC Helix Operations Management.

        Important

        By default, BMC Helix AIOps shows up to 10,000 events for a service in the Impacting Events > Events list. If an impacted service has more than 10,000 events, the total count in Impacted Events displays the actual number of events, however, you can only view 10,000 events.

        Service_Impact_243.png

      2. Click Incidents.
        The incident information events with details such as the incident ID (cross-launch link to the incident in BMC Helix ITSM), status, priority, date and time the incident occurred are displayed. 
        ClickDetails icon.png Details to open the Incident Information Event Details pane. You can use the More Details_24301.pngcross-launch link to view the information event in BMC Helix Operations Management. To view the original event, click View Details against the Original Event field. 
        Incident Info Events_24301.png

         
      3. (If integrated with 

        ServiceNow

         Change Management): Change requests from ServiceNow are displayed. 
        Impacting Events_Changes_SNOW_243.png

  1. Select a time range to view events, incidents, or changes that occurred in the selected period.
    By default, data is displayed for the last three hours. You can select a time range of 6 hours, 12 hours, 24 hours, or 7 days.
  2. Hover over a time slot to view the exact health score.
    HealthScore_Example_243.png
  3. Hover over the event, incident, or change on the health timeline to view details.
    To learn more about the health timeline, see Service-health-score-and-health-timeline.
  4. (Optional) Hover over the move icon Icon_MoveAccordions.png for a section to rearrange the section on the service details page. After the icon changes to a hand pointer, drag and drop the pointer as needed.


Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks:

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*