Monitoring service health


As an operator or a site reliability engineer (SRE), use BMC Helix AIOps to monitor the services in your organization for their health, performance, and availability. This ensures that users are able to access the system and perform their tasks quickly and without interruption.

Each service can contain one or more child services, and a combination of nodes, applications, and devices . 

Service health is determined by the events generated for a service. Typically, a service is considered healthy if there is no impact, such as open events for that service. However, depending on how services are designed in your organization, a lower impact that doesn't affect the service performance can also be considered healthy. For example, a single transaction failure might raise multiple false alarms that turn a service into red, or a key high-availability application might have some nodes down, but can continue to perform at an optimal health.  Therefore, it is important to have an algorithm that continuously learns how to determine the most impacted entities for a service. 

BMC Helix AIOps uses AI/ML algorithms to compute the health score and impact score and displays the impacted nodes and events for a service.

BMC Helix AIOps displays services from the following components:

  • Service models created and managed in BMC Helix AIOps
  • Groups published as services in BMC Helix Operations Management
  • All business services from BMC Helix Discovery
  • Service models or topologies ingested from third-party applications through BMC Helix Intelligent Integrations connectors


To get started with service monitoring

In the BMC Helix AIOps console, all services are displayed on the Services page. 

  1. Click Services to view the following information: 
    • All services color-coded by severity in a heatmap or tile view
    • Child services associated with the parent services
    • Search, basic filters, and advanced filters for services
    • Option to create new services (available for the Service Designer role only)

      About the heatmap box sizes

      The impact score and the number of services to be displayed on this page determine the size of a heatmap box, the higher the impact score the larger the box. T he box size is dynamic and relative to other boxes.

  2. Hover over a service to view a quick summary of the impact. 
    The impact score, situations, events, and incidents associated with a service are displayed.
    Services_impact_241.png
  1. (Optional) If there is a child service , click the service box to view the next level of child services associated with the service.
    If a child service is impacted, the impact and the health score are propagated to all the services that the impacted service depends upon. If the parent service is not impacted and the health score is propagated from a dependent service, a label (Propagated) is displayed next to the health score.
    ServicesPage_HeatMap_Propagated_241.png
  1. (Optional) Choose how to view services on the heatmap view:
    • Basic search: Enter a service name in the search box and click search Search button.png
    • Basic filters: Select or clear the severity filter checkboxes to view services based on the selected severity. If the parent service matches the selected filter, the child services are also displayed. You can also click Select all to view all services. Filter selection is retained even if you access other pages and navigate back to the Services page. 
    • Advanced filter: Click to view services by Service Kind (Business Service, Technical Service, Business Application) or services with specific label-value pairs. By default, business services and technical services are displayed. Search and filter options are retained even if you switch between the heatmap and tile views.
    • Number of services per page: Click to select the number of services to be displayed on the page.
      The service count displayed on the heatmap view shows only the parent services filtered by Severity and Service Kind.
    • Refresh page: Click Refresh Refresh.pngto refresh the page.
      By default, the Services page is automatically refreshed after every five minutes. 
      The automatic refresh interval is applicable for active browser sessions only. If you navigate to any other browser tab, the session no longer remains active and the UI pages do not get automatically refreshed. To change the refresh interval duration, see Configuring-general-settings.
  1. (Optional) Click Tile View tile_view_icon.pngto view services in a tile view.
    Each tile represents a service and displays the service name, service impact score, and the count of situations, events, and incidents, and the total impacted CI count associated with the service. Search results or filters are retained across both the tile view and the heatmap view. If the Show Policy-based Situations option is disabled, the policy-based situations are hidden and only the ML-based situation count is displayed in both, the heatmap and tile view. It might take up to 15 minutes for the situation count to update after the Show Policy-based Situations option is enabled or disabled. 

    Important

    The service count displayed on the tile view shows all the services filtered by Severity and Service Kind.

  2. (Optional) Save a set of preferences (severity filters, advanced filters, and the page view (Heatmap view or Tile view)) in a preset. These preferences are saved across sessions:
    1. Click Add Preset.
    2. Enter a name for the preset.
    3. (Optional) Set the preset as the default preset.
    4. Click Create.
      The presets you create on the Services page are applied to the Situations page as well. f you select a preset on the Services page, the situations are displayed according to the services filtered by the preset. 
      Also, if you select a preset on the Situations page, the Services page displays services according to that preset.
  1. (Optional) Click Save Preferences to save your page preferences.
    Your selected severity filters, advanced filters, and the page view (Heatmap view or Tile view) are saved until you change your page preferences again. 
Why don't I see any services on the Services page?

Services start appearing on the Services page as soon as service models are created, or services are discovered by BMC Helix Discovery. For more information, see Creating-service-models.

To monitor service health

  1. Click Services and click a service name to view the following details:
    • Service name and severity level
    • Health score
      If a dependent service is impacted, the health score is propagated to all services that the impacted service depends upon.

      Click here to learn more about the propagated health score
      • If more than one child service or a parent service is impacted, the lowest health score is displayed.
      • The service health score is propagated to all the services on which the impacted service depends upon.
      • If the parent service is not impacted, the propagated health score is denoted by a label (Propagated) both on the tile view and heatmap view.
      • If the parent service is impacted, the Analyze Root Cause section shows the list of all impacted child services and CIs.
    • Incidents: Click to view incident details. 
      The incident message cross-launch link opens the incident details in BMC Helix ITSM (if you have permission to access the application). 
      Incident tab_23.4.01.png
    • Total Events: Number of events generated for the service. This number includes events with all the severity types (Critical, Major, Minor, Warning, Information, Ok, and Unknown).
    • Impacting Events: Number of events used to compute health score for a service.

      What is the difference between Total Events and Impacting Events?

      The health score of a service is calculated by using the impacting events count. You can define event rules to consider only specific events based on the impacted CIs, event severities, or messages. 

      For example, define an event rule to consider only the events with Critical severity for computing the health score. To do this, edit the service and select the Events option. In the Define Service Events processing pane, select the Add Event Rule and the Add filter option. In the filtering options, select Severity as the attribute and Critical as the value and add the event rule. After successfully saving the service, you can see that only the critical events are listed in the impacting events category. 

      Additionally, you can edit the same event rule to include events with Critical and Major severity for computing the health score. You can also add more event rules to include specific objects (like the database or printers) so that only those events will be considered in the health score calculation.

      Examples of events rules:

      • Message Matches alarm.*
      • Message Matches .*alarm
      • Message Matches alarm.*memory
      • Severity Critical, Major
      • Object Equals NUK_Memory
      • Object Class Matches NUK.*
    • Impacting Child Services: Number of direct and propagated child services impacting the parent service.
      Click the Information icon Info_icon.pngto see the child services that are impacting the parent service in the following order:
      • Total Impacted Child Services
      • Propagated Child Services
      • Direct Impacting Child Services
      • Top 10 Direct Impacting Child Services
        A maximum of the top ten directly impacting child services are listed. 
        24_1_Impacting_Child_Services.png

        Click the Impacting Child Services link to view the details

        The detailed view displays the count of the total impacted child services, propagated child services, and directly impacted child services.

        Click the link to open the child service in a new tab. 

        24_1_Impacting_Child_Services_Details.png

    • Refresh page: Click Refresh Refresh.png to refresh the page. 

By default, the services details page is not automatically refreshed, and the Auto Refresh Interval option is set to  Off. To change the refresh interval duration, see Configuring-general-settings.


    • Health timeline
    • CI topology
    • Service hierarchy
    • Health indicators (in the View Health Indicators section)
    • Situations (in the Analyze Situations section)
    • Root cause (in the Analyze Root Cause section)
    • Service insights (in the Analyze Service Insights section)
    • Incident Count_@3301.png
  1. Click the Impacting Events link to view the impacting events, situations, incidents, and changes for a service, and perform the following optional steps:
    1. Click any event, situation, incident, or change to view details, related events (for situations only), logs, and notes, and perform additional operations. 
      You can also use the More Details cross-launch link to view the selected event, situation, incident, or change in BMC Helix Operations Management.

      Important

      By default, BMC Helix AIOps shows up to 10,000 events for a service in the Impacting Events > Events list. If an impacted service has more than 10,000 events, the total count in Impacted Events displays the actual number of events, however, you can only view 10,000 events.

      Service_Impact_23102.png

    2. (If integrated with 

      ServiceNow

       Change Management): Change requests from ServiceNow are displayed. 
      Impacting Events_Changes_SNOW_241.png

  2. Select a time range to view events, incidents, or changes that occurred in the selected period.
    By default, data is displayed for the last three hours. You can select a time range of 6 hours, 12 hours, 24 hours, or 7 days.
  3. Hover over a time slot to view the exact health score.
    HealthScore_Example_23102.png
  4. Hover over the event, incident, or change on the health timeline to view details.
    To learn more about the health timeline, see Service-health-score-and-health-timeline.
  5. (Optional) Hover over the move icon Icon_MoveAccordions.png for a section to rearrange the section on the service details page. After the icon changes to a hand pointer, drag and drop the pointer as needed.


Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks:

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*