Monitoring service health

As an operator or a site reliability engineer (SRE), use BMC Helix AIOps to monitor the services in your organization for their health, performance, and availability. This ensures that users are able to access the system and perform their tasks quickly and without interruption.

Each service can contain one or more child services, and a combination of nodes, applications, and devices

Service health is determined by the events generated for a service. Typically, a service is considered healthy if there is no impact, such as open events for that service. However, depending on how services are designed in your organization, a lower impact that doesn't affect the service performance can also be considered healthy. For example, a single transaction failure might raise multiple false alarms that turn a service into red, or a key high-availability application might have some nodes down, but can continue to perform at an optimal health.  Therefore, it is important to have an algorithm that continuously learns how to determine the most impacted entities for a service. 

BMC Helix AIOps uses AI/ML algorithms to compute the health score and impact score and displays the impacted nodes and events for a service.

BMC Helix AIOps displays services from the following components:

  • Service models created and managed in BMC Helix AIOps
  • Groups published as services in BMC Helix Operations Management
  • All business services from BMC Helix Discovery
  • Service models or topologies ingested from third-party application through BMC Helix Intelligent Integrations connectors


To get started with service monitoring

In the BMC Helix AIOps console, all services are displayed on the Services page. 

  1. Click Services to view the following information: 
    • All services color-coded by severity in a heatmap or tile view
    • Child services associated with the parent services
    • Search, basic filters, and advanced filters for services
    • Option to create new services (available for the Service Designer role only)

      About the heatmap box sizes

      The impact score and the number of services to be displayed on this page determine the size of a heatmap box, the higher the impact score the larger the box. The box size is dynamic and relative to other boxes.

  2. Hover over a service to view a quick summary of the impact.
    The impact score, situations, events, incidents, and impacted entities (includes child services and configuration items) associated with a service are displayed.  

  3. (Optional) If there is a child service , click the service box to view the next level of child services associated with the service.
    If a child service is impacted, the impact and the health score is propagated to all the services that the impacted service depends upon. If the parent service is not impacted and the health score is propagated from a dependent service, a label (Propagated) is displayed next to the health score. If you hover over the service, Impacted Entities  shows the count of the impacted child service and CIs.
  4. (Optional) Choose how to view services on the heatmap view:

    • Basic search : Enter a service name in the search box and click search
    • Basic filters: Select or clear the severity filter check boxes to view services based on the selected severity. If the parent service matches the selected filter, the child services are also displayed. You can also click Select all to view all services. Filter selection is retained even if you access other pages and navigate back to the Services page. 
    • Advanced filter: Click to view services by Service Kind (Business Service, Technical Service, Business Application) or services with specific label-value pairs.
      Search and filter options are retained even if you switch between the heatmap and tile views.

      Important

      The heatmap view doesn't list a child service if its parent service kind differs from the selected service kind filter, even though the total count of services includes the child service kind. For example, the view doesn't list a child technical service, if its parent is of the business service kind and you have selected Technical Service from the filter.


    • Number of services per page: Click to select the number of services to be displayed on the page. 
    • Refresh page: Click Refresh to refresh the page. 
  5. (Optional) Click Tile View to view services in a tile view.
    Each tile represents a service and displays the service name, service impact score, and the count of situations, events, incidents, and total impacted CI count associated with the service. Search results or filters are retained across both the tile view and the heatmap view.
  6. (Optional) Click Save Preferences to save your page preferences.
    This option is enabled only when you select a severity filter, advanced filter, or change the view. Page preferences are saved for your user account permanently until a new preference is saved. 

Why don't I see any services on the Services page?

Services start appearing on the Services page as soon as service models are created, or services are discovered by BMC Helix Discovery. For more information, see Creating service models.

To monitor service health

  1. Click Services and click a service name to view the following details:
    • Service name and severity level
    • Health score
      If a dependent service is impacted, the health score is propagated  to all services that the impacted service depends upon.

      • If more than one child service or a parent service is impacted, the lowest health score is displayed.
      • Service health score is propagated to all the services on which the impacted service depends upon.
      • If the parent service is not impacted, the propagated health score is denoted by a label (Propagated) both on the tile view and heatmap view.
      • If the parent service is impacted, the Analyze Root Cause section shows the list of all impacted child services and CIs.
    • Incidents: Click to view incident details.
      The incident message cross-launch link opens the incident details in BMC Helix ITSM (if you have permissions to access the application). 
    • Total Events: Number of events generated for the service. 
    • Impacting Events: Number of events used to compute health score for a service. 
      (Total Events and Impacting Events include the self-monitoring events currently. To exclude the events, reach out to BMC Customer Support.)

    • Impacting Child Services: Number of child services impacting the parent service.  
    • Health timeline

    • CI topology: Topology displaying service CIs and relationships between them

      Tip

      If you have a large number of CIs in the service topology, use the search box to locate a particular CI. When the CI is located, it is highlighted. If a filter is already applied and CI is not present in the filtered view, the CI is  searched in the entire CI topology. When the CI is located, it is highlighted and the filter is cleared.

    • Service hierarchy
    • Health indicators (in the View Health Indicators section)

    • Situations (in the Analyze Situations section)

    • Root cause (in the Analyze Root Cause section)

    • Service insights (in the Analyze Service Insights section)

  2. Click the Impacting Events link to view the impacting events, situations, incidents, and changes for a service, and perform the following optional steps:
    1. Click any event, situation, incident, or change to view details, related events (for situations only), logs and notes, and perform additional operations. 
      You can also use the More Details cross-launch link to view the selected event, situation, incident, or change in BMC Helix Operations Management.

      Important

      By default, BMC Helix AIOps shows up to 10,000 events for a service in the Impacting Events > Events list. If an impacted service has more than 10,000 events, the total count in Impacted Events displays the actual number of events, however, you can only view 10,000 events.

  3. Select a time range to view events, incidents, or changes occurred in the selected time period.
    By default, data is displayed for the last three hours.
  4. Hover over a time slot to view the exact health score at that point in time.
  5. Hover over the event, incident, or change on the health timeline to view details.
    To learn more about health timeline, see Service health score and health timeline.
  6. (Optional) Hover over the move icon for a section to rearrange the section on service details page. After the icon changes to a hand pointer, drag and drop the pointer as needed.


Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks:

Was this page helpful? Yes No Submitting... Thank you

Comments