Monitoring service health


As an operator or a site reliability engineer (SRE), use BMC Helix AIOps to monitor the services in your organization for their health, performance, and availability. This ensures that users are able to access the system and perform their tasks quickly and without interruption.

Each service can contain one or more child services, and a combination of nodes, applications, and devices. 

Service health is determined by the events generated for a service. Typically, a service is considered healthy if there is no impact, such as open events for that service. However, depending on how services are designed in your organization, a lower impact that doesn't affect the service performance can also be considered healthy. For example, a single transaction failure might raise multiple false alarms that turn a service into red, or a key high-availability application might have some nodes down, but can continue to perform at an optimal health.  Therefore, it is important to have an algorithm that continuously learns how to determine the most impacted entities for a service. 

BMC Helix AIOps uses AI/ML algorithms to compute the health score and impact score and displays the impacted nodes and events for a service.

BMC Helix AIOps displays services from the following components:

  • Service models created and managed in BMC Helix AIOps
  • Groups published as services in BMC Helix Operations Management
  • All business services from BMC Helix Discovery
  • Service models or topologies ingested from third-party application through BMC Helix Intelligent Integrations connectors


To get started with service monitoring

In the BMC Helix AIOps console, all services are displayed on the Services page. 

  1. Click Services to view the following information: 
    • All services color-coded by severity in a heatmap or tile view
    • Child services associated with the parent services
    • Search, basic filters, and advanced filters for services
    • Option to create new services (available for the Service Designer role only)

      About the heatmap box sizes

      The impact score and the number of services to be displayed on this page determine the size of a heatmap box, the higher the impact score the larger the box. The box size is dynamic and relative to other boxes.

  2. Hover over a service to view a quick summary of the impact. 
    The impact score, situations, events, incidents, and impacted entities (includes child services and configuration items) associated with a service are displayed.  
    Services_impact_23102.png
  1. (Optional) If there is a child service , click the service box to view the next level of child services associated with the service. 
  2. If a child service is impacted, the impact and the health score is propagated to all the services that the impacted service depends upon. If the parent service is not impacted and the health score is propagated from a dependent service, a label (Propagated) is displayed next to the health score. If you hover over the service, Impacted Entities  shows the count of the impacted child service and CIs.

    ServicesPage_HeatMap_Propagated_23102.png

  3. (Optional) Choose how to view services on the heatmap view:
    • Basic search : Enter a service name in the search box and click search Search button.png
    • Basic filters: Select or clear the severity filter check boxes to view services based on the selected severity. If the parent service matches the selected filter, the child services are also displayed. You can also click Select all to view all services. 
    • Advanced filter: Click to view only services that are labeled with specific label-value pairs.
      Search and filter options are retained even if you switch between the heatmap and tile views.
    • Number of services per page: Click to select the number of services to be displayed on the page. 
    • Refresh page: Click Refresh Refresh.pngto refresh the page. 
  1. (Optional) Click Tile View tile_view_icon.png t o view services in a tile view.
    Each tile represents a service and displays the service name, service impact score, and the count of situations, events, incidents, and total impacted CI count associated with the service. Search results or filters are retained across both the tile view and the heatmap view.
Why don't I see any services on the Services page?

Services start appearing on the Services page as soon as service models are created, or services are discovered by BMC Helix Discovery. For more information, see Creating-service-models .

To monitor service health

  1. Click Services and click a service name to view the following details:
    • Service name and severity level
    • Health score
      If a dependent service is impacted, the health score is propagated  to all services that the impacted service depends upon.

      Click here to learn more about the propagated health score
      • If more than one child service or a parent service is impacted, the lowest health score is displayed.
      • Service health score is propagated to all the services on which the impacted service depends upon.
      • If the parent service is not impacted, the propagated health score is denoted by a label (Propagated) both on the tile view and heatmap view.
      • If the parent service is impacted, the Analyze Root Cause section shows the list of all impacted child services and CIs.
    • Total events: Total events generated for the service. 
    • Impacting events: Events used to compute health score for a service.
    • Impacting sub-services: If there are no impacting events for a service, or if the impact is higher on the child services, the count of impacted child services is displayed.  
    • Health timeline
    • CI topology
    • Service hierarchy
    • Health indicators (in the View Health Indicators section)
    • Situations (in the Analyze Situations section)
    • Root cause (in the Analyze Root Cause section)
    • Service insights (in the Analyze Service Insights section)
      ServiceDetails_23201.png
  2. Click the Impacting Events link to view the impacting events, situations, incidents, and changes for a service, and perform the following optional steps:
    1. Click any event, situation, incident, or change to view details, related events (for situations only), logs and notes, and perform additional operations. 
      You can also use the More Details cross-launch link to view the selected event, situation, incident, or change in BMC Helix Operations Management.

      Important

      By default, BMC Helix AIOps shows up to 10,000 events for a service in the Impacting Events > Events list. If an impacted service has more than 10,000 events, the total count in Impacted Events displays the actual number of events, however, you can only view 10,000 events.

      Service_Impact_23102.png

  3. Select a time range to view events, incidents, or changes occurred in the selected time period.
    By default, data is displayed for the last three hours.
  4. Hover over a time slot to view the exact health score at that point in time.
    HealthScore_Example_23102.png
  5. Hover over the event, incident, or change on the health timeline to view details.
    To learn more about health timeline, see Service-health-score-and-health-timeline.
  6. (Optional) Hover over the move icon Icon_MoveAccordions.png for a section to rearrange the section on service details page. After the icon changes to a hand pointer, drag and drop the pointer as needed.


Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks:

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*