Monitoring service health

As an operator or a site reliability engineer (SRE), use BMC Helix AIOps to monitor the services in your organization for their health, performance, and availability. This ensures that users are able to access the system and perform their tasks quickly and without interruption.

Each service can contain one or more child services, and a combination of nodes, applications, and devices

Service health is determined by the events generated for a service. Typically, a service is considered healthy if there is no impact, such as open events for that service. However, depending on how services are designed in your organization, a lower impact that doesn't affect the service performance can also be considered healthy. For example, a single transaction failure might raise multiple false alarms that turn a service into red, or a key high-availability application might have some nodes down, but can continue to perform at an optimal health.  Therefore, it is important to have an algorithm that continuously learns how to determine the most impacted entities for a service. 

BMC Helix AIOps uses AI/ML algorithms to compute the health score and impact score and displays the impacted nodes and events for a service.

Related topics

Service health score and health timeline

Service dashboard Open link

BMC Helix AIOps displays services from the following components:

  • Service models created and managed in BMC Helix AIOps
  • Groups published as services in BMC Helix Operations Management
  • All business services from BMC Helix Discovery
  • Service models or topologies ingested from third-party application through BMC Helix Intelligent Integrations connectors


To get started with service monitoring

In the BMC Helix AIOps console, all services are displayed on the Services page. 

  1. Click Services to view the following information: 
    • All services color-coded by severity in a heatmap or tile view
    • Child services associated with the parent services
    • Search, basic filters, and advanced filters for services
    • Option to create new services (available for the Service Designer role only)

      About the heatmap box sizes

      The impact score and the number of services to be displayed on this page determine the size of a heatmap box, the higher the impact score the larger the box. T he box size is dynamic and relative to other boxes.

  2. Hover over a service to view a quick summary of the impact.
    The impact score, situations, events, and incidents associated with a service are displayed.

  3. ( Optional ) If there is a child service , click the service box to view the next level of child services associated with the service.
    If a child service is impacted, the impact and the health score is propagated to all the services that the impacted service depends upon. If the parent service is not impacted and the health score is propagated from a dependent service, a label (Propagated) is displayed next to the health score.

  4. ( Optional ) Choose how to view services on the heatmap view:

    • Basic search : Enter a service name in the search box and click search
    • Basic filters: Select or clear the severity filter checkboxes to view services based on the selected severity. If the parent service matches the selected filter, the child services are also displayed. You can also click Select all to view all services. Filter selection is retained even if you access other pages and navigate back to the Services page. 
    • Advanced filter: Click to view services by Service Kind (Business Service, Technical Service, Business Application) or services with specific label-value pairs. By default, business services and technical services are displayed. Search and filter options are retained even if you switch between the heatmap and tile views.
    • Number of services per page: Click to select the number of services to be displayed on the page.
      Important:
      The service count displayed on the heatmap view shows only the parent services filtered by Severity and Service Kind.
    • Refresh page: Click Refresh to refresh the page. 
      By default, the Services page is automatically refreshed after every five minutes. To change the refresh interval duration, see Configuring general settings.
  5. (Optional) Click Tile View t o view services in a tile view .
    Each tile represents a service and displays the service name, service impact score, and the count of situations, events, and incidents, and the total impacted CI count associated with the service. Search results or filters are retained across both the tile view and the heatmap view.
    Important: The service count displayed on the tile view shows all the services filtered by Severity and Service Kind.
  6. (Optional) Click Save Preferences to save your page preferences.
    Your selected severity filters, advanced filters, and the page view (Heatmap view or Tile view) is saved until you change your page preferences again. 

 

Why don't I see any services on the Services page?

Services start appearing on the Services page as soon as service models are created, or services are discovered by BMC Helix Discovery. For more information, see Creating service models .

To monitor service health

  1. Click Services and click a service name to view the following details:
    • Service name and severity level
    • Health score
      If a dependent service is impacted, the health score is propagated  to all services that the impacted service depends upon.

      • If more than one child service or a parent service is impacted, the lowest health score is displayed.
      • Service health score is propagated to all the services on which the impacted service depends upon.
      • If the parent service is not impacted, the propagated health score is denoted by a label (Propagated) both on the tile view and heatmap view.
      • If the parent service is impacted, the Analyze Root Cause section shows the list of all impacted child services and CIs.
    • Incidents: Click to view incident details.
      The incident message cross-launch link opens the incident details in BMC Helix ITSM (if you have permission to access the application). 
    • Total Events: Number of events generated for the service. 
    • Impacting Events: Number of events used to compute health score for a service. 

    • Impacting Child Services: Number of child services impacting the parent service.
    • Refresh page: Click Refresh   to refresh the page. 
      By default, the services details page is not automatically refreshed, and the Auto Refresh Interval option is set to Off. To change the refresh interval duration, see Configuring general settings.
    • Health timeline

    • CI topology
    • Service hierarchy
    • Health indicators (in the View Health Indicators section)

    • Situations (in the Analyze Situations section)

    • Root cause (in the Analyze Root Cause section)

    • Service insights (in the Analyze Service Insights section)

  2. Click the Impacting Events link to view the impacting events, situations, incidents, and changes for a service, and perform the following optional steps:
    1. Click any event, situation, incident, or change to view details, related events (for situations only), logs and notes, and perform additional operations. 
      You can also use the More Details cross-launch link to view the selected event, situation, incident, or change in BMC Helix Operations Management.

      Important

      By default, BMC Helix AIOps shows up to 10,000 events for a service in the Impacting Events > Events list. If an impacted service has more than 10,000 events, the total count in Impacted Events displays the actual number of events, however, you can only view 10,000 events.

  3. Select a time range to view events, incidents, or changes that occurred in the selected time period.
    By default, data is displayed for the last three hours.
  4. Hover over a time slot to view the exact health score at that point in time.
  5. Hover over the event, incident, or change on the health timeline to view details.
    To learn more about health timeline, see Service health score and health timeline.
  6. (Optional) Hover over the move icon for a section to rearrange the section on service details page. After the icon changes to a hand pointer, drag and drop the pointer as needed.


Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks:

Was this page helpful? Yes No Submitting... Thank you

Comments