Monitoring service insights


Service insights are various analytics, such as, trends, patterns, behavior, daily averages of different events, and metric data, that correspond to the performance of a service. Operators or site reliability engineers (SREs) can use the service insights to see service degradation, quickly investigate and identify the root cause, and restore system availability as quickly as possible.

BMC Helix AIOps  uses AI/ML algorithms to analyze events and metric data collected from the service environment over a period. You can see the insights in terms of textual summaries with graphs. Remember that insights are only available for services that have been available for at least two days.  

As operators or SREs, you can view the following information:

  • Health insights to identify precise time of service degradation in a day or week.
  • Health insights into events and incidents as a trend and behavior over a period of the last 15 days.


The following video (1:53) shows a high-level overview of the service insights feature in BMC Helix AIOps:

icon_play.pngWatch the YouTube video about Overview of service insights in BMC Helix AIOps.

Insights available for monitoring services

The following table provides an overview of different insights and what information you can infer from the insight summary::

Insights

Description

Summary

Health score 

This insight is derived from the health score of a service. The health score is calculated by using AI/ML technology on the events associated with various service entities, such as, nodes, clusters, applications, devices, and child services . The health score ranges from 0 to 100, and directly correlates with the service health, so the higher the score, the healthier the service. Take corrective measures if there is a degradation in health score.

You can see this insight only for decreasing trend in the health score. This insight is not displayed if there is an increasing trend or no trend in the health score for the given period. 

For more information, see Insight into health score. 

  • Overall trend of your service health
  • Latest percentage degradation of the health score between two subsequent dates

Severity pattern

This insight shows the pattern derived from daily occurrences of Critical or Major severity. If a service is repeatedly affected due to Critical or Major severities, every day during specific times, it indicates the service is not healthy. Take corrective measures if you see a pattern in severity occurrences over a period.

For more information, see Insight into severity pattern 

  • Duration of repeated occurrences of a Critical or Major severity

Events

(Major, Critical)

This insight is derived from Major and Critical events; that is, notifications about any change in the state of an application or device that you are monitoring. Events occurrences are correlated inversely with service health, meaning the fewer the events, the healthier a service.   Take corrective measures if you see Major and Critical events are increasing with an alarming daily average, over a period.

You can see this insight only for an increasing trend in Major/Critical events. This insight is not displayed if there is a decreasing trend or no trend in Major/Critical event occurrences for the given period. 

For more information, see Insight into Major or Critical events

  • Trend of Critical or Major events over a period
  • Number of average events
  • Latest percentage increase in Critical or Major events between two subsequent dates

Incidents

This insight is derived from incidents; that is, events that are not part of the standard operation of a service and are causing interruption or quality degradation of a service. Incident-related insights in BMC Helix AIOps are derived from incidents generated from events, either through a notification policy or by a right-click on the event, which have resulted in an Incident Info (or INCIDENT_INFO) class event. Take corrective measures if the occurrences of incidents are increasing with an alarming daily average, over a period .

You can see this insight only for an increasing trend in incidents. This insight is not displayed if there is a decreasing trend or no trend in incident occurrences for the given period. 

For more information, see Insight into incidents

  • Trend of incidents over a period
  • Number of average incidents
  • Latest percentage increase in incidents between two subsequent dates

To view insights for a service

  1. On the Services page, click the service name.
  2. Scroll down and expand Analyze Service Insights.
    You can see insights based on the availability of metric data collected from the IT network. Insights are displayed as soon as they start appearing, up to last 15 days.
  3. Click the summary text to view the corresponding graph.
    Insights are available for 
    health score, severity pattern, Major or Critical events, and incidents.

Are insights available if the service is impacted for less than 15 days?

Yes. Insights are displayed as soon as they are available. For example, if the service health score has degraded for the last 5 days, the trend and pattern is displayed on the graph. 

Insight into health score

The insight summary shows the degradation of a service health score over a period, and the latest degradation in terms of a percentage change in daily average health score between the last two subsequent dates. In the graph, the daily average health score is measured vertically (along the Y-axis) and date-time is measured horizontally (along the X-axis). The highlighted zone shows the latest percentage degradation of the health score. 

Example: 

Let us take an example of insights into health score for which the summary shows the decreasing trend, and also 100% degradation of average service health. The highlighted zone in the graph represents the recent decrease in the average health score.

Insights_HealthScore_243.png

Now, let us understand how the insights are derived using the average health score, as described in the table given below. The recent decrease in average health score and their dates are highlighted in the table. You can correlate these values with the highlighted zone in the graph.

Date

Avg. health score

% change in average health score

[(H2 - H1)/(H1)] x 100

H2 = Average score on a date

H1 = Average score on previous date

 

12/29/2022

16

-

12/30/2022

0

[(0 - 16)/16] x 100 = 100%

 

Insight into severity pattern

The insight summary shows a pattern for Critical or Major severity based on their daily occurrences during specific times. In the corresponding graph, the daily average health score is measured vertically (along the Y-axis) and date-time is measured horizontally (along the X-axis). Multiple highlighted zones show the pattern of daily occurrences of Critical or Major severities. The summary is generated based on the date range and duration of the severity on the server's time zone and can be viewed in the user's local time zone.

Example:

Let us take an example of insights into severity pattern for a service. In the graph, the highlighted line represent pattern for Critical severity.

Insights_Pattern_243.png

Now, let us understand how the severity pattern is derived using the daily occurrences of Critical severity, as described in the table given below. You can correlate the severity duration and the corresponding date range with the highlighted line in the graph.

Date range

Severity

Duration of severity

12/29/2022  to  01/12/2023

Critical

05:30 hr to 05:30 hr


Insight into Major or Critical events

The insight summary shows an  increasing trend of Critical or Major events with daily average of event occurrences over a period, and the latest percentage increase of that event with comparison dates . In the corresponding graph, the number of Major or Critical events is measured vertically (along the Y-axis) and date-time is measured horizontally (along the X-axis). The highlighted zone shows the recent percentage increase in events.

Example:

Let us take an example of insights into Critical events for which the summary shows an increasing trend of Critical events with the daily average of 11 Critical events, and the recent increase in Critical events from 15 to 18, with the comparison dates. The highlighted zone in the graph represents the recent increase in the occurrences of Critical events.

Insights_CriticalEvents_243.png

Now, let us understand how the insights are derived using the daily occurrences of Critical events data, as described in the table given below. The recent increase in Critical events and their dates are highlighted in the table. You can correlate these values with the highlighted zone in the graph.

Date

Critical events

Average Critical events

(Sum of Critical events)/(Number of days)

12/28/2022

10







(10+2+4+4+5+5+6+15+16+15+15+10+15+18+18) /15 =10.53

Daily average = 11 events (critical)

12/29/2022

2

12/30/2022

4

12/31/2022

4

01/01/2023

5

01/02/2023

5

01/03/2023

6

01/04/2023

15

01/05/2023

16

01/06/2023

15

01/07/2023

15

01/08/2023

10

01/09/2023

15

01/10/2023

18

01/11/2023

18


Insight into incidents

The insight summary shows an increasing trend of incidents with daily average of incident occurrences over a period, and the latest percentage increase of incidents with comparison dates. In the corresponding graph, the number of incidents is measured vertically (along the Y-axis) and date-time is measured horizontally (along the X-axis). The highlighted zone shows the latest increase in incident count. In , whenever an incident is created against a service, an information event gets logged in BMC Helix Operations Management . Such information events are then considered in BMC Helix AIOps to derive incident-related insights for the respective service.

Example:

Let us take an example of insights into incidents for which the summary shows an increasing trend with the daily average of 8 incidents, and the recent increase in incidents from 13 to 15, with the comparison dates. The highlighted zone in the graph represents the recent increase in the occurrences of incidents.

Insights_Incidents_243.png

Now, let us understand how the insights are derived using the daily occurrences of incidents, as described in the table given below. The recent increase in incidents and their dates are highlighted in the table. You can correlate these values with the highlighted zone in the graph.

Date

Incidents

Average incidents

(Sum of incidents)/(Number of days)

12/28/2022

1






(1+2+3+4+5+7+6+7+9+9+10+11+12+13+15)/15 =7.6

12/29/2022

2

12/30/2022

3

12/31/2022

4

01/01/2023

5

01/02/2023

7

01/03/2023

6

01/04/2023

7

01/05/2023

9

01/06/2023

9

01/07/2023

10

01/08/2023

11

01/09/2023

12

01/10/2023

13

Daily average = 8 incidents

01/11/2023

15

Where to go from here

Based on the health of and impact on a service, you can perform any of the following tasks: 


 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*