Variate policies use unsupervised learning methods for real-time anomaly detection of various metrics that change dynamically with time. For example, page load time, active virtual pages, VM utilization, and CPU load are some of the metrics. With this feature, you can:
- Detect anomalies for multivariate metrics.
- Understand the basis of metric and do event probable-cause analysis.
- Control event sensitivity using anomalous score settings.
- Visualize anomaly events in a stacked graphical format.
Anomalies are observations that diverge from a well-structured data pattern. BMC Helix Operations Management displays the anomaly event plotted in a graph to easily distinguish the anomaly data point from the regular data point. An anomaly event is generated based on the anomaly score computation for the metrics configured in variate policies.
You can define multivariate metrics and configure the anomaly event and score settings in variate policies.
Each variate policy consists of basic policy information such as the name, description, policy type, and one or more metrics. You can configure the following anomaly settings:
- Event settings: The anomaly event is generated only if the anomaly persists for the specified duration in this settings.
- Score settings: The standard deviation value for computing the variability range (Sigma) score for each event severity type.
Is there an out-of-the-box variate policy?
There are no out-of-the-box variate policies. However, when you create a variate policy, you have a choice to either use the default anomaly scores or change it.
KPIs, metrics, variate policy types, and scores
KPIs and Metrics: The key attributes of interest that you want to monitor or analyze to derive insights. Some of the essential infrastructure and application parameters can be marked as KPIs, and other attributes can be set as metrics. The values of these KPIs and metrics are analyzed by the anomaly detection algorithm to indicate a normal or abnormal behavior. In BMC Helix Operations Management, each metric is input as a streamed time-series data and is analyzed for anomaly patterns for a specific duration.
Anomaly detection and score: BMC Helix Operations Management uses Random Cut Forest (RCF) algorithms to analyze the time-series data to detect anomalies. With each data point input, the algorithm associates an anomaly score. A low-score value indicates the data point is normal, and a high-score value indicates the presence of an anomaly data. The definitions of low and high are dependent on the application that provides the data. However, in common practice, anomaly scores beyond three standard deviations from the mean score are considered anomalous.
Scoring method: From each anomaly score, the mean value and standard deviations are calculated. A threshold score is then calculated as mean value + (plus) a factor of standard deviation. If an anomaly score for a metric or KPI is above the threshold score, the data point is considered an anomalous data point.
How do I control the number of anomaly events?
To avoid getting too few or too many anomaly events generated by BMC Helix Operations Management, you can configure the anomaly event and score settings in such a way that best suits your organizational needs. For more information, see Configuring variate policies.
- Severity and anomaly events:BMC Helix Operations Management supports three different anomaly event severity types (Critical, Major, and Minor). You can configure the anomaly score to define a severity type of each anomaly event. Here is the statistical formula for calculating the score of each severity type:
- Minor: The anomaly score is greater than (Mean value + Standard Deviation)
- Major: The anomaly score is greater than (Mean value + 2 times the Standard Deviation)
- Critical: The anomaly score is greater than (Mean value + 3.5 times the Standard Deviation)
Note: The anomaly score settings is user-configurable. BMC recommends that you do not change the default standard deviation settings unless you are familiar with this concept and know your organizational requirements. For more information, see Configuring variate policies.
- Anomaly event duration: BMC Helix Operations Management supports four out-of-the-box anomaly duration (0, 5, 10, and 15 minutes). An anomaly event is generated only if the anomalous data point persists for the duration selected. For example, if the duration is set as 10 minutes, the policy generates an anomaly event only if the anomalous data point persists for a minimum of 10 minutes. Every successive duration of 10 minutes can be considered as a moving window. If you have configured a 10 minutes duration, 0 to 10 minutes, 1 to 11 minutes, 2 to 12 minutes, and so on are successive moving windows of duration.
- Band of normality: A grey pattern drawn on the anomaly event-progression graph. This is merely a visual indicator to show the path of normal data progression and does not have any other significance. For any given point on the graph, this is calculated for both the upper and the lower cut lines as follows:
- The upper-cut line is drawn based on the Standard Deviation score of all data points for the last 1 hour + (plus) the data point value at that point.
- The lower-cut line is drawn based on the data point value at that point - (minus) the Standard Deviation score of all data points for the last 1 hour.
- Variate policy details and learning phase:
- Multivariate: All the metrics in a multivariate configuration are analyzed at the same time. The anomaly event is generated based on the combination of metrics analyzed. All the metrics analyzed contribute to the anomaly score, a single event is generated for the combination, and a single view containing stacked graph for each metric is created. However, if there are more than 10 metrics in one policy, only the metrics with scores in the top 10 ranks are considered. All other metrics with scores outside of the top 10 ranks are ignored. For more information, see Viewing anomaly events.
Learning Phase: The time taken to collect the required data points to start detecting anomalies in the data set. The RCF algorithm requires3 20 data points for detecting and generating anomaly event.
When do I start seeing the anomaly events?
Only after the RCF algorithm completes its learning phase. However, BMC Helix Operations Management is built with a robust mechanism to look for existing historical data points for the selected metrics immediately after you create a variate policy.
- If there are enough number of data points, you can see the anomalies in a short time.
- If there are not enough data points, wait for the BMC Helix Operations Management to collect the above mentioned minimum required data points before you could see the anomalies.