Candidate Monitoring Threshold Calculations
Once basis data is collected from monitoring ten executions of a job-step program, the thresholds are calculated that define abnormal behavior in that program for each of the three monitored conditions (elapsed time, TCB time, and I/O activity). For elapsed time and TCB time, this threshold is a number that designates the maximum minutes that an instance of the program can use before its usage of the resource is considered abnormal.
Minimum Run Time for Basis Data Collection
When monitoring of program execution takes place to collect basis data, the elapsed time of that execution must exceed a minimum value to cause storage of the data from that execution as basis data. This prevents brief executions of a program (when it finds no work to do) from artificially lowering the thresholds that will be used on subsequent executions where the program does do work and therefore runs longer. The default for the minimum elapsed time required to retain basis data is two minutes. This parameter can be set to any value from 1 to 999. Refer to the “AutoStrobe Parameters” section of the Advanced-Configuration-Guide for more information about adjusting minimum time values for basis data collection.
Threshold for Elapsed Time
The threshold for elapsed time, which AutoStrobe uses to initiate a program measurement, is calculated as follows:
- Collect the elapsed time for each of ten executions of the program.
- Discard the highest and lowest values from the ten executions.
- Add to the highest value of the eight remaining executions, a percentage of the average of the eight executions.
The default value for this percentage is 50, and the range of valid values is from 1 through 100. This percentage can be changed using the AUTOELAPPERCNT installation parameter. For more information, refer to the “AutoStrobe Parameters” section of the Advanced-Configuration-Guide.
For example, the average elapsed time of the eight program runs is 40 minutes. The longest time any of the eight executions ran was 53 minutes. If the percentage value is 50, then the threshold to kick off the measurement of this program is 73 minutes. 53 plus 50 percent of 40 (20) equals 73.
If you lowered the percentage value to 10, then the threshold for elapsed time would drop to 57 minutes based on a calculation of 53 plus 10 percent of 40 (4) equals 57.
Or if you increased the percentage value to 98, then the threshold for elapsed time is set at 92 minutes, which is 53 plus 98 percent of 40 (39) equals 92.
Threshold for TCB Time
The calculation for the TCB time threshold to kick off a measurement uses the same formula as for elapsed time. As an example, a program uses on average 32 TCB minutes after the high and low TCB figures are discounted. The high of the eight runs was 40 TCB minutes. Using the 50 percent default value causes the initiation of program measurement if it uses 56 TCB minutes. Acceptable values for this parameter are 1 through 100. Refer to the explanation of the AUTOTCBPERCNT parameter in the Advanced-Configuration-Guide for more information about adjusting the elapsed time percentage.
Threshold for I/O Activity
Basis data is collected and a norm for I/O activity is calculated to cause a measurement to initiate when the program’s activity deviates significantly from that norm. Since either high or low I/O activity indicates a problem at the point in time when the activity occurs, the calculations are significantly different from those for elapsed time or TCB time.
You can control how to determine what should be identified as abnormal elapsed time, TCB time, or I/O activity by specifying what performance evaluation parameters are used to calculate the thresholds that define abnormal activity. These percentages are set in the Strobe installation PARMLIB that contains parameter value definitions required by Strobe.
Establishing a Norm for I/O Activity
To establish a norm, the first ten executions of the program are monitored to determine how many I/O and CPU service units the program uses in each execution. The collection of this basis data occurs in the following manner: for each of these ten executions, a matrix is built that divides the CPU service units of the program into 25 intervals. This matrix is a time-line of I/O activity in each of the 25 CPU intervals. When each of the program’s executions to collect basis data completes, the time-line data collected from this execution is combined with that collected from previous executions and creates a new basis-data time-line that is a composite of this program’s I/O activity.
After the tenth execution, the following takes place:
- Calculation of the average of the total CPU Service Units for the ten executions and divides them into 25 intervals.
- Calculation of the average I/O Service Unit usage for the execution for each interval.
- Application of this average I/O usage for each interval as the composite average for each composite interval calculated in step 1.
This is the basis data that represents normal I/O activity during the program’s execution.
Detecting Abnormal I/O Activity
Once this composite time line of I/O usage is established, monitoring of subsequent executions of the program takes place to detect deviations from this matrix as the execution proceeds along the time-line. Each deviation is specified as a event. An event is a combination of a percentage of how much the detected I/O activity is greater than or less than the mean value in the average interval of the composite time line and the number of consecutive intervals in which the deviation occurs. Th following figure shows an example of a composite time-line for I/O activity in a monitored program and how abnormal activity is detected in a subsequent execution of that program.
Example of How AutoStrobe Detects Abnormal I/O Activity
This example assumes that the average CPU service units used by the program in its first ten monitored executions is 275. This means that for this example each interval in the time-line for the composite of these ten executions represents 11 CPU service units.
The solid line in the graph shows the average number of I/O service units used in the first ten monitored instances of the program relative to the CPU service units consumed as their executions proceeded. In this example, the first ten monitoring indicate that the normal I/O activity for this program is:
- Relatively heavy (between 30 and 40 I/O service units in each CPU interval) in the intervals at the points in the program’s execution where it has consumed approximately 20 - 50 CPU service units.
- Lower I/O activity (between 15 and 30 service units per interval) for most of the rest of the program’s execution (as measured by CPU service units).
- Very high I/O activity (over 40 I/O service units per interval) in the intervals at the points in the program’s execution where it has consumed between 250 and 275 CPU service units.
The broken line on the graph represents the I/O activity for an execution of the program after its basis data has been collected. Monitoring of this execution is for abnormal behavior. Assume that the standard (that is, event) for abnormal consumption of I/O service-units is either 75% higher or 50% lower for three consecutive intervals than the norm established for those intervals.
In this example, the I/O activity in the execution being monitored was significantly lower in the second and third interval but moved closer to the norm in the fourth interval so there was no marking of this usage as abnormal. However, in the middle of the program’s execution (at the points where it has consumed between 66 and 99 CPU service units), the consumption of I/O service units was significantly higher than the norm for three consecutive intervals. When this abnormal behavior appears at the third of these intervals, a Strobe measurement initiates of the monitored program.
There is one default value defining abnormal I/O activity for an event (greater than 100%, or less than 75% for 3 consecutive intervals) when I/O activity is compared to the “norm” calculated over the first ten program executions. If you want to define additional events (high and low percentages and a number of consecutive intervals) for abnormal I/O activity, you can set the values using the AUTOIOEVENTx parameters in the Strobe parameter library. If multiple I/O events are defined, only one needs to be exceeded to be recognized as abnormal behavior.
You can define as many as five sets of high and low percentages and a number of intervals to use as benchmarks of abnormal I/O activity. For example, in addition to the default event of 100% over or 75% under for three intervals, you could define the following two events:
- I/O Event A (30,50,9)-This combination of values would initiate a measurement if a long period of somewhat minor abnormal I/O activity (30% higher or 50% lower than average for 9 consecutive periods) was observed.
- I/O Event B (300,99,1)-This combination of values would initiate a measurement anytime a huge swing in I/O activity (300% higher or 99% lower than average for one period) was observed.
Refer to the explanation of the five AUTOIOEVENTx parameters in the Advanced-Configuration-Guide for more information about setting up the events that define abnormal I/O activity.