Strobe performance profile interpretation steps
The Performance Profile begins with the Measurement Session Data report and is structured hierarchically so that subsequent reports provide detail for the higher-level reports. This section discusses the steps you take to interpret the Performance Profile. The process is illustrated in the Strobe Performance Profile Interpretation Flowchart. This flowchart details the interpretation path for batch applications. For online applications, follow the same path, but also refer to the subsystem-specific reports that are described in the applicable section of the Using-Strobe-options.
The flowchart indicates:
- Major decision points with diamonds
- Questions with boxes
- Related reports to the right of the decision point or question.
The flowchart indicates which reports in the Performance Profile help you to make a decision or answer a question. For example, the first decision is to determine whether the Performance Profile is valid, and subsequently whether to pursue a reduction in CPU time expenditure or wait time. Use the information in the Measurement Session Data report to make both of these decisions. Continue following the flowchart until you have determined the cause of high wait time or high CPU time.
Strobe Performance Profile Interpretation Flowchart
Samples of each of the reports are in Using-Strobe-performance-profile, which provides a reference for the columns and fields that appear on all the reports. The reports in the Performance Profile print in the order found in Using-Strobe-performance-profile, not in the order in which this section discusses them.
Is this a valid profile?
The first tasks in interpreting a Performance Profile are to verify that the reports in the Profile present measurement of the correct program on the correct system and that they are statistically valid.
To determine validity, see the following Measurement Session Data Report. First, verify these fields to ensure that the values reflect the application job step that you intended to measure:
In the JOB ENVIRONMENT section, check:
- PROGRAM MEASURED
- JOB NAME
- JOB NUMBER
- STEP NAME
- CONDITION CODE or COMPLETION CODE (appear only if the the step ended while Strobe was measuring)
- SYSTEM
In the MEASUREMENT PARAMETERS section check the options that you specified:
- ESTIMATED SESSION TIME
- TARGET SAMPLE SIZE
- REQUEST NUMBER, preceded by either (Q) or (A). (Q) indicates that you submitted a queued request, and (A) indicates that you submitted an active request
Measurement Session Data Report
Check the margin of error fields in the MEASUREMENT STATISTICS section against the criteria outlined below. Because Strobe intermittently samples application execution rather than continuously monitoring, there is always some inaccuracy inherent in the measurement. Normally, this slight inaccuracy should not affect the reliability of the reports. Strobe quantifies this amount for reports that detail CPU time and for reports that detail run time.
- RUN MARGIN OF ERROR PCT, the margin of error for the percentages in the reports that detail run time. A value less than or equal to 2.00 indicates that the information in the Performance Profile is statistically valid. The percentages in reports that detail run time are reliable within a range of plus or minus this percentage.
- CPU MARGIN OF ERROR PCT, the margin of error for the percentages in the reports that detail CPU time. This value shows the margin of error for CPU execution samples. A high value (over 10) in this field indicates that the information in the reports that show program execution is most likely not valid and that you probably have not collected enough execution samples. If your goal is to reduce wait time, however, a high CPU margin of error does not affect the validity of the reports that show run time.
After you have determined the validity of the Performance Profile, pursue the steps for CPU time reduction, wait time reduction, or both. Many of the reports have histograms where a spike—the longest row of pluses—clearly indicates observed high CPU time consumption or run time.
Wait or CPU?—Determining performance improvement opportunities
Once you know that the Performance Profile is representative and valid for the job step that you intended to measure, determine whether reducing wait time or reducing CPU time presents the best opportunity for improving performance. The Measurement Session Data report quantifies the time the application is using CPU resources and experiencing wait. Note the following on the Measurement Session Data report:
- SESSION TIME—the amount of time the session ran.
- EXEC TIME PERCENT—the percentage of time that Strobe observed the application consuming CPU time. This value is calculated from the percentage of samples in which Strobe observed CPU execution.
- WAIT TIME PERCENT—the percentage of time Strobe observed the job step waiting for an event to complete.
- CPU TIME—the amount of time the application was using CPU resources.
- WAIT TIME—the amount of time Strobe observed that the central processing system was available but was not in use by application tasks executing within the measured job step.
Depending upon which quantity is larger and your goal in implementing an APM program, pursue an opportunity to reduce CPU time or wait time. For example, if your immediate concern is about a shrinking batch window, the first objective is to reduce wait time. If, however, costs related to CPU time are your main concern, investigate CPU time first. This section looks at the path to follow when looking for an opportunity to reduce wait time.
What is causing wait?
If wait time seems high, the Resource Demand Distribution report, helps you determine whether files, tasks, or file management activity (.FILEMGT on the flowchart) are causing the wait. When Strobe detects file-processing activity that it does not relate to a specific ddname or to a specific IOS driver, it groups all the activity under the name .FILEMGT.
Resource Demand Distribution report
The Resource Demand Distribution report identifies, in the CAUSING CPU WAIT column under PERCENT OF RUN TIME SPENT, the percent of run time that the tasks, .FILEMGT, or ddnames were waiting for an event (usually I/O) to complete.
Locate high values in this column and then follow the appropriate branch or branches of the flowchart. Often, the Resource Demand Distribution report shows some file wait, some task wait, and some wait related to .FILEMGT. You may need to follow more than one path on the flowchart to explore all the possible opportunities for improvement.
Are files causing wait time?
If the Resource Demand Distribution report shows a high value for files in the CAUSING CPU WAIT column, determine why the files are causing wait. Explore whether the Performance Profile relates this wait to high physical access or internal contention, or whether it suggests the possibility of external contention.
Is high physical access causing wait time?
The Resource Demand Distribution report indicates that files are causing wait when high percentages appear for ddnames in the CAUSING CPU WAIT column under the PERCENT OF RUN TIME SPENT heading. If files are causing wait, read the Data Set Characteristics report to identify the physical characteristics of the files. Here you may find that options such as the default values for block sizes and buffering are creating undue wait. For files with access methods such as VSAM, the Data Set Characteristics Supplement report and the VSAM LSR Pool Statistics report, which detail buffering, provide important information.
VSAM and QSAM offer various options in blocking and buffering. While a comprehensive explanation of the tuning options is beyond the scope of this manual, the following paragraphs show how Strobe can indicate the blocking and buffering options you should consider modifying. For more detailed information relating to your specific Performance Profile, refer to the IBM Data Facility Storage Management Subsystem documentation.
QSAM performance improvement opportunities
One possible opportunity to improve performance of QSAM data sets is to increase the block size of the file so that it is the maximum possible for the DASD you are using without wasting DASD space. Increasing the block size enables you to minimize the number of blocks transferred—represented on the Data Set Characteristic report as EXCP (direct invocations of execute channel programs) COUNTS—and significantly reduce wait time. Each time MVS issues an EXCP, the address space for the application program typically waits until the EXCP completes. Higher block sizes reduce the number of EXCPs and, consequently, the wait time.
Data Set Characteristics report
For VSAM KSDS data sets you might benefit from the use of the BMC IAM product.
Because different devices have different blocking optimizations, research the correct block size for the device you are using. In general, for sequential files, half-track blocking optimizes I/O performance. Half-track blocking is a value in bytes of half the maximum value of the track, depending upon record format, device type, and record length.
VSAM performance improvement opportunities
A possible opportunity for performance improvement of key-sequenced VSAM files with data and index clusters relates to file buffering. For a key-sequenced VSAM file that is directly accessed, buffering of the index component of the file may need to be increased. It is most efficient to assign to the number of buffers a value that is equal to the number of index set records plus the request parameter list’s string number (RPL STRNO on the Data Set Characteristics report). The number of index set records for an index with two index levels is always one. By allocating more buffers, you place the entire index in storage. This placement reduces the number of physical I/Os (the EXCP COUNT on the Data Set Characteristics report) for the index component of the VSAM file and significantly reduces wait time.
As a general rule, if you always plan to access the file directly (DIR in the OPEN INTENT field on the Data Set Characteristics Supplement report, you do not need to adjust the number of buffers to the data portion of the cluster. If, however, you change the application program so that this cluster is accessed sequentially (SEQ on the Data Set Characteristics Supplement report), increasing the number of the buffers to the data portion will reduce wait time. If the cluster is accessed dynamically (DYN on the Data Set Characteristics Supplement report), allocating extra buffers to the cluster’s index and data portions can improve performance.
For VSAM KSDS data sets you may benefit from the use of the BMC IAM product.
Data Set Characteristics Supplement Report
If files experience control interval (CI) or control area (CA) splits, the section of the Data Set Characteristics Supplement report that details %CA FREE and %CI FREE becomes relevant. CA splits (and, to a lesser extent, CI splits) incur wait. You can minimize CA splits by specifying FREESPACE when the cluster is defined, depending upon the function of the application.
If the VSAM file uses local shared resources (LSR), Strobe produces the VSAM LSR Pool Statistics report. A directly accessed VSAM file that uses LSR and has efficient buffers performs fewer I/Os than a file with the same number of records that uses non-shared resources (NSR), thereby incurring less wait time. This efficiency occurs because the index component of files that take advantage of LSR can maintain sequence set records in storage, but an NSR file cannot. The VSAM LSR Pool Statistics report can help you evaluate the efficiency of the way LSR files are buffered by reporting the number of retrieves with I/O and the number of retrieves without I/O. If the number of retrieves with I/O is high compared to the number of retrieves without I/O, then investigating file buffering may offer an opportunity for performance improvement.
VSAM LSR Pool Statistics Report
In addition to evaluating file characteristics on the Data Set Characteristics report and the Data Set Characteristics Supplement report, consider the possibility of internal contention.
Is internal contention causing wait time?
Internal contention occurs if I/O is being performed on two files of the same step on the same device at the same time.
To check for internal contention, examine the I/O Facility Utilization Summary report, to see if there are other files on the same device. This report lists the device and volume on which each data set resides and the amount of run time spent servicing each volume. If this report shows access to other files on the same device a high percentage of the time, see the Time Distribution of Activity Level report to see if the files are accessed at the same time.
I/O Facility Utilization Summary Report
The Time Distribution of Activity Level report shows, in vertical slices, the level of file access activity over time. Each vertical segment represents 1% of the measurement session. From this report, determine which tasks or ddnames are being accessed at times that overlap.
Time Distribution of Activity Level Report
If the Time Distribution of Activity Level report shows high access to multiple files concurrently, and the I/O Facility Utilization Summary report indicates that the files that are being accessed concurrently reside on the same volume, separating them may improve performance.
Another form of internal contention for a resource involves tape devices. If your application accesses a file on tape that spans multiple volumes, you will see the time associated with the tape mount and dismount attributed to .FILEMGT on the Resource Demand Distribution report. A period of total inactivity to the file in the Time Distribution of Activity Level report indicates this problem. Unless multiple tape units are allocated to service a multivolume sequential data set, access to a multivolume sequential data set will be held up during tape change.
External contention
External contention occurs when more than one application is trying at the same time to access the same files or DASD volume. Although the Performance Profile does not specifically indicate external contention, the Resource Demand Distribution report may suggest this condition when the value in the SERVICED BY I/O column for a file is much smaller than the value in the CAUSING CPU WAIT column. If you suspect this condition, use the I/O Facility Utilization Summary report to determine which DASD your application programs are accessing, and then investigate applications that execute at the same time.
Are file management activities (.FILEMGT) causing wait time?
Sometimes Strobe cannot determine the specific files that were responsible for wait, but can determine that it is related to system file management. .FILEMGT represents an aggregate of all file management activities such as file open and close, enqueues, tape mounts, catalog management, and similar overhead routines or file access activities performed by IOS drivers that Strobe does not recognize. Strobe can determine which module was waiting and where in your source code these overhead routines are invoked. You can then change your source code so that it invokes these overhead routines more economically.
If the Resource Demand Distribution report indicates that wait is related to .FILEMGT, see the Wait Time by Module report to identify the system services that are causing wait. The Wait Time by Module report tracks wait associated with .FILEMGT, ddnames, or tasks. Under MODULE NAME and SECTION NAME columns, this report identifies all the modules and control sections in which the target program was found to be waiting for CPU resources. A control section is the smallest unit of execution produced by a compiler or assembler that can be linked or replaced in a load module. See the Attribution of CPU Wait Time report to identify the sites of invocation of service routines, as described in more detail below.
Wait Time by Module Report
If the Wait Time by Module report identifies system service routines (such as SVCs reported under .SYSTEM) or language routines, check the Attribution of CPU Wait Time report. This report identifies where selected service routines are invoked. For the system service routine identified in the header line, it shows under WAS INVOKED BY the hexadecimal offset from which the service routine was directly or indirectly invoked. If the report is indexed, it also shows the line number and the procedure name from which the system service routine was invoked. With this information, examine the source code to find ways to invoke the overhead routines so that they do not incur as much wait time.
Is task wait causing wait?
Task wait is all wait time not associated with file I/O or .FILEMGT. The Resource Demand Distribution report shows that tasks are causing wait when the RESOURCE column shows CPU as the type of resource used. See the Wait Time By Module report. This report lists all the modules, control sections, and operating system supervisory and service components in which the target program was found to be in the wait state.
If task wait is occurring in a system module, see the Attribution of CPU Wait Time report to identify program locations that called service routines. The Attribution of CPU Wait Time report shows the return address and, for indexed source Performance Profiles, the lines of code and procedure names where the program invoked selected system service routines. Although you can rarely improve the performance of system service routines, you can examine the Attribution of CPU Wait Time report to see where they are invoked from and then restructure the code so that they are executed more economically. If subsystem routines are causing wait, see the Strobe subsystem wait reports. These reports are explained in the corresponding section for the option in the Using-Strobe-options.
Attribution of CPU Wait Time Report
What’s using CPU time?
If your objective is to reduce CPU time and the Measurement Session Data report indicates that CPU time offers an opportunity for performance improvement, see Program Section Usage Summary report. Under the CPU TIME PERCENT TOTAL and SOLO columns, this report identifies which modules or control sections are the greatest consumers of this resource. The modules and control sections that use the greatest amount of CPU time are distinguished with a spike on the histogram to the right of the percentages. In addition, Strobe reports offer further detail of CPU consumption by applications that use subsystems such as Adabas/Natural, CICS, Db2, CA-IDMS, and IMS. For more information about Strobe options, refer to the Using-Strobe-options.
Is overhead using excessive CPU time?
Overhead represents activities done on behalf of the application program by MVS system service routines, language routines, or subsystem routines. All of these are grouped under the label .SYSTEM on the Program Section Usage Summary report. On the Program Section Usage Summary report each type of system service has a detail line.
Program Section Usage Summary Report
When the Program Section Usage Summary shows high CPU time for .SYSTEM, distinguished by a spike on the histogram to the right of the entry, go to the Program Usage by Procedure report to identify further the users of this resource.
The first part of the Program Usage by Procedure report shows services reported under .SYSTEM. On this report, each control section of a system service has a detail line. Strobe identifies the function of the system services that the application program source code invoked, and the percent of CPU time used.
Program Usage by Procedure Report
For many system services that cause high CPU use, Strobe creates an Attribution of CPU Execution Time report. This report identifies CPU execution time spent in certain service routines and identifies where they are invoked. Strobe attributes subsystem and language routines if your site has installed the Strobe options that support them.
The Attribution of CPU Execution Time report pinpoints opportunities to avoid overhead by identifying the hexadecimal offsets and, if the Profile has indexed source, the lines of code that invoke the system routines. Again, although normally you cannot improve the performance of system code itself, you may be able to restructure the invoking code so that it calls the overhead routines less frequently. Look at the invoking application code to see if the system service routines can be accessed more economically.
Attribution of CPU Execution Time Report
Is the source code causing excessive CPU time consumption?
On the Program Section Usage Summary report, each control section of an application has a detail line. If the Program Section Usage Summary report showed that application code rather than overhead routines was responsible for CPU consumption, read the Program Usage by Procedure report for the indicated user program. The Program Usage by Procedure report displays which procedure names and lines of code were using CPU.
On the Program Usage by Procedure report, each codeblock has a detail line. A codeblock is a division of a control section whose size in bytes equals the specified report resolution. The detail lines on the Program Usage by Procedure report relate CPU use to hexadecimal offsets within the program and, for an indexed source Performance Profile, to source code lines and procedure names. Again, spikes in the histogram indicate the significant users. (For more information, see Indexed Source Performance Profiles.)
Most Intensively Executed Procedures Report
Incorporate changes
Strobe can pinpoint the location of inefficiencies within the source code. After you interpret the Performance Profile, decide where to economize on the resources that your application uses. Examine the application, make corrections, test them, and measure again with Strobe. Remeasuring enables you both to see the results of the changes and to identify other possible opportunities for improvement that may have been masked by the first performance problem.
Summary of reports to analyze
The Performance Profile reports, shown next, reflect the interpretation explained in .
The following reports are related to wait time:
Report name | Type of information |
---|---|
Resource Demand Distribution | Identifies what is causing wait |
Data Set Characteristics | Identifies files, buffers, blocksize, etc. |
I/O Facility Utilization Summary | Identifies run time by ddname, volume, and unit |
DASD Usage by Cylinder | Identifies run time by cylinder, ddname, and volume |
Data Set Characteristics Supplement | Identifies intent, logical operations, etc. |
VSAM LSR Pool Statistics | Provides resource information for data sets allocated to local shared resources |
Time Distribution of Activity Level | Gives timeline of resource activity |
Attribution of CPU Wait Time | Identifies wait caused by callers of service routines |
Wait Time by Module | Identifies in which module the job is waiting |
Subsystem-specific reports | Provide CPU wait time information |
The following reports are related to CPU time:
Report name | Type of information |
---|---|
Program Section Usage Summary | Identifies CPU consumption at module level or by type of service routine |
Program Usage by Procedure | Breaks down CPU consumption within modules or specific service routine |
Attribution of CPU Execution Time | Identify CPU used by the callers of service routines |
Most Intensively Executed Procedures | Identifies the top ten CPU consumers |
Subsystem-specific reports | Provide CPU consumption information |
The following reports provide additional reference information:
Report name | Type of information |
---|---|
Coupling Facility Activity report | Identifies system-wide coupling facility activity |
Token - Cross Reference report | Reconciles all tokens with their original long names |
Additional Strobe options generate reports that provide additional support for specific operating environments such as Adabas/Natural, CA-IDMS, CICS, Db2, CA Gen, MQSeries, Java, UNIX System Services, and IMS. Refer to the Options Guide for more information.