FDR/DSF/CPK/ABR I/O Error Recovery
Tape I/O errors
FDR may encounter tape I/O errors. FDR always writes out highly blocked information to the tape. The records on tape are undefined format (RECFM=U) with lengths that vary from 20 bytes to 56K.
Most error recovery in cartridge drives is internal to the tape control unit, so when the error is reported back to z/OS most errors are considered permanent. However, normal access methods (usually BSAM) are used by FDR to read and write backup data sets, z/OS's Error Recovery Procedures (ERPs) may be used to perform additional recovery. If this recovery is successful, FDR is not even aware that the error occurred. If the error is permanent, FDR may take additional actions which complement and extent the ERPs.
These notes mostly apply to backup data sets on DASD as well, but DASD I/O errors are extremely rare on RAID DASD subsystems.
Tape errors during dump
During any backup to a tape cartridge, FDR issues message FDR200 to identify the error reported by BSAM, and immediately terminates the backup (see the notes on Tape Swapping in FDR/DSF/CPK/ABR I/O Error Recovery). Since the cartridge control unit and z/OS ERPs cannot recover from the error, there is nothing more than FDR can do. The backup needs to be rerun with new output tapes. If the error repeats itself on the next run, you may need to ask your tape drive vendor to look at the error records and run diagnostics on the drives; since FDR is using IBM access methods (BSAM) there is little chance that FDR caused the error.
Tape errors during restore
If a permanent I/O error is detected during a RESTORE or COMPAKT-from-backup operation, the program issues message FDR200 or CPK502E to identify the error reported by BSAM. If the block in error is an FDR control record the program immediately terminates. If the block in error is a data block, FDR continues processing with the next block of data. A maximum of 19 blocks are bypassed before terminating the restore unless MAXERR= is coded. The block in error may contain one or more tracks of data. These tracks are not restored.
Even if a tape block is successfully read, its length is compared to an internal length field, in order to detect blocks that may have had undetected I/O errors, or which have been shortened by some other program. A block with a length error is bypassed, just like a block with an I/O error. Tapes created by the FDR system should be copied only by the FDRTCOPY utility (see FDR-Tape-Copy-FDRTCOPY) or FATSCOPY, a separate program product from BMC Corporation. Because they contain blocks over 32K in length, utilities such as IEBGENER do not copy them correctly, resulting in block length errors.
An FDR366 or CPK582E message is issued at the end of the restore specifying the tracks, if any, that were lost. On an ABR data set restore, message FDR155 specifies the data set to which the missing tracks belong. FDR, DSF, and CPK issue a U0888 abend and ABR sets a completion code 12 at the end of the restore to draw attention to the error, whether or not tracks needed for the restore were lost. If you code TAPERRCD=NO on the RESTORE TYPE= statement, the error terminations occur only if required tracks are really lost.
Tape swapping
z/OS includes a facility, called SWAP, that attempts to recover from an error on a tape by swapping it to another drive and retrying the operation. It is possible to turn swapping off globally via the z/OS console command SWAP OFF. Even if swapping is enabled, the operator is asked for permission to do the swap and can designate the tape drive to which the tape is swapped (IBM message IGF500D).
When a swap occurs, z/OS must reposition the tape to the position it had when the error occurred and repeat the operation. For cartridge drives, which have buffers, all buffered WRITE data not yet written to the tape must be recovered and written to the new device.
For cartridge drives, each installation should decide if it wants to allow swaps during backups. The repositioning on cartridges uses a hardware block ID that is reliable. However, you must be aware that BMC Corporation cannot guarantee that all the data we wrote was actually written to the tape. For restores, at least one swap can be attempted; more than one is probably futile.
Be aware that if a swap is apparently successful in recovering from the error, FDR is not informed that the error occurred and does not report it. You cannot easily tell from a job listing that a swap occurred, because SWAP messages IGF500I, IGF502E, and IGF505I are not printed with the console messages in the job log at the beginning of the SYSOUT. Only the fact that the tape was mounted on one drive but dismounted from another gives you a clue about swapping.
DASD I/O errors
The FDR system uses its own CCW chains to read and write DASD tracks. In many cases, z/OS Error Recovery Procedures (ERPs) are allowed to recover from DASD I/O errors. However, our own ERPs are often used in place of the system ERPs because of the unique nature of the I/Os we issue. Many errors are retried many times or in various ways in order to read or write the data if at all possible. If all recovery fails, the error is reported in a message so that you can take appropriate action.
There are various I/O error messages, depending on where the error is detected and what type of I/O FDR is doing, but they are usually followed by diagnostic messages including a number of control blocks and other information, such as the IOB, DEB, DCB, UCB, and CCWs. The format of these can be found in various IBM manuals, but there are several significant pieces of information that may help you decipher the error condition:
- The IOB contains two important fields: bytes 2 and 3 (last two bytes of the first word) contain sense data from the DASD volume, which can be found in the appropriate control unit (such as 3880 or 3990) hardware manual; bytes 8-15 (the 3rd and 4th words) contain the CSW (channel status word), which is defined in the Principles of Operation manual for your system.
- In the CCW printout, the CCWs for the current I/O are printed in a vertical table or up to four on a line.
Contact BMC Support for assistance in diagnosing a DASD I/O error.
DASD errors during dump
Permanent I/O errors on DASD while dumping usually result in DASD tracks not being dumped, since the FDR system could not read the track having the error. Even if some data could be read from a track, a partial track is never dumped. Most I/O errors result in one DASD track being bypassed, but some errors affect more than one track. When a track is bypassed because of an I/O error, a “dummy” entry is written to the backup tape in its place, so that a restore from that tape knows why it is missing (a warning message FDR150 is issued during a restore when such a missing track is encountered).
A maximum of 19 I/O errors can occur before the dump is terminated, unless the MAXERR= keyword is specified. If that maximum error count is not reached, FDR, DSF, and CPK terminate with a U0888 abend, and ABR with a completion code of 12, to call attention to the errors; the backups are definitely missing some tracks. The I/O error messages contain a cylinder and track number that you can use to determine which data sets are affected, by comparing it to a map of the volume.
I/O errors reading the VTOC or VVDS usually make the backup unusable.
DASD errors during restore
Permanent I/O errors writing to DASD usually affect just one DASD track. The track involved (identified by cylinder and track number in the I/O error message) may have partial data written to it, or none at all. Use that track ID to determine the data set affected.
Invalid track format
“Invalid Track Format” is an often-reported I/O error that is not really an I/O error. A DASD track has a fixed maximum capacity, which varies by device type; the actually maximum data for a given track is determined by a formula based on the number and size of the records written to it. If an application or access method erroneously tries to write more data on a track than it holds, the last record on the track is only partially present. During that write, the DASD indicates the “invalid track format” I/O error, but that partial last record may be left on the track. When that track is read by an FDR backup, the same “invalid track format” is reported by the DASD. It is really reporting a logical data error, caused by a programming error, or sometimes an z/OS bug.
“Invalid Track Format” can be recognized when the sense from the DASD (the last 4 digits of the first word of the IOB in the FDR diagnostics) contain x’0040’. This error most frequently occurs in Partitioned Data Sets (PDSs). If so, it is possible that the error is not really in any real member, but is in dead space between members. The easiest way to tell is to copy the PDS with IEBCOPY; if no errors occur, the backup of the PDS is clean.
If you can be sure that the track getting the “invalid track format” does not contain any useful data, you can make the error go away with the IBM utility ICKDSF with the command:
INSPECT TRACKS(X’cccc’,X’hhhh’) NOPRESERVE
where “cccc” and “hhhh” are the cylinder and head of the bad track in hex.