How DBS works
This topic reveals some of the internal algorithms and methods used by DBS.
DBS Job Flow Management
DBS adds intelligence to the job selection and step initiation/termination process; however, it does not decide which job is “next.” That choice is left to either the traditional JES2 initiators or WLM initiators. BMC ThruPut Manager might assist these initiators, but the fundamental job selection process is not altered.
When DBS is running under the control of SLM, it is SLM’s decision which job is “next.” What DBS does is to determine whether or not the job should be allowed to proceed to initiation based on the availability of tape drives. DBS either accepts the job or rejects it. If the job is rejected, selecting “next” is again done by the facilities mentioned above, not by DBS.
A job must be analyzed in order to be under DBS management. That is where the job’s requirements for tape drives are determined. As a result of the analysis process, a Drive Pool mask and a Work Group mask are created and associated with the job. DBS is aware of all the defined Drive Pools and Work Group Pools and has assigned a bit for each one to be used when constructing the masks. The job masks are identical in structure to the JESplex/Member masks. In addition to the masks, all the “counts” and the high watermark for the job’s requirements are also calculated and associated with the job.
The Job Selection Process
The DBS job selection process is not based on “counts.” The concept is different. Each defined Pool is either OPENED for additional jobs or it is CLOSED. If any of the Pools required by a job are CLOSED, then the job is rejected. This decision is a very efficient process using the masks. OPENing and CLOSing Pools is a separate process that involves several considerations, such as the sum total of all the high watermarks of jobs in execution, overbooking factors, steps waiting for DBS “permission” to enter allocation, and historical trends.
DBS maintains a JESplex mask and a mask for each participating JES2 member. The member masks, as you would expect, are each a subset of the JESplex mask. The reason that a mask is needed for individual members is to allow for asymmetries. This can occur at the Configuration level or at the Policy level. Configuration asymmetries tend to be “hard” ones. Policy asymmetries are usually logical ones.
Execution Time Actions
Once a job is selected, DBS tracks the job step by step until it terminates. DBS is aware of all the requirements for each step. It also keeps track of the initial high watermark and subsequently reduced high watermark. The drive requirements for a step were calculated during the simulation/analysis process. DBS refers to these values as “best effort” determination. The term is intended to highlight the possibility that actual allocation might produce different results. For obvious reasons, allocation is always right. So when discrepancies are encountered, the “best effort” value is changed to reflect the actual value.
The actual value is determined immediately after allocation. All steps are scrutinized, even when the DBS “best effort” did not find any tape drive requirement. The actual results are compared to the “best effort” values and, if necessary, adjustments are made.
If drive allocation occurs for a job for which DBS did not find any requirements, DBS is informed. Since the job was not expected to need any drives, it was not registered with DBS at job initiation. When actual drive requirements are found the job is then registered with DBS, so any job that was analyzed is registered with DBS when executing regardless of the best effort determination.
For steps for which drive allocation is expected, a call is made to DBS to determine if the requirements can be satisfied before it is allowed to proceed to actual allocation. This approach prevents steps from entering allocation recovery; however, the more important reason for jobs to wait prior to actual allocation is to allow DBS priority management, not only for a system but also across the JESplex. Once a step enters allocation recovery, there is only a limited amount of control that can be exercised. The HOLD or NOHOLD facility is system oriented and in general is first-in-first-out.
The step call to DBS may be viewed as a request for “permission” to enter allocation. If DBS determines that the requirements can be satisfied, it then:
- Adjusts the counts to reflect the needs as determined by “best effort.”
- Posts the caller so it can proceed to allocation.
If DBS determines that the requirement cannot be satisfied, it then:
- Records the “best effort” requirements for a job/step. It also records the DBS priority.
- Denies the step permission to enter allocation.
When drives become available (normally at step termination of other jobs), DBS asks:
- Is any step in the JESplex waiting for “permission”?
- Can the newly available drives satisfy the requirements of the highest priority step?
If the answer is YES then it will post the appropriate step or steps. This is the mechanism to honor DBS priority. Steps with equal priority are handled first-come-first- served.
After static allocation is done, a call is made again to inform DBS of the actual drive counts that have been allocated. DBS then does the necessary adjustments by comparing “best effort” to actual allocations. Certain type of discrepancies are logged. This allows a review of what might be causing the differences so that “best effort” calculations can be improved.
It should be noted that one of the triggers for DBS to CLOSE a Pool or Pools is the denial of “permission” for any step to enter allocation. The Pool or Pools causing the delay are immediately CLOSED.
DBS is notified when a step has completed execution. It then:
- Returns the drives to available status.
- Determines if the high watermark for the corresponding job has to be adjusted.
- Determines if any step is waiting for the type and number of drives that are now available. If that is the case, it posts the step or steps.
DBS is also notified when a job terminates. At this point the only action required is to de-register the job.
DBS Job Status
A DBS managed job can be in one of the following states:
- In the JES2 execution queue waiting to be selected. Its turn has not arrived yet.
- In the JES2 execution queue waiting for a DBS Pool or Pools to be OPENED:
- If the Pool is CLOSED at the JESplex level it is the equivalent of the job being in HOLD status.
- If the Pool is CLOSED for a particular system, it is the equivalent to the job having affinity to the systems that have the needed Pools OPENED.
- In the JES2 execution queue with no DBS system affinity to any system because the job exceeds the JESplex drive resources. This situation is Policy dependent. It is a “conceptual” HOLD since the job is not in a hard HOLD. It is similar to the status of a job when no scheduling environment is active that can satisfy the job requirements.
- The job could be executing a step with no drive requirements. In this state there no direct relationship between the step and availability of drives. DBS knows about future requirements for the job and that information is used by the overallocation algorithm, but it has no direct bearing on the values for Pools.
- The job could be executing a step that requires tape drives. In that case there is a direct relationship between the step and the count values of the Pools the step needs. Any drives the step is using were moved from available status to in-use status and will stay that way until the step terminates.
- The job could be waiting in a step for DBS “permission” to enter allocation. DBS is aware of the step requirements and its priority. When resolving the situation it does not do best fit. The highest priority/earliest arrival step is served first. No drives are given from that Pool to any step request in the JESplex until the top priority waiting step can be satisfied. Steps can jump ahead of other steps if they have higher priority; however, this will not cause the lower priority steps to wait “forever.” Since the Pool or Pools needed are CLOSED there is only a finite number of steps running in the JESplex. Before any other job is allowed to initiate, they will all be served as other steps terminate and drives are returned to the available pool.
- Under unusual circumstances a job could be in allocation recovery. This could occur under the following conditions:
- DBS only controls jobs that are DBS-managed. It does not serialize other tasks. It is possible for DBS to grant “permission” to a step to enter allocation, but between the time permission is granted and allocation begins for the step a poacher acts. There is a window of opportunity for a poacher to “steal the drives.” The more poachers you have running in high priority service classes, the more likely this is to occur.
- The other situation is when “best effort” under-counts the number of drives needed. There might not have been any additional available drives when permission was granted, but if the actual requirement had been known the step would have been placed in a wait status.
Once the job terminates DBS does not keep any historical information about that particular job. It accumulates historical trends to calculate overallocation factors, but no job-specific details. If you need downstream data about the jobs managed by DBS or about tape drive utilization, you can request DBS to generate SMF records.
The DBS System Affinity Mask
There is an additional consideration in the job selection process: if you have asymmetric systems (as a result of a Policy or Configuration) the following situation could occur:
- A particular system has a significantly reduced drive count. (If it is zero then there is no problem because the Pool will always be CLOSED in that system). As a result there might be jobs that exceed the available resources for that system.
- These jobs cannot be run in that system (or systems).
- When there are drives available (it could be as few as 1) the Pool will be OPENED.
- Unless something additional is done to prevent the jobs that exceed requirements from being selected, the “is this Pool OPENED?” job selection logic will allow them to continue to job initiation. This is not a very good idea.
To handle the situation described above DBS, in addition to the Drive Pool masks, constructs a DBS System Affinity mask. Jobs that exceed the requirements of a particular system will not have DBS affinity to that system. The first thing that DBS does during the selection process is to test the affinity.
In addition to asymmetric systems you can have a situation that a Policy is loaded with a significantly reduced number of drives for the JESplex. This could be the result of hardware out of service, or rearrangement of availability of drives across JESplexes at different times of the day. Any job that exceeds the resources made available by the Policy for the JESplex will have all its DBS System Affinity removed so it will not be selected by any system.
Note that if the Configuration for the JESplex cannot satisfy the requirements for a particular job, this is a “will never be able to run” situation regardless of the Policy, since a Policy cannot exceed the resources defined in a Configuration. This situation usually calls for an informative message and a JCL error, but if desired, you can use the JAL statement DBS HOLD to request that the job be placed in HOLD in the MHS_TM category.
The DBS System Affinity mask should not be confused with the OPEN/CLOSE mechanism used for job selection:
- The DBS System Affinity mask addresses the problem of jobs exceeding existing resources as described by the Policy (either at the system level or JESplex level).
- OPEN/CLOSE manages the normal job selection to maximize drive usage without overallocation problems.
Other Considerations
DBS is also called during step execution in two situations:
- A dynamic allocation has taken place that required tape drives.
- A FREE=CLOSE has caused a drive to be unallocated.
Normally, the job analysis process cannot detect dynamic allocation requirements. If the installation takes no action to let DBS know about the dynamic allocation of drives by a particular job/step, the following takes place:
- Dynamic allocation will enter system allocation without DBS having any knowledge:
- If it can be satisfied, DBS is invoked after allocation has taken place. As a result of this call the in-use counts for the job will be updated and the availability counts will be reduced. As with step allocation, the actual job might have to be registered if there were not any apparent drive requirements at the time of analysis.
- If it cannot be satisfied, the caller receives (in most cases) an “unable to allocate” return code from dynamic allocation. It is then up to the caller to decide what to do. In a number of cases that can result in a step failure. In other cases, the executing code might have a wait-and-try-again mechanism based on some reasonable time interval. The behavior of a particular step is situational and normally only known to the installation. No generalizations can be made, other than “if drives are always available things are always OK.”
For situations where the unavailability of a tape transport might result in a step failure, DBS provides a facility to prevent that occurrence. As you might expect the need has to be externalized. A job/step can request, either in DAL or JECL, that a number of drives of a particular type be RESERVED. This is in addition to the ones needed and externalized in JCL. The step is not given “permission” to proceed to allocation until the sum total of the “best effort” count and the number requested to be RESERVED are available. The RESERVED count will continue to be in effect until the step terminates.
If and when dynamic allocation occurs, and the actual drives are of a different type from the ones that were RESERVED, they are treated as additional requests and added to the total.
The case of FREE=CLOSE is a simple one. The appropriate number of drives are moved from in-use to available. The particular step has it in-use count reduced.
A Schematic Description of DBS Mask Management
This section is intended for readers who are more oriented towards schematic descriptions rather than long explanations. It reiterates what the previous sections have described. It focuses on the role played by the DBS System Affinity masks during job selection.
For the purpose of illustration, let’s assume the following:
- A JESplex with two participant LPARs (member A and member B).
- A DBS Configuration with 6 Pools defined.
- To simplify the description, the size of the Drive Pool mask allows for up to 8 Pools. The actual mask allows for up to 128 Pools.
- Similarly, the JESplex mask is restricted to 8 systems. The actual mask has room for 32 systems.
When a Policy is activated in a JESplex with no DBS managed jobs running all the Pools are OPENed since there is no usage. DBS will create the masks as follows:
JESplex | 1111 | 1100 |
MEMBER A | 1111 | 1100 |
MEMBER B | 1111 | 1100 |
This represents the 6 Drive Pools that are defined in this installation. All the bits are “ON” so all the Pools are OPENed.
The equivalent Work Group masks are more complex. There are a total of 18 masks. There are 6 Work Groups so each Work Group needs a mask set. 3 sets are needed. One for the JESplexs, one for member A, and one for member B.
WORK GROUP 1 | 1111 | 1100 |
WORK GROUP 2 | 1111 | 1100 |
WORK GROUP 3 | 1111 | 1100 |
WORK GROUP 4 | 1111 | 1100 |
WORK GROUP 5 | 1111 | 1100 |
WORK GROUP 6 | 1111 | 1100 |
If the systems are symmetric, as in the case of the example Policy, the JESplex masks and the individual system masks will be the same. Only when there are system asymmetries would the JESplex masks and individual system masks be different.
When a job arrives—let’s call it JOBA—it is analyzed. If it needs tape drives the DBS System Affinity mask, the Device Pool mask, and an index to the Work Group masks (which one to use of the possible 6) are created. Let’s say that JOBA needs drives from only one Pool, the second one. The job is assigned to the 4th Work Group. Normally, the installation names the Work Groups. Internally, they are assigned an index value from 00 to 05.
The DBS masks that are constructed for the job are as follows:
DBS System Affinity mask | 1111 | 1100 |
Drive Pool Mask | 0100 | 0000 |
WORK GROUP Index (value 03 for the fourth WORK GROUP) | 0000 | 0011 |
- The DBS job affinity mask indicates that under the active Policy, the job can run on any system. It does not exceed the available resources.
- The Drive Pool Mask indicates that it needs drives from Pool 2.
- For the Work Group, there is no need for a Work Group Pool mask since the requirements are identical to the Drive Pool mask. What is needed is a pointer to the Work Group mask to be used for job selection. For JOBA the value is 3 to represent the 4th Work Group. (Only in the computer world does counting start at zero!)
As jobs arrive and are analyzed, the DBS masks are constructed (among other functions) and then the jobs are placed in the execution queue.
At job selection time, DBS is asked to verify if the selected job should be allowed to proceed to initiation. (Other functions are also called for that purpose.) Let’s suppose that the job that has been selected is the one that was initially described, that is, JOBA. This occurs in member B (the second JESplex system). Let’s further assume that all Pools are OPENED. DBS will do the following:
- Determine if the job can run in this system. Since the DBS System Affinity mask is like this (shown before):
DBS System Affinity mask | 1111 | 1111 |
JOBA can be selected in any system, so the first requirement is satisfied.
- The next verification is related to Work Groups. Here we have a set of 6 masks for the JESplex and a set of 6 masks for member B. From the Work Group index, the correct mask for Work Groups is the 4th one. This is true for the JESplex and for member B. DBS then:
- Extracts the two masks.
- The masks are ANDed. The bits must be “ON” in both masks indicating that the Pool is OPENED at the JESplex level and for member B.
- The JOBA Drive Pool mask is used for the determination of Work Groups and Drive Pools. In this case the Pool masks will look like this:
Job Mask | 0100 | 0000 |
JESplex and member B mask (ANDed) | 11111 | 1100 |
- The Pool needed by JOBA is OPENED for its Work Group, so DBS proceeds to the next and final verification.
- The Drive Pool verification does not require any index since there is only one mask for the JESplex and one mask for member B. The process is identical to the one illustrated above. JOBA, from a DBS point of view, can be allowed to proceed to initiation.
For JOBA to be affected the 2nd Drive Pool or the 4th Work Group Pool must be CLOSED. Any other activity for other Pools or Work Groups does not affect the eligibility for JOBA’s selection.
To show the role of the DBS System Affinity mask, let’s assume that a new Policy has been activated. This Policy, for whatever reason, drastically reduces the number of drives available to member A. At Policy activation time DBS reviews the jobs in the execution queue to see if any one of them exceeds the reduced number of drives available to member A. Any job found to exceed resources will have the member A (system 1) affinity removed, so the mask will look like this:.
Job Mask | 0111 | 1111 |
None of these jobs will pass the first step in the DBS verification process in member A, so they will not be selected.
The mechanism described above is a highly efficient process that can handle the complexity of the many combinations and permutations resulting from the interaction of Drive Pools, Work Group Pools, JESplex resources, and asymmetric systems.
So the job selection process depends on two things:
- The DBS System Affinity of the job.
- The status of the Pools it needs.
The next two sections give a brief explanation about the processes of setting DBS System Affinity masks and the OPENing and CLOSing of Pools.
Managing the DBS System Affinity Assignment.
The initial DBS System Affinity mask for the job is constructed at job analysis time. It is based on the active Policy at the time of analysis. If a job exceeds the resource requirements at the JESplex level for a Configuration, the job is failed with an informative message and a JCL error. The requirements of the job can never be satisfied regardless of the Policy.
If a job exceeds the resources available to one or more systems with the active Policy, the DBS System Affinity mask is constructed to reflect the situation. The job is allowed to proceed to the execution queue.
When a new Policy is activated, the activation process reviews the Policy resources and the jobs in the queue and modifies their DBS System Affinity mask if necessary. In some cases, if the Policy is richer in resources affinities are added. In cases where the resources are more limited, affinities are removed.
OPENing AND CLOSing Pools
The decision to OPEN/CLOSE Pools is made by two separate processes:
- The DBS allocation algorithm. This is the most important and complex of the two processes.
- The step initiation process.
The DBS allocation algorithm is invoked at regular intervals. It can execute in any of the participating systems in a given JESplex. When it runs, it performs a JESplex wide evaluation. From this evaluation it determines what the new masks should be. These new masks are broadcast to all the systems using XCF. Each system thus has the most current JESplex/system mask in memory. The job selection process always uses the masks reflecting the current conditions.
At step initiation, if any request cannot be satisfied because there are not sufficient drives available, a signal is sent to have the “deficient” Pool CLOSED. It should be noted here that the step initiation process is not aware whether a particular Pool is OPENED or CLOSED. The step authorization process is based on the number of drives needed by the step and the number of drives that are available. The fact that a Pool is CLOSED does not mean that there are no drives available. The over-allocation algorithm might decide that there are enough jobs running with high watermarks that could result in over-commitment. In that case the Pool is CLOSED; however, the running steps will have their requests satisfied. It must be remembered that the point of the exercise is to run at the edge. That is, when a step request is made, there are just about enough avail- able drives to satisfy the request.
So to repeat:
- The fact that a Pool is CLOSED does not necessarily mean that there are no available drives. The DBS allocation algorithm might have determined that initiating another job that needs drives from that Pool will exceed the over-allocation factors. That is, the statistical possibility of causing an allocation recovery to occur is too high.
A Pool is always CLOSED when DBS has to make a step wait, however. It indicates that DBS has gone beyond the edge for the Pool so no more load can be accepted until the situation returns to normal.
DBS Unavailable Devices
Devices are unavailable (at a system or JlESpex level) when they can no longer be considered by DBS to be available for allocation to a Job. When a device is no longer available it will no longer contribute to the pool’s available count. The maximum pool counts may need to be logically altered since they cannot be higher than the total number of actual devices in the pool.
This may affect the ability of jobs to run. There may be fewer or no systems where a job can run as the result of devices becoming unavailable. The jobs will not be able to run because there will be no system where the job has DBS affinity. The affected jobs will remain on the queue but will show a status of ‘DBS affinity’. Examining the job or jobs with the DBS dialogue will show the details of ‘why the job is not running’. By drilling down to the device level the displays will also indicate the reason the device(s) are unavailable.
- If jobs do not have DBS Affinity to any system, the DBS application will handle the jobs depending on the number of devices in the required pools. As long as there are enough devices to allow the possibility of the job running, the job will just remain on the queue but will not execute. The devices going into the count may be unavailable, they are still counted for the purpose of determining how to handle the job.
- If there are insufficient devices in the current configuration to satisfy the jobs requirements even when unavailable devices are added into the count then the job will be handled differently. If the job is already on the queue when this occurs then it will be put in MHS_HOLD for DBS. If a job hits this situation when it is analyzed then it will be put in MHS_HOLD and then requeued for analysis.
When device(s) in a pool are made unavailable with the command DBS SET DEV device#UNAVAILABLE some existing DBS jobs may be affected. A job that was selectable may no longer be selectable because it required a count that included the device that has been made UNAVAILABLE. These jobs will not be selected because the DBS affinity will now reflect the change in the available devices. They will NOT be held.
When enough devices are made available again the DBS affinity will be altered to make them selectable.