10.1 Introduction


Duplicate files are a database of files which can be used as a source for merge backups (both full and incremental) rather than having to read the files from the original system.  They are a very simple and very powerful form of deduplication.

What makes them so powerful is that once they are in the duplicate database, the files, when they appear on a system, do not have to be opened or read to be available for restores.  The system will automatically put them or a reference to them into the backup. This results in blazingly fast backups. If you use the reference method, it makes very small backups as well.

The client system when requested to use duplicate files, when it sees a new file (through the archive bit, modification date, incremental database or whatever method it uses) will look at the modification date.  If it is greater than a specified number of days old (by default it is 30), it will consider the file a potential duplicate and will send a placeholder record to the Reservoir, even for an incremental backup.

The Reservoir will look up the file in the duplicate database, ignoring the path information but using the file's unqualified name, modification date and time and exact size. If all of these match, the file will be considered a potential duplicate.

This system has been available in UPSTREAM z/OS for over 15 years. But its implementation in the Reservoir is somewhat different.

image2021-4-27_15-26-20.png

The example above is from a full merge backup to disk where the data is copied from the duplicate database to the new backup. Note that many of the files come from the duplicate repository. This method of performing a copy of the data is the safest as it does not require that you have the duplicate repository available when you next use the backup (for a restore, a vault/migrate/copy or whatever).

image2021-4-27_15-27-58.png

In the example above it is again a full merge backup to disk. However, since the files point to the duplicate database rather than get copied to the backup, the resulting backup is substantially smaller. In some systems it can be a fraction of the original size.

This is the fastest and most efficient method.  However, you must have this database available at restore or vault/migrate/copy time for the operation to work.  Thus if you are using your backups for disaster recovery, the first step would be to restore the duplicate database from backups that don't use it.

The Reservoir can have any number of duplicate databases, with simple numbers starting at one (similar to vaulting). There is also a configuration utility in the Profile tab of the Director allowing you to configure how a duplicate database is defined (including such issues as whether files are copied from the database or point back to it).

Finally, once you have defined a duplicate database you set your profile to use it. From that point forward it is both a source and target of duplicate files.

At maintenance time, if you configure it, all disk backups for profiles in a duplicate repository will be examined for duplicate files. If the number of duplicate files reaches a value you set (the default is 3), and it meets size and other criteria you set, then the file is added to the database.

image2021-4-27_15-29-36.png

In the example above, the count is set to two, so that when a file is found twice in the disk backups for a group of profiles configured to source a given duplicate database, it is copied at maintenance time into the database.

Once a file is in the database any instance of the file can come from the database rather than having to be read from the system. 

Also during maintenance, the backups are examined to determine if a file is no longer in the system, but it is in the duplicate repository. At that time the file is marked for deletion. When the database has reached a certain percentage of space marked for deletion (the default is 50%) the database is compacted so that the space marked for deletion is gone.

When a file is run through the uscopy utility for copy/migrate/vault, it is copied to the output storage as a safety measure in case the database is not available. It can only be pointed to in the original backup or subsequent merge backups.

Duplicate files have their metadata removed.  UNIX security information as well as other directory information comes from the original file. However, special files, device files, sparse files, encrypted files, and PlugIn files are not considered eligible for admission into the duplicate database (with the exception of WinSS files which are not part of a component or writer). 

Also, you can exclude files from the duplicate database through the use of an exclude file at configuration time. Here you can specify a file which contains a list of files that you do not wish to be included because, for some reason, you know that they are not good candidates for the duplicate database (they have their modification dates changed for example).

The idea for duplicate files is that it is the data itself that is duplicated. 

This section will describe how to setup duplicate files on your system. It is designed to be as trouble free as possible and offer the highest level of safety and performance.  Feel free to contact Innovation Technical Support if you have any questions about the facility.



 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*

BMC Compuware Upstream Reservoir 3.09