This topic provides recommendations for backing up and restoring the BMC Database Automation Manager to help you implement solutions to recovery scenarios.
One multi-site disaster recovery scenario is used to further assist in solution implementation.
Disaster recovery enables an administrator to recover the BMC Database Automation Manager from an existing backup to:
While the previous list does not address all possible scenarios, it should be an adequate foundation upon which other recovery strategies can be built. This topic also describes different methods to back up the operational BMC Database Automation instance in line with the existing business requirements, and provides guidance on recovery from the following classes of failures:
The following illustrated architecture can be used to demonstrate multiple recovery scenarios, both intra- and inter-datacenter recovery (same or different host as well). The application data (on the 'Data' spindle) can be offloaded to the other datacenter using the administrator's preferred protocol (ssh, ftp, and so forth). As illustrated, the backup is generated by a script, and both generating and transferring it can be configured to run using a system scheduler. The 'Backup' storage does not need to be directly connected to the Red Hat Enterprise Linux (RHEL) host which represents the target (although this will simplify the restore). It is also strongly encouraged that a backup to tape be executed at the recovery site.
This section provides an overview of the process for archiving, backing up, and restoring BMC Database Automation using the backup script.
The clarity_backup.pl script is the primary backup-to-disk tool provided by BMC to back up the BMC Database Automation Manager solution. It can be found here:
/app/clarity/manager_scripts/bin/clarity_backup.pl
For syntax and commands used with the script, see Creating a backup.
If executed without any arguments, the script provides both a backup and restore example. This creates a tarball (gzipped) inside of the working directory which can be sizable depending upon your installation, so proceed with caution.
A recursive backup of CLARITY_ROOT (default is /app/clarity) will be performed to the working directory (in GZIP format). This includes:
BMC recommends that you have several copies of this backup to facilitate the recovery scenarios outlined above. Recommended locations are:
Also, you should leverage existing organizational procedures for Oracle disaster recovery. Consult the Oracle documentation if you need additional information (see included links in Related topics on this page). For more information on the data warehouse, see Data warehouse.
The clarity_backup.pl script is also responsible for restore activities, and the basic syntax is detailed below. It should be assumed that all data backed up will be restored into or over the existing installation.
For syntax and commands used with the script, see Archiving, restoring jobs, and creating backup files
Note
This tool brings the Management server offline during script runtime. The amount of downtime incurred is a function of the amount of data to be backed up and performance of the system, but a good estimate would be 15-30 minutes. In addition to the file-level backup, the Postgres database is shutdown for an offline backup or restore procedure. This means there will be a service disruption during the time in which the backup is done. You must plan accordingly.
Note Before recovering managers in a Multi-Manager environment, ensure that you have performed a backup in the following order: In the event that a Content Manager is lost, the entire mesh must be rebuilt. The new server that will be used should have the same exact hostname and the IP addresses of those being replaced. In the event that a Satellite Manager is lost, the entire mesh does not need to be rebuilt. Instead, a replacement satellite can be brought into the mesh. The new server that will be used should have the same exact hostname and the IP address of the satellite being replaced. Note Note the “pa_conninfo” for the content manager and the lost satellite: Create a slony script to remove this node from the cluster – create a file named “slonik_script.slonik” in /tmp with the following contents: For example, to remove rh5-mm009-03 from this mesh:Recovering the Content Manager
/app/clarity/manager_scripts/bin/deconfigure_megamesh
psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node"
command on the Satellite Manager. If output of this command displays the registered node, then go to the next step else skip the steps iv and v.psql -h localhost -U tcrimi "GridApp" -c "DROP SCHEMA _megamesh CASCADE"
command to drop the megamesh schema from the Satellite Manager.psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node"
command, this command should throw an error: schema _megamesh does not exist.
configure_megamesh
command on the Satellite Manager/app/clarity/manager_scripts/bin/configure_megamesh
/app/clarity/manager_scripts/bin/deconfigure_megames
h–k
option) /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
/app/clarity/manager_scripts/bin/configure_megamesh
Recovering the Satellite Manager
su postgres
psql -U GridApp
GridApp=# select * from _megamesh.sl_path ;
pa_server | pa_client
| pa_conninfo
| pa_connretry-----------+-----------+--------------------------------------------------------------
------------------------------+-------------- 3 | 1 | host=rh5-mm009-03.gridapp-dev.com dbname=GridApp user=megamesh
port=5432 password=password | 10 1 | 3 | host=rh5-mm009-01.gridapp-dev.com dbname=GridApp user=megamesh
port=5432 password=password | 10(2 rows)
GridApp=# select * from _megamesh.sl_node;
no_id | no_active | no_comment |no_spool
-------+-----------+-----------------------------------------------+----------
1 | t | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
3 | t | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
(2 rows)
cluster name = megamesh;
node <satellite manager no_id> admin conninfo=’<satellite manager pa_conninfo> |
user=megamesh_config port=5432 password=<megamesh_config user’s password>';node <content manager no_id> admin conninfo=’<content manager pa_conninfo>”
user=megamesh_config port=5432 password=<megamesh_config user’s password>'; try {
drop node (id = <satellite manager no_id>, event node = 1);
} on error {
echo 'Failed to drop node from cluster';
exit 1;
}
cluster name = megamesh;
node 3 admin conninfo='host=rh5-mm009-03.gridapp-dev.com dbname=GridApp
user=megamesh_config port=5432 password=password';node 1 admin conninfo='host=rh5-mm009-01.gridapp-dev.com dbname=GridApp
user=megamesh_config port=5432 password=password'; try {
drop node (id = 3, event node = 1);
} on error {
echo 'Failed to drop node from cluster';
exit 1;
}
slonik < /tmp/slonik_script.slonik
psql -h localhost -U postgres GridApp
GridApp=# select * from _megamesh.sl_node;
no_id | no_active | no_comment |no_spool
-------+-----------+-----------------------------------------------+----------
1 | t |Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
(1 row)
/app/clarity/manager_scripts/bin/deconfigure_megamesh
/app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
configure_megamesh
command on the Satellite Manager./app/clarity/manager_scripts/bin/configure_megamesh
To validate that you have restored the archived backup correctly, first ensure the following:
To validate the restore process:
As root, run (replacing the tarball with the filename on your server):
clarity_backup.pl -r clarity_backup-2011-10-18-24920.tar.gz -v
Review the output and ensure that only 'INFO' and 'WARN' messages are present. The output should terminate with:
WARN Re-starting services: WARN backup Complete
It is necessary to validate the successful restoration of Postgres and file level components. First, checking Users/RBAC ensures that Postgres data has successfully been loaded. See below for an example of Role validation - perform the same checks for Groups and Users. Also ensure that authentication with one of the enumerated users succeeds.
Validate restoration of Job logs. First ensure that the selection on the jobs page looks accurate (see below), then drill down into an individual job to confirm that the job logs can be retrieved.
For content restoration, review the patch (screenshot below), Action, and Template repositories.
There are multiple Agent constraints that must be taken into consideration in order to have the Agents re-establish connection with the new Manager host.
To optimize MTTR and minimize the Agent issues outlined above, it is strongly recommended that the DNS host record for the failed Manager be updated to point to the IP address of the Manager in the secondary datacenter after the restore is completed. The following diagram illustrates the primary Manager going offline, and the consequences of updating the host record.
Re-homing Agents
In the previous scenario there is one DNS server in each datacenter, both of which are authoritative for the Manager's domain. When the failure in the Primary Datacenter occurs, the host record is updated to point to the new IP address.
Validating that the new Agents are online is extremely straightforward: A green host icon indicates connectivity, while a red does not. See the screenshot for a view of an online node. To complete basic connectivity testing validate that a log bundle can be downloaded. At the completion of these two activities, both UDP/TCP connectivity will be effectively tested.