Disaster recovery

This topic provides recommendations for backing up and restoring the BMC Database Automation Manager to help you implement solutions to recovery scenarios.

One multi-site disaster recovery scenario is used to further assist in solution implementation.

Overview

Disaster recovery enables an administrator to recover the BMC Database Automation Manager from an existing backup to:

New physical/virtual hardware
A new physical/virtual instance in a new datacenter
Existing physical/virtual hardware

While the previous list does not address all possible scenarios, it should be an adequate foundation upon which other recovery strategies can be built. This topic also describes different methods to back up the operational BMC Database Automation instance in line with the existing business requirements, and provides guidance on recovery from the following classes of failures:

Physical hardware failure (new host, same datacenter recovery).
Data corruption (same host, same datacenter recovery).
Natural disaster (different host, different datacenter recovery).

Assumptions

Backup physical or virtual hardware is available that meets the requirements of the multiple restore strategies listed. For example, to recover from a localized hardware failure, there must be a pre-provisioned virtual/physical host available as the restore target.
Backup hardware must have the same operating system and architecture as the failed host. An exact specification match is not necessary, as long as it is understood that performance will be impacted in accordance with the change.
Multi-Manager configuration is not in use.
Where the data warehouse is in use, Dataguard is pre-configured in a primary/standby relationship.
Firewall rules are in place for Agent connectivity for the secondary datacenter. See Network-requirements.

Example scenario

The following illustrated architecture can be used to demonstrate multiple recovery scenarios, both intra- and inter-datacenter recovery (same or different host as well). The application data (on the 'Data' spindle) can be offloaded to the other datacenter using the administrator's preferred protocol (ssh, ftp, and so forth). As illustrated, the backup is generated by a script, and both generating and transferring it can be configured to run using a system scheduler. The 'Backup' storage does not need to be directly connected to the Red Hat Enterprise Linux (RHEL) host which represents the target (although this will simplify the restore). It is also strongly encouraged that a backup to tape be executed at the recovery site.

Performing a backup and restore for disaster recovery

This section provides an overview of the process for archiving, backing up, and restoring BMC Database Automation using the backup script.

The clarity_backup.pl script is the primary backup-to-disk tool provided by BMC to back up the BMC Database Automation Manager solution. It can be found here:

/app/clarity/manager_scripts/bin/clarity_backup.pl

For syntax and commands used with the script, see Creating-a-backup.

Prerequisites

Installation of BMC Database Automation at the primary site.
Physical/Virtual hardware at the secondary datacenter matching the OS/architecture of the dedicated primary BMC Database Automation Manager.
Copy of BMC Database Automation installer media at backup site - matching version and architecture of primary.
'root' access, without which the clarity_backup.pl script will not work.
Knowledge to configure the system scheduler (crond is acceptable).
Adequate disk space to satisfy the required retention of the business (both for disk and tape/virtual backup copies). The clarity_backup.pl tool will only perform a full backup. The main advantage to this is the required disk space can be easily forecasted based on required retention and the existing backup size. An obvious downside to this is the extra disk space used.
A WAN link that can sustain a reasonable transfer rate to copy the daily backup to a different datacenter. If this can't be copied offsite daily, then the customer must decide how much data loss is acceptable (known as Recovery Point Objective, or RPO) in the event of a datacenter impacting issue.

If executed without any arguments, the script provides both a backup and restore example. This creates a tarball (gzipped) inside of the working directory which can be sizable depending upon your installation, so proceed with caution.

A recursive backup of CLARITY_ROOT (default is /app/clarity) will be performed to the working directory (in GZIP format). This includes:

Provisioning, patching, and upgrade templates
Clarity Actions
Patch Packages
MSSQL Media
Custom discovery modules
Job Logs
User accounts, RBAC configuration

BMC recommends that you have several copies of this backup to facilitate the recovery scenarios outlined above. Recommended locations are:

Same datacenter, same server (disk).
Same datacenter, different server (shared storage).
Different datacenter.
Offload to tape or virtual tape device (multiple datacenters recommended).

Also, you should leverage existing organizational procedures for Oracle disaster recovery. Consult the Oracle documentation if you need additional information (see included links in Related topics on this page). For more information on the data warehouse, see Data warehouse.

The clarity_backup.pl script is also responsible for restore activities, and the basic syntax is detailed below. It should be assumed that all data backed up will be restored into or over the existing installation.

For syntax and commands used with the script, see Archiving-restoring-jobs-and-creating-backup-files

Note

This tool brings the Management server offline during script runtime. The amount of downtime incurred is a function of the amount of data to be backed up and performance of the system, but a good estimate would be 15-30 minutes. In addition to the file-level backup, the Postgres database is shutdown for an offline backup or restore procedure. This means there will be a service disruption during the time in which the backup is done. You must plan accordingly.

Performing disaster recovery for Multi-Manager environments

Note

Before recovering managers in a Multi-Manager environment, ensure that you have performed a backup in the following order:

Back up all Satellite Managers.
After the backup has completed on the satellite managers, back up the Content Manager.

Recovering the Content Manager

In the event that a Content Manager is lost, the entire mesh must be rebuilt. The new server that will be used should have the same exact hostname and the IP addresses of those being replaced.

Install BDA and configure the Multi-Manager prerequisites on the new servers. For more information, see Installing the Manager software in a Mult-Manager configuration.
Deconfigure the mesh.
1. On the Satellite Manager(s):
  1. /app/clarity/manager_scripts/bin/deconfigure_megamesh
  2. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
  3. Run the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command on the Satellite Manager. If output of this command displays the registered node, then go to the next step else skip the steps iv and v.
  4. Run the psql -h localhost -U tcrimi "GridApp" -c "DROP SCHEMA _megamesh CASCADE" command to drop the megamesh schema from the Satellite Manager.
  5. Verify the registered node on the Satellite Manager using the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command, this command should throw an error:
    schema _megamesh does not exist.
  6. Configure the mesh on the Satellite Manager(s). Configure the mesh on the Content Manager before executing the configure_megamesh command on the Satellite Manager
    /app/clarity/manager_scripts/bin/configure_megamesh
  7. Restart all BDA services including the megamesh service on the Satellite Manager.
2. On the Content Manager:
  1. /app/clarity/manager_scripts/bin/deconfigure_megamesh
  2. Restore the backup on the Content Manager (do not use the –k option)
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  3. Configure the mesh on the Content Manager
    /app/clarity/manager_scripts/bin/configure_megamesh
  4. Restart all BDA services including the megamesh service on the Content Manager.

Recovering the Satellite Manager

In the event that a Satellite Manager is lost, the entire mesh does not need to be rebuilt. Instead, a replacement satellite can be brought into the mesh. The new server that will be used should have the same exact hostname and the IP address of the satellite being replaced.

Note

BMC Software recommends that you contact Customer Support in the event a Satellite Manager is lost to assist with the following procedure.

Remove the lost satellite from the slony cluster (these commands should be run from the content manager):
1. Login to the postgres:
  su postgres
  psql -U GridApp
2. Note the “pa_conninfo” for the content manager and the lost satellite:
  GridApp=# select * from _megamesh.sl_path ;
  pa_server | pa_client
  |                                       pa_conninfo
      | pa_connretry
  -----------+-----------+--------------------------------------------------------------
  ------------------------------+--------------
           3 |         1 | host=rh5-mm009-03.gridapp-dev.com dbname=GridApp user=megamesh
  port=5432 password=password |           10
           1 |         3 | host=rh5-mm009-01.gridapp-dev.com dbname=GridApp user=megamesh
  port=5432 password=password |           10
  (2 rows)
3. Note the “no_id” for the content manager and the lost satellite:
  GridApp=# select * from _megamesh.sl_node;
  no_id | no_active |                no_comment                     |no_spool
  -------+-----------+-----------------------------------------------+----------
       1 | t         | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
       3 | t         | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
  (2 rows)
4. Create a slony script to remove this node from the cluster – create a file named “slonik_script.slonik” in /tmp with the following contents:
  cluster name = megamesh;
  node <satellite manager no_id> admin conninfo=’<satellite manager pa_conninfo> |
  user=megamesh_config port=5432 password=<megamesh_config user’s password>';
  node <content manager no_id> admin conninfo=’<content manager pa_conninfo>”
  user=megamesh_config port=5432 password=<megamesh_config user’s password>';
     try {
         drop node (id = <satellite manager no_id>, event node = 1);
     } on error {
         echo 'Failed to drop node from cluster';
         exit 1;
     }
  For example, to remove rh5-mm009-03 from this mesh:
  cluster name = megamesh;
  node 3 admin conninfo='host=rh5-mm009-03.gridapp-dev.com dbname=GridApp
  user=megamesh_config port=5432 password=password';
  node 1 admin conninfo='host=rh5-mm009-01.gridapp-dev.com dbname=GridApp
  user=megamesh_config port=5432 password=password';
    try {
        drop node (id = 3, event node = 1);
    } on error {
        echo 'Failed to drop node from cluster';
        exit 1;
    }
5. Run the slony script created in Step 1.d:
  slonik < /tmp/slonik_script.slonik
6. Verify that the lost Satellite is no longer in the cluster:
  psql -h localhost -U postgres GridApp
  
  GridApp=# select * from _megamesh.sl_node;
  no_id | no_active | no_comment |no_spool
  -------+-----------+-----------------------------------------------+----------
  1 | t |Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
  (1 row)
Install BDA and configure the Multi-Manager prerequisites on the new servers. For more information, see Installing the Manager software in a Mult-Manager configuration.
Deconfigure the mesh
/app/clarity/manager_scripts/bin/deconfigure_megamesh
Restore the backup of the Satellite Manager (do not use the –k option)
/app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
Configure the mesh on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
/app/clarity/manager_scripts/bin/configure_megamesh
Restart all BDA services including the megamesh service on the Satellite Manager.
Verify that the backup of the Satellite Manager contains all packages, templates, and actions that were available before recovery.

Validating the restored archive

To validate that you have restored the archived backup correctly, first ensure the following:

The backup tarball already exists on the standby physical/virtual hardware.
The BMC Database Automation Manager is installed using the exact same version/architecture on the standby server. See the Installing section for the standard installation procedure.
No previous restores have been executed against this target. It is extremely difficult to validate the restoration of data if there is an existing Postgres dataset already loaded.
You are logged in as 'root'.
You have access to the BMC Database Automation Manager as a user with sysadmin privileges.

To validate the restore process:

As root, run (replacing the tarball with the filename on your server):
clarity_backup.pl -r clarity_backup-2011-10-18-24920.tar.gz -v
Review the output and ensure that only 'INFO' and 'WARN' messages are present. The output should terminate with:
WARN Re-starting services:
WARN backup Complete
Validate that the complete service stack is running (check process list):
After the previous step has been successfully completed, the rest of the restore can be validated at the GUI level. Log on as the 'sysadmin' user or equivalent to begin.
It is necessary to validate the successful restoration of Postgres and file level components. First, checking Users/RBAC ensures that Postgres data has successfully been loaded. See below for an example of Role validation - perform the same checks for Groups and Users. Also ensure that authentication with one of the enumerated users succeeds.
Validate restoration of Job logs. First ensure that the selection on the jobs page looks accurate (see below), then drill down into an individual job to confirm that the job logs can be retrieved.
For content restoration, review the patch (screenshot below), Action, and Template repositories.

Agents

There are multiple Agent constraints that must be taken into consideration in order to have the Agents re-establish connection with the new Manager host.

The hostname of the Manager that the Agent reports to is hard-coded into /app/clarity/dagent/etc/dagent.conf on each Agent.
Manager/Agent communication is done over TLS, and if the canonical name in the Manager certificate does not match the hostname in the dagent.conf file, then the connection will fail.
Are there any issues with Agent timeouts?
Correct firewall rules must be in place for clients to connect to the Manager in the secondary datacenter.

To optimize MTTR and minimize the Agent issues outlined above, it is strongly recommended that the DNS host record for the failed Manager be updated to point to the IP address of the Manager in the secondary datacenter after the restore is completed. The following diagram illustrates the primary Manager going offline, and the consequences of updating the host record.

In the previous scenario there is one DNS server in each datacenter, both of which are authoritative for the Manager's domain. When the failure in the Primary Datacenter occurs, the host record is updated to point to the new IP address.

Validating that the new Agents are online is extremely straightforward: A green host icon indicates connectivity, while a red does not. See the screenshot for a view of an online node. To complete basic connectivity testing validate that a log bundle can be downloaded. At the completion of these two activities, both UDP/TCP connectivity will be effectively tested.

Disaster recovery

Overview

Assumptions

Example scenario

Performing a backup and restore for disaster recovery

Prerequisites

Performing disaster recovery for Multi-Manager environments

Recovering the Content Manager

Recovering the Satellite Manager

Validating the restored archive

Agents

Related topics

On this page