Disaster Recovery

This topic provides recommendations for backing up and restoring the BMC Database Automation (BDA) Manager to help you implement both standalone and Multi-Manager environment recovery scenarios.

Performing disaster recovery for standalone environments

Disaster recovery enables an administrator to recover the BDA Manager from an existing backup to:

  • New physical or virtual hardware
  • A new physical or virtual instance in a new datacenter
  • Existing physical or virtual hardware

Example scenario

The following example scenario also describes different methods to back up the operational BDA instance in line with the existing business requirements, and provides guidance on recovery from the following classes of failures.

  • Physical hardware failure (new host, same datacenter recovery)
  • Data corruption (same host, same datacenter recovery)
  • Natural disaster (different host, different datacenter recovery)

Overview

The following illustration demonstrates multiple recovery scenarios, both intra- and inter-datacenter recovery (same or different host). The application data (on the 'Data' spindle) can be offloaded to the other datacenter using the administrator's preferred protocol (ssh, ftp, and so forth). As illustrated, the backup is generated by a script, and both generating and transferring it can be configured to run using a system scheduler. The 'Backup' storage does not need to be directly connected to the Red Hat Enterprise Linux (RHEL) host which represents the target (although this will simplify the restore). It is also strongly encouraged that a backup to tape be executed at the recovery site.

Assumptions

  • Backup physical or virtual hardware is available that meets the requirements of the multiple restore strategies listed. For example, to recover from a localized hardware failure, there must be a pre-provisioned virtual/physical host available as the restore target.
  • Backup hardware must have the same operating system and architecture as the failed host. An exact specification match is not necessary, as long as it is understood that performance will be impacted in accordance with the change.
  • Where the data warehouse is in use, Dataguard is pre-configured in a primary/standby relationship.
  • Firewall rules are in place for Agent connectivity for the secondary datacenter. See Network requirements.

Performing a backup and restore for disaster recovery

This section provides an overview of the process for archiving, backing up, and restoring BMC Database Automation using the backup script.

The clarity_backup.pl script is the primary backup-to-disk tool provided by BMC to back up the BMC Database Automation Manager solution. It can be found here:

/app/clarity/manager_scripts/bin/clarity_backup.pl

For syntax and commands used with the script, see Creating a backup.

Prerequisites

  • BDA is Installed at the primary site.
  • Physical/Virtual hardware that matches the OS/architecture of the dedicated primary BDA Manager is installed at the secondary datacenter.
  • A copy of the BDA installer media that matches the version and architecture of the primary BDA Manager is available at the backup sit
  • 'root' access, without which the clarity_backup.pl script will not work.
  • Knowledge to configure the system scheduler (crond is acceptable).
  • Adequate disk space to satisfy the required retention of the business (both for disk and tape/virtual backup copies). The clarity_backup.pl tool will only perform a full backup. The main advantage to this is the required disk space can be easily forecasted based on required retention and the existing backup size. An obvious downside to this is the extra disk space used.
  • A WAN link that can sustain a reasonable transfer rate to copy the daily backup to a different datacenter. If this can't be copied offsite daily, then the customer must decide how much data loss is acceptable (known as Recovery Point Objective, or RPO) in the event of a datacenter impacting issue.

If executed without any arguments, the script provides both a backup and restore example. This creates a tarball (gzipped) inside of the working directory which can be sizable depending upon your installation, so proceed with caution.

Create a recursive backup of CLARITY_ROOT (default is /app/clarity) to the working directory (in GZIP format). This includes:

  • Provisioning, patching, and upgrade templates
  • Actions
  • Patch Packages
  • MSSQL Media
  • Custom discovery modules
  • Job Logs
  • User accounts, RBAC configuration

Tip

BMC Software recommends that you have several copies of this backup to facilitate the recovery scenarios outlined below. Recommended locations are:

  • Same datacenter, same server (disk).
  • Same datacenter, different server (shared storage).
  • Different datacenter.
  • Offload to tape or virtual tape device (multiple datacenters recommended).

Also, you should leverage existing organizational procedures for Oracle disaster recovery. Consult the Oracle documentation if you need additional information (see included links in Related topics on this page). For more information on the data warehouse, see Data warehouse.

The clarity_backup.pl script is also responsible for restore activities, and the basic syntax is detailed below. It should be assumed that all data backed up will be restored into or over the existing installation.

For syntax and commands used with the script, see Archiving, restoring jobs, and creating backup files.

Note

The clarity_backup.pl script brings the Management server offline during script runtime. The amount of downtime incurred is a function of the amount of data to be backed up and performance of the system, but a good estimate would be 15-30 minutes. In addition to the file-level backup, the Postgres database is shutdown for an offline backup or restore procedure. This means there will be a service disruption during the time in which the backup is done. You must plan accordingly.

Validating the restored archive

To validate that you have restored the archived backup correctly, first ensure the following:

  • The backup tarball already exists on the standby physical/virtual hardware.
  • The BMC Database Automation Manager is installed using the exact same version/architecture on the standby server. See the Installing section for the standard installation procedure.
  • No previous restores have been executed against this target. It is extremely difficult to validate the restoration of data if there is an existing Postgres dataset already loaded.
  • You are logged in as 'root'.
  • You have access to the BMC Database Automation Manager as a user with sysadmin privileges.

To validate the restore process:

  1. As root, run (replacing the tarball with the filename on your server):

    clarity_backup.pl -r clarity_backup-2011-10-18-24920.tar.gz -v
    
  2. Review the output and ensure that only 'INFO' and 'WARN' messages are present. The output should terminate with:

    WARN      Re-starting services:
    WARN      backup Complete
    
  3. Validate that the complete service stack is running (check process list):

    The remainder of the restore operation can be validated at the GUI level.
  4. Log on as the 'sysadmin' user or equivalent to begin.
  5. Validate the successful restoration of Postgres and file level components.
    1. Check Users/RBAC to ensure that Postgres data has successfully been loaded.
    2. Ensure that authentication with one of the enumerated users succeeds.
  6. Validate restoration of Job logs.
    1. Ensure that the selection on the jobs page looks accurate.
    2. Drill down into an individual job to confirm that the job logs can be retrieved.
  7. For content restoration, review the patch, Action, and Template repositories.

Agents

There are multiple Agent constraints that must be taken into consideration in order to have the Agents re-establish connection with the new Manager host.

  • The hostname of the Manager that the Agent reports to is hard-coded into /app/clarity/dagent/etc/dagent.conf on each Agent.
  • Manager/Agent communication is done over TLS, and if the canonical name in the Manager certificate does not match the hostname in the dagent.conf file, then the connection will fail.
  • Are there any issues with Agent timeouts?
  • Correct firewall rules must be in place for clients to connect to the Manager in the secondary datacenter.

To optimize MTTR and minimize these Agent issues, BMC strongly recommends that the DNS host record for the failed Manager be updated to point to the IP address of the Manager in the secondary datacenter after the restore is completed. The following diagram illustrates the primary Manager going offline, and the consequences of updating the host record.

 In the previous scenario there is one DNS server in each datacenter, both of which are authoritative for the Manager's domain. When the failure in the Primary Datacenter occurs, the host record is updated to point to the new IP address.

Validating that the new Agents are online is extremely straightforward: A green host icon indicates connectivity, while a red does not. See the screenshot for a view of an online node. To complete basic connectivity testing validate that a log bundle can be downloaded. At the completion of these two activities, both UDP/TCP connectivity will be effectively tested.

Performing disaster recovery for Multi-Manager environments

Note

In order to recover one or more Managers in a Multi-Manager environment, you must have a backup of each Manager created using the BMC Database Automation Backup Utility.

Example scenarios 

This section describes scenarios for recovering from a loss of a Content Manager or one or more Satellite Managers or a complete loss of the mesh. Each scenario describes the recovery of a lost manager on a new server for the following two cases: 

  • A new server has the same hostname and IP address as the one being replaced.
  • A new server has a different hostname and IP address than the one being replaced.

Assumptions

These scenarios are written based on the following assumptions.

  • The Satellite Manager(s) have nodes and databases approved into domains.
  • Some nodes are configured for agent failover while some are not. 
  • The Satellite Manager(s) have some nodes that are in the Pending Approval state. 
  • In the cases where the hostname and IP address of the Satellite Manager and the Content Manager are same as the one being replaced, the Satellite Manager is referred to as S1 and the Content Manager is referred to as C1. 
  • In the cases where the new server has a different hostname and IP address than the one being replaced, the new Satellite Manager is referred to as NS1 and the new Content Manager is referred to as NC1.

Scenario 1:

Event: Satellite Manager (S1) has crashed and is not responding to the Agents.

After some time, the nodes configured for failover are visible on the failover Satellite Manager (S2).

Recovering the Satellite Manager when the hostname and IP address are same

  1. Remove the lost Satellite Manager from the slony cluster (these commands should be run from the Content Manager):
    1. Log in to the postgres:
      psql -h localhost -U tcrimi "GridApp"
    2. Note the no_id for the Content Manager and the lost Satellite Manager.

      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                no_comment                     |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
           3 | t         | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
      (2 rows)

      Note

      Ensure to add the IP Address of the Content Manager and the Mask in the pg_hba.conf file and then restart the postgresql service. For example,

      host all megamesh 172.19.23.174 255.255.255.252 password
      host all megamesh_config 172.19.23.174 255.255.255.252 password

    3. Remove the entry of the Satellite Manager (S1) from the cluster by executing the attached script.

    4. Verify that the lost Satellite Manager is no longer in the cluster:
      psql -h localhost -U postgres GridApp

      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                 no_comment                    |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         |Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com  | f
      (1 row)
    5. Confirm the entry of the lost Satellite Manager (S1) in the mesh:
      GridApp=# select * from aaa.mesh_setup;
      no_id | id | manager_name               
      -------+-----------+-----------------------------------------------+----------
           1 | 335bafc889015a5b       | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
           3 | 9d8daf01f373c3a0       | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
      (2 rows)
    6. Remove the entry of the Satellite Manager (S1) from the mesh:
      GridApp=# delete from aaa.mesh_setup where no_id = '<no_id>';
      For example,
      GridApp=# delete from aaa.mesh_setup where no_id = '3';
    7. Verify that the lost Satellite Manager is no longer in the mesh:

      no_id | id | manager_name               
      -------+-----------+-----------------------------------------------+----------
           1 | 335bafc889015a5b       | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f    
      (1 row)
  2. Install the Manager software in a Multi-Manager configuration on the new servers (Satellite Manager (s)).
  3. Deconfigure the mesh on the Satellile Manager first and then on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  4. Restore the backup of the Satellite Manager (do not use the –k option).
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  5. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
  6. Configure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh
  7. Configure the mesh on the Satellite Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh
  8. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Satellite Manager(s) are visible on the restored Satellite Manager.
After configuring the Satellite Manager(s) in the mesh, all restored action templates, and packages, standards are overwritten and synced with the Content Manager.

After restarting all BDA services, perform the Switch back to primary manager operation on the failover Satellite Manager (S2). All nodes are available again on the Primary Satellite Manager (S1), and all unapproved nodes are visible in the unapproved hosts list on the Satellite Manager (S1).

Recovering the Satellite Manager when the hostname and IP address are different 

  1. Remove the lost satellite from the slony cluster (these commands should be run from the Content Manager):
    1. Log in to the postgres:
      psql -h localhost -U tcrimi "GridApp"
    2. Note the no_id for the Content Manager and the lost Satellite Manager.

      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                no_comment                     |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
           3 | t         | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
      (2 rows)

    3. Remove the entry of the Satellite Manager (S1) from the Content Manager by executing the attached script.

    4. Verify that the lost Satellite Manager is no longer in the cluster:

      psql -h localhost -U postgres GridApp
      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                 no_comment                    |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         |Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com  | f
      (1 row)

  2. Install the Manager software in a Multi-Manager configuration on new server (s) (Satellite Manager (NS1)).
  3. Start the dmanager service on the Satellite Manager (NS1).
  4. Restore the backup of the Satellite Manager (do not use the –k option). 
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  5. On the Content Manager and on other Satellite Managers, replace and mask the IP address of the Satellite Manager with the IP address of the Satellite Manager (NS1).
    /var/lib/pgsql/data/pg_hba.conf
  6. Restart all BDA services including postgresql, httpd, and megamesh on the Content Manager.
  7. Remove id_rsa.pub, id_rsa, and ssh_known_hosts files from the /app/clarity/dmanager/etc directory on the Satellite Manager (NS1).
  8. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager (NS1).

  9. Configure the mesh on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh

    Note

    While configuring the mesh, you get the following message if the public key was not created while installing the Manager software in step 2.

    New ssh public key created at /app/clarity/dmanager/etc/id_rsa.pub.
    Please append this file to /home/megamesh/.ssh/authorized_keys on
    your megamanager and rerun this script.

    If this message appears, follow these steps:

    Append the contents of the /app/clarity/dmanager/etc/id_rsa.pub file to the /home/megamesh/.ssh/authorized_keys file.

    Deconfigure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh

    Configure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh

    Configure the mesh on the Satellite Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh

  10. Restart all BDA services including the megamesh service on the Satellite Manager (NS1).
  11. Run the following script on NS1 to replace the entry of S1 by NS1 in the dstate file of the Content Manager:
    /app/clarity/manager_scripts/bin/megamesh_content_refresh_node_info

During the process of recovery, all actions, templates, and packages available in the restored backup of the Satellite Manager(s) are visible on the NS1. 
After configuring the Satellite Manager(s) in the mesh, all restored action templates, and packages, standards are overwritten and synced with the Content Manager.

After restarting all BDA services, perform the Switch back to primary manager operation on the failover Satellite Manager (S2). All nodes are available again on the Primary Satellite Manager (NS1), and all unapproved nodes are visible in the unapproved hosts list on the Satellite Manager (S1). 

Event: Content Manager (C1) has crashed.

Satellite Managers S1 and S2 are intact.

Recovering the Content Manager when the hostname and IP address are same 

  1. Install the Manager software in a Multi-Manager configuration on the new server (Content Manager).
  2. On all the Satellite Managers:
    1. Deconfigure the mesh.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
      You can ignore errors that might appear during deconfiguration on the Satellite Manager.
    2. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
    3. Run the following command on the Satellite Manager:
      psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node"
      If the output of this command displays the registered node, go to the next step; otherwise, skip steps d and e.
    4. Run the following command to drop the megamesh schema from the Satellite Manager:
      psql -h localhost -U tcrimi "GridApp" -c "DROP SCHEMA _megamesh CASCADE"
    5. Verify the registered node on the Satellite Manager using the following command:
      psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node"
      This command should throw an error:
      schema "_megamesh" does not exist  
  3. On the Content Manager:
    1. Deconfigure the mesh.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
    2. Restore the backup on the Content Manager (do not use the –k option).  /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz

      Note

      After you have restored the backup, follow these steps to check the owner and permissions on the log directory and the files inside it:

      1. Check the owner and permission on the /app/clarity/dmanager/var/log directory, and ensure that it is clarity:root for the log directory. If permission or owner is changed, then run the following command as a root user: chown clarity:root /app/clarity/dmanager/var/log
      2. Check the owner and permission on the files in the /app/clarity/dmanager/var/log directory and ensure that it is clarity:root for the files in the log directory. If permission or owner is changed, then run the following command as a root user: chown clarity:root /app/clarity/dmanager/var/log/*

    3. Run the following command on the Content Manager:
      psql -h localhost -U tcrimi "GridApp" -c "select * from aaa.mesh_setup"
      If there are rows in the table, then run the following command:
      psql -h localhost -U tcrimi "GridApp" -c "truncate table aaa.mesh_setup"

    4. Ensure that the the Slony initialization script is linked from the /etc/init.d directory. If not, then run the following command:
      ln -s /app/clarity/manager_scripts/bin/slony.init /etc/init.d/slony
    5. Configure the mesh on the Content Manager. 
      /app/clarity/manager_scripts/bin/configure_megamesh
    6. Restart all BDA services including the megamesh service on the Content Manager.
  4. On all the Satellite Managers:
    1. Configure the mesh on the Satellite Manager(s).
      /app/clarity/manager_scripts/bin/configure_megamesh

      Note

      If the above command fails, perform step 3 on the Content Manager and then run the command again on the Satellite Manager.

    2. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Content Manager are visible on the recovered Content Manager.   
After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s). 

Recovering the Content Manager when the hostname and IP address are different

  1. Build new machine with different hostname and IP address for the Content Manager (NC1).
  2. Install the Manager software on the new Content Manager (NC1). See Installing the Manager software in a Multi-Manager configuration.
  3. Start the dmanager service on the new manager host.
  4. On the Content Manager:
    1. Deconfigure the mesh on the Satellite Manager and then on the Content Manager. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
    2. Restore the backup of the Content Manager on the new Content Manager (do not use the –k option). 
      /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
    3. Configure the mesh on the new Content Manager, see Installing the Manager software in a Multi-Manager configuration.
    4. For BDA versions 8.9.00 and later, run the psql -h localhost -U tcrimi "GridApp" -c "truncate table aaa.mesh_setup;" command on the new Content Manager.
    5. Configure the megamesh on the new Content Manager (NC1).
      /app/clarity/manager_scripts/bin/configure_megamesh
    6. Restart all BDA services including the megamesh service on the Content Manager.
  5. On the Satellite Manager(s):
    1. Deconfigure the mesh. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
    2. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
    3. Run the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command on the Satellite Manager. If the output of this command displays the registered node, go to the next step; otherwise, skip Steps iv and v.
    4. Run the psql -h localhost -U tcrimi "GridApp" -c "DROP SCHEMA _megamesh CASCADE" command to drop the megamesh schema from the Satellite Manager.
    5. Verify the registered node on the Satellite Manager using the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command.
      This command should throw an error:
      schema "_megamesh" does not exist  
    6. Replace and mask the IP address of the Content Manager with the IP address of the new Content Manager (NC1).
      /var/lib/pgsql/data/pg_hba.conf
    7. Restart postgresql, httpd, and mtd services.
    8. Replace the hostname of the Content Manager with the hostname of the new Content Manager (NC1) in the /app/clarity/dmanager/etc/mesh.conf and /app/clarity/dmanager/etc/dmanager.conf files on the Satellite Manager(s).

    9. For BDA versions 8.9.00 and later, run the psql -h localhost -U tcrimi "GridApp" -c "truncate table aaa.mesh_setup;" command on all new Satellite Manager(s).
    10. Add the Satellite Manager’s key from the /app/clarity/dmanager/etc/id_rsa.pub file on the new Content Manager to the /home/megamesh/.ssh/authorized_keys file.

    11. Validate password less connectivity from Satellite Manager(s) using the command:
      su clarity -c 'ssh -i /app/clarity/dmanager/etc/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/app/clarity/dmanager/etc/ssh_known_hosts -l megamesh <Content_manager_name> date'

    12. Configure the mesh on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
      /app/clarity/manager_scripts/bin/configure_megamesh
    13. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Content Manager are visible on the recovered Content Manager.   
After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s). 

Event: The Content Manager (CI) and both Satellite Managers (S1 and S2) are down.

The Content Manager and Satellite Manager(s) are not responding, and there is a complete loss of the mesh.

Recovering the Content Manager when the hostname and IP address are same 

  1. Install the Manager software in a Multi-Manager configuration on the new servers.
  2. Deconfigure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  3. Restore the backup on the Content Manager (do not use the –k option). 
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  4. Configure the mesh on the Content Manager. 
    /app/clarity/manager_scripts/bin/configure_megamesh
  5. Restart all BDA services including the megamesh service on the Content Manager.
During the process of recovery, all actions, templates, and packages available in the restored backup of the Content Manager are visible on the recovered Content Manager.   
After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s). 

Recovering the Satellite Manager when the hostname and IP address are same 

  1. Install the Manager software in a Multi-Manager configuration on the new servers.
  2. Deconfigure the mesh on the Satellite Manager first and then on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  3. Restore the backup of the Satellite Manager (do not use the –k option).
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  4. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
  5. Configure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh
  6. Configure the mesh on the Satellite Manager(s).
    /app/clarity/manager_scripts/bin/configure_megamesh  
  7. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Satellite Manager(s) are visible on the restored Satellite Manager. 
After configuring the Satellite Manager(s) in the mesh, all restored action templates, and packages, standards are overwritten and synced with the Content Manager.

Recovering the Content Manager and Satellite Manager(s) when the hostname and IP address are different 

  1. Build new machines with different hostnames and IP addresses for the Content Manager (NC1) and Satellite Manager(s) (NS1) and (NS2).
  2. Install the Manager software on all new machines, NC1, NS1, and NS2.
  3. Restore the backup of the Satellite Manager(s) and the Content Manager (do not use the –k option). 
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  4. On the Satellite Manager(s), remove id_rsa.pub, id_rsa, and ssh_known_hosts files from the /app/clarity/dmanager/etc directory.
  5. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
  6. Configure the mesh on the new Content Manager and the new Satellite Manager(s). See Installing the Manager software in a Multi-Manager configuration and Adding Satellite Managers to the mesh.
  7. Deconfigure the megamesh on the Satellite Manager and then on the Content Manager. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  8. Configure the mesh on the Content Manager and the Satellite Manager(s).
    /app/clarity/manager_scripts/bin/configure_megamesh
  9. Validate password less connectivity from Satellite Manager(s) using the command:
    su clarity -c 'ssh -i /app/clarity/dmanager/etc/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/app/clarity/dmanager/etc/ssh_known_hosts -l megamesh <Content_manager_name> date'
  10. Add NS1 and the restored Content Manager in the mesh. See Adding Satellite Managers to the mesh.
  11. Restart all BDA services including the megamesh service on the Satellite Manager.
  12. Run the following script on NS1 to replace the entry of S1 by NS1 in the dstate file of the Content Manager:
    /app/clarity/manager_scripts/bin/megamesh_content_refresh_node_info

During the process of recovery, all actions, templates, and packages are visible on the Content Manager, and all actions, templates and packages available on the restored backup of the Satellite Manager(s) are visible on NS1. 

After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s).

All templates, actions, packages, agents, and unapproved nodes synced from the Content Manager are visible on the Satellite Manager(s).

Scenario 2:

The backups of the Content Manager and the Satellite Manager(s) are taken at different times. For example, the backup of the Content Manager is taken at time T1 and the backup of the Satellite Manager is taken at time T2. Some objects are added on the Content Manager after the backup and these objects are already present in the backup of the Satellite Manager(s). 

Event: Satellite Manager (S1) has crashed and is not responding to the Agents.

After some time, the nodes configured for failover are visible on the failover Satellite Manager (S2).

Recovering the Satellite Manager when the hostname and IP address are same 

  1. Remove the lost satellite from the slony cluster (these commands should be run from the content manager):
    1. Log in to the postgres:
      psql -h localhost -U tcrimi "GridApp"
    2. Note the no_id for the Content Manager and the lost Satellite Manager.

      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                no_comment                     |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
           3 | t         | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
      (2 rows)

      Ensure to add the IP Address of the Content Manager and the Mask in the pg_hba.conf file and then restart the postgresql service. For example,

      host all megamesh 172.19.23.174 255.255.255.252 password
      host all megamesh_config 172.19.23.174 255.255.255.252 password


    3. Remove the entry of the Satellite Manager (S1) from the Content Manager by executing the attached script.

    4. Verify that the lost Satellite is no longer in the cluster:
      psql -h localhost -U postgres GridApp

      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                 no_comment                    |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         |Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com  | f
      (1 row)
    5. Confirm the entry of the lost Satellite Manager (S1) in the mesh:
      GridApp=# select * from aaa.mesh_setup;
      no_id | id | manager_name               
      -------+-----------+-----------------------------------------------+----------
           1 | 335bafc889015a5b       | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
           3 | 9d8daf01f373c3a0       | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
      (2 rows)
    6. Remove the entry of the Satellite Manager (S1) from the mesh:
      GridApp=# delete from aaa.mesh_setup where no_id = '<no_id>';
      For example,
      GridApp=# delete from aaa.mesh_setup where no_id = '3';
    7. Verify that the lost Satellite Manager is no longer in the mesh:

      no_id | id | manager_name               
      -------+-----------+-----------------------------------------------+----------
           1 | 335bafc889015a5b       | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f    
      (1 row)
  2. Install the Manager software in a Multi-Manager configuration on a new server.
  3. Deconfigure the mesh on the Satellite Manager first and then on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  4. Restore the backup of the Satellite Manager (do not use the –k option).
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  5. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
  6. Configure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh
  7. Configure the mesh on the Satellite Manager(s).
    /app/clarity/manager_scripts/bin/configure_megamesh  
  8. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Satellite Manager(s) are visible on the restored Satellite Manager. 
After configuring the Satellite Manager(s) in the mesh, all restored action templates, and packages, standards are overwritten and synced with the Content Manager.

After restarting all BDA services, perform the Switch back to primary manager operation on the failover Satellite Manager (S2). All nodes are available again on the Primary Satellite Manager (S1), and all unapproved nodes are visible in the unapproved hosts list on the Satellite Manager (S1). 

Recovering the Satellite Manager when the hostname and IP address are different 

  1. Remove the lost satellite from the slony cluster (these commands should be run from the content manager):
    1. Login to the postgres:
      psql -h localhost -U tcrimi "GridApp"
    2. Note the no_id for the Content Manager and the lost Satellite Manager.

      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                no_comment                     |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
           3 | t         | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
      (2 rows)

    3. Remove the entry of the Satellite Manager (S1) from the Content Manager by executing the attached script.

    4. Verify that the lost Satellite is no longer in the cluster:
      psql -h localhost -U postgres GridApp
      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                 no_comment                    |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         |Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com  | f
      (1 row)

  2. Install the Manager software in a Multi-Manager configuration on NS1.
  3. Start the dmanager service on the new Satellite Manager (NS1). 
  4. Restore the backup of the Satellite Manager (do not use the –k option). 
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  5. On the Content Manager and on other Satellite Managers, replace and mask the IP address of the Satellite Manager with the IP address of the new Satellite Manager.
    /var/lib/pgsql/data/pg_hba.conf
  6. Remove id_rsa.pub, id_rsa, and ssh_known_hosts files from the /app/clarity/dmanager/etc directory on the Satellite Manager.
  7. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.

  8. Configure the mesh on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh  
  9. Restart all BDA services including the megamesh service on the Satellite Manager.
  10. Run the following script on NS1 to replace the entry of S1 by NS1 in the dstate file of the Content Manager:
    /app/clarity/manager_scripts/bin/megamesh_content_refresh_node_info

During the process of recovery, all actions, templates, and packages available in the restored backup of the Satellite Manager(s) are visible on NS1. 
After configuring the Satellite Manager(s) in the mesh, all restored action templates, and packages, standards are overwritten and synced with the Content Manager.

After restarting all BDA services, perform the Switch back to primary manager operation on the failover Satellite Manager (S2). All nodes are available again on the Primary Satellite Manager (NS1), and all unapproved nodes are visible in the unapproved hosts list on the Satellite Manager (S1). 

Event: Content Manager (C1) has crashed.

Satellite Managers S1 and S2 are intact.

Recovering the Content Manager when the hostname and IP address are same 

  1. Install BDA and configure the Multi-Manager prerequisites on the new servers. For more information, see Installing the Manager software in a Multi-Manager configuration.
  2. On the Content Manager:
    1. Deconfigure the mesh.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
    2. Restore the backup on the Content Manager (do not use the –k option).
      /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
    3. Configure the mesh on the Content Manager. 
      /app/clarity/manager_scripts/bin/configure_megamesh
    4. Restart all BDA services including the megamesh service on the Content Manager.
  3. On the Satellite Manager(s):
    1. Deconfigure the mesh. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
      .
    2. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
    3. Run the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command on the Satellite Manager. If the output of this command displays the registered node, go to the next step; otherwise, skip Steps d and e.
    4. Run the psql -h localhost -U tcrimi "GridApp" -c "DROP SCHEMA _megamesh CASCADE" command to drop the megamesh schema from the Satellite Manager.
    5. Verify the registered node on the Satellite Manager using the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command. This command should throw an error:
      schema "_megamesh" does not exist  
    6. Configure the mesh on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
      /app/clarity/manager_scripts/bin/configure_megamesh
    7. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Content Manager are visible on the recovered Content Manager.   
After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s). 

Recovering the Content Manager when the hostname and IP address are different 

  1. Build new machine with different hostname and IP address for the Content Manager (NC1).
  2. Install the Manager software on the new content host. See Installing the Manager software in a Multi-Manager configuration.
  3. Start the dmanager service on the new manager host.
  4. On the Content Manager:
    1. Deconfigure the mesh on the Satellite Manager and then on the Content Manager. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
    2. Restore the backup of the Content Manager on the new Content Manager (do not use the –k option). 
      /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
    3. Install the Manager software in a Multi-Manager configuration on the new Content Manager (NC1).
    4. For BDA versions 8.9.00 and later, run the psql -h localhost -U tcrimi "GridApp" -c "truncate table aaa.mesh_setup;" command on the Content Manager.
    5. Configure the megamesh on the new Content Manager (NC1).
      /app/clarity/manager_scripts/bin/configure_megamesh
    6. Restart all BDA services including the megamesh service on the Content Manager.
  5. On the Satellite Manager(s):
    1. Deconfigure the mesh. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh.
    2. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
    3. Run the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command on the Satellite Manager. If the output of this command displays the registered node, go to the next step; otherwise, skip Steps iv and v.
    4. Run the psql -h localhost -U tcrimi "GridApp" -c "DROP SCHEMA _megamesh CASCADE" command to drop the megamesh schema from the Satellite Manager.
    5. Verify the registered node on the Satellite Manager using the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command.
      This command should throw an error:
      schema "_megamesh" does not exist  
    6. Replace and mask the IP address of the Content Manager with the IP address of the new Content Manager (NC1).
      /var/lib/pgsql/data/pg_hba.conf
    7. Restart postgresql, httpd, and mtd services.
    8. Replace the hostname of the Content Manager with the hostname of the new Content Manager (NC1) in the /app/clarity/dmanager/etc/mesh.conf and /app/clarity/dmanager/etc/dmanager.conf files on the Satellite Manager(s).
    9. For BDA versions 8.9.00 and later, run the psql -h localhost -U tcrimi "GridApp" -c "truncate table aaa.mesh_setup;" command on all Satellite Manager(s).
    10. Add the Satellite Manager’s key from the /app/clarity/dmanager/etc/id_rsa.pub file on the new Content Manager to the /home/megamesh/.ssh/authorized_keys file.

    11. Validate password less connectivity from Satellite Manager(s) using the command:
      su clarity -c 'ssh -i /app/clarity/dmanager/etc/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/app/clarity/dmanager/etc/ssh_known_hosts -l megamesh <Content_manager_name> date'

    12. Configure the mesh on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
      /app/clarity/manager_scripts/bin/configure_megamesh
    13. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Content Manager are visible on the recovered Content Manager.   
After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s). 

Event: The Content Manager (CI) and both Satellite Managers (S1 and S2) are down.

The Content Manager and Satellite Manager(s) are not responding, and there is a complete loss of the mesh.

Recovering the Content Manager when the hostname and IP address are same 

  1. Install the Manager software in a Multi-Manager configuration on the new servers.
  2. Deconfigure the mesh on the Satellite Manager first and then on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  3. Restore the backup on the new Content Manager (do not use the –k option). 
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  4. Configure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh
  5. Configure the mesh on the Sattelite Manager (s).
    /app/clarity/manager_scripts/bin/configure_megamesh
  6. Restart all BDA services including the megamesh service on the Content Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Content Manager are visible on the recovered Content Manager.   
After configuring the Satellite Manager in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s). The objects added after the backup of the Content Manager are removed from the Satellite Manager. 

Recovering the Satellite Manager when the hostname and IP address are same 

  1. Install the Manager software in a Multi-Manager configuration on new servers.
  2. Deconfigure the mesh on the Satellite Manager and then on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  3. Restore the backup of the Satellite Manager (do not use the –k option).
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  4. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
  5. Configure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh 
  6. Configure the mesh on the Satellite Manager(s).
    /app/clarity/manager_scripts/bin/configure_megamesh  
  7. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Satellite Manager(s) are visible on the restored Satellite Manager. 

After configuring the Satellite Manager(s) in the mesh, all restored action templates, and packages, standards are overwritten and synced with the Content Manager.

Recovering the Content Manager and Satellite Manager(s) when the hostname and IP address are different 

  1. Build new machine with different hostname and IP address for the Content Manager (NC1) and Satellite Manager(s) (NS1) and (NS2).
  2. Install the Manager software on all new machines, NC1, NS1, and NS2. For more information, see Installing the Manager software in a Multi-Manager configuration.
  3. Restore the backup of the Satellite Manager(s) and the Content Manager (do not use the –k option). 
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  4. On the Satellite Manager(s), remove id_rsa.pub, id_rsa, and ssh_known_hosts files from the /app/clarity/dmanager/etc directory.
  5. Configure the mesh on the new Content Manager and the new Satellite Manager(s), see Installing the Manager software in a Multi-Manager configuration and Adding Satellite Managers to the mesh.
  6. Deconfigure the megamesh on the Satellite Manager and then on the Content Manager. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  7. Configure the mesh on the Content Manager first and then on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh
  8. Validate password less connectivity from Satellite Manager(s) using the command:
    su clarity -c 'ssh -i /app/clarity/dmanager/etc/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/app/clarity/dmanager/etc/ssh_known_hosts -l megamesh <Content_manager_name> date'
  9. Add NS1 and the restored Content Manager in the mesh. See Adding Satellite Managers to the mesh.
  10. Restart all BDA services including the megamesh service on the Satellite Manager.
During the process of recovery, all actions, templates, and packages are visible on the restored Content Manager, and all actions, templates and packages available in the restored backup of the Satellite Manager(s) are visible on this Satellite Manager(s). The objects added after the backup of the Content Manager are also visible on the Satellite Manager.

After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager. The objects added after the backup of the Content Manager are removed from the Satellite Manager.

All templates, actions, packages, agents, and unapproved nodes synced from the Content Manager are visible on the Satellite Manager(s). 

Scenario 3: 

The backups of the Content Manager and the Satellite Manager(s) are taken at different times. For example, the backup of the Content Manager is taken at time T2 and the backup of the Satellite Manager(s) is taken at time T1. Some objects are added after the backup of the Satellite Manager and these objects are already present in the backup of the Content Manager. 

Event: Satellite Manager (S1) has crashed and is not responding to the Agents.

After some time, the nodes configured for failover are visible on the failover Satellite Manager (S2).

Recovering the Satellite Manager when the hostname and IP address are same 

  1. Remove the lost satellite from the slony cluster (these commands should be run from the content manager):
    1. Log in to the postgres:
      psql -h localhost -U tcrimi "GridApp"
    2. Note the no_id for the Content Manager and the lost Satellite Manager.

      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                no_comment                     |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
           3 | t         | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
      (2 rows)

      Note

      Ensure to add the IP Address of the Content Manager and the Mask in the pg_hba.conf file and then restart the postgresql service. For example,

      host all megamesh 172.19.23.174 255.255.255.252 password
      host all megamesh_config 172.19.23.174 255.255.255.252 password

    3. Remove the entry of the Satellite Manager (S1) from the Content Manager by executing the attached script.

    4. Verify that the lost Satellite is no longer in the cluster:
      psql -h localhost -U postgres GridApp

      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                 no_comment                    |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         |Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com  | f
      (1 row)
    5. Confirm the entry of the lost Satellite Manager (S1) in the mesh:
      GridApp=# select * from aaa.mesh_setup;
      no_id | id | manager_name               
      -------+-----------+-----------------------------------------------+----------
           1 | 335bafc889015a5b       | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
           3 | 9d8daf01f373c3a0       | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
      (2 rows)
    6. Remove the entry of the Satellite Manager (S1) from the mesh:
      GridApp=# delete from aaa.mesh_setup where no_id = '<no_id>';
      For example,
      GridApp=# delete from aaa.mesh_setup where no_id = '3';
    7. Verify that the lost Satellite Manager is no longer in the mesh:

      no_id | id | manager_name               
      -------+-----------+-----------------------------------------------+----------
           1 | 335bafc889015a5b       | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f    
      (1 row)
  2. Install the Manager software in a Multi-Manager configuration on new servers.
  3. Deconfigure the mesh on the Satellite Manager first and then on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  4. Restore the backup of the Satellite Manager (do not use the –k option).
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  5. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
  6. Configure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh
  7. Configure the mesh on the Satellite Manager(s).
    /app/clarity/manager_scripts/bin/configure_megamesh  
  8. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Satellite Manager(s) are visible on the restored Satellite Manager. 
After configuring the Satellite Manager(s) in the mesh, all restored action templates, and packages, standards are overwritten and synced with the Content Manager.

After restarting all BDA services, perform the Switch back to primary manager operation on the failover Satellite Manager (S2). All nodes are available again on the Primary Satellite Manager (S1), and all unapproved nodes are visible in the unapproved hosts list on the Satellite Manager (S1). 

Recovering the Satellite Manager when the hostname and IP address are different 

  1. Remove the lost satellite from the slony cluster (these commands should be run from the content manager):
    1. Log in to the postgres:
      psql -h localhost -U tcrimi "GridApp"
    2. Note the no_id for the Content Manager and the lost Satellite Manager.

    3. GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                no_comment                     |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         | Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com | f
           3 | t         | Node 2 - GridApp@rh5-mm009-03.gridapp-dev.com | f
      (2 rows)

    4. Remove the entry of the Satellite Manager (S1) from the Content Manager by executing the attached script.

    5. Verify that the lost Satellite is no longer in the cluster:
      psql -h localhost -U postgres GridApp
      GridApp=# select * from _megamesh.sl_node;
      no_id | no_active |                 no_comment                    |no_spool
      -------+-----------+-----------------------------------------------+----------
           1 | t         |Node 1 - GridApp@rh5-mm009-01.gridapp-dev.com  | f
      (1 row)

  2. Install the Manager software in a Multi-Manager configuration on NS1. 
  3. Start the dmanager service on the new Satellite Manager (NS1). 
  4. Restore the backup of the Satellite Manager (do not use the –k option). 
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  5. On the Content Manager and on other Satellite Managers, replace and mask the IP address of the Satellite Manager with the IP address of the new Satellite Manager.
    /var/lib/pgsql/data/pg_hba.conf
  6. Remove id_rsa.pub, id_rsa, and ssh_known_hosts files from the /app/clarity/dmanager/etc directory on the Satellite Manager.
  7. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.

  8. Configure the mesh on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh  
  9. Restart all BDA services including the megamesh service on the Satellite Manager.
  10. Run the following script on NS1 to replace the entry of S1 by NS1 in the dstate file of the Content Manager:
    /app/clarity/manager_scripts/bin/megamesh_content_refresh_node_info

During the process of recovery, all actions, templates, and packages available in the restored backup of the Satellite Manager(s) are visible on NS1. 
After configuring the Satellite Manager(s) in the mesh, all restored action templates, and packages, standards are overwritten and synced with the Content Manager.

After restarting all BDA services, perform the Switch back to primary manager operation on the failover Satellite Manager (S2). All nodes are available again on the Primary Satellite Manager (S1), and all unapproved nodes are visible in the unapproved hosts list on the Satellite Manager (S1). 

Event: Content Manager (C1) has crashed.

Recovering the Content Manager when the hostname and IP address are same 

  1. Install the Manager software in a Multi-Manager configuration on the new servers.
  2. On the Content Manager:
    1. Deconfigure the mesh. 
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
    2. Restore the backup on the Content Manager (do not use the –k option). 
      /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
    3. Configure the mesh.
      /app/clarity/manager_scripts/bin/configure_megamesh
    4. Restart all BDA services including the megamesh service on the Content Manager.
  3. On the Satellite Manager(s):
    1. Deconfigure the mesh. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh.
    2. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
    3. Run the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command on the Satellite Manager. If the output of this command displays the registered node, go to the next step; otherwise, skip Steps iv and v.
    4. Run the psql -h localhost -U tcrimi "GridApp" -c "DROP SCHEMA _megamesh CASCADE" command to drop the megamesh schema from the Satellite Manager.
    5. Verify the registered node on the Satellite Manager using the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command. This command should throw an error:
      schema "_megamesh" does not exist  
    6. Configure the mesh on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
      /app/clarity/manager_scripts/bin/configure_megamesh
    7. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Content Manager are visible on the recovered Content Manager.   
After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s). 

Recovering the Content Manager when the hostname and IP address are different 

  1. Build a new machine with different hostname and IP address for the Content Manager (NC1).
  2. Install the Manager software on the new Content Manager host. See Installing the Manager software in a Multi-Manager configuration.
  3. Start the dmanager service on the new manager host.
  4. On the Content Manager:
    1. Deconfigure the mesh on the Satellite Manager and then on the Content Manager. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
    2. Restore the backup on the Content Manager on the new Content Manager (do not use the –k option). 
      /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
    3. Configure the mesh on the new Content Manager, see Installing the Manager software in a Multi-Manager configuration.
    4. For BDA versions 8.9.00 and later, run the psql -h localhost -U tcrimi "GridApp" -c "truncate table aaa.mesh_setup;" command on the Content Manager.
    5. Configure the megamesh on the new Content Manager (NC1).
      /app/clarity/manager_scripts/bin/configure_megamesh
    6. Restart all BDA services including the megamesh service on the Content Manager.
  5. On the Satellite Manager(s):
    1. Deconfigure the mesh. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
      /app/clarity/manager_scripts/bin/deconfigure_megamesh
    2. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
    3. Run the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command on the Satellite Manager. If the output of this command displays the registered node, go to the next step; otherwise, skip Steps iv and v.
    4. Run the psql -h localhost -U tcrimi "GridApp" -c "DROP SCHEMA _megamesh CASCADE" command to drop the megamesh schema from the Satellite Manager.
    5. Verify the registered node on the Satellite Manager using the psql -h localhost -U tcrimi "GridApp" -c "select * from _megamesh.sl_node" command. This command should throw an error:
      schema "_megamesh" does not exist  
    6. Replace and mask the IP address of the Content Manager with the IP address of the new Content Manager (NC1).
      /var/lib/pgsql/data/pg_hba.conf
    7. Restart postgresql, httpd, and mtd services.
    8. Replace the hostname of the Content Manager with the hostname of the new Content Manager (NC1) in the /app/clarity/dmanager/etc/mesh.conf and /app/clarity/dmanager/etc/dmanager.conf files on the Satellite Manager(s).
    9. For BDA versions 8.9.00 and later, run the psql -h localhost -U tcrimi "GridApp" -c "truncate table aaa.mesh_setup;" command on all Satellite Manager(s).
    10. Add the Satellite Manager’s key from the /app/clarity/dmanager/etc/id_rsa.pub file on the new Content Manager to the /home/megamesh/.ssh/authorized_keys file.

    11. Validate password less connectivity from Satellite Manager(s) using the command:
      su clarity -c 'ssh -i /app/clarity/dmanager/etc/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/app/clarity/dmanager/etc/ssh_known_hosts -l megamesh <Content_manager_name> date'

    12. Configure the mesh on the Satellite Manager(s). The Content Manager must be available in the mesh before executing the configure_megamesh command on the Satellite Manager.
      /app/clarity/manager_scripts/bin/configure_megamesh
    13. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Content Manager are visible on the recovered Content Manager.   
After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s). 

Event: The Content Manager (CI) and both Satellite Managers (S1 and S2) are down.

The Content Manager and Satellite Manager(s) are not responding, and there is a complete loss of the mesh.

Recovering the Content Manager when the hostname and IP address are same 

  1. Install the Manager software in a Multi-Manager configuration on new servers.
  2. Deconfigure the mesh on the Satellite Manager first and then on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  3. Restore the backup on the Content Manager (do not use the –k option). 
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  4. Configure the mesh on the Content Manager. 
    /app/clarity/manager_scripts/bin/configure_megamesh
  5. Configure the mesh on the Satellite Manager. 
    /app/clarity/manager_scripts/bin/configure_megamesh
  6. Restart all BDA services including the megamesh service on the Content Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Content Manager are visible on the recovered Content Manager.   
After configuring the Satellite Manager in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager(s). The objects added after the backup of the Content Manager are removed from the Satellite Manager. 

Recovering the Satellite Manager when the hostname and IP address are same 

  1. Install the Manager software in a Multi-Manager configuration on new servers.
  2. Deconfigure the mesh on the Satellite Manager first and then on the Content Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  3. Restore the backup of the Satellite Manager (do not use the –k option).
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  4. Remove lines containing the term capability_blacklist from the /app/clarity/dmanager/etc/d2500_config file on the Satellite Manager.
  5. Configure the mesh on the Content Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh  
  6. Configure the mesh on the Satellite Manager.
    /app/clarity/manager_scripts/bin/configure_megamesh  
  7. Restart all BDA services including the megamesh service on the Satellite Manager.

During the process of recovery, all actions, templates, and packages available in the restored backup of the Satellite Manager(s) are visible on the restored Satellite Manager. 
After configuring the Satellite Manager(s) in the mesh, all restored action templates, and packages, standards are overwritten and synced with the Content Manager.

Recovering the Content Manager and Satellite Manager(s) when the hostname and IP address are different

  1. Build new machine with different hostname and IP address for the Content Manager (NC1) and Satellite Manager(s) (NS1) and (NS2).
  2. Install the Manager software on all new machines, NC1, NS1, and NS2.
  3. Restore the backup of the Satellite Manager(s) and the Content Manager (do not use the –k option). 
    /app/clarity/manager_scripts/bin/clarity_backup.pl -r backup.tar.gz
  4. On the Satellite Manager(s), remove id_rsa.pub, id_rsa, and ssh_known_hosts files from the /app/clarity/dmanager/etc directory.
  5. Configure the mesh on the new Content Manager and the new Satellite Manager(s), see Installing the Manager software in a Multi-Manager configuration and Adding Satellite Managers to the mesh.
  6. Deconfigure the megamesh on the Satellite Manager first and then on the Content Manager. You can ignore errors that might appear during deconfiguration on the Satellite Manager.
    /app/clarity/manager_scripts/bin/deconfigure_megamesh
  7. Configure the mesh on the Content Manager first and then the Satellite Manager(s).
    /app/clarity/manager_scripts/bin/configure_megamesh
  8. Validate password less connectivity from Satellite Manager(s) using the command:
    su clarity -c 'ssh -i /app/clarity/dmanager/etc/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/app/clarity/dmanager/etc/ssh_known_hosts -l megamesh <Content_manager_name> date'
  9. Add NS1, NS2, and the restored Content Manager in the mesh. See Adding Satellite Managers to the mesh.
  10. Restart all BDA services including the megamesh service on the Satellite Manager.
  11. Run the following script on NS1 to replace the entry of S1 by NS1 in the dstate file of the Content Manager:
    /app/clarity/manager_scripts/bin/megamesh_content_refresh_node_info

During the process of recovery, all actions, templates, and packages are visible on the restored Content Manager, and all actions, templates and packages available in the restored backup of the Satellite Manager(s) are visible on this Satellite Manager(s). The objects added after the backup of the Content Manager are also visible on the Satellite Manager.

After configuring the Satellite Manager(s) in the mesh, all actions, templates, and packages are overwritten and synced with the Content Manager on the Satellite Manager. The objects added after the backup of the Content Manager are removed from the Satellite Manager.

All templates, actions, packages, agents, and unapproved nodes synced from the Content Manager are visible on the Satellite Manager(s).

Was this page helpful? Yes No Submitting... Thank you

Comments