This site will undergo a maintenance outage on Saturday, 13 September beginning at 2:30 AM Central/1 PM IST for a platform upgrade. The downtime will be ~three hours.

Default language.

Important This documentation space contains information about the on-premises version of BMC Helix Discovery. If you are using the SaaS version of BMC Helix Discovery, see BMC Helix Discovery (SaaS).

Creating a disaster recovery cluster


The BMC Discovery Disaster Recovery (DR) solution is a lightweight utility that provides disaster recovery capabilities across clusters of supported versions of BMC Discovery. It runs on a cluster of appliances, makes incremental backups at midnight, and synchronizes those backups and a zip archive of configuration files to a standby cluster of BMC Discovery appliances. Once the transfer has started, the system checks every 15 minutes to make sure that the transferred data is still up to date. 

If the source cluster experiences a loss of service, an operator can switch to the standby cluster using the synchronized backups and restore services. 

Setting up the DR solution

The BMC Discovery DR solution was introduced with Technology Knowledge Update TKU 2025-Aug-1. The DR Solution utility automates the use of external tools and consists of the following files:

  • dr_discovery.py
  • tw_dr_discovery

Setting up a DR user

The DR solution requires a BMC Discovery user to be defined with few privileges.

To do this, you need to access the UI, log in, and then create a new group with the following permissions:

  • model/datastore/admin
  • system/settings/read
  • system/settings/write

After that, create a user and assign it to the new group.

The following images show the group and user pages:

DR-user-1.png

DR-user-2.png

You need to log in as that user while configuring DR.

Clusters

The following diagram shows the pairing we are setting up:

DR-clusters.png

BMC Discovery Outposts

All BMC Discovery Outposts must be registered with both source and destination clusters. 

If you already use BMC Discovery Outpost, they are already connected to your source cluster, and you must register them with the destination cluster.

Having all BMC Discovery Outposts registered on both clusters allows them to reconnect to the destination cluster after restore and switch over.

Important

If you skip this step, the BMC Discovery Outpost cannot reconnect to the destination cluster after restore, and all of the credentials on them are lost. You must manually re-enter them.

Time synchronization

Time synchronization settings and status are not included in the backup and restore process. To maintain consistency across clusters, ensure that the secondary cluster is manually configured to match the time synchronization settings of the primary cluster.

CyberArk integration

BMC Discovery provides a user interface to assist with the deployment of the CyberArk AAM service. However, CyberArk AAM is not included in the backup and restore process.

To ensure continued functionality across clusters, make sure that CyberArk AAM is installed and configured on the secondary cluster to match the setup of the primary cluster.

Run the setup

We recommend that you use the screen utility when using any long-running terminal application on a remote host.
 
When the upload is complete on all the source cluster appliances, run the following command on the first one, in this example, a1.company.com:

Connect to the first appliance of the source cluster, a1.company.com, and run the setup command:

tw_dr_discovery --setup

Edit the configuration using the setup. The elements you are interested in the first time are:

  • Email SMTP
  • Email sender
  • Email recipient
  • BMC Discovery username. Do not use the system user. Rather, use the user that you created in the BMC Discovery UI.
  • Secondary cluster member: hostname to be b1.company.com (your secondary cluster appliance)
  • Save the configuration

Run the setup and accept all the following questions by answering "y" to:

  • Test the SSH connection to the secondary member
  • Create the configuration archive
  • Display the generations
    • Change the datastore mode to generational
    • Create a new generation (that is, an incremental backup)
  • Test the email address
  • Register the cron job for the backup
  • Register the cron job for the sync

Your setup should be complete and working at this point.

If not, run the tw_dr_discovery --setup again to test and modify any settings until your DR solution is satisfied.

Repeat the setup on the other source cluster appliances

The previous section was to set up the appliance a1.company.com, now repeat the steps for a2.company.com and a3.company.com.

To do this:

  • Download the etc/dr_discovery.json file generated by the setup on a1.company.com.
  • Edit the hostname and change it from b1.company.com to b2.company.com then upload the json file to the a2.company.com appliance.

Then change the hostname  from b2.company.com to b3.company.com, and upload the json file to the a3.company.com appliance.

The following is a PowerShell example of the download. Replace and upload it to the remaining appliances of the source cluster.

sftp tideway@**a1.company.com**:/usr/tideway/etc/dr_discovery.json

((Get-Content -Path .\dr_discovery.json) -replace '**b1.company.com**','**b2.company.com**') | Set-Content -Path .\dr_discovery.json

echo 'put tw_dr_discovery' | sftp tideway@**a2.company.com**:/usr/tideway/bin/

((Get-Content -Path .\dr_discovery.json) -replace '**b2.company.com**','**b3.company.com**') | Set-Content -Path .\dr_discovery.json

echo 'put tw_dr_discovery' | sftp tideway@**a3.company.com**:/usr/tideway/bin/

Now connect to a2.company.com using SSH and run the setup:

tw_dr_discovery --setup

This time, you are only interested in the following elements of the setup:

  • Test the SSH connection to the secondary member
  • Create the configuration archive
  • Register the cron job for the back up
  • Register the cron job for the sync

Now connect to a3.company.com using SSH and run the setup:

tw_dr_discovery --setup

This time, you are only interested in the following elements of the setup:

  • Test the SSH connection to the secondary member
  • Create the configuration archive
  • Register the cron job for the back up
  • Register the cron job for the sync

The setup phase is complete, and the solution creates an incremental backup every night at midnight, and run a sync will run every 15 minutes.

Switch over

BMC Discovery Disaster Recovery solution supports manual switch to the secondary cluster using the most recent synchronized data from the primary cluster if it becomes unresponsive.
  

If the primary cluster is still operational

Before initiating the switch over, it is crucial to stop data synchronization from the primary to the secondary cluster to prevent conflicts or data inconsistencies.

  • On each member of the primary cluster, run the following command:
    tw_dr_discovery --setup

  • When prompted:
     Do you want to disable the sync cron? [y/N]:
    Respond with y to disable the synchronization cron job.

Restore on the secondary cluster

We recommend that you use the screen utility when using any long-running terminal application on a remote host. 

  • Connect to any appliance in the secondary cluster (e.g., b1.company.com) via SSH.
  • Run the restore command:
    tw_dr_discovery --restore

  • Follow the interactive prompts:

    • The utility will connect to other secondary appliances via SSH.
    • SSH passwords will be requested (not stored); SSH keys will be used afterward.
    • A list of available backups will be displayed (most recent at the top).
    • Confirm the backup to restore

Automated restore steps

Once the backup is selected, the following steps will be executed automatically:

  • Stop services
  • Backup current configuration and data
  • Deploy configuration files
  • Deploy the selected backup
  • Repeat the same steps on all other appliances in the secondary cluster

Restart services

After the restore process is complete, services are restarted across the secondary cluster.

DNS update

To complete the switch over, you must update the DNS records to point the addresses from the source cluster to the newly restored cluster. This is necessary to:

  • Ensure that BMC Discovery Outposts reconnect to the new active cluster.
  • Redirect user and system traffic to the restored environment.

When you have completed the switch over and have confirmed the cluster is working correctly and that the switched over data is satisfactory, we recommend that you clean up (delete or move to an off-appliance archive) the /usr/tideway/var/tideway.db/data/datadir-previous to reclaim the disk space. 

Switch back procedure

After a successful switch over, the previously designated secondary cluster assumes the role of the primary cluster and begins ingesting all new data. At this point, the system treats this cluster as the active primary.

To maintain DR readiness, it is essential to re-establish a secondary cluster. This ensures continued protection and high availability.

Steps to reconfigure DR

  1. Designate a New Secondary Cluster
    Identify a cluster to serve as the new secondary. This may be:

    • The original primary cluster (prior to switch over), or
    • A newly provisioned cluster.
  2. Reconfigure the New Primary Cluster
    On each node of the new primary cluster:

    • Execute the DR setup procedure.
    • Update the node’s role to "Primary".
    • Configure the DR target by specifying the address of a node in the new secondary cluster.

This process re-establishes the DR topology, and ensures that the new primary cluster is protected by a designated secondary cluster.

Limitations

This section describes known limitations.

Data integrity

The incremental backups are automatically generated every 24 hours.

The sync runs every 15 minutes.

The system only transfers read-only data, which means the current version on the source cluster will be part of the incremental backup on the next day.

In a typical scenario, the secondary cluster data available is one day old.

The list of available backups displayed during the restore procedure might be older than the previous day, especially if an appliance from the source cluster fails to sync. Backups must be successfully synchronized to all members before they are available for restore.

Manual testing

The backup and sync commands are intended to be run by cron. However, you can run them manually for troubleshooting or to reduce the period between backups to have more recent data available to restore.

tw_dr_discovery --backup

Important

A 35 minute flush period is enforced after a backup. Any backup or sync operation triggered during this window is ignored.

tw_dr_discovery --sync

If you see messages during a restore stating that expected backup generations are missing, it is likely that the sync was running before the flush was completed and that a file transfer was skipped to avoid corruption.

This issue is resolved at the next sync, and the restore can run again.

Rollback

If the restore procedure encounters an error, it will roll the data modification back to a state where you can rerun the restore.

Important

The services do not restart if a rollback occurs. You must manually verify and rerun the restore process.

Troubleshooting

The following section describes issues that you might encounter while configuring and using the DR solution.

log file

The utility writes a log file /usr/tideway/log/tw_dr_discovery.log for troubleshooting. 

bad interpreter

If you upload from a Windows host and get this error:

$ tw_dr_discovery
-bash: /usr/tideway/bin/tw_dr_discovery: /bin/sh^M: bad interpreter: No such file or directory

You probably have incorrect line endings. To fix them, run the following commands on the Linux host:

$ dos2unix ~/bin/tw_dr_discovery ~/utils/dr_discovery.py
dos2unix: converting file /usr/tideway/bin/tw_dr_discovery to Unix format...
dos2unix: converting file /usr/tideway/utils/dr_discovery.py to Unix format...

Then tw_dr_discovery will run correctly.

The system user blocked

The system user is blocked after too many incorrect password attempts.

To check whether the system user is blocked, connect to the appliance, as the tideway user using SSH and then run the following:

$ tw_listusers
...
system:
    ...
    fullname: System User
    ...
    auth failures: 12
    user state: USER_STATE_BLOCKED
        reason: "Too many authentication failures (12 attempts) at [snip]"
    ...

The system user is blocked. To unblock it, run:

$ tw_upduser --active system
Set User State USER_STATE_ACTIVE

This action unblocks the system user, and you can log in again.

 

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*