Disaster recovery deployment

A disaster is any event that interrupts the normal operations of an organization. A disaster could be power outages, failure or breakdown of systems, or natural calamities.

Disaster recovery equips an organization to be prepared for a disaster by creating and maintaining a standby site. A standby site has an infrastructure that is identical to the primary site and ensures that an organization sustains a disaster and continues its operations with minimal or no impact.

The process of disaster recovery involves backing up data on the primary site at regular intervals and restoring the backed-up data on the standby site. It is considered an active-passive replication.

BMC Helix IT Operations Management components backed up

Disaster recovery is accomplished by backing up the following components and restoring them on a standby cluster:

- Elasticsearch
- Victoria Metrics
- PostgreSQL
- Kafka
- Zookeeper
- Object Storage data
- Knowledge Module (KM) repository
- AIOps
- Kubernetes Config Maps and Secrets

Important

Disaster recovery is supported for all BMC Helix IT Operations Management applications including BMC Discovery.

Disaster recovery overview

Component	Description
Primary site	The primary site is your data center, which hosts all your live systems.
Standby site	The standby site is the disaster recovery system. You must create and maintain this site. It must have an infrastructure identical to your primary site and must seamlessly take over the functionality if the primary site fails.
Failover	Failover is the process of switching to the standby site when a disaster occurs in the primary site.
Failback	Failback is the process of transferring control back to the primary site to take over the functionality from the standby site.

To configure disaster recovery, you need MinIO and the disaster recovery utility, which is provided with the installer and located at helix-on-prem-deployment-manager/utilities/disaster-recovery.

You must run the disaster recovery utility with the backup option to back up data and with the restore option to restore data. You can configure the backup frequency and the data retention period.

After configuring disaster recovery, data is backed up, saved in the MinIO on the primary site, and then replicated onto the MinIO on the standby site at scheduled intervals.
For more information about MinIO bucket replication, see Bucket Replication in the MinIO documentation.

Options to fail back to the primary site

Failback is transferring control back to the primary site to take over the functionality from the standby site. Here are a few options for failback:

Bring up the primary site and fail back to it
Make the standby site your new primary site
Set up a new primary site and fail back to it

The following image depicts the options to fail back:

To fail back to the primary site

Perform the following steps:

Configure data backup on Site B.
Warning
Important
Site B becomes your active or new primary site at this stage.
For more information about configuring data backup, see "To configure data backup on the primary site" in the Configuring-disaster-recovery topic.
Perform either of the following actions:
- If you chose option 1 or 2, bring up Site A.
- If you chose option 3, create a new site (Site C).
Perform either of the following actions:
- If you chose option 1 or 2, replicate data from Site B to Site A.
- If you chose option 3, replicate data from Site B to Site C.
  For more information, see "To configure MinIO data replication from the primary site to the standby site" in the Configuring-disaster-recovery topic.
Perform either of the following actions:
- If you chose option 1, restore data on Site A.
  Warning
  Important
  Skip this task if you chose option 2 to fail back (Make the standby site your new primary site).
- If you chose option 3, restore data on Site C.
  For more information, see Restoring-data-on-the-standby-site.

RPO and RTO measurements

Recovery Point Objective (RPO) is the time-based measurement of tolerated data loss. Recovery Time Objecting (RTO) is the targeted duration between an event failure and the point where the operations resume.
The default configurations set RPO expectations to 2 hours.

Important

Disaster recovery is a new feature, and the RTO is still being measured for general expectations.
We recommend that you perform a trial run of your disaster recovery operation to give you personalized expectations of how your setup and environments will measure at RPO/RTO metrics.

Where to go from here

Configuring-disaster-recovery