Disaster recovery deployment

A disaster is any event that interrupts the normal operations of an organization. A disaster could be power outages, failure or breakdown of systems, or natural calamities.  

Disaster recovery equips an organization to be prepared for a disaster by creating and maintaining a standby site. A standby site has an infrastructure that is identical to the primary site and ensures that an organization sustains a disaster and continues its operations with minimal or no impact.   

The process of disaster recovery involves backing up data on the primary site at regular intervals and restoring the backed-up data on the standby site. It is considered an active-passive replication. 

BMC Helix IT Operations Management components backed up 

Disaster recovery is accomplished by backing up the following components and restoring them on a standby cluster: 

    • Elasticsearch 
    • Victoria Metrics 
    • PostgreSQL 
    • Kafka  
    • Zookeeper 
    • Object Storage data
    • Knowledge Module (KM) repository  
    • AIOps
    • Kubernetes Config Maps and Secrets

Important

Disaster recovery is supported for all BMC Helix IT Operations Management applications including BMC Discovery.

Disaster recovery overview 


Component

Description

Primary site

The primary site is your data center, which hosts all your live systems. 

Standby site

The standby site is the disaster recovery system. You must create and maintain this site.
It must have an infrastructure identical to your primary site and must seamlessly take over the functionality if the primary site fails.

Failover

Failover is the process of switching to the standby site when a disaster occurs in the primary site. 

Failback

Failback is the process of transferring control back to the primary site to take over the functionality from the standby site.  


To configure disaster recovery, you need MinIO and the disaster recovery utility, which is provided with the installer and located at helix-on-prem-deployment-manager/utilities/disaster-recovery

You must run the disaster recovery utility with the backup option to back up data and with the restore option to restore data. You can configure the backup frequency and the data retention period.  

After configuring disaster recovery, data is backed up, saved in the MinIO on the primary site, and then replicated onto the MinIO on the standby site at scheduled intervals.
For more information about MinIO bucket replication, see  Bucket Replication Open link in the MinIO documentation.

RPO and RTO measurements 

Recovery Point Objective (RPO) is the time-based measurement of tolerated data loss. Recovery Time Objecting (RTO) is the targeted duration between an event failure and the point where the operations resume.  
The default configurations set RPO expectations to 2 hours. 

Important

Disaster recovery is a new feature, and the RTO is still being measured for general expectations.  
We recommend that you perform a trial run of your disaster recovery operation to give you personalized expectations of how your setup and environments will measure at RPO/RTO metrics. 

Was this page helpful? Yes No Submitting... Thank you

Comments