Configuring disaster recovery
Before you begin
- Set up a standby site identical to your primary site:
- BMC Helix IT Operations Management (BMC Helix ITOM) version 24.2.00 or later must be installed.
- The primary and standby sites must have the same product version, namespace, patches, and hotfixes and must be deployed with the same URLs, except the MinIO URLs (MINIO_API_LB_HOST and MINIO_LB_HOST).
For more information, see Configuration-file-settings. - Primary and standby should have the same cluster resources.
For more information, see Sizing-and-scalability-considerations.
- Adjust the reserve capacity of the MinIO PVC to accommodate the extra space that the disaster recovery will require.
The sizing requirements are available in the Sizing considerations topic. You can use the same values for a compact deployment. - Configure disaster recovery for BMC Discovery.
To configure disaster recovery for BMC Discovery, contact BMC Support.
To configure data backup on the primary site
Always configure the first data backup when the load is low with less activity.
Perform the following steps:
- Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-configs.
In the disaster-recovery.config file, specify the values of the following parameters:
Parameter
Description
Example
BUCKET_NAME
Specify the name of the bucket used to back up data on MinIO.
Important:
Use the following conventions while assigning a bucket name:
- It can have 3 to 63 characters.
- It can have lowercase letters, numbers, and hyphens (-).
- It must begin and end with an alphabet or a number.
- It must not start with xn-- as it might get interpreted as a Punycode format; for example, xn–bucketname.
- It must not have uppercase letters, periods (.), and underscores (_).
- It must not have hyphens next to periods (.); for example, my-.bucket.com or my.-bucket.
- It must not end with a hyphen or -s3alias (-s3alias is reserved for the MinIO bucket access point alias name).
BUCKET_NAME=helixdr-backup
SITE_NAME
Specify a name to identify the site from where you want to back up data.
Important:
- Use only lowercase letters and numbers.
- Do not add any blank spaces.
- Do not use any special character except a hyphen.
SITE_NAME=India
NAMESPACE
Specify the namespace where you have installed BMC Helix ITOM.
NAMESPACE=helix-cluster3
DR_BACKUP_INTERVAL_IN_HOUR
Specify the backup interval in hours. You can set values between 1 to 24.
We recommend that you set the value of this parameter based on the size of your data.
The value that you specify defines the interval for backing up your data.For example, if you set the value of this parameter as 1 hour, data backup is performed at the start of every hour as per the cron schedule (0 */1 * * *).
If your current cluster time is 2:15 P.M. on November 2:- The first backup will occur at 3:00 P.M. This will be a complete data backup.
- Subsequent backups will occur at 4:00 P.M., 5:00 P.M., 6:00 P.M., and so on. These will be incremental backups.
- At 3:00 P.M. on November 3, a complete data backup will occur.
For example, if you set the value of this parameter as 3 hours, data backup is performed at the start of every third hour as per the cron schedule (0 */3 * * *).
If your current cluster time is 2:15 P.M. on November 2:- The first backup will occur at 3:00 P.M. This will be a complete data backup.
- Subsequent backups will occur at 6:00 P.M., 9:00 P.M., and so on. These will be incremental backups.
- At 3:00 P.M. on November 3, a complete data backup will occur.
Important: For Victoria Metrics, the default interval is one hour. You cannot modify the default interval.
DR_BACKUP_INTERVAL_IN_HOUR=1
DR_MAX_BACKUP_TO_RETAIN
Specify the number of days for which you want to retain the backed-up data.
DR_MAX_BACKUP_TO_RETAIN=5
To configure a data backup, run the following command:
./disaster-recovery.sh backupLogs get saved in the helix-on-prem-deployment-manager/logs directory. You can check the logs to monitor the progress, success, or failure of the data backup process.
To configure MinIO data replication from the primary site to the standby site
Data replication is always from the MinIO on the primary site to the MinIO on the standby site. Typically, port 443 must be open on the standby site (External load balancer) unless you decide to configure a different port on the standby site for MinIO.
- Log in to the controller machine on the standby site:
- Set the value of the namespace in the disaster-recovery.config file
The following command:
./disaster-recovery.sh configureRestore
- To create a restore bucket and enable versioning, perform the following steps:
- On the standby site MinIO console, under Administrator, select Buckets.
- At the top-right corner of the console, click Create Bucket.
- In the Bucket Name box, type a name to identify the restore bucket; for example, helixdr-restore.
- To enable versioning, turn on the Versioning toggle.
- Click Create Bucket.
- Click Create Bucket.
- Log on to the production site MinIO console by using the URL you set for the parameter MINIO_LB_HOST in the infra.config file.
For more information, see Configuration-file-settings.
Use the credentials you set during the deployment of BMC Helix ITOM. - On the MinIO console, under Administrator, select Buckets.
- From the list of buckets, select the backup bucket.
In the disaster-recovery.config file, you must have set a name for the backup bucket by using the BUCKET_NAME parameter; for example, helixdr-backup.
- To enable versioning, perform the following steps:
- Click the pencil icon.
- Click the pencil icon.
- In the Versioning on Bucket dialog box, click Enable.
Current Status changes from Unversioned to Versioned.
- In the Versioning on Bucket dialog box, click Enable.
- Add a replication rule to replicate data from the backup bucket (on the primary site MinIO) to the restore bucket (on the standby site MinIO):
- On the production site MinIO console, under Administrator, select Buckets.
- From the list of buckets, select the backup bucket; for example, helixdr-backup.
- Go to the Replication tab and click the Add Replication Rule button.
- In the Set Bucket Replication dialog box, enter the following values:
- Target URL - The API end-point of the MinIO on the standby site.
This is the URL that you set for the parameter MINIO_API_LB_HOST in the infra.config file.
- Target URL - The API end-point of the MinIO on the standby site.
- Access Key - User name to access the standby site MinIO.
- Secret Key - Password to access the standby site MinIO.
- Target Bucket - Name of the restore bucket; for example, helixdr-restore.
Leave the other values to their default and click Save to set the replication rule.
After you set the replication rule, data from the primary site MinIO gets replicated onto the standby site MinIO.
- To verify the replication is successful:
- Log in to the MinIO on the standby site and check if the data is available in the replication bucket.
- Check if the data in the backup bucket on the primary site is of the same size as that on the standby site.
To validate the data backup
- Log in to the controller.
To confirm the cronjobs were successfully configured, run the following command:
kubectl -n <itom namespace> get cronjob | grep dr-Sample output:
Verify that all the backup cron jobs are running according to your configured schedule.- After the first backup job is complete, log in to the MinIO web console and go to the Object Browser.
- Go to the bucket that is configured to back up data; for example, helixdr-backup.
- Open the site folder (<SiteName>; for example, India) where you are backing up your data, and then open the backupStatus folder.
Open the backup.log file and validate that there are no errors.
Download and open the last_backup.json file and validate that there are no errors.
Sample output for a successful backup:{"EVENTES": "{\"helixdr-backupeventes-1696766450\":\"fd02977e-032c-4898-a602-fac5684a64af\"}", "LOGES": "{\"helixdr-backuploges-1696766451\":\"9a3f0ee0-4b7e-4884-84a3-e7c3ebddec95\"}", "KAFKA": "20231009-040016", "VM": "hourly/2023-10-09:03", "VMAGG": "hourly/2023-10-09:03", "PG": "20231008-080003F_20231009-040004I", "ZOOKEEPER": "20231009-040014", "K8OBJS": "20231009-040005", "DRSREPO": "20231009-040005", "MINIO": "20231009-040004"}
(Optional) To scale down the standby site
To save resources, after configuring disaster recovery, you can scale down the application pods and keep only the data lake components running on the standby site.
Perform the following steps:
- Go to helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-scale
Run the following command:
./product_scale.sh down- Data from the primary site MinIO continues to be replicated onto the standby site MinIO but the applications will not run.
- To scale down the data lake components also, use DOWN-ALL or down-all. Example: ./product_scale.sh down-all. This stops all services except MinIO, Postgres, and external Elasticsearch, Fluentd, and Kibana (EFK) stacks.
- Do not repeat the DOWN-ALL or down-all multiple times to avoid issues during scale up.
(Optional) To stop the data backup process on the primary site
Expect some downtime while you stop the data backup.
- Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery.
- Run the following command:
./disaster-recovery.sh disable
Any data backup process that is in progress is completed, and the subsequent backup process is stopped.
The data backed up in MinIO is not deleted.
Back up of all BMC Helix ITOM applications is stopped, except BMC Discovery.
To configure disaster recovery again, you must perform all the steps listed in this topic.
Where to go from here