Configuring disaster recovery


As a administrator, use the instructions in this topic to configure disaster recovery.

Before you begin 

  • Set up a standby site identical to your primary site:
    • BMC Helix IT Operations Management (BMC Helix ITOM) version 23.4.00 or later must be installed. 
    • The primary and standby sites must have the same product version, namespace, patches, and hotfixes and must be deployed with the same URLs, except the MinIO URLs (MINIO_API_LB_HOST and MINIO_LB_HOST).
      For more information, see Configuration-file-settings.
    • Primary and standby should have the same cluster resources.
      For more information, see Sizing-and-scalability-considerations 
  • Adjust the reserve capacity of the MinIO PVC to accommodate the extra space that the disaster recovery will require.
    Currently, the sizing requirements for a small deployment are available in the Sizing considerations topic. You can use the same values for a compact deployment. 
  • Configure disaster recovery for BMC Discovery.
    To configure disaster recovery for BMC Discovery, contact BMC Support.

Important

Based on the size of your deployment, the disaster recovery configuration causes a one-time downtime of about 20-30 minutes on the primary site because it causes the restart of the following services:

  • victoria-metrics-cluster-vmstorage
  • victoria-metrics-aggregate-vmstorage-agg
  • minio
  • postgres-bmc-pg-ha
  • logml-model-store
  • ml-model-store

The subsequent data backups will have no downtime.  


To apply the BMC Helix on premises disaster recovery v24.1.00.004 hotfix 

The BMC Helix OnPrem Disaster Recovery v24.1.00.004 hotfix enhances the performance of the disaster recovery functionality. 
Perform the following steps:

  1. If you have already enabled disaster recovery, disable it:
    1. Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery.
    2. Run the following command:

      ./disaster-recovery.sh disable 
  2. Make sure you downloaded the BMC Helix OnPrem Disaster Recovery v24.1.00.004 hotfix from Electronic Product Distribution (EPD).
    OnPremDRHotfix24.1.png

  3. To extract the hotfix-24.1.00.004-12-Post.tar.gz file, run the following command:

    tar -xvf hotfix-24.1.00.004-12-Post.tar.gz

    The new-image-list.txt file present in the extracted hotfix folder contains the following container images:

    • 2410004-v1-victoriametrics-vmbackupmanager-v1.100.1   
    • 2410004-v1-bitnami-minio-2024.03.21-rockylinux-9   
    • 2410004-v1-victoriametrics-vmrestore-v1.99.0-enterprise
    • ade-on-prem-dr-de0a5b0-533
  4. Synchronize your local repository with BMC Docker Trusted Registry (DTR).
    For more information, see Setting-up-a-Harbor-registry-in-a-local-network-and-synchronizing-it-with-BMC-DTR or Setting-up-a-Harbor-registry-in-an-air-gapped-environment-and-synchronizing-it-with-BMC-DTR.
  5. To execute the sh file, run the following command:

    bash hf_script.sh /<path to 24.1 deployment manager directory>/helix-on-prem-deployment-manager

    Replace <path to 24.1 deployment manager directory> with the full path of the directory where you have saved the 24.1 deployment manager.
    Example:

    bash hf_script.sh /data/24.1.00/helix-on-prem-deployment-manager

A copy of the directory helix-on-prem-deployment-manager gets created in the path specified in the command.

In the example, a new directory helix-on-prem-deployment-manager_HF1 gets created at /data/24.1.00.
No changes are made to the original directory specified in the command.

Important

To enable disaster recovery, use the newly created installer folder.

To optimize the storage allocation for disaster recovery backups

Because extra resources have been provisioned for disaster recovery backups, you must optimize the storage allocation by updating the value of the MINIO_DR_STORAGE_SIZE parameter.
Based on your deployment size, update the value of the MINIO_DR_STORAGE_SIZE parameter in the helix-on-prem-deployment-manager/configs/<deployment size>.config file. The recommended values are:

  • Compact - 750Gi
  • Small - 1375Gi
  • Medium - 2250Gi
  • Large - 10625Gi
  • Extra-large - 12080Gi

For example, if the size of your deployment is small, set the value of the MINIO_DR_STORAGE_SIZE parameter as 1375Gi in the helix-on-prem-deployment-manager/configs/small.config file.
MINIO_DR_STORAGE_SIZE = 1375Gi


To configure data backup on the primary site 

Always configure the first data backup when the load is low with less activity.

Important

There might be a downtime of about 30 minutes when you back up data for the first time. Subsequent data backups will have no downtime.  

Perform the following steps:

  1. Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-configs

    Important

    If you applied the BMC Helix OnPrem Disaster Recovery v24.1.00.004 hotfix, make sure you run the command from the new hotfix directory; for example, /data/24.1.00./helix-on-prem-deployment-manager_HF1.

  2. In the disaster-recovery.config file, specify the values of the following parameters:

    Parameter 

    Description 

    Example 

    BUCKET_NAME 

    Specify the name of the bucket used to back up data on MinIO. 

    Important:

    Use the following conventions while assigning a bucket name:

    • It can have 3 to 63 characters.
    • It can have lowercase letters, numbers, and hyphens (-).
    • It must begin and end with an alphabet or a number.
    • It must not start with xn-- as it might get interpreted as a Punycode format; for example, xn–bucketname.
    • It must not have uppercase letters, periods (.), and underscores (_).
    • It must not have hyphens next to periods (.); for example, my-.bucket.com or my.-bucket.
    • It must not end with a hyphen or -s3alias (-s3alias is reserved for the MinIO bucket access point alias name).

    BUCKET_NAME=helixdr-backup 

    SITE_NAME 

    Specify a name to identify the site from where you want to back up data. 

    Important:

    • Use only lower case letters and numbers.
    • Do not add any blank spaces.
    • Do not use any special character except a hyphen.

    SITE_NAME=India

    NAMESPACE 

    Specify the namespace where you have installed BMC Helix ITOM.  

    NAMESPACE=helix-cluster3  

    DR_BACKUP_INTERVAL_IN_HOUR 

    Specify the backup interval in hours. You can set values between 1 to 24.
    We recommend that you set the value of this parameter based on the size of your data.
    The value that you specify defines the interval for backing up your data.

    For example, if you set the value of this parameter as 1 hour, data backup is performed at the start of every hour as per the cron schedule (0 */1 * * *).
    If your current cluster time is 2:15 P.M. on November 2:

    • The first backup will occur at 3:00 P.M. This will be a complete data backup.
    • Subsequent backups will occur at 4:00 P.M., 5:00 P.M., 6:00 P.M., and so on. These will be incremental backups. 
    • At 3:00 P.M. on November 3, a complete data backup will occur.

    For example, if you set the value of this parameter as 3 hours, data backup is performed at the start of every third hour as per the cron schedule (0 */3 * * *). 
    If your current cluster time is 2:15 P.M. on November 2:

    • The first backup will occur at 3:00 P.M. This will be a complete data backup.
    • Subsequent backups will occur at 6:00 P.M., 9:00 P.M., and so on. These will be incremental backups. 
    • At 3:00 P.M. on November 3, a complete data backup will occur.

    Important: For Victoria Metrics, the default interval is one hour. You cannot modify the default interval.

    DR_BACKUP_INTERVAL_IN_HOUR=1 

     

    DR_MAX_BACKUP_TO_RETAIN 

    Specify the number of days for which you want to retain the backed-up data.  

    DR_MAX_BACKUP_TO_RETAIN=5 

    Important

    If you decide to configure disaster recovery for both BMC Helix ITOM and BMC Helix IT Service Management, make sure the backup schedules and retention periods are in sync.

  3. To configure a data backup, run the following command: 

    ./disaster-recovery.sh backup 

    Logs get saved in the helix-on-prem-deployment-manager/logs directory. You can check the logs to monitor the progress, success, or failure of the data backup process. 

    Important

    If you applied the BMC Helix OnPrem Disaster Recovery v24.1.00.004 hotfix, logs get saved in a new directory; for example, /data/24.1.00./helix-on-prem-deployment-manager_HF1/logs.


To configure MinIO data replication from the primary site to the standby site 

Data replication is always from the MinIO on the primary site to the MinIO on the standby site. Typically, port 443 must be open on the standby site (External load balancer) unless you decide to configure a different port on the standby site for MinIO. 

Best practice

Make sure the MinIO on the standby site is reachable from the primary site.

The MinIO on the standby site should have sufficient network bandwidth for efficient and quick data replication.


  1. Log in to the controller machine on the standby site and run the following commands:

    kubectl -n <ITOM NAMESPACE> annotate ingress minio-api --overwrite=true nginx.ingress.kubernetes.io/proxy-body-size=0


    kubectl -n <ITOM NAMESPACE> annotate ingress minio --overwrite=true nginx.ingress.kubernetes.io/proxy-body-size=0
  2. To create a restore bucket and enable versioning, perform the following steps:
    1. On the standby site MinIO console, under Administrator, select Buckets.
    2. At the top-right corner of the console, click Create Bucket.
    1. In the Bucket Name box, type a name to identify the restore bucket; for example, helixdr-restore.
    2. To enable versioning, turn on the Versioning toggle.
    1. Click Create Bucket.
      23.4_DR_MinIO5.png
  1. Log on to the production site MinIO console by using the URL you set for the parameter MINIO_LB_HOST in the infra.config file.
    For more information, see Configuration-file-settings.
    Use the credentials you set during the deployment of BMC Helix ITOM.
  2. On the MinIO console, under Administrator, select Buckets.
  1. From the list of buckets, select the backup bucket.
    In the disaster-recovery.config file, you must have set a name for the backup bucket by using the BUCKET_NAME parameter; for example, helixdr-backup.
    23.4_DR_MinIO3.png
  1. To enable versioning, perform the following steps:
    1. Click the pencil icon.
      23.4_DR_MinIO8.png
    1. In the Versioning on Bucket dialog box, click Enable.
      Current Status changes from Unversioned to Versioned.
  1. Add a replication rule to replicate data from the backup bucket (on the primary site MinIO) to the restore bucket (on the standby site MinIO): 
    1. On the production site MinIO console, under Administrator, select Buckets.
    1. From the list of buckets, select the backup bucket; for example, helixdr-backup.
    2. Go to the Replication tab and click the Add Replication Rule button.
      23.4_DR_MinIO6.png
  1. In the Set Bucket Replication dialog box, enter the following values:
    1. Target URL - The API end-point of the MinIO on the standby site.
      This is the URL that you set for the parameter 
      MINIO_API_LB_HOST in the infra.config file.
    1. Access Key - User name to access the standby site MinIO.
    2. Secret Key - Password to access the standby site MinIO.
    3. Target Bucket - Name of the restore bucket; for example, helixdr-restore.
    4. Leave the other values to their default and click Save to set the replication rule.
      23.4_DR_MinIO7A.png
      After you set the replication rule, data from the 
      primary site MinIO gets replicated onto the standby site MinIO.

      Important

      In the Replication Options area, make sure the Existing Objects toggle is turned on.

      ExistingObjects.png

  1. To verify the replication is successful:
    1. Log in to the MinIO on the standby site and check if the data is available in the replication bucket.
    2. Check if the data in the back up bucket on the primary site is of the same size as that on the standby site. 

Important

If the data backup process fails, an email is sent to the tenant email ID defined in the infra.config file. 


To validate the data backup

  1. Log in to the controller.
  2. To confirm the cronjobs were successfully configured, run the following command: 

    kubectl -n <itom namespace> get cronjob | grep dr-


    Sample output:
    24.1_DR_Validate.png
    Verify that all the backup cron jobs are running according to your configured schedule.

  3. After completing the first backup job, log in to the MinIO web console and go to the Object Browser.
  4. Go to the bucket that is configured to back up data; for example, helixdr-backup.
  5. Open the site folder (<SiteName>; for example, India) where you are backing up your data, and then open the backupStatus folder.
  6. Open the backup.log file and validate that there are no errors.

    Sample output of a successful backup
    Backup start time    : 2024-4-10 8:00
    Backup end time      : 2024-04-10 08:09:25.382721
    Status               : Backup Completed successfully

    Backup succeeded for : ['drs-repo', 'events-es', 'log-es', 'kafka', 'minio', 'zookeeper', 'Victoria Metrics', 'Postgres DB', 'K8S Objects']
    Backup failed for    : []
    Backup timed out for : []

    Backup details data  : {'EVENTES': '{"helixdr-backupeventes-1712736045":"7c14b0c3-5399-423e-8771-8a9285ff3254"}', 'LOGES': '{"helixdr-backuploges-1712736047":"1a9c134f-13cf-4073-b717-3fc923a00f43"}', 'KAFKA': '20240410-080056', 'VM': 'daily/2024-04-10', 'VMAGG': 'daily/2024-04-10', 'PG': '20240410-080003F', 'ZOOKEEPER': '20240410-080014', 'K8OBJS': '20240410-080046', 'DRSREPO': '20240410:eab5bf55', 'MINIO': '20240410-080006-F'}
    MinioUsage Data : 5.7 GiB Used, 7 Buckets, 205 Objects

    Sample output of a failed backup
    Backup start time    : 2024-5-2 9:00
    Backup end time      : 2024-05-02 09:56:03.069877
    Status               : Failure

    Backup succeeded for : ['drs-repo', 'events-es', 'log-es', 'kafka', 'minio', 'zookeeper', 'Victoria Metrics', 'Postgres DB', 'K8S Objects']
    Backup failed for    : ['Victoria Metrics Aggregate Cluster']
    Backup timed out for : []

    Backup details data  : {'EVENTES': '{"itsmdr-backupeventes-1714640446":"d4b140d4-af53-4f42-b934-06cde2048e62"}', 'LOGES': '{"itsmdr-backuploges-1714640445":"b1f1fffb-7da0-4ec7-b6f1-74b32812e5dd"}', 'KAFKA': '20240502-090018', 'VM': 'daily/2024-05-02', 'VMAGG': '', 'PG': '20240502-090004F', 'ZOOKEEPER': '20240502-090013', 'K8OBJS': '20240502-090006', 'DRSREPO': '20240502:81575db4', 'MINIO': '20240502-090004-F'}
    MinioUsage Data : 11 GiB Used, 8 Buckets, 9,340 Objects

  7. Download and open the last_backup.json file and validate that there are no errors.
    Sample output for a successful backup:

    {"EVENTES": "{\"helixdr-backupeventes-1696766450\":\"fd02977e-032c-4898-a602-fac5684a64af\"}", "LOGES": "{\"helixdr-backuploges-1696766451\":\"9a3f0ee0-4b7e-4884-84a3-e7c3ebddec95\"}", "KAFKA": "20231009-040016", "VM": "hourly/2023-10-09:03", "VMAGG": "hourly/2023-10-09:03", "PG": "20231008-080003F_20231009-040004I", "ZOOKEEPER": "20231009-040014", "K8OBJS": "20231009-040005", "DRSREPO": "20231009-040005", "MINIO": "20231009-040004"}

    This confirms that the standby site configuration is completed successfully. 


(Optional) To scale down the standby site

To save resources, after configuring disaster recovery, you can scale down the application pods and keep only the data lake components running on the standby site. 

Perform the following steps:

  1. Go to helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-scale
  2. Run the following command:

    ./product_scale.sh  down

    Data from the primary site MinIO continues to be replicated onto the standby site MinIO but the applications will not run.

Important

  • If you upgrade the primary site to a new version of BMC Helix ITOM, you must upgrade the standby site to the same version.
  • If you apply a hotfix on the primary site, you must apply the same hotfix on the standby site.

(Optional) To stop the data backup process on the primary site 

Expect some downtime while you stop the data backup. 

  1. Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery.

    If you applied the hotfix-24.1.00.004-12-Post.tar.gz, make sure you run the command from the new hotfix directory; for example, /data/24.1.00./helix-on-prem-deployment-manager_HF1.

  2. Run the following command:
    ./disaster-recovery.sh disable 

Any data backup process that is in progress is completed, and the subsequent backup process is stopped.

The data backed up in MinIO is not deleted.  

Back up of all BMC Helix ITOM applications is stopped, except BMC Discovery. 

To configure disaster recovery again, you must perform all the steps listed in this topic. 


 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*