Configuring disaster recovery


As a administrator, use the instructions in this topic to configure disaster recovery.

Before you begin 

  • Set up a standby site identical to your primary site:
    • BMC Helix IT Operations Management (BMC Helix ITOM) version 24.2.00 or later must be installed. 
    • The primary and standby sites must have the same product version, namespace, patches, and hotfixes and must be deployed with the same URLs, except the MinIO URLs (MINIO_API_LB_HOST and MINIO_LB_HOST).
      For more information, see Configuration-file-settings.
    • Primary and standby should have the same cluster resources.
      For more information, see Sizing-and-scalability-considerations 
  • Adjust the reserve capacity of the MinIO PVC to accommodate the extra space that the disaster recovery will require.
    The sizing requirements are available in the Sizing considerations topic. You can use the same values for a compact deployment. 
  • Configure disaster recovery for BMC Discovery.
    To configure disaster recovery for BMC Discovery, contact BMC Support.


To configure data backup on the primary site 

Always configure the first data backup when the load is low with less activity.

Important

Based on the size of your deployment, the disaster recovery configuration causes a one-time downtime of about 20-30 minutes on the primary site because it causes the restart of the following services:

  • victoria-metrics-cluster-vmstorage
  • victoria-metrics-aggregate-vmstorage-agg
  • minio
  • postgres-bmc-pg-ha
  • logml-model-store
  • ml-model-store

The subsequent data backups will have no downtime.  


Perform the following steps:

  1. Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-configs
  2. In the disaster-recovery.config file, specify the values of the following parameters:

    Parameter 

    Description 

    Example 

    BUCKET_NAME 

    Specify the name of the bucket used to back up data on MinIO. 

    Important:

    Use the following conventions while assigning a bucket name:

    • It can have 3 to 63 characters.
    • It can have lowercase letters, numbers, and hyphens (-).
    • It must begin and end with an alphabet or a number.
    • It must not start with xn-- as it might get interpreted as a Punycode format; for example, xn–bucketname.
    • It must not have uppercase letters, periods (.), and underscores (_).
    • It must not have hyphens next to periods (.); for example, my-.bucket.com or my.-bucket.
    • It must not end with a hyphen or -s3alias (-s3alias is reserved for the MinIO bucket access point alias name).

    BUCKET_NAME=helixdr-backup 

    SITE_NAME 

    Specify a name to identify the site from where you want to back up data. 

    Important:

    • Use only lowercase letters and numbers.
    • Do not add any blank spaces.
    • Do not use any special character except a hyphen.

    SITE_NAME=India

    NAMESPACE 

    Specify the namespace where you have installed BMC Helix ITOM.  

    NAMESPACE=helix-cluster3  

    DR_BACKUP_INTERVAL_IN_HOUR 

    Specify the backup interval in hours. You can set values between 1 to 24.
    We recommend that you set the value of this parameter based on the size of your data.
    The value that you specify defines the interval for backing up your data.

    For example, if you set the value of this parameter as 1 hour, data backup is performed at the start of every hour as per the cron schedule (0 */1 * * *).
    If your current cluster time is 2:15 P.M. on November 2:

    • The first backup will occur at 3:00 P.M. This will be a complete data backup.
    • Subsequent backups will occur at 4:00 P.M., 5:00 P.M., 6:00 P.M., and so on. These will be incremental backups. 
    • At 3:00 P.M. on November 3, a complete data backup will occur.

    For example, if you set the value of this parameter as 3 hours, data backup is performed at the start of every third hour as per the cron schedule (0 */3 * * *). 
    If your current cluster time is 2:15 P.M. on November 2:

    • The first backup will occur at 3:00 P.M. This will be a complete data backup.
    • Subsequent backups will occur at 6:00 P.M., 9:00 P.M., and so on. These will be incremental backups. 
    • At 3:00 P.M. on November 3, a complete data backup will occur.

    Important: For Victoria Metrics, the default interval is one hour. You cannot modify the default interval.

    DR_BACKUP_INTERVAL_IN_HOUR=1 

     

    DR_MAX_BACKUP_TO_RETAIN 

    Specify the number of days for which you want to retain the backed-up data.  

    DR_MAX_BACKUP_TO_RETAIN=5 

    Important

    If you decide to configure disaster recovery for both BMC Helix ITOM and BMC Helix IT Service Management, make sure the backup schedules and retention periods are in sync.

  3. To configure a data backup, run the following command: 

    ./disaster-recovery.sh backup 

    Logs get saved in the helix-on-prem-deployment-manager/logs directory. You can check the logs to monitor the progress, success, or failure of the data backup process. 


To configure MinIO data replication from the primary site to the standby site 

Data replication is always from the MinIO on the primary site to the MinIO on the standby site. Typically, port 443 must be open on the standby site (External load balancer) unless you decide to configure a different port on the standby site for MinIO. 

Best practice

Make sure the MinIO on the standby site is reachable from the primary site.

The MinIO on the standby site should have sufficient network bandwidth for efficient and quick data replication.


  1. Log in to the controller machine on the standby site:
    1. Set the value of the namespace in the disaster-recovery.config file 
    2. The following command:

       ./disaster-recovery.sh configureRestore  

  2. To create a restore bucket and enable versioning, perform the following steps:
    1. On the standby site MinIO console, under Administrator, select Buckets.
    2. At the top-right corner of the console, click Create Bucket.
    1. In the Bucket Name box, type a name to identify the restore bucket; for example, helixdr-restore.
    2. To enable versioning, turn on the Versioning toggle.
    1. Click Create Bucket.
      23.4_DR_MinIO5.png
  1. Log on to the production site MinIO console by using the URL you set for the parameter MINIO_LB_HOST in the infra.config file.
    For more information, see Configuration-file-settings.
    Use the credentials you set during the deployment of BMC Helix ITOM.
  2. On the MinIO console, under Administrator, select Buckets.
  1. From the list of buckets, select the backup bucket.
    In the disaster-recovery.config file, you must have set a name for the backup bucket by using the BUCKET_NAME parameter; for example, helixdr-backup.
    23.4_DR_MinIO3.png
  1. To enable versioning, perform the following steps:
    1. Click the pencil icon.
      23.4_DR_MinIO8.png
    1. In the Versioning on Bucket dialog box, click Enable.
      Current Status changes from Unversioned to Versioned.
  1. Add a replication rule to replicate data from the backup bucket (on the primary site MinIO) to the restore bucket (on the standby site MinIO): 
    1. On the production site MinIO console, under Administrator, select Buckets.
    1. From the list of buckets, select the backup bucket; for example, helixdr-backup.
    2. Go to the Replication tab and click the Add Replication Rule button.
      23.4_DR_MinIO6.png
  1. In the Set Bucket Replication dialog box, enter the following values:
    1. Target URL - The API end-point of the MinIO on the standby site.
      This is the URL that you set for the parameter 
      MINIO_API_LB_HOST in the infra.config file.
    1. Access Key - User name to access the standby site MinIO.
    2. Secret Key - Password to access the standby site MinIO.
    3. Target Bucket - Name of the restore bucket; for example, helixdr-restore.
    4. Leave the other values to their default and click Save to set the replication rule.
      23.4_DR_MinIO7A.png
      After you set the replication rule, data from the 
      primary site MinIO gets replicated onto the standby site MinIO.

      Important

      In the Replication Options area, make sure the Existing Objects toggle is turned on.

      ExistingObjects.png

  1. To verify the replication is successful:
    1. Log in to the MinIO on the standby site and check if the data is available in the replication bucket.
    2. Check if the data in the backup bucket on the primary site is of the same size as that on the standby site. 

Important

If the data backup process fails, an email is sent to the tenant email ID defined in the infra.config file. 

To validate the data backup

  1. Log in to the controller.
  2. To confirm the cronjobs were successfully configured, run the following command: 

    kubectl -n <itom namespace> get cronjob | grep dr-

    Sample output:
    24.1_DR_Validate.png

    Verify that all the backup cron jobs are running according to your configured schedule.

  3. After the first backup job is complete, log in to the MinIO web console and go to the Object Browser.
  4. Go to the bucket that is configured to back up data; for example, helixdr-backup.
  5. Open the site folder (<SiteName>; for example, India) where you are backing up your data, and then open the backupStatus folder.
  6. Open the backup.log file and validate that there are no errors.

    Sample output of a successful backup
    Backup start time    : 2024-4-10 8:00
    Backup end time      : 2024-04-10 08:09:25.382721
    Status               : Backup Completed successfully

    Backup succeeded for : ['drs-repo', 'events-es', 'log-es', 'kafka', 'minio', 'zookeeper', 'Victoria Metrics', 'Postgres DB', 'K8S Objects']
    Backup failed for    : []
    Backup timed out for : []

    Backup details data  : {'EVENTES': '{"helixdr-backupeventes-1712736045":"7c14b0c3-5399-423e-8771-8a9285ff3254"}', 'LOGES': '{"helixdr-backuploges-1712736047":"1a9c134f-13cf-4073-b717-3fc923a00f43"}', 'KAFKA': '20240410-080056', 'VM': 'daily/2024-04-10', 'VMAGG': 'daily/2024-04-10', 'PG': '20240410-080003F', 'ZOOKEEPER': '20240410-080014', 'K8OBJS': '20240410-080046', 'DRSREPO': '20240410:eab5bf55', 'MINIO': '20240410-080006-F'}
    MinioUsage Data : 5.7 GiB Used, 7 Buckets, 205 Objects

    Sample output of a failed backup
    Backup start time    : 2024-5-2 9:00
    Backup end time      : 2024-05-02 09:56:03.069877
    Status               : Failure

    Backup succeeded for : ['drs-repo', 'events-es', 'log-es', 'kafka', 'minio', 'zookeeper', 'Victoria Metrics', 'Postgres DB', 'K8S Objects']
    Backup failed for    : ['Victoria Metrics Aggregate Cluster']
    Backup timed out for : []

    Backup details data  : {'EVENTES': '{"itsmdr-backupeventes-1714640446":"d4b140d4-af53-4f42-b934-06cde2048e62"}', 'LOGES': '{"itsmdr-backuploges-1714640445":"b1f1fffb-7da0-4ec7-b6f1-74b32812e5dd"}', 'KAFKA': '20240502-090018', 'VM': 'daily/2024-05-02', 'VMAGG': '', 'PG': '20240502-090004F', 'ZOOKEEPER': '20240502-090013', 'K8OBJS': '20240502-090006', 'DRSREPO': '20240502:81575db4', 'MINIO': '20240502-090004-F'}
    MinioUsage Data : 11 GiB Used, 8 Buckets, 9,340 Objects

  7. Download and open the last_backup.json file and validate that there are no errors.
    Sample output for a successful backup:

    {"EVENTES": "{\"helixdr-backupeventes-1696766450\":\"fd02977e-032c-4898-a602-fac5684a64af\"}", "LOGES": "{\"helixdr-backuploges-1696766451\":\"9a3f0ee0-4b7e-4884-84a3-e7c3ebddec95\"}", "KAFKA": "20231009-040016", "VM": "hourly/2023-10-09:03", "VMAGG": "hourly/2023-10-09:03", "PG": "20231008-080003F_20231009-040004I", "ZOOKEEPER": "20231009-040014", "K8OBJS": "20231009-040005", "DRSREPO": "20231009-040005", "MINIO": "20231009-040004"}

(Optional) To scale down the standby site

To save resources, after configuring disaster recovery, you can scale down the application pods and keep only the data lake components running on the standby site. 

Perform the following steps:

  1. Go to helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-scale
  2. Run the following command:

    ./product_scale.sh  down

    Data from the primary site MinIO continues to be replicated onto the standby site MinIO but the applications will not run.

Important

  • If you upgrade the primary site to a new version of BMC Helix ITOM, you must upgrade the standby site to the same version.
  • If you apply a hotfix on the primary site, you must apply the same hotfix on the standby site.


(OptionalTo stop the data backup process on the primary site

Expect some downtime while you stop the data backup. 

  1. Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery.
  2. Run the following command:
    ./disaster-recovery.sh disable 

Any data backup process that is in progress is completed, and the subsequent backup process is stopped.

The data backed up in MinIO is not deleted.  

Back up of all BMC Helix ITOM applications is stopped, except BMC Discovery. 

To configure disaster recovery again, you must perform all the steps listed in this topic. 

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*