Backing up and replicating components for disaster recovery


As a administrator, use the instructions in this topic to configure disaster recovery.

 

Before you begin 

  • Set up a standby site identical to your primary site:
    • BMC Helix IT Operations Management(BMC Helix ITOM) version 24.2.00 or later must be installed. 
    • The primary and standby sites must have the same product version, namespace, patches, and hotfixes and must be deployed with the same URLs, except the MinIO URLs (MINIO_API_LB_HOST and MINIO_LB_HOST). 
      For more information, see Configuration file settings.
    • Primary and standby should have the same cluster resources.
      For more information, see Sizing and scalability considerations 
  • Adjust the reserve capacity of the MinIO PVC to accommodate the extra space that the disaster recovery will require. 
    The sizing requirements are available in the Sizing considerations topic. You can use the same values for a compact deployment. 
  • Configure disaster recovery for BMC Discovery. 
    To configure disaster recovery for BMC Discovery, refer to Creating a disaster recovery cluster.

 

To configure data backup on the primary site 

Always configure the first data backup when the load is low with less activity.

Warning

Important

Based on the size of your deployment, the disaster recovery configuration causes a one-time downtime of about 20-30 minutes on the primary site because it causes the restart of the following services:

  • victoria-metrics-cluster-vmstorage
  • victoria-metrics-aggregate-vmstorage-agg
  • minio
  • postgres-bmc-pg-ha
  • logml-model-store
  • ml-model-store

The subsequent data backups will have no downtime.  

Perform the following steps:

  1. Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-configs
  2. In the disaster-recovery.config file, specify the values of the following parameters:

    Parameter 

    Description 

    Example 

    BUCKET_NAME 

    Specify the name of the bucket used to back up data on MinIO. 

    Important:

    Use the following conventions while assigning a bucket name:

    • It can have 3 to 63 characters.
    • It can have lowercase letters, numbers, and hyphens (-).
    • It must begin and end with an alphabet or a number.
    • It must not start with xn-- as it might get interpreted as a Punycode format; for example, xn–bucketname.
    • It must not have uppercase letters, periods (.), and underscores (_).
    • It must not have hyphens next to periods (.); for example, my-.bucket.com or my.-bucket.
    • It must not end with a hyphen or -s3alias (-s3alias is reserved for the MinIO bucket access point alias name).

    BUCKET_NAME=helixdr-backup 

    SITE_NAME 

    Specify a name to identify the site from where you want to back up data. 

    Important:

    • Use only lowercase letters and numbers.
    • Do not add any blank spaces.
    • Do not use any special character except a hyphen.

    SITE_NAME=India

    NAMESPACE 

    Specify the namespace where you have installed BMC Helix ITOM.  

    NAMESPACE=helix-cluster3  

    DR_BACKUP_INTERVAL_IN_HOUR 

    Specify the backup interval in hours. You can set values between 1 to 24.
    We recommend that you set the value of this parameter based on the size of your data.
    The value that you specify defines the interval for backing up your data.

    For example, if you set the value of this parameter as 1 hour, data backup is performed at the start of every hour as per the cron schedule (0 */1 * * *). 
    If your current cluster time is 2:15 P.M. on November 2:

    • The first backup will occur at 3:00 P.M. This will be a complete data backup.
    • Subsequent backups will occur at 4:00 P.M., 5:00 P.M., 6:00 P.M., and so on. These will be incremental backups. 
    • At 3:00 P.M. on November 3, a complete data backup will occur.

    For example, if you set the value of this parameter as 3 hours, data backup is performed at the start of every third hour as per the cron schedule (0 */3 * * *). 
    If your current cluster time is 2:15 P.M. on November 2:

    • The first backup will occur at 3:00 P.M. This will be a complete data backup.
    • Subsequent backups will occur at 6:00 P.M., 9:00 P.M., and so on. These will be incremental backups. 
    • At 3:00 P.M. on November 3, a complete data backup will occur.

    Important: For Victoria Metrics, the default interval is one hour. You cannot modify the default interval.

    DR_BACKUP_INTERVAL_IN_HOUR=1 

     

    DR_MAX_BACKUP_TO_RETAIN 

    Specify the number of days for which you want to retain the backed-up data.  

    DR_MAX_BACKUP_TO_RETAIN=5 

    Warning

    Important

    If you decide to configure disaster recovery for both BMC Helix ITOM and BMC Helix IT Service Management, make sure the backup schedules and retention periods are in sync.

  3. To configure a data backup, run the following command: 

    ./disaster-recovery.sh backup 

    Backup configuration logs are saved in the helix-on-prem-deployment-manager/logs directory. 

 

To configure MinIO data replication from the primary site to the standby site 

Data replication is always from the MinIO on the primary site to the MinIO on the standby site. Typically, port 443 must be open on the standby site (External load balancer) unless you decide to configure a different port on the standby site for MinIO. 

Success

Best practice

Make sure the MinIO on the standby site is reachable from the primary site.

The MinIO on the standby site should have sufficient network bandwidth for efficient and quick data replication.

  1. Log in to the controller machine on the standby site:
    1. Set the value of the namespace, for example, ITOM Namespace, in the disaster-recovery.config file by using the following command:
       ./disaster-recovery.sh configureRestore  
  2. To create a restore bucket and enable versioning, perform the following steps:
    1. On the standby site MinIO console, under Administrator, select Buckets.
    2. At the top-right corner of the console, click Create Bucket.
    3. In the Bucket Name box, type a name to identify the restore bucket; for example, helixdr-restore.
    4. To enable versioning, turn on the Versioning toggle.
    5. Click Create Bucket.
      23.4_DR_MinIO5.png
  3. Log on to the production site MinIO console by using the URL you set for the parameter MINIO_LB_HOST in the infra.config file.
    For more information, see Configuration file settings

    Use the credentials you set during the deployment of BMC Helix ITOM.
  4. On the MinIO console, under Administrator, select Buckets.
  5. From the list of buckets, select the backup bucket.
    In the disaster-recovery.config file, you must have set a name for the backup bucket by using the BUCKET_NAME parameter; for example, helixdr-backup.
    23.4_DR_MinIO3.png
  6. To enable versioning, perform the following steps:
    1. Click the pencil icon.
      23.4_DR_MinIO8.png
    2. In the Versioning on Bucket dialog box, click Enable.
      The current status changes from Unversioned to Versioned.
    3. Install the MinIO AIStor Client (mc).
      See AIStor Client.
    4. Set up an alias for the MinIO host by using the following command:

      mc alias set <ALIAS-NAME> <MINIO-API-END-POINT> <USERNAME> <PASSWORD> --insecure

      For example, to set up an alias for the primary site, use the following command:

      /opt/minio/mc alias set drpr https://campus-minio-api.adeonprem.xyz.com/ admin bmcAdm1n --insecure

      Similarly, set up an alias for the secondary site.

    5. Enable versioning on the MinIO bucket to support MinIO bucket replication by using the following command:

      mc version enable <MINIO-ALIAS>/<BUCKET> --insecure

      For example, to enable versioning for the primary site, use the following command:

      /opt/minio/mc version enable drpr/helixdr-backup --insecure

       Similarly, enable the vesioing for the secondary site.

    6. Create a replication rule to configure MinIO bucket replication from the source bucket to the destination by using the following command:

      mc replicate add <SOURCE-ALIAS>/<SOURCE-BUCKET> \ --remote-bucket <DESTINATION-ALIAS>/<DESTINATION-BUCKET> \

       --replicate "delete,delete-marker,metadata-sync,existing-objects" \

       --priority 1 \

       --insecure

       For example,

      /opt/minio/mc replicate add drpr/helixdr-backup \

       --remote-bucket drsr/bucket-replication \

       --replicate "delete,delete-marker,metadata-sync,existing-objects" \

       --priority 1 \

       --insecure

    7. Use the following command to verify the MinIO bucket replication:

      mc replicate ls <MINIO-ALIAS>/<BUCKET> --insecure

      For example, 

      /opt/minio/mc replicate ls drpr/helixdr-backup --insecure

      Note
      The time required for MinIO bucket replication depends on the bandwidth of your network and the size of the data.

    8. To verify MinIO bucket replication is functional:

      1. Generate a list of source MinIO bucket objects by using the following command: 

        mc ls --recursive <SOURCE-MINIO>/<BUCKET> --json --insecure | jq -r '.key + " " + (.size|tostring)' | sort > /tmp/src_objects.txt

        For example,

        /opt/minio/mc ls --recursive drpr/helixdr-backup --json --insecure | jq -r '.key + " " + (.size|tostring)' | sort > /tmp/src_objects.txt

      2. Generate a list of destination MinIO bucket objects by using the following command:

        mc ls --recursive <DESTINATION-MINIO>/<BUCKET> --json --insecure | jq -r '.key + " " + (.size|tostring)' | sort > /tmp/dest_objects.txt

        For example,

        /opt/minio/mc ls --recursive drpr/bucket-replication --json --insecure | jq -r '.key + " " + (.size|tostring)' | sort > /tmp/dest_objects.txt

        Compare the source and destination MinIO bucket lists by using the following command:

        diff /tmp/src_objects.txt /tmp/dest_objects.txt

        Note​​

        • ​​​Wait for some time after running the diff command.
        • If the output of the diff command is empty, it indicates that all the objects were replicated successfully.
        • If the output of the diff command is empty, it indicates that the specified objects are either missing or have a size mismatch.​​​​​
    9. To verify if the replication is successful:

      1. Log in to MinIO on the standby site and verify that the data is available in the replication bucket.
      2. Check if the data in the backup bucket on the primary site is of the same size as that on the standby site. 
Warning

Important

If the data backup process fails, an email is sent to the tenant email ID defined in the infra.config file. 

To validate the data backup

  1. Log in to the controller.
  2. To confirm the cronjobs were successfully configured, run the following command: 

    kubectl -n <itom namespace> get cronjob | grep dr-

    Sample output:
    24.1_DR_Validate.png
    Verify that all the backup cron jobs are running according to your configured schedule.

    1. After the first backup job is complete, log in to the MinIO web console and go to the Object Browser.
    2. Go to the bucket that is configured to back up data; for example, helixdr-backup.
    3. Open the site folder (<SiteName>; for example, India) where you are backing up your data, and then open the backupStatus folder.
    4. Open the backup.log file and validate that there are no errors.

      Sample output of a successful backup
      Backup start time    : 2024-4-10 8:00
      Backup end time      : 2024-04-10 08:09:25.382721
      Status               : Backup Completed successfully

      Backup succeeded for : ['drs-repo', 'events-es', 'log-es', 'kafka', 'minio', 'zookeeper', 'Victoria Metrics', 'Postgres DB', 'K8S Objects']
      Backup failed for    : []
      Backup timed out for : []

      Backup details data  : {'EVENTES': '{"helixdr-backupeventes-1712736045":"7c14b0c3-5399-423e-8771-8a9285ff3254"}', 'LOGES': '{"helixdr-backuploges-1712736047":"1a9c134f-13cf-4073-b717-3fc923a00f43"}', 'KAFKA': '20240410-080056', 'VM': 'daily/2024-04-10', 'VMAGG': 'daily/2024-04-10', 'PG': '20240410-080003F', 'ZOOKEEPER': '20240410-080014', 'K8OBJS': '20240410-080046', 'DRSREPO': '20240410:eab5bf55', 'MINIO': '20240410-080006-F'}
      MinioUsage Data : 5.7 GiB Used, 7 Buckets, 205 Objects

      Sample output of a failed backup
      Backup start time    : 2024-5-2 9:00
      Backup end time      : 2024-05-02 09:56:03.069877
      Status               : Failure

      Backup succeeded for : ['drs-repo', 'events-es', 'log-es', 'kafka', 'minio', 'zookeeper', 'Victoria Metrics', 'Postgres DB', 'K8S Objects']
      Backup failed for    : ['Victoria Metrics Aggregate Cluster']
      Backup timed out for : []

      Backup details data  : {'EVENTES': '{"itsmdr-backupeventes-1714640446":"d4b140d4-af53-4f42-b934-06cde2048e62"}', 'LOGES': '{"itsmdr-backuploges-1714640445":"b1f1fffb-7da0-4ec7-b6f1-74b32812e5dd"}', 'KAFKA': '20240502-090018', 'VM': 'daily/2024-05-02', 'VMAGG': '', 'PG': '20240502-090004F', 'ZOOKEEPER': '20240502-090013', 'K8OBJS': '20240502-090006', 'DRSREPO': '20240502:81575db4', 'MINIO': '20240502-090004-F'}
      MinioUsage Data : 11 GiB Used, 8 Buckets, 9,340 Objects

    5. Download and open the last_backup.json file and validate that there are no errors.
      Sample output for a successful backup:

      {"EVENTES": "{\"helixdr-backupeventes-1696766450\":\"fd02977e-032c-4898-a602-fac5684a64af\"}", "LOGES": "{\"helixdr-backuploges-1696766451\":\"9a3f0ee0-4b7e-4884-84a3-e7c3ebddec95\"}", "KAFKA": "20231009-040016", "VM": "hourly/2023-10-09:03", "VMAGG": "hourly/2023-10-09:03", "PG": "20231008-080003F_20231009-040004I", "ZOOKEEPER": "20231009-040014", "K8OBJS": "20231009-040005", "DRSREPO": "20231009-040005", "MINIO": "20231009-040004"}

(Optional) To scale down the standby site

To save resources, after configuring disaster recovery, you can scale down the application pods and keep only the data lake components running on the standby site. 

Perform the following steps:

  1. Go to helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-scale
  2. Run the following command:

    ./product_scale.sh  down
    • Data from the primary site MinIO continues to be replicated onto the standby site MinIO but the applications will not run.
    • To scale down the data lake components also, use DOWN-ALL or down-all. Example: ./product_scale.sh  down-all. This stops all services except MinIO, Postgres, and external Elasticsearch, Fluentd, and Kibana (EFK) stacks.
    • Do not repeat the DOWN-ALL or down-all multiple times to avoid issues during scale up. 
Warning

Important

  • If you upgrade the primary site to a new version of BMC Helix ITOM, you must upgrade the standby site to the same version.
  • If you apply a hotfix on the primary site, you must apply the same hotfix on the standby site.


(OptionalTo stop the data backup process on the primary site

Expect some downtime while you stop the data backup. 

  1. Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery.
  2. Run the following command:
    ./disaster-recovery.sh disable 

Any data backup process that is in progress is completed, and the subsequent backup process is stopped.

The data backed up in MinIO is not deleted.  

Back up of all BMC Helix ITOM applications is stopped, except BMC Discovery. 

To configure disaster recovery again, you must perform all the steps listed in this topic. 

 

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*

BMC Helix IT Operations Management deployment 25.4