Preparing for a disaster


Deploy BMC Helix Service Management in a disaster recovery cluster that is a secondary cluster so that if a failure occurs in the primary cluster, the BMC Helix Service Management components, BMC Helix Innovation Suite and Service management applications are deployed in the secondary cluster can take over the workload.
 

Disaster recovery deployment process

The following image shows the tasks to deploy BMC Helix Service Management in a secondary cluster for disaster recovery:

DR deployment.png

Before you begin

Verify that you complete the following tasks:

  • Production and standby sites must be identical. They must have the same product version, namespace, and must be deployed with the same URLs, except the MinIO URLs (MINIO_API_LB_HOST and MINIO_LB_HOST).
  • Production and standby should have the same resources.
  • Set up namespaces in your secondary cluster that are the same as in your primary cluster.
  • Installed BMC Helix Service Management in your primary cluster.
  • Set up a cluster for disaster recovery that is your secondary cluster if it is not set up already.
  • Installed BMC Helix Platform Common Services in both your primary and secondary clusters if they are not set up already.
  • If your primary and secondary clusters use different kubeconfig files, make sure that you add separate kubeconfig credentials in Jenkins for your primary and secondary clusters.
  • Make sure that the database alias name is the same for the database server in the primary and secondary clusters.

Helix disaster recovery pipeline modes

Use the disaster recovery deployment pipeline, HELIX_DR, to deploy BMC Helix Service Management in a secondary cluster. The HELIX_ONPREM_DEPLOYMENT pipeline runs the HELIX_DR pipeline during deployment.

The disaster recovery deployment pipeline, HELIX_DR, provides the following modes:

  • Scale downIn this mode, the HELIX_DR pipeline deploys BMC Helix Innovation Suite and Service Management application components with zero replicas. Use this mode to synchronize the components in the primary and secondary clusters.
  • Scale upIn this mode, the HELIX_DR pipeline deploys BMC Helix Innovation Suite and Service Management application components with replicas in the deployment input configuration file. Use this mode when a disaster occurs in the primary cluster.

Important

The HELIX_DR pipeline does not support BMC Helix ITSM: Smart Reporting deployment.

Task 1: To add the HELIX_DR pipeline to the Jenkins server

  1. Log in to the Jenkins server by using the following URL:
    http://<Jenkins server host name>:8080
  2. On the Jenkins home page, click New Item.
  3. In the Enter an item name field, enter HELIX_DR.
  4. Select Pipeline and click OK.
  5. Click the Pipeline tab.
  6. Enter the following information:
    1. From the Definition list, select Pipeline script from SCM.
    2. From the SCM list, select Git.
    3. Enter the Repository URL as the path of your local Git repository in the format ssh://git@<jenkins_server>/<path to itsm-on-premise-installer.git>.
      Example: ssh://git@<Jenkins server host name>/home/git/Git_Repo/ITSM_REPO/itsm-on-premise-installer.git.
    4. Enter the Git server credentials.
    5. Specify the script path as pipeline/jenkinsfile/HELIX_DR.jenkinsfile.
  1. Click Apply and then Save.
    After the pipeline is created, make sure that the pipeline is selected from Jenkins home page.
  2. Click Build Now.
    The first build job fails because it needs to run the first time to load all the parameters of the pipeline script.
  3. After the build job fails, select the pipeline name again from the Jenkins home page.
    The Build Now option changes to Build With Parameters.

Task 2: To set up BMC Helix Platform Common Services on the secondary cluster

To set up BMC Helix Platform Common Services on the secondary cluster, perform the following steps:

  1. Configure the data backup on the primary cluster.
  2. Configure MinIO data replication
  3. Validate the data backup.

Important

Do not perform this task if you have already configured disaster recovery for BMC Helix IT Operations Management.

To configure data backup on the primary cluster

Always configure the first data backup when the load is low with less activity.

Important

There might be a downtime of about 30 minutes when you back up data for the first time. Subsequent data backups will have no downtime.

  1. Log in to in to the controller from where the Kubernetes cluster is accessible.
  2. Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-configs.
  3. In the disaster-recovery.config file, specify the values of the following parameters:

    Parameter 

    Description 

    Example 

    BUCKET_NAME 

    Specify the name of the bucket used to back up data on MinIO.

    Important:

    Use the following conventions while assigning a bucket name:

    • It can have 3 to 63 characters.
    • It can have lowercase letters, numbers, and hyphens (-).
    • It must begin and end with an alphabet or a number.
    • It must not start with xn-- as it might get interpreted as a Punycode format; for example, xn–bucketname.
    • It must not have uppercase letters, periods (.), and underscores (_).
    • It must not have hyphens next to periods (.); for example, my-.bucket.com or my.-bucket.
    • It must not end with a hyphen or -s3alias (-s3alias is reserved for the MinIO bucket access point alias name).

    BUCKET_NAME=helixdr-backup 

    SITE_NAME 

    Specify a name to identify the site from where you want to back up data. 

    SITE_NAME=India

    NAMESPACE 

    Specify the namespace where you have installed BMC Helix Platform Common Services.  

    NAMESPACE=helix-cluster3  

    DR_BACKUP_INTERVAL_IN_HOUR 

    Specify the backup interval in hours. You can set values between 1 to 24.
    We recommend that you set the value of this parameter based on the size of your data.
    The value that you specify defines the interval for backing up your data.

    For example, if you set the value of this parameter as 1 hour, data backup is performed at the start of every hour as per the cron schedule (0 */1 * * *).
    If your current cluster time is 2:15 P.M. on November 2:

    • The first backup will occur at 3:00 P.M. This will be a complete data backup.
    • Subsequent backups will occur at 4:00 P.M., 5:00 P.M., 6:00 P.M., and so on. These will be incremental backups. 
    • At 3:00 P.M. on November 3, a complete data backup will occur.

    For example, if you set the value of this parameter as 3 hours, data backup is performed at the start of every third hour as per the cron schedule (0 */3 * * *). 
    If your current cluster time is 2:15 P.M. on November 2:

    • The first backup will occur at 3:00 P.M. This will be a complete data backup.
    • Subsequent backups will occur at 6:00 P.M., 9:00 P.M., and so on. These will be incremental backups. 
    • At 3:00 P.M. on November 3, a complete data backup will occur.

    Important:

    • For Victoria Metrics, the default interval is one hour. You cannot modify the default interval.
    • The following application data and configurations will be lost during the backup intervals:
      • BMC Helix ITSM InsightsPPM jobs created and real-time incident correlations.
      • BMC Helix Dashboards

        —Reports created or modified and report schedules.

      • BMC Helix Single Sign-OnData or configuration changes.

    DR_BACKUP_INTERVAL_IN_HOUR=1 

     

    DR_MAX_BACKUP_TO_RETAIN 

    Specify the number of days for which you want to retain the backed-up data.  

    DR_MAX_BACKUP_TO_RETAIN=5 

  4. To back up data, run the following command:

    ./disaster-recovery.sh backup 

To configure MinIO data replication from the primary cluster to the secondary cluster

Data replication is always from the MinIO on the primary site to the MinIO on the standby site. Typically, port 443 must be open on the standby site unless you decide to configure a different port on the standby site for MinIO.  

Best practice
Make sure the MinIO on the standby site is reachable from the primary site.

The MinIO on the standby site should have sufficient network bandwidth for efficient and quick data replication.

  1. Log on to MinIO by using the URL that you set for the parameter MINIO_LB_HOST in the infra.config file.
    Use the credentials you set during the deployment of BMC Helix Platform Common Services.
  2. On the production site MinIO console, under Administrator, select Buckets.
  1. From the list of buckets, select the backup bucket.
    In the disaster-recovery.config file, you must have set a name for the backup bucket by using the BUCKET_NAME parameter; for example, helixdr-backup.
    23.4_DR_MinIO3.png
  1. To enable versioning, perform the following steps:
    1. Click the pencil icon.
      23.4_DR_MinIO8.png
    1. In the Versioning on Bucket dialog box, click Enable.
      Current Status changes from Unversioned to Versioned.
  1. To create a restore bucket and enable versioning, perform the following steps:
    1. On the standby site MinIO console, under Administrator, select Buckets.
    2. At the top-right corner of the console, click Create Bucket.
    1. In the Bucket Name box, type a name to identify the restore bucket; for example, helixdr-restore.
    2. To enable versioning, turn on the Versioning toggle.
    1. Click Create Bucket.
      23.4_DR_MinIO5.png
  1. Add a replication rule to replicate data from the backup bucket (on the production site MinIO) to the restore bucket (on the standby site MinIO): 
    1. On the production site MinIO console, under Administrator, select Buckets.
    1. From the list of buckets, select the backup bucket; for example, helixdr-backup.
    2. Go to the Replication tab and click the Add Replication Rule button.
      23.4_DR_MinIO6.png
  1. In the Set Bucket Replication dialog box, enter the following values:
    1. Target URL - The API end-point of the MinIO on the standby site.
      This is the URL that you set for the parameter MINIO_API_LB_HOST in the infra.config file.
    1. Access Key - User name to access the standby site MinIO.
    2. Secret Key - Password to access the standby site MinIO.
    3. Target Bucket - Name of the restore bucket; for example, helixdr-restore.
    4. Leave the other values to their default and click Save to set the replication rule.
      23.4_DR_MinIO7A.png

After you set the replication rule, data from the production site MinIO gets replicated onto the standby site MinIO.

Logs get saved in the helix-on-prem-deployment-manager/logs directory. You can check the logs to monitor the progress, success, or failure of the data back up process. 
 

Important

If the data backup process fails, an email is sent to the tenant email ID defined in the infra.config file. 

To validate the data backup

  1. Log in to the controller.
  2. To confirm the cronjobs were successfully configured, run the following command: 
    kubectl -n <itom namespace> get cronjob | grep dr-
  3. After the backup jobs are completed on the controller machine, log in to the MinIO web console and go to the Object Browser.
  4. Go to the bucket that is configured to back up data; for example, helixdr-backup.
  5. Open the site folder (<SiteName>; for example, India) where you are backing up your data, and then open the backupStatus folder.
  6. Open the backup.log file and validate that there are no errors.
    Sample output for a successful backup:

    Backup Start Time : 2023-10-9 4:00.
    Backup End Time : 2023-10-09 04:08:30.012419
  7. Download and open the last_backup.json file and validate that there are no errors.
    Sample output for a successful backup:

    {"EVENTES": "{\"helixdr-backupeventes-1696766450\":\"fd02977e-032c-4898-a602-fac5684a64af\"}", "LOGES": "{\"helixdr-backuploges-1696766451\":\"9a3f0ee0-4b7e-4884-84a3-e7c3ebddec95\"}", "KAFKA": "20231009-040016", "VM": "hourly/2023-10-09:03", "VMAGG": "hourly/2023-10-09:03", "PG": "20231008-080003F_20231009-040004I", "ZOOKEEPER": "20231009-040014", "K8OBJS": "20231009-040005", "DRSREPO": "20231009-040005", "MINIO": "20231009-040004"}

(Optional) To scale down the standby site

To save resources, after configuring disaster recovery, you can scale down the application pods and keep only the data lake components running on the standby site. 

Perform the following steps:

  1. Go to helix-on-prem-deployment-manager/utilities/disaster-recovery/dr-scale
  2. Run the following command:

    ./product_scale.sh  down

    Data from the production site MinIO continues to be replicated onto the standby site MinIO but the applications will not run.

    Important

    If you want to stop the data backup process on the production site, perform the following steps: 

    Expect some downtime while you stop the data backup. 

    1. Go to /helix-on-prem-deployment-manager/utilities/disaster-recovery.
    2. Run the following command:
      ./disaster-recovery.sh disable

Task 3: To deploy BMC Helix Innovation Suite on the secondary cluster

Deploy BMC Helix Service Management on the secondary cluster by using the HELIX_DR pipeline scale down mode.

  1. Log in to BMC Deployment Engine that is the Jenkins server.
  2. In your secondary cluster, copy the kubeconfig file located at HELM/<Jenkins node> to the ~/.kube folder.
  3. Log in to the Jenkins server by using the following URL:
    http://<Jenkins server host name>:8080
  4. On the Jenkins server, select the HELIX_ONPREM_DEPLOYMENT pipeline.
  5. In the Build History, select the latest build and click Rebuild.
  6. In the INFRASTRUCTURE section, in the KUBECONFIG_CREDENTIAL parameter, specify Jenkins credential ID that contains the kubeconfig file for the secondary cluster.
    To find the kubeconfig credential ID, go to http://<jenkinsurl>:8080/credentials.
  7. In the CUSTOMER-INFO section, in the CLUSTER parameter, specify the name of your secondary cluster.Find the cluster from the kubeconfig file. The current-context value in the kubeconfig file is the cluster name.
  8. In the PRODUCTS section, clear the HELIX_CLAMAV check box.
  9. In the INFRA-DEPLOY section, clear the SUPPORT_ASSISTANT_TOOL check box.
  1. In the PRODUCT-DEPLOY section, select the HELIX_GENERATE_CONFIG, HELIX_DR, and SCALE_DOWN options.

    Important

    Make sure that you keep the other parameter values identical to the primary site.

  2. Click Rebuild.

Where to go from here

Recovering-after-a-disaster

 

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*