Disaster recovery deployment

A disaster is any event that interrupts the normal operations of an organization. A disaster could be power outages, failure or breakdown of systems, or natural calamities. Disaster recovery equips an organization to be prepared for a disaster by creating and maintaining a secondary cluster. A secondary cluster has an infrastructure that is identical to its primary cluster and ensures that an organization sustains a disaster and continues its operations with minimal or no impact.

Disaster recovery can be achieved for BMC Helix Service Management by moving the workload across data centers on demand and data replication between the primary and secondary cluster databases.

Disaster recovery overview

DR overview.png

Component	Description
Primary cluster	The primary cluster is your data center or your primary site, which hosts all your live systems.
Secondary cluster	The secondary cluster is the disaster recovery system. You must create and maintain this site.
Failover	Failover is the process of switching to the secondary cluster when a disaster occurs in the production site.
Failback	Failback is the process of transferring control back to the primary cluster to take over the functionality from the secondary cluster.

Disaster recovery deployment requirements

Consider the following requirements while planning disaster recovery for BMC Helix Innovation Suite and Service Management applications:

The disaster recovery deployment must have the same product version and patch level as the production deployment.
Use your database vendor's technology or any suitable third-party solution for database replication. Perform only one-way replication of the database.
After a failure event, each component needs to be ready and available to start functioning based on your Recovery Time Objective (RTO).
The amount of data loss during a failure event depends on your Recovery Point Objective (RPO).

BMC Helix Innovation Suite database requirements

The dataset in the BMC Helix Innovation Suite database is very large, and you must have a replicated instance of the data.

Consider the following requirements while planning the data replication.

Set up a database server on the secondary cluster and configure the data replication between the primary and secondary clusters.
Make sure that the database in the secondary cluster is synchronous with the database in the primary cluster and data is consistent between two databases.
Make sure that you use a common alias name for the database server that points to the primary or secondary host, whichever is active.

You can accomplish data replication in various ways based on your database vendor:

Oracle Database—Oracle provides the GoldenGate software that manages the replication.
For information about GoldenGate, see Oracle GoldenGate in Oracle documentation.
Microsoft SQL Server—Microsoft provides SQL Server Management Studio (SSMS) for replication.
See SQL Server Replication in Microsoft documentation.
PostgreSQL—PostgreSQL provides a Hot Standby instance. Data replication is accomplished by setting up a Hot Standby instance.
See 27.4. Hot Standby in PostgreSQL documentation and Setting up PostgreSQL Hot Standby in Google Cloud Platform community.

RPO and RTO measurements

RPO is the time-based measurement of tolerated data loss. RTO is the targeted duration between an event failure and the point where the operations resume.
The default configurations set RPO expectations to 2 hours.

Important

Disaster recovery is a new feature, and the RTO is still being measured for general expectations.

We recommend that you perform a trial run of your disaster recovery operation to give you personalized expectations of how your setup and environments will measure at RPO/RTO metrics.

Workflow

The following table lists the tasks to set up disaster recovery for BMC Helix Service Management:

Task	Description	Reference
1	Deploy BMC Helix Service Management in a disaster recovery cluster to synchronize the platform and application components in the primary and secondary clusters, prepare the cluster for a disaster.	Preparing-for-a-disaster
2	Perform the data restore tasks on the secondary cluster so that data is restored when a disaster occurs.	Recovering-after-a-disaster