Fault tolerance

Clusters are used to improve discovery, search performance, and reporting performance. Clusters work by sharing data and the work in processing that data across each machine in the cluster. As cluster size increases, so does the likelihood of one or more of the machines in the cluster experiencing some sort of hardware failure. To prevent data loss in the event of hardware failure, fault tolerance has been built into BMC Discovery.

To prevent data loss when a machine fails, you can store copies of your data on more than one machine. If every bit of data is stored on two different machines, then any one machine can fail with no resultant data loss. If every bit of data is stored on three machines, then two machines can fail without data loss. The number of copies of the data governs the number of machines that can fail in the cluster before failures cause data loss. The number of copies, or the replication factor, is configured automatically, as follows:

Cluster size	Replication factor	Number of failures tolerated
1	1 (not fault tolerant)	0 (not fault tolerant)
2	2	0 (not fault tolerant, no data is lost, but the cluster cannot continue to operate. See following text.)
3 to 8	2	1
9 to 15	3	2
16 and over	4	3

In the case of the fault tolerant cluster of two, the cluster can experience the failure of a machine without losing data. However, the cluster cannot continue to operate, as the coordinator can no longer duplicate data the required number of times (the replication factor), in this case two, as there is only one working machine. To make the cluster operational again, you must remove it from the cluster by using tw_cluster_control.

If you enable fault tolerance, the cluster will survive the loss of a machine without losing data, or interruption to discovery, and the UI enables you to control the cluster just as before the failure. If the failure was transient or the machine was repaired, when it returns it starts to process data again, perform discovery, and catch up with the rest of the cluster. If the failure of the machine is permanent, you can use the cluster manager to remove the machine from the cluster, and the cluster will redistribute (rebalance) the data over the remaining members. If you replace the machine, a new copy of the data held on the failed machine is copied to the new machine.

Fault tolerance works by storing data on multiple machines. As a consequence of having multiple copies, the total storage capacity of the cluster is reduced. Additionally, the overhead of writing, tracking, and searching through multiple copies reduces the overall performance of the cluster relative to the same cluster with fault tolerance disabled.

Enabling and disabling fault tolerance

You can enable fault tolerance when creating a cluster by selecting the fault tolerance check box in the create cluster dialog.

To enable fault tolerance in an existing cluster

From the Fault Tolerance section of the cluster manager, click the Enable Fault Tolerance button.

The summary shows the progress of the rebalancing operation. For a large system containing a large amount of data, this may take some time. However, the system remains fully usable throughout.

To disable fault tolerance in an existing cluster

From the Fault Tolerance section of the cluster manager, click the Disable Fault Tolerance button.

The summary shows the progress of the rebalancing operation. For a large system containing a large amount of data, this may take some time. However, the system remains fully usable throughout.

Fault tolerance

Enabling and disabling fault tolerance

To enable fault tolerance in an existing cluster

To disable fault tolerance in an existing cluster

Comments