Unsupported content

 

This version of the product is no longer supported. However, the documentation is available for your convenience. You will not be able to leave comments.

Fault tolerance

Clusters are used to improve discovery, search performance, and reporting performance. Clusters work by sharing data and the work in processing that data across each machine in the cluster. As cluster size increases, so does the likelihood of one or more of the machines in the cluster experiencing some sort of hardware failure. To prevent data loss in the event of hardware failure, fault tolerance has been built into BMC Discovery.

To prevent data loss when a machine fails, you can store copies of your data on more than one machine. If every bit of data is stored on two different machines, then any one machine can fail with no resultant data loss. If every bit of data is stored on three machines, then two machines can fail without data loss. The number of copies of the data governs the number of machines that can fail in the cluster before failures cause data loss. The number of copies, or the replication factor, is configured automatically, as follows:

Cluster sizeReplication factorNumber of failures tolerated
11 (not fault tolerant)0 (not fault tolerant)
220 (not fault tolerant, no data is lost, but the cluster cannot continue to operate. See following text.)
3 to 821
9 to 1532
16 and over43

In the case of the fault tolerant cluster of two, the cluster can experience the failure of a machine without losing data. However, the cluster cannot continue to operate, as the coordinator can no longer duplicate data the required number of times (the replication factor), in this case two, as there is only one working machine. To make the cluster operational again, you must remove it from the cluster by using tw_cluster_control

If you enable fault tolerance, the cluster will survive the loss of a machine without losing data, or interruption to discovery, and the UI enables you to control the cluster just as before the failure. If the failure was transient or the machine was repaired, when it returns it starts to process data again, perform discovery, and catch up with the rest of the cluster. If the failure of the machine is permanent, you can use the cluster manager to remove the machine from the cluster, and the cluster will redistribute (rebalance) the data over the remaining members. If you replace the machine, a new copy of the data held on the failed machine is copied to the new machine.

Fault tolerance works by storing data on multiple machines. As a consequence of having multiple copies, the total storage capacity of the cluster is reduced. Additionally, the overhead of writing, tracking, and searching through multiple copies reduces the overall performance of the cluster relative to the same cluster with fault tolerance disabled.

Enabling and disabling fault tolerance

You can enable fault tolerance when creating a cluster by selecting the fault tolerance check box in the create cluster dialog.

To enable fault tolerance in an existing cluster

From the Fault Tolerance section of the cluster manager, click the Enable Fault Tolerance button.

The summary shows the progress of the rebalancing operation. For a large system containing a large amount of data, this may take some time. However, the system remains fully usable throughout.

To disable fault tolerance in an existing cluster

From the Fault Tolerance section of the cluster manager, click the Disable Fault Tolerance button.

The summary shows the progress of the rebalancing operation. For a large system containing a large amount of data, this may take some time. However, the system remains fully usable throughout.

Was this page helpful? Yes No Submitting... Thank you

Comments

  1. Edoardo Spelta

    Article states: " If every bit of data is stored on two different machines, then any one machine can fail with no resultant data loss"...

    The table in this page however shows that with 2 cluster nodes the number of tolerated failures is zero (no fault tolerance).

    Which one of the two is true ?



    Apr 03, 2019 01:25
    1. Duncan Tweed

      Hi Edoardo. 

      Both are true. I have changed the text to try and make this clearer.

      When a fault tolerant cluster of two experiences the loss of a machine, no data is lost, but the cluster cannot operate until the failed member is removed, fixed, or replaced. When you do that, the cluster rebalances, and operation continues without data loss.

      Thanks for the comment. 

      Duncan.

      Apr 09, 2019 04:55
  2. Edoardo Spelta

    Hello,

    what do you mean by "the cluster cannot operate" ?

    Do you mean that, since there's only one node left, it can't copy/receive data to/from the dead node but will go on doing it's discoveries according to its schedule ? 

    Apr 09, 2019 05:09
  3. Duncan Tweed

    It will most likely be unresponsive. The cluster will not continue to perform scheduled discovery runs. You might be able to add a new member to get it going again, or you might have to use tw_cluster_control to remove the failed member, and revert the surviving machine to a standalone machine, before rebuilding the cluster.

    This is only for a two member fault tolerant cluster, which is not a typical configuration.

    Where you have a three member cluster and experience a single failure, the cluster continues to operate; perform scans, CMDB syncs, and normal operations.

    I hope that helps,

    Duncan.

    Apr 09, 2019 06:30
  4. Edoardo Spelta

    Thank you very much for the explanation!!

    Apr 09, 2019 06:46