Troubleshooting clusters

When you encounter problems with a cluster, the first thing that you see may be errors in the Cluster Manager UI. If you are unable to recover using the UI, you may be able to use the tw_cluster_control utility to fix the problem.

My cluster will not start!

When a cluster starts, one of the checks it performs is to ensure that the members are all the correct machines, rather than a different machine on the same IP address. For virtual machines, it checks the VM UUID. If the VM UUID has changed, the machine is assumed to be a different machine, and the cluster cannot start and logs the following critical error in tw_svc_cluster_manager.log.

Clustered machine: VM UUID has changed
Replace expected VM UUID by running: tw_cluster_control --replace-vm-uuid

The critical error is also displayed on the console if you are attempting to start tw_svc_cluster_manager manually.

To replace the VM UUID:
You can do this by running tw_cluster_control from the machine on which the cluster manager has failed.

Log in to the appliance command line as the tideway user.
Use tw_cluster_controlreplace the expected VM UUID with the current value and enable the cluster service to start. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --replace-vm-uuid
[tideway@wilkoapp1 ~]$
Start the cluster manager. Enter:
[tideway@wilkoapp1 ~]$ sudo /sbin/service appliance start
[tideway@wilkoapp1 ~]$
Start the remaining services. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --cluster-start-services
[tideway@wilkoapp1 ~]$

To determine the health of a cluster

You can do this from any running member of the cluster.

Log in to the appliance command line as the tideway user.
Query the status of the cluster. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --show-members

This is an example of the cluster health check results for a three machine cluster with fault tolerance enabled where a non-coordinator machine cannot be contacted.

      Cluster UUID : 508a243177476a01543289485ecb04e5
      Cluster Name : Harness Cluster
     Cluster Alias :
   Fault Tolerance : Enabled
Replication Factor : 2
Number of Members : 3

                        UUID : 508a243177476a01383089485ecb04e5
                        Name : Harness Cluster-01
                     Address : wilkoapp1.tideway.com
      Cluster Manager Health : MEMBER_HEALTH_OK
              Overall Health : MEMBER_HEALTH_OK
                    Services : SERVICES_RUNNING
                       State : MEMBER_STATE_NORMAL
                 Coordinator : Yes
                Last Contact : Fri Feb 14 16:47:33 2014
                    CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
                  Processors : 4
                      Memory : 3833M
                        Swap : 8191M
        Free Space : /usr 33014M/38458M (15%)

                        UUID : 9013913177476a525d8289485ecd04e2
                        Name : Harness Cluster-02
                     Address : 137.72.94.205
      Cluster Manager Health : MEMBER_HEALTH_OK
              Overall Health : MEMBER_HEALTH_OK
                    Services : SERVICES_RUNNING
                       State : MEMBER_STATE_NORMAL
                 Coordinator : No
                Last Contact : Fri Feb 14 16:47:33 2014
                    CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
                  Processors : 4
                      Memory : 3833M
                        Swap : 8191M
        Free Space : /usr 33077M/38458M (14%)

                        UUID : 20972c31774768b3ad7889485ece04e8
                        Name : Harness Cluster-03
                     Address : wilkoapp3.tideway.com
      Cluster Manager Health : MEMBER_HEALTH_ERROR Communication failure
              Overall Health : MEMBER_HEALTH_ERROR Communication failure
                    Services : SERVICES_UNKNOWN
                       State : MEMBER_STATE_NORMAL
                 Coordinator : No
                Last Contact : Fri Feb 14 16:46:45 2014
                    CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
                  Processors : 4
                      Memory : 3833M
                        Swap : 8191M
        Free Space : /usr 32671M/38458M (16%)

[tideway@wilkoapp1 ~]$

Removing all failed machines from the cluster

You can do this by running tw_cluster_control from any running member of the cluster. The order of these steps is particularly important. See the note below.

Log in to the appliance command line as the tideway user.
Use tw_cluster_control to remove all failed machines from the cluster. You need to supply the password of the system user. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --remove-broken
Password:

Found 1 broken member:
Harness Cluster-03 (wilkoapp3.tideway.com)

Are you sure you want to make these changes? (y/n) y

1 member is being removed
[tideway@wilkoapp1 ~]$
Before you can reuse any machine removed from the cluster, you must revert it to a standalone configuration. All data is lost in this process. On the command line of the machine you want to revert, enter:
tideway@wilkoapp3 ~]$ tw_cluster_control --revert-to-standalone
Removing this machine from the current cluster

WARNING: This will delete all data on this cluster member!

  Do you wish to proceed? (yes/no) yes
Stopping Cluster Manager service:
Stopping Cluster Manager service:                           [ FAILED ]
Removing cluster configuration
Starting Cluster Manager service:
Starting Cluster Manager service:                           [ OK ]
Removing appliance configuration
Deleting datastore
Starting Security service:                                  [ OK ]
...

...
Starting Reasoning service:                                 [ OK ]
Importing TKU packages:                                     [ Unknown ]
Writing TKU-Core-2064-02-1-ADDM-10.0+:                      [ OK ]
[ TKU-Extended-DB-Discovery-2064-02-1-ADDM-10.0+:           [ Unknown ]
[ TKU-Extended-Middleware-Discovery-2064-02-1-ADDM-10.0+:   [ Unknown ]
[ TKU-System-2064-02-1-ADDM-10.0+:                          [ Unknown ]
Activating patterns:                                        [ OK ]
Starting Tomcat:                                            [ OK ]
Starting Reports service:                                   [ OK ]
Starting Application Server service:                        [ OK ]
Updating cron tables:                                       [ OK ]
Updating baseline:                                          [ OK ]
Finished removing cluster configuration
[tideway@wilkoapp3 ~]$

If a machine in the cluster has failed, and you --revert-to-standalone before removing it from the cluster, then you must shut down the reverted machine before issuing the --remove-broken command.

A machine has failed in my non-fault tolerant cluster

A cluster that has no fault tolerance cannot survive when a machine fails. All data is lost and the only remaining option is to revert all of the machines to a standalone state and then create a new cluster. If you have backed up your cluster, you can restore the backup onto a new cluster of the same size.

To revert all of the machines to a standalone state:

Log in to the appliance command line as the tideway user.
Use tw_cluster_controlto revert the machine to a standalone state. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --revert-to-standalone
Removing this machine from the current cluster

WARNING: This will delete all data on this cluster member!

Do you wish to proceed? (yes/no) yes
Stopping Cluster Manager service:
...
Repeat for all members of the cluster.

Restoring a failed and forcibly removed coordinator

If a coordinator fails and is forcibly removed from the cluster, it must be reverted to a standalone state before it can be used again. To do this:

Log in to the appliance command line as the tideway user.
Use tw_cluster_control to revert the machine to a standalone state. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --revert-to-standalone
Removing this machine from the current cluster

WARNING: This will delete all data on this cluster member!

Do you wish to proceed? (yes/no) yes
Stopping Cluster Manager service:
...
The machine can be added back into the cluster.

Making an existing cluster member the coordinator

Where you have changed a failed coordinator and the cluster has not restarted, you must make another machine become the coordinator. To do this:

Log in to the appliance command line for the machine you want to make the coordinator. Log in as the tideway user.
Use tw_cluster_control to to make that machine become the coordinator. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --become-coordinator

Password for BMC Discovery UI user system:
Making this machine the cluster coordinator
Becoming coordinator... (see UI for progress)
[tideway@wilkoapp1 ~]$
The machine then becomes the coordinator.