Troubleshooting clusters
My cluster will not start!
When a cluster starts, one of the checks it performs is to ensure that the members are all the correct machines, rather than a different machine on the same IP address. For virtual machines, it checks the VM UUID. If the VM UUID has changed, the machine is assumed to be a different machine, and the cluster cannot start and logs the following critical error in tw_svc_cluster_manager.log.
Replace expected VM UUID by running: tw_cluster_control --replace-vm-uuid
The critical error is also displayed on the console if you are attempting to start tw_svc_cluster_manager manually.
To replace the VM UUID:
You can do this by running tw_cluster_control from the machine on which the cluster manager has failed.
- Log in to the appliance command line as the tideway user.
Use tw_cluster_controlreplace the expected VM UUID with the current value and enable the cluster service to start. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --replace-vm-uuid
[tideway@wilkoapp1 ~]$Start the cluster manager. Enter:
[tideway@wilkoapp1 ~]$ sudo /sbin/service appliance start
[tideway@wilkoapp1 ~]$Start the remaining services. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --cluster-start-services
[tideway@wilkoapp1 ~]$
To determine the health of a cluster
You can do this from any running member of the cluster.
- Log in to the appliance command line as the tideway user.
Query the status of the cluster. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --show-members
This is an example of the cluster health check results for a three machine cluster with fault tolerance enabled where a non-coordinator machine cannot be contacted.
Cluster UUID : 508a243177476a01543289485ecb04e5
Cluster Name : Harness Cluster
Cluster Alias :
Fault Tolerance : Enabled
Replication Factor : 2
Number of Members : 3
UUID : 508a243177476a01383089485ecb04e5
Name : Harness Cluster-01
Address : wilkoapp1.tideway.com
Cluster Manager Health : MEMBER_HEALTH_OK
Overall Health : MEMBER_HEALTH_OK
Services : SERVICES_RUNNING
State : MEMBER_STATE_NORMAL
Coordinator : Yes
Last Contact : Fri Feb 14 16:47:33 2014
CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Processors : 4
Memory : 3833M
Swap : 8191M
Free Space : /usr 33014M/38458M (15%)
UUID : 9013913177476a525d8289485ecd04e2
Name : Harness Cluster-02
Address : 137.72.94.205
Cluster Manager Health : MEMBER_HEALTH_OK
Overall Health : MEMBER_HEALTH_OK
Services : SERVICES_RUNNING
State : MEMBER_STATE_NORMAL
Coordinator : No
Last Contact : Fri Feb 14 16:47:33 2014
CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Processors : 4
Memory : 3833M
Swap : 8191M
Free Space : /usr 33077M/38458M (14%)
UUID : 20972c31774768b3ad7889485ece04e8
Name : Harness Cluster-03
Address : wilkoapp3.tideway.com
Cluster Manager Health : MEMBER_HEALTH_ERROR Communication failure
Overall Health : MEMBER_HEALTH_ERROR Communication failure
Services : SERVICES_UNKNOWN
State : MEMBER_STATE_NORMAL
Coordinator : No
Last Contact : Fri Feb 14 16:46:45 2014
CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Processors : 4
Memory : 3833M
Swap : 8191M
Free Space : /usr 32671M/38458M (16%)
[tideway@wilkoapp1 ~]$
Removing all failed machines from the cluster
You can do this by running tw_cluster_control from any running member of the cluster. The order of these steps is particularly important. See the note below.
- Log in to the appliance command line as the tideway user.
Use tw_cluster_control to remove all failed machines from the cluster. You need to supply the password of the system user. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --remove-broken
Password:
Found 1 broken member:
Harness Cluster-03 (wilkoapp3.tideway.com)
Are you sure you want to make these changes? (y/n) y
1 member is being removed
[tideway@wilkoapp1 ~]$Before you can reuse any machine removed from the cluster, you must revert it to a standalone configuration. All data is lost in this process. On the command line of the machine you want to revert, enter:
tideway@wilkoapp3 ~]$ tw_cluster_control --revert-to-standalone
Removing this machine from the current cluster
WARNING: This will delete all data on this cluster member!
Do you wish to proceed? (yes/no) yes
Stopping Cluster Manager service:
Stopping Cluster Manager service: [ FAILED ]
Removing cluster configuration
Starting Cluster Manager service:
Starting Cluster Manager service: [ OK ]
Removing appliance configuration
Deleting datastore
Starting Security service: [ OK ]
...
...
Starting Reasoning service: [ OK ]
Importing TKU packages: [ Unknown ]
Writing TKU-Core-2064-02-1-ADDM-10.0+: [ OK ]
[ TKU-Extended-DB-Discovery-2064-02-1-ADDM-10.0+: [ Unknown ]
[ TKU-Extended-Middleware-Discovery-2064-02-1-ADDM-10.0+: [ Unknown ]
[ TKU-System-2064-02-1-ADDM-10.0+: [ Unknown ]
Activating patterns: [ OK ]
Starting Tomcat: [ OK ]
Starting Reports service: [ OK ]
Starting Application Server service: [ OK ]
Updating cron tables: [ OK ]
Updating baseline: [ OK ]
Finished removing cluster configuration
[tideway@wilkoapp3 ~]$
A machine has failed in my non-fault tolerant cluster
A cluster that has no fault tolerance cannot survive when a machine fails. All data is lost and the only remaining option is to revert all of the machines to a standalone state and then create a new cluster. If you have backed up your cluster, you can restore the backup onto a new cluster of the same size.
To revert all of the machines to a standalone state:
- Log in to the appliance command line as the tideway user.
Use tw_cluster_controlto revert the machine to a standalone state. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --revert-to-standalone
Removing this machine from the current cluster
WARNING: This will delete all data on this cluster member!
Do you wish to proceed? (yes/no) yes
Stopping Cluster Manager service:
...- Repeat for all members of the cluster.
Restoring a failed and forcibly removed coordinator
If a coordinator fails and is forcibly removed from the cluster, it must be reverted to a standalone state before it can be used again. To do this:- Log in to the appliance command line as the tideway user.
- Use tw_cluster_control to revert the machine to a standalone state. Enter:The machine can be added back into the cluster.[tideway@wilkoapp1 ~]$ tw_cluster_control --revert-to-standalone
Removing this machine from the current cluster
WARNING: This will delete all data on this cluster member!
Do you wish to proceed? (yes/no) yes
Stopping Cluster Manager service:
...
Making an existing cluster member the coordinator
Where you have changed a failed coordinator and the cluster has not restarted, you must make another machine become the coordinator. To do this:- Log in to the appliance command line for the machine you want to make the coordinator. Log in as the tideway user.
- Use tw_cluster_control to to make that machine become the coordinator. Enter:The machine then becomes the coordinator.[tideway@wilkoapp1 ~]$ tw_cluster_control --become-coordinator
Password for BMC Discovery UI user system:
Making this machine the cluster coordinator
Becoming coordinator... (see UI for progress)
[tideway@wilkoapp1 ~]$