Troubleshooting clusters
When you encounter problems with a cluster, the first thing that you see may be errors in the Cluster Manager UI. If you are unable to recover using the UI, you may be able to use the tw_cluster_control utility to fix the problem.
My cluster will not start!
When a cluster starts, one of the checks it performs is to ensure that the members are all the correct machines, rather than a different machine on the same IP address. For virtual machines, it checks the VM UUID. If the VM UUID has changed, the machine is assumed to be a different machine, and the cluster cannot start and logs the following critical error in tw_svc_cluster_manager.log
.
Clustered machine: VM UUID has changed Replace expected VM UUID by running: tw_cluster_control --replace-vm-uuid
The critical error is also displayed on the console if you are attempting to start tw_svc_cluster_manager
manually.
To replace the VM UUID:
You can do this by running tw_cluster_control
from the machine on which the cluster manager has failed.
- Log in to the appliance command line as the tideway user.
Use
tw_cluster_control
replace the expected VM UUID with the current value and enable the cluster service to start. Enter:[tideway@wilkoapp1 ~]$ tw_cluster_control --replace-vm-uuid [tideway@wilkoapp1 ~]$
Start the cluster manager. Enter:
[tideway@wilkoapp1 ~]$ sudo /sbin/service appliance start [tideway@wilkoapp1 ~]$
Start the remaining services. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --cluster-start-services [tideway@wilkoapp1 ~]$
To determine the health of a cluster
You can do this from any running member of the cluster.
- Log in to the appliance command line as the tideway user.
Query the status of the cluster. Enter:
[tideway@wilkoapp1 ~]$ tw_cluster_control --show-members
This is an example of the cluster health check results for a three machine cluster with fault tolerance enabled where a non-coordinator machine cannot be contacted.
Cluster UUID : 508a243177476a01543289485ecb04e5 Cluster Name : Harness Cluster Cluster Alias : Fault Tolerance : Enabled Replication Factor : 2 Number of Members : 3 UUID : 508a243177476a01383089485ecb04e5 Name : Harness Cluster-01 Address : wilkoapp1.tideway.com Cluster Manager Health : MEMBER_HEALTH_OK Overall Health : MEMBER_HEALTH_OK Services : SERVICES_RUNNING State : MEMBER_STATE_NORMAL Coordinator : Yes Last Contact : Fri Feb 14 16:47:33 2014 CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz Processors : 4 Memory : 3833M Swap : 8191M Free Space : /usr 33014M/38458M (15%) UUID : 9013913177476a525d8289485ecd04e2 Name : Harness Cluster-02 Address : 137.72.94.205 Cluster Manager Health : MEMBER_HEALTH_OK Overall Health : MEMBER_HEALTH_OK Services : SERVICES_RUNNING State : MEMBER_STATE_NORMAL Coordinator : No Last Contact : Fri Feb 14 16:47:33 2014 CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz Processors : 4 Memory : 3833M Swap : 8191M Free Space : /usr 33077M/38458M (14%) UUID : 20972c31774768b3ad7889485ece04e8 Name : Harness Cluster-03 Address : wilkoapp3.tideway.com Cluster Manager Health : MEMBER_HEALTH_ERROR Communication failure Overall Health : MEMBER_HEALTH_ERROR Communication failure Services : SERVICES_UNKNOWN State : MEMBER_STATE_NORMAL Coordinator : No Last Contact : Fri Feb 14 16:46:45 2014 CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz Processors : 4 Memory : 3833M Swap : 8191M Free Space : /usr 32671M/38458M (16%) [tideway@wilkoapp1 ~]$
Removing all failed machines from the cluster
You can do this by running tw_cluster_control
from any running member of the cluster.
- Log in to the appliance command line as the tideway user.
Use
tw_cluster_control
to remove all failed machines from the cluster. You need to supply the password of the system user. Enter:[tideway@wilkoapp1 ~]$ tw_cluster_control --remove-broken Password: Found 1 broken member: Harness Cluster-03 (wilkoapp3.tideway.com) Are you sure you want to make these changes? (y/n) y 1 member is being removed [tideway@wilkoapp1 ~]$
Before you can reuse any machine removed from the cluster, you must revert it to a standalone configuration. All data is lost in this process. On the command line of the machine you want to revert, enter:
tideway@wilkoapp3 ~]$ tw_cluster_control --revert-to-standalone Removing this machine from the current cluster WARNING: This will delete all data on this cluster member! Do you wish to proceed? (yes/no) yes Stopping Cluster Manager service: Stopping Cluster Manager service: [ FAILED ] Removing cluster configuration Starting Cluster Manager service: Starting Cluster Manager service: [ OK ] Removing appliance configuration Deleting datastore Starting Security service: [ OK ] ... ... Starting Reasoning service: [ OK ] Importing TKU packages: [ Unknown ] Writing TKU-Core-2064-02-1-ADDM-10.0+: [ OK ] [ TKU-Extended-DB-Discovery-2064-02-1-ADDM-10.0+: [ Unknown ] [ TKU-Extended-Middleware-Discovery-2064-02-1-ADDM-10.0+: [ Unknown ] [ TKU-System-2064-02-1-ADDM-10.0+: [ Unknown ] Activating patterns: [ OK ] Starting Tomcat: [ OK ] Starting Reports service: [ OK ] Starting Application Server service: [ OK ] Updating cron tables: [ OK ] Updating baseline: [ OK ] Finished removing cluster configuration [tideway@wilkoapp3 ~]$
A machine has failed in my non-fault tolerant cluster
A cluster that has no fault tolerance cannot survive when a machine fails. All data is lost and the only remaining option is to revert all of the machines to a standalone state and then create a new cluster. If you have backed up your cluster, you can restore the backup onto a new cluster of the same size.
To revert all of the machines to a standalone state:
- Log in to the appliance command line as the tideway user.
Use
tw_cluster_control
to revert the machine to a standalone state. Enter:[tideway@wilkoapp1 ~]$ tw_cluster_control --revert-to-standalone Removing this machine from the current cluster WARNING: This will delete all data on this cluster member! Do you wish to proceed? (yes/no) yes Stopping Cluster Manager service: ...
- Repeat for all members of the cluster.
Restoring a failed and forcibly removed coordinator
If a coordinator fails and is forcibly removed from the cluster, it must be reverted to a standalone state before it can be used again. To do this:
- Log in to the appliance command line as the tideway user.
Use
tw_cluster_control
to revert the machine to a standalone state. Enter:[tideway@wilkoapp1 ~]$ tw_cluster_control --revert-to-standalone Removing this machine from the current cluster WARNING: This will delete all data on this cluster member! Do you wish to proceed? (yes/no) yes Stopping Cluster Manager service: ...
The machine can be added back into the cluster.
Comments
Log in or register to comment.