Child pages
    • Troubleshooting clusters
    Skip to end of metadata
    Go to start of metadata

    When you encounter problems with a cluster, the first thing that you see may be errors in the Cluster Manager UI. If you are unable to recover using the UI, you may be able to use the tw_cluster_control utility to fix the problem.

    My cluster will not start!

    When a cluster starts, one of the checks it performs is to ensure that the members are all the correct machines, rather than a different machine on the same IP address. For virtual machines, it checks the VM UUID. If the VM UUID has changed, the machine is assumed to be a different machine, and the cluster cannot start and logs the following critical error in tw_svc_cluster_manager.log.

    Clustered machine: VM UUID has changed
    Replace expected VM UUID by running: tw_cluster_control --replace-vm-uuid

    The critical error is also displayed on the console if you are attempting to start tw_svc_cluster_manager manually.

    To replace the VM UUID:
    You can do this by running tw_cluster_control from the machine on which the cluster manager has failed.

    1. Log in to the appliance command line as the tideway user.
    2. Use tw_cluster_controlreplace the expected VM UUID with the current value and enable the cluster service to start. Enter:

      [tideway@wilkoapp1 ~]$ tw_cluster_control --replace-vm-uuid
      [tideway@wilkoapp1 ~]$
    3. Start the cluster manager. Enter:

      [tideway@wilkoapp1 ~]$ sudo /sbin/service appliance start
      [tideway@wilkoapp1 ~]$
    4. Start the remaining services. Enter:

      [tideway@wilkoapp1 ~]$ tw_cluster_control --cluster-start-services
      [tideway@wilkoapp1 ~]$

    To determine the health of a cluster

    You can do this from any running member of the cluster.

    1. Log in to the appliance command line as the tideway user.
    2. Query the status of the cluster. Enter:

      [tideway@wilkoapp1 ~]$ tw_cluster_control --show-members
      

    This is an example of the cluster health check results for a three machine cluster with fault tolerance enabled where a non-coordinator machine cannot be contacted.

      
          Cluster UUID : 508a243177476a01543289485ecb04e5
          Cluster Name : Harness Cluster
         Cluster Alias :
       Fault Tolerance : Enabled
    Replication Factor : 2
     Number of Members : 3
    
                            UUID : 508a243177476a01383089485ecb04e5
                            Name : Harness Cluster-01
                         Address : wilkoapp1.tideway.com
          Cluster Manager Health : MEMBER_HEALTH_OK
                  Overall Health : MEMBER_HEALTH_OK
                        Services : SERVICES_RUNNING
                           State : MEMBER_STATE_NORMAL
                     Coordinator : Yes
                    Last Contact : Fri Feb 14 16:47:33 2014
                        CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
                      Processors : 4
                          Memory : 3833M
                            Swap : 8191M
            Free Space : /usr 33014M/38458M (15%)
    
                            UUID : 9013913177476a525d8289485ecd04e2
                            Name : Harness Cluster-02
                         Address : 137.72.94.205
          Cluster Manager Health : MEMBER_HEALTH_OK
                  Overall Health : MEMBER_HEALTH_OK
                        Services : SERVICES_RUNNING
                           State : MEMBER_STATE_NORMAL
                     Coordinator : No
                    Last Contact : Fri Feb 14 16:47:33 2014
                        CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
                      Processors : 4
                          Memory : 3833M
                            Swap : 8191M
            Free Space : /usr 33077M/38458M (14%)
    
                            UUID : 20972c31774768b3ad7889485ece04e8
                            Name : Harness Cluster-03
                         Address : wilkoapp3.tideway.com
          Cluster Manager Health : MEMBER_HEALTH_ERROR Communication failure
                  Overall Health : MEMBER_HEALTH_ERROR Communication failure
                        Services : SERVICES_UNKNOWN
                           State : MEMBER_STATE_NORMAL
                     Coordinator : No
                    Last Contact : Fri Feb 14 16:46:45 2014
                        CPU Type : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
                      Processors : 4
                          Memory : 3833M
                            Swap : 8191M
            Free Space : /usr 32671M/38458M (16%)
    
    [tideway@wilkoapp1 ~]$

    Removing all failed machines from the cluster 

    You can do this by running tw_cluster_control from any running member of the cluster. The order of these steps is particularly important. See the note below.

    1. Log in to the appliance command line as the tideway user.
    2. Use tw_cluster_control to remove all failed machines from the cluster. You need to supply the password of the system user. Enter:

      [tideway@wilkoapp1 ~]$ tw_cluster_control --remove-broken
      Password:
      
      Found 1 broken member:
          Harness Cluster-03 (wilkoapp3.tideway.com)
      
      Are you sure you want to make these changes? (y/n) y
      
      1 member is being removed
      [tideway@wilkoapp1 ~]$
    3. Before you can reuse any machine removed from the cluster, you must revert it to a standalone configuration. All data is lost in this process. On the command line of the machine you want to revert, enter:

      tideway@wilkoapp3 ~]$ tw_cluster_control --revert-to-standalone
      Removing this machine from the current cluster
      
      WARNING: This will delete all data on this cluster member!
      
        Do you wish to proceed? (yes/no) yes
      Stopping Cluster Manager service:
      Stopping Cluster Manager service:                           [  FAILED  ]
      Removing cluster configuration
      Starting Cluster Manager service:
      Starting Cluster Manager service:                           [  OK  ]
      Removing appliance configuration
      Deleting datastore
      Starting Security service:                                  [  OK  ]
      ...
      
      ...
      Starting Reasoning service:                                 [  OK  ]
      Importing TKU packages:                                     [  Unknown  ]
      Writing TKU-Core-2064-02-1-ADDM-10.0+:                      [  OK  ]
      [ TKU-Extended-DB-Discovery-2064-02-1-ADDM-10.0+:           [  Unknown  ]
      [ TKU-Extended-Middleware-Discovery-2064-02-1-ADDM-10.0+:   [  Unknown  ]
      [ TKU-System-2064-02-1-ADDM-10.0+:                          [  Unknown  ]
      Activating patterns:                                        [  OK  ]
      Starting Tomcat:                                            [  OK  ]
      Starting Reports service:                                   [  OK  ]
      Starting Application Server service:                        [  OK  ]
      Updating cron tables:                                       [  OK  ]
      Updating baseline:                                          [  OK  ]
      Finished removing cluster configuration
      [tideway@wilkoapp3 ~]$ 
      

    If a machine in the cluster has failed, and you --revert-to-standalone before removing it from the cluster, then you must shut down the reverted machine before issuing the --remove-broken command.

    A machine has failed in my non-fault tolerant cluster

    A cluster that has no fault tolerance cannot survive when a machine fails. All data is lost and the only remaining option is to revert all of the machines to a standalone state and then create a new cluster. If you have backed up your cluster, you can restore the backup onto a new cluster of the same size.

    To revert all of the machines to a standalone state:

    1. Log in to the appliance command line as the tideway user.
    2. Use tw_cluster_controlto revert the machine to a standalone state. Enter:

      [tideway@wilkoapp1 ~]$ tw_cluster_control --revert-to-standalone
      Removing this machine from the current cluster
      
      WARNING: This will delete all data on this cluster member!
      
        Do you wish to proceed? (yes/no) yes
      Stopping Cluster Manager service:
      ...
      
    3. Repeat for all members of the cluster.







    Restoring a failed and forcibly removed coordinator

    If a coordinator fails and is forcibly removed from the cluster, it must be reverted to a standalone state before it can be used again. To do this:

    1. Log in to the appliance command line as the tideway user.
    2. Use tw_cluster_controlto revert the machine to a standalone state. Enter:

      [tideway@wilkoapp1 ~]$ tw_cluster_control --revert-to-standalone
      Removing this machine from the current cluster
      
      WARNING: This will delete all data on this cluster member!
      
        Do you wish to proceed? (yes/no) yes
      Stopping Cluster Manager service:
      ...
      

      The machine can be added back into the cluster.

    • No labels