Troubleshooting cluster creation failures

When your attempt to create a cluster fails, use the following troubleshooting steps to identify and resolve the problem or create a BMC Support case.

Issue symptom

An attempt to create a cluster fails.

Issue scope

This issue can affect BMC Discovery versions 11.x and 12.x.

Resolution

Perform the following steps to troubleshoot and resolve the failure in creating a cluster.

Step 1: Verify that the prerequisites are fulfilled

Make sure that you have fulfilled the following prerequisites for clusters:

All machines in the cluster have the same major and minor versions.
The individual machines that make up a cluster are in the same location.
Clusters are created from machines of similar specifications and performance.

If any of these prerequisites are not fulfilled, that may be a probable cause for the failure. Proceed further only if you are experiencing failure in cluster creation even after fulfilling the prerequisites.

Step 2: Isolate the error type and apply its solution

You may experience the following possible cluster-related errors. Identify the error applicable in your environment and apply the solution relevant to it:

Error 1: When trying to add a new member to an existing cluster, error messages are displayed in the UI
Error 2: After you change the IP addresses for cluster members, the services do not start
Error 3: A new cluster shuts down because a cluster member runs out of disk space during rebalancing of a cluster
Error 4: After you add new members to a cluster and perform a reboot, the cluster does not start

Error 1: When you try to add a new member to an existing cluster, error messages are displayed in the UI

The UI displays any of the following error messages when you try to add a new member to an existing cluster:

Error: No machines added to pending changes
Failed to add the candidate machine xxx.xxx.xxx.xxx for one of the following reasons:
- The candidate does not have a default configuration
- The version of the candidate is not compatible with other machines in the cluster
- The candidate has a more recent TKU version than the cluster

Solution 1 for error 1

On the Administration page of the candidate member, go to Appliance > Cluster Management.
The UI displays messages that list the reasons why the appliance cannot join an existing cluster. For example, the following message may be displayed:
This machine cannot join an existing cluster for the following reasons:
1 credential found
3 DiscoveryRun nodes found
33 Host nodes found
LDAP has been enabled
Reset the configuration to the default values.
For example, if the vault passphrase is set on the candidate member, clear the vault passphrase from the Administration > Vault Management UI. After you reset the configuration to the default values, the appliance should be able to join the cluster.
However, be aware that resetting the configuration to default values will delete the existing configuration.

Solution 2 for error 1

If you are unaware of the specific reason that caused the failure in cluster creation, then in the Administration > Vault Management UI, click Reset Configuration. This will delete all existing configuration settings and the appliance should be able to join the cluster.

Error 2: After you change the IP addresses for cluster members, the services do not start

Perform the following steps:

Verify if you have changed the IP addresses according to the instructions in Changing the appliance IP address.
If you have followed the instructions correctly and still face any issues, proceed with the remaining steps.
If you created the cluster member VMs by using VM cloning, and the IPs are now changed, then do the following:
1. Run the command tw_cluster_control --replace-vm-uuid on each cluster member to fix the UUID mismatch caused by cloning.
2. Log in as the tideway user and perform the following steps on each cluster member:
  1. Backup the files /usr/tideway/etc/cluster.conf and /usr/tideway/etc/ds_cluster.conf by executing the following commands:
    cp /usr/tideway/etc/cluster.conf /usr/tideway/etc/cluster.conf.backup
    cp /usr/tideway/etc/ds_cluster.conf /usr/tideway/etc/ds_cluster.conf.backup
    Important
    Do not delete or move the ds_cluster.conf file. Use the cp command to take a backup of the file.
  2. Edit the cluster.conf and ds_cluster.conf files to specify the new IP address.
  3. Run the following commands to stop and restart the services:
    On CentOS 6:
    sudo /sbin/service tideway stop
    sudo /sbin/service cluster stop
    sudo /sbin/service omniNames stop
    sudo /sbin/service appliance stop
    
    sudo /sbin/service appliance start
    sudo /sbin/service omniNames start
    sudo /sbin/service cluster start
    sudo /sbin/service tideway start
    On CentOS 7:
    tw_service_control --stop
    sudo systemctl stop cluster
    sudo systemctl stop omniNames
    sudo systemctl stop appliance
    
    sudo systemctl start appliance
    sudo systemctl start omniNames
    sudo systemctl start cluster
    tw_service_control --start

Error 3: A new cluster shuts down because a cluster member runs out of disk space during rebalancing of the cluster

If a newly created cluster shuts down because a cluster member runs out of disk space during rebalancing of the cluster, perform the following steps to recover the cluster from this state:

On the problematic cluster member, add a new local disk with enough disk space.
Move the datastore to the newly added disk by running the command tw_disk_utils.
If the datastore transaction logs also require additional space, add a second local disk of appropriate size and move the logs to that disk.
For more information, see the article KA 000082640 on the BMC Community page.
After the data movement is complete, reboot all the cluster members.

Error 4: After you add new members to a cluster and perform a reboot, the cluster does not start

After you successfully add new members to a cluster and then perform a reboot, the cluster does not start. Instead, the following message is repeatedly seen in the model log, INFO: Still waiting to contact 2 cluster members.

To resolve this issue, edit the /etc/hosts files on all the cluster members. You must edit the /etc/hosts files so that they instruct each member on how to communicate with the other members. This method is a fallback in case the DNS Server does not provide the needed information.

Tip

To edit /etc/hosts, you must be the root user. Editing the files from WinSCP may not work because WinSCP logs in as the tideway user. The ideal way is to use vi from the command line.

The following text is an example of a desired /etc/hosts on a cluster member:

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

127.0.1.1 <member1 fqdn> <member 1 hostname>
<member1 IP> <member1 fqdn> <member 1 hostname># DO NOT modify this line. Generated by Atrium Discovery.

#### all other cluster members below

<member2 IP> <member2 fqdn> <member 2 hostname>
<member3 IP> <member3 fqdn> <member 3 hostname>

After you edit the /etc/hosts files to include the required information for each member, you should be able to start the cluster successfully.

If you are still unable to resolve your issue after trying the listed solutions, create a BMC Support case. Provide details of the issue and the exact steps you attempted to resolve it.