Troubleshooting cluster creation failures
When your attempt to create a cluster fails, use the following troubleshooting steps to identify and resolve the problem or create a BMC Support case.
Issue symptom
An attempt to create a cluster fails.
Issue scope
This issue can affect BMC Discovery versions 11.x and 12.x.
Resolution
Perform the following steps to troubleshoot and resolve the failure in creating a cluster.
Step 1: Verify that the prerequisites are fulfilled
Make sure that you have fulfilled the following prerequisites for clusters:
- All machines in the cluster have the same major and minor versions.
- The individual machines that make up a cluster are in the same location.
- Clusters are created from machines of similar specifications and performance.
If any of these prerequisites are not fulfilled, that may be a probable cause for the failure. Proceed further only if you are experiencing failure in cluster creation even after fulfilling the prerequisites.
Step 2: Isolate the error type and apply its solution
You may experience the following possible cluster-related errors. Identify the error applicable in your environment and apply the solution relevant to it:
- Error 1: When trying to add a new member to an existing cluster, error messages are displayed in the UI
- Error 2: After you change the IP addresses for cluster members, the services do not start
- Error 3: A new cluster shuts down because a cluster member runs out of disk space during rebalancing of a cluster
- Error 4: After you add new members to a cluster and perform a reboot, the cluster does not start
Error 1: When you try to add a new member to an existing cluster, error messages are displayed in the UI
The UI displays any of the following error messages when you try to add a new member to an existing cluster:
- Error: No machines added to pending changes
- Failed to add the candidate machine xxx.xxx.xxx.xxx for one of the following reasons:
- The candidate does not have a default configuration
- The version of the candidate is not compatible with other machines in the cluster
- The candidate has a more recent TKU version than the cluster
Solution 1 for error 1
On the Administration page of the candidate member, go to Appliance > Cluster Management.
The UI displays messages that list the reasons why the appliance cannot join an existing cluster. For example, the following message may be displayed:This machine cannot join an existing cluster for the following reasons:
1 credential found
3 DiscoveryRun nodes found
33 Host nodes found
LDAP has been enabled- Reset the configuration to the default values.
For example, if the vault passphrase is set on the candidate member, clear the vault passphrase from the Administration > Vault Management UI. After you reset the configuration to the default values, the appliance should be able to join the cluster.
However, be aware that resetting the configuration to default values will delete the existing configuration.
Solution 2 for error 1
If you are unaware of the specific reason that caused the failure in cluster creation, then in the Administration > Vault Management UI, click Reset Configuration. This will delete all existing configuration settings and the appliance should be able to join the cluster.
Error 2: After you change the IP addresses for cluster members, the services do not start
Perform the following steps:
- Verify if you have changed the IP addresses according to the instructions in Changing the appliance IP address.
If you have followed the instructions correctly and still face any issues, proceed with the remaining steps. - If you created the cluster member VMs by using VM cloning, and the IPs are now changed, then do the following:
- Run the command tw_cluster_control --replace-vm-uuid on each cluster member to fix the UUID mismatch caused by cloning.
- Log in as the tideway user and perform the following steps on each cluster member:
Backup the files /usr/tideway/etc/cluster.conf and /usr/tideway/etc/ds_cluster.conf by executing the following commands:
cp /usr/tideway/etc/cluster.conf /usr/tideway/etc/cluster.conf.backup
cp /usr/tideway/etc/ds_cluster.conf /usr/tideway/etc/ds_cluster.conf.backup- Edit the cluster.conf and ds_cluster.conf files to specify the new IP address.
Run the following commands to stop and restart the services:
On CentOS 6:sudo /sbin/service tideway stop
sudo /sbin/service cluster stop
sudo /sbin/service omniNames stop
sudo /sbin/service appliance stop
sudo /sbin/service appliance start
sudo /sbin/service omniNames start
sudo /sbin/service cluster start
sudo /sbin/service tideway startOn CentOS 7:
tw_service_control --stop
sudo systemctl stop cluster
sudo systemctl stop omniNames
sudo systemctl stop appliance
sudo systemctl start appliance
sudo systemctl start omniNames
sudo systemctl start cluster
tw_service_control --start
Error 3: A new cluster shuts down because a cluster member runs out of disk space during rebalancing of the cluster
If a newly created cluster shuts down because a cluster member runs out of disk space during rebalancing of the cluster, perform the following steps to recover the cluster from this state:
- On the problematic cluster member, add a new local disk with enough disk space.
- Move the datastore to the newly added disk by running the command tw_disk_utils.
If the datastore transaction logs also require additional space, add a second local disk of appropriate size and move the logs to that disk.
For more information, see the article KA 000082640 on the BMC Community page.- After the data movement is complete, reboot all the cluster members.
Error 4: After you add new members to a cluster and perform a reboot, the cluster does not start
After you successfully add new members to a cluster and then perform a reboot, the cluster does not start. Instead, the following message is repeatedly seen in the model log, INFO: Still waiting to contact 2 cluster members.
To resolve this issue, edit the /etc/hosts files on all the cluster members. You must edit the /etc/hosts files so that they instruct each member on how to communicate with the other members. This method is a fallback in case the DNS Server does not provide the needed information.
The following text is an example of a desired /etc/hosts on a cluster member:
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.1.1 <member1 fqdn> <member 1 hostname>
<member1 IP> <member1 fqdn> <member 1 hostname># DO NOT modify this line. Generated by Atrium Discovery.
#### all other cluster members below
<member2 IP> <member2 fqdn> <member 2 hostname>
<member3 IP> <member3 fqdn> <member 3 hostname>
After you edit the /etc/hosts files to include the required information for each member, you should be able to start the cluster successfully.
If you are still unable to resolve your issue after trying the listed solutions, create a BMC Support case. Provide details of the issue and the exact steps you attempted to resolve it.