Troubleshooting an Infrastructure Management high-availability deployment
Consult this topic for troubleshooting information related to a high-availability deployment of Infrastructure Management.
Frequent fail over and fail back
In an application high availability deployment of TrueSight Infrastructure Management, there could be frequent fail over and fail back behavior based on cell being down either on primary or secondary.
The reason for frequent failover and failback due to cell down on primary or secondary might be because cell database (mcdb) is out of sync.
Ensure that the secondary server is down and perform the following steps:
- Delete all content in the installedDirectory\TrueSight\pw\server\var\<secondary_cell_name> folder of the secondary server.
- Copy the installedDirectory\TrueSight\pw\server\var\<primary_cell_name>\mcdb file of the primary server to the above location on the secondary server.
- Restart the secondary server.
- Ensure that the primary server is running in active mode and secondary server in standby mode.
To reconfigure the secondary server
If you need to deploy and use a different secondary server instead of the original one, perform the following steps:
- Install Infrastructure Management on the new secondary server.
- Stop the primary server and the new secondary server.
- On the primary server, edit the following files and replace the original secondary server host name with the new secondary server host name:
- Copy the updated installedDirectory/TrueSight/pw/pronto/tmp/ha-generated.conf file from the primary server to the new secondary server.
- On the new secondary server, run the following command
pw ha enable standby file=<location of ha-generated.conf file>
- Copy the following folder from the primary server to the new secondary server:
- Rename the copied folder on the new secondary server to pncell_<hostname>#2
- Copy the installedDirectory/TrueSight/pw/pronto/tmp/addremoteagent file from the primary server to the new secondary server.
- Start the primary server.
- On the primary server, run the addremoteagent file.
- Start the secondary server, after the primary server is up and running.
The HA nodes experience a split-brain condition
A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.
The following Infrastructure Management server properties help to prevent and recover from a split-brain condition.
To set the properties, edit the installationDirectory/pw/custom/conf/pronet.conf file on the secondary server only and make the required changes.
|Split-brain condition prevention|
Controls the split-brain condition prevention support. If set to
The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node.
The value must be less than 20.
Any positive integer
The frequency (in seconds) at which the monitoring service checks the remote node status.
Any positive integer
|Split-brain condition recovery|
Controls the split-brain condition recovery support. If set to
Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified.
The frequency (in seconds) at which the monitoring service checks for a split-brain condition.
Any positive integer
To enable TLS mode
The switchToTLS.pl script does not work as expected when Infrastructure Management is configured for application high-availability. To resolve this issue, perform the following steps:
- Stop the Infrastructure Management server.
- Create a backup of the activemq-rar.rar file from the location installedDirectory\TrueSight\pw\wildfly\standalone\deployments.
- Copy the ssl.activemq-rar.rar file from installedDirectory\TrueSight\pw\wildfly\store to the following locations and rename the file to activemq-rar.rar:
- Extract the amq-broker-config.xml file in each folder and ensure that the following content exists. If not, update the file:
<amq:transportConnector name="bppm-jms-broker-tcp-conn" uri="ssl://0.0.0.0:8093?
transport.enabledProtocols=TLSv1.2&transport.enabledCipherSuites=TLS_DHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_128_CBC_SHA&allowLinkStealing=true" allowLinkStealing="true" />
- Save the amq-broker-config.xml file if you made any changes to it and add it to the activemq-rar.rar file.
- Restart the Infrastructure Management server.
The Rate process does not start on any of the HA nodes
Due to cache replication issues, the Rate process may not start on any of the HA nodes. A null-pointer exception is also displayed in the TrueSight.log file.
As a workaround, perform the following steps:
- Stop the standby node first and then stop the active node.
- Back up the federatedcacheserver-transport-tcp.xml and clustermanager.xml files in the installedDirectory\pw\pronto\conf folder on both nodes.
Edit the installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml file and update the
VERIFY_SUSPECTelements timeout value to
<FD max_tries="9" timeout="30000"/> <VERIFY_SUSPECT timeout="30000"/>
Edit the installedDirectory\pw\pronto\conf\clustermanager.xml file and update
FDelement timeout value to
<FD max_tries="9" timeout="30000"/> <VERIFY_SUSPECT timeout="30000" num_msgs="5"/>
- Ensure that you perform steps 3 and 4 on both nodes.
- Start the previously active node first.
- After the node is up and running and you are able to access the operator console, restart the other node.