Troubleshooting problems with Infrastructure Management high availability

Consult this topic for troubleshooting information related to a high-availability deployment of Infrastructure Management.

Cell database out of sync

In an application high availability deployment of TrueSight Infrastructure Management, the cell database (MCDB) might go out of sync in the following scenarios:

  • When the nodes recover from a split-brain condition
  • When the secondary server is down for a long time and the Infrastructure Management processes on the primary server has been restarted multiple times

Resolution

  1. Stop both the primary and secondary server.
  2. Delete all content in the installedDirectory\TrueSight\pw\server\var\<secondary_cell_name> folder of the secondary server.
  3. Copy the installedDirectory\TrueSight\pw\server\var\<primary_cell_name>\mcdb and .xact file of the primary server to the above location on the secondary server.
  4. Start the primary server.
  5. Ensure that the primary server processes are fully up and then start the secondary server.

To reconfigure the secondary server

If you need to deploy and use a different secondary server instead of the original one, perform the following steps:

  1. Install Infrastructure Management on the new secondary server.
  2. Stop the primary server and the new secondary server.
  3. On the primary server, edit the following files and replace the original secondary server host name with the new secondary server host name:
    • installedDirectory\pw\pronto\tmp\ha-generated.conf

    • installedDirectory\pw\server\etc\mcell.dir

    • installedDirectory\pw\pronto\data\admin\admin.dir

    • installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml

    • installedDirectory\pw\pronto\conf\clustermanager.xml

    • installedDirectory\pw\pronto\conf\cell_info.list

    • installedDirectory\pw\integrations\ibrsd\conf\IBRSD.dir

    • installedDirectory\pw\custom\conf\ha.conf

  4. Copy the updated installedDirectory/TrueSight/pw/pronto/tmp/ha-generated.conf file from the primary server to the new secondary server.
  5. On the new secondary server, run the following command 
    pw ha enable standby file=<location of ha-generated.conf file>
  6. Copy the following folder from the primary server to the new secondary server:
    installedDirectory/TrueSight/pw/server/var/pncell_<hostname>#1
  7. Rename the copied folder on the new secondary server to pncell_<hostname>#2
  8. Copy the installedDirectory/TrueSight/pw/pronto/tmp/addremoteagent file from the primary server to the new secondary server.
  9. Start the primary server.
  10. On the primary server, run the addremoteagent file.
  11. Start the secondary server, after the primary server is up and running.

The HA nodes experience a split-brain condition

A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.

The following Infrastructure Management server properties help to prevent and recover from a split-brain condition.

To set the properties, edit the installationDirectory/pw/custom/conf/pronet.conf file on the secondary server only and make the required changes.

Split-brain condition prevention

Property name

Property details

Default value

Valid values

pronet.ha.split.brain.prevention.support

Controls the split-brain condition prevention support. If set to false the Active-Active node prevention operation does not occur.

true

true;
false

pronet.ha.split.brain.prevention.max.retry.count

The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node.

5

Any positive integer

pronet.ha.split.brain.prevention.scan.frequency.in.secs

The frequency (in seconds) at which the monitoring service checks the remote node status.

30

Any positive integer

Split-brain condition recovery

Property name

Property details

Default value

Valid values

pronet.ha.split.brain.recovery.support

Controls the split-brain condition recovery support. If set to false the Active-Active node recovery operation does not occur and the HA deployment stays in the split-brain condition.

true

true;
false

pronet.ha.split.brain.recovery.notification.support

Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified.

true

true;
false

pronet.ha.split.brain.recovery.scan.frequency.in.secs

The frequency (in seconds) at which the monitoring service checks for a split-brain condition.

30

Any positive integer

Errors and warnings are displayed when failback occurs

When a failback occurs, errors and warnings related to ActiveMQ are displayed in the TrueSight.log file.

As a workaround, perform the following steps:

  1. Back up the following files:
    installedDirectory\wildfly\standalone\deployments\activemq-rar.rar.deployed
    installedDirectory\wildfly\standalone\deployments\activemq-rar.rar

  2. As per the node status (active or standby), copy the appropriate file to the installedDirectory\wildfly\standalone\deployments folder:
    installedDirectory\Active\activemq-rar.rar
    installedDirectory\StandBy\activemq-rar.rar

  3. Restart the Infrastructure Management server.

Agent controller error after a restart

After restarting the Infrastructure Management server, the following error message is displayed in the TrueSight.log file.

ERROR status_change_msg_rt [ProcessStatusChangeMessageHandler_Rate] Proactive check for FedCacheRuntime status failed. [AgentController] Missing Resource String
java.rmi.NotBoundException: AgentController

You can ignore this error message if it occurs immediately after a restart.

If the error message occurs even after an hour post restart, try restarting the Infrastructure Management server again. If it does not resolve the issue, contact BMC Customer Support.

Related topics

Infrastructure Management Server high-availability architecture

Installing


Was this page helpful? Yes No Submitting... Thank you

Comments