Troubleshooting an Infrastructure Management high-availability deployment


Consult this topic for troubleshooting information related to a high-availability deployment of Infrastructure Management.

Cell database out of sync

In an application high availability deployment of TrueSight Infrastructure Management, the cell database (MCDB) might go out of sync in the following scenarios:

  • When the nodes recover from a split-brain condition
  • When the secondary server is down for a long time and the Infrastructure Management processes on the primary server has been restarted multiple times

Resolution

  1. Stop both the primary and secondary server.
  2. Delete all content in the installedDirectory\TrueSight\pw\server\var\<secondary_cell_name> folder of the secondary server.
  3. Copy all content from the installedDirectory\TrueSight\pw\server\var\<primary_cell_name> folder of the primary server to the above location on the secondary server.
  4. Start the primary server.
  5. Ensure that the primary server processes are fully up and then start the secondary server.

To reconfigure the secondary server

If you need to deploy and use a different secondary server instead of the original one, perform the following steps:

  1. Install Infrastructure Management on the new secondary server.
  2. Stop the primary server and the new secondary server.
  3. On the primary server, edit the following files and replace the original secondary server host name with the new secondary server host name:
    • installedDirectory\pw\pronto\tmp\ha-generated.conf

    • installedDirectory\pw\server\etc\mcell.dir

    • installedDirectory\pw\pronto\data\admin\admin.dir

    • installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml

    • installedDirectory\pw\pronto\conf\clustermanager.xml

    • installedDirectory\pw\pronto\conf\cell_info.list

    • installedDirectory\pw\integrations\ibrsd\conf\IBRSD.dir

    • installedDirectory\pw\custom\conf\ha.conf

  4. Copy the updated installedDirectory/TrueSight/pw/pronto/tmp/ha-generated.conf file from the primary server to the new secondary server.
  5. On the new secondary server, run the following command 
    pw ha enable standby file=<location of ha-generated.conf file>
  6. Copy the following folder from the primary server to the new secondary server:
    installedDirectory/TrueSight/pw/server/var/pncell_<hostname>#1
  7. Rename the copied folder on the new secondary server to pncell_<hostname>#2
  8. Copy the installedDirectory/TrueSight/pw/pronto/tmp/addremoteagent file from the primary server to the new secondary server.
  9. Start the primary server.
  10. On the primary server, run the addremoteagent file.
  11. Start the secondary server, after the primary server is up and running.

TrueSight Infrastructure Management performs a failover as the primary cell stops responding

In a TrueSight Infrastructure Management high-availability deployment, if the primary cell fails or stops responding, the secondary cell becomes active. After a while, the Infrastructure Management server performs the failover. This may cause the cell database (MCDB) to go out of sync.

You can disable the automatic failover feature for the cell by setting an mcell configuration parameter. If you set the CellDuplicateAutoFailOver parameter to NO, and if the primary cell fails, automatic cell failover will not happen. Instead, it will wait for the agent controller process to perform the failover. 

To disable the automatic cell failover, do the following on both primary and secondary Infrastructure Management servers:

  1. Go to the <Infrastructure Management Installation Directory>\pw\server\etc directory.
  2. Set the following configuration parameter to NO:
    CellDuplicateAutoFailOver=NO
  3. Restart the cell process.

The HA nodes experience a split-brain condition

A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.

The following Infrastructure Management server properties help to prevent and recover from a split-brain condition.

To set the properties, edit the following files on the secondary server only and make the required changes:

Split-brain condition prevention

Property name

Property details

Default value

Valid values

pronet.ha.split.brain.prevention.support

Controls the split-brain condition prevention support. If set to false the Active-Active node prevention operation does not occur.

true

true;
false

pronet.ha.split.brain.prevention.max.retry.count

The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node.

The value must be less than 20.

6

Any positive integer

pronet.ha.split.brain.prevention.scan.frequency.in.secs

The frequency (in seconds) at which the monitoring service checks the remote node status.

30

Any positive integer

Split-brain condition recovery

Property name

Property details

Default value

Valid values

pronet.ha.split.brain.recovery.support

Controls the split-brain condition recovery support. If set to false the Active-Active node recovery operation does not occur and the HA deployment stays in the split-brain condition.

true

true;
false

pronet.ha.split.brain.recovery.notification.support

Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified.

true

true;
false

pronet.ha.split.brain.recovery.scan.frequency.in.secs

The frequency (in seconds) at which the monitoring service checks for a split-brain condition.

30

Any positive integer

Errors and warnings are displayed when failback occurs

When a failback occurs, errors and warnings related to ActiveMQ are displayed in the TrueSight.log file.

As a workaround, perform the following steps:

  1. Back up the following files:
    installedDirectory\wildfly\standalone\deployments\activemq-rar.rar.deployed
    installedDirectory\wildfly\standalone\deployments\activemq-rar.rar

  2. As per the node status (active or standby), copy the appropriate file to the installedDirectory\wildfly\standalone\deployments folder:
    installedDirectory\Active\activemq-rar.rar
    installedDirectory\StandBy\activemq-rar.rar

  3. Restart the Infrastructure Management server.

The Rate process does not start on any of the HA nodes

Due to cache replication issues, the Rate process may not start on any of the HA nodes. A null-pointer exception is also displayed in the installedDirectory\pw\pronto\logs\TrueSight.log file.

As a workaround, perform the following steps:

  1. Stop the standby node first and then stop the active node.
  2. Back up the federatedcacheserver-transport-tcp.xml and clustermanager.xml files in the installedDirectory\pw\pronto\conf folder on both nodes.
  3. Edit the installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml file and update the FD, VERIFY_SUSPECT elements timeout value to 30000 and FD element max_tries to 9

    <FD max_tries="9" timeout="30000"/>
    <VERIFY_SUSPECT timeout="30000"/>


  4. Edit the installedDirectory\pw\pronto\conf\clustermanager.xml file and update FD element timeout value to 30000, FD max_tries to 9 and num_msgs to 5.

    <FD max_tries="9" timeout="30000"/>
    <VERIFY_SUSPECT timeout="30000" num_msgs="5"/>


  5. Ensure that you perform steps 3 and 4 on both nodes.
  6. Start the previously active node first.
  7. After the node is up and running and you are able to access the operator console, restart the other node. 

The Rate process crashes

The Rate process crashes during a failback. Check the installedDirectory\pw\pronto\logs\Rate.log file if it contains the following error message:

Caused by: org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for messageId=<message ID #> returned server error (status=0x86): org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request <request #> 

  1. Edit the installedDirectory\pw\pronto\conf\federatedcacheserver.xml file and update the acquire-timeout element value to 90000 seconds. 

    <acquire-timeout="90000"/>
  2. Restart the Infrastructure Management server processes.

Failback fails due to server not initialized completely

Failback fails and the JServer and Rate processes are not running. The TrueSight.log file contains the following:

Naming Exception while getting Topic Session.
javax.naming.NameNotFoundException: jboss/exported/ConnectionFactory – service jboss.naming.context.java.jboss.exported.jboss.exported.ConnectionFactory
at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:106)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:207)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:184)
at org.jboss.naming.remote.protocol.v1.Protocol$1.handleServerMessage(Protocol.java:127)
at org.jboss.naming.remote.protocol.v1.RemoteNamingServerV1$MessageReciever$1.run(RemoteNamingServerV1.java:73)

As a workaround, perform the following steps:

  1. Update the host file to make sure localhost is resolved to the IP address.
  2. If step #1 does not fix the issue, edit the file installedDirectory\pw\pronto\conf\pronet.conf and installedDirectory\pw\custom\conf\pronet.conf and check values of the property java.naming.provider.url
    If it is localhost, change it to 127.0.0.1.

mcell.dir is blank after a failover when the Infrastructure Management disk is full

Note

This problem is observed only when the Presentation Server is in a HA mode.

Perform the following steps to recover the mcell.dir file from the backup folder:

  1. Navigate to pw/pronto/tmp and open the mcell_dir_org_backup.
  2. Navigate to pw/server/etc/ and copy the information from the mcell_dir_org_backup file to mcell.dir.
  3. Update the current active node of the Presentation Server in the mcell.dir file.
  4. Stop and Start the Infrastructure Management server.


Automatic restart of Infrastructure Management Server processes (Supported only with 11.3.02 and later) 

When a process is unavailable or fails on any node in a high-availability deployment mode, by default, all the server processes are shutdown.

Set the pronet.ha.availability.self.shutdown.mechanism property in the <Infrastructure Management Install directory>\pw\custom\conf\pronet.conf file to automatically restart all the processes on the failed node.

Examples

#To automatically restart the processes on a failed node in a high-availability deployment of Infrastructure Management Server
pronet.ha.availability.self.shutdown.mechanism=restart

Notes

  • It is recommended to set this configuration property only on the secondary server.
  • If you have set the property to restart all the processes, and at a later point in time, if you want the processes to shut down, you can delete the following entry from the <Infrastructure Management Install directory>\pw\custom\conf\pronet.conf file.

    pronet.ha.availability.self.shutdown.mechanism=restart
  • After you set or delete the pronet.ha.availability.self.shutdown.mechanism property in the <Infrastructure Management Install directory>\pw\custom\conf\pronet.conf file, restart the Infrastructure Management Server by running the following command:

    pw system restart force

    The changes will be reflected once you restart the Infrastructure Management Server.


Rate process is unable to obtain monitor object data from the cache (applicable to TrueSight Infrastructure Management version 11.3.02 and later) 

During a failover, Rate process is unable to fetch monitor object data from the cache and crashes. Following error message is displayed:

INFO  03/13 16:28:31 Rate                 [RateInit-1] 600002 Failed to execute initialization task. [ mPlatformClass=class com.proactivenet.api.sla.SLAPlatformImpl, mObject=null, mStaticClass=null, mMethodName=null mArgClasses=null, mArgs=null]

ERROR 03/13 16:28:31 Rate                 [JavaRate] 300073 Unable to initialize local MO Cache for Rate

Error Id: 300471

As a workaround, perform the following steps:

  1. Verify database connectivity by running the pw dbconfig list command.
  2. Ensure that the adapter priorities are correctly set if you are using a dual Network Interface Card (NIC).

  3. Add the desired IP address details in the hosts file or modify the setting in the ESX host file to set a different label to the network adapter. Contact your administrator for more details.

  4. Disable the NICs that are not required.

If the above workaround doesn't resolve the issue, do the following:

  1. Stop the primary Infrastructure Management server.
  2. Stop the secondary Infrastructure Management server.
  3. Perform the following first on the primary and then on the secondary server:
    1. Go to the <TrueSight Infrastructure Management Install Directory>\pw\pronto\conf directory.
    2. Take a backup of the federatedcacheserver-transport-tcp.xml file.
    3. Edit the federatedcacheserver-transport-tcp.xml file, and replace the host names with their IP addresses in the following code lines:

      #Original code
      initial_hosts="server1.bmc.com[10590],server2.bmc.com[10590]
      jgroups.tcp.address:server1.bmc.com
      
      
      #Modified code
      initial_hosts="<IP address of server1.bmc.com>[10590],<IP address of server2.bmc.com>[10590]
      jgroups.tcp.address:<IP address of server1.bmc.com>
    4. Save the federatedcacheserver-transport-tcp.xml file.
  4. Start the primary Infrastructure Management server. 
  5. Start the secondary Infrastructure Management server. 

Related topics

Troubleshooting

Troubleshooting a Presentation Server high-availability deployment


Was this page helpful? Yes No Submitting... Thank you

Comments