Troubleshooting an Infrastructure Management high-availability deployment

Consult this topic for troubleshooting information related to a high-availability deployment of Infrastructure Management.

Database is not coming up after applying the service pack or fix pack to the primary cluster node (applicable to 11.3.03 and later)

Immediately after applying the service pack or fix pack to the primary cluster node, stop the Infrastructure Management server on the primary cluster node.
On the secondary cluster node, run the <Install_dir>\pw\pronto\bin\updateRegistryForSybase17.bat file to update the Sybase 17 registry settings for Windows.
Restart the Infrastructure Management server on the secondary cluster node.

Automatic restart of Infrastructure Management server processes (Supported only with 11.3.02 and later)

For 11.3.03

In an application high-availability deployment of TrueSight Infrastructure Management, when any of the critical server processes become unavailable, recovery action (restart or shutdown) is performed.

If the Infrastructure Management server is unable to establish connectivity with the database for a specific amount of time, both nodes are shut down.

If required, configure the following parameters in the installedDirectory\pw\custom\conf\pronet.conf file:

Property name	Property details	Default value
pronet.component.unavailability.recovery.action	Recovery action that will be performed if a critical process is unavailable. Note: This property does not affect the database connectivity check.	shutdown (primary node) restart (secondary node)
pronet.ha.availability.scan.frequency.in.secs	The polling interval (in sec) of the critical processes availability check.	60
pronet.ha.availability.max.retry.count	The number of retries at the frequency specified by the pronet.ha.availability.scan.frequency.in.secs property. If a critical process is unavailable for this number of retries, recovery action is performed.	6
pronet.availability.db.connection.max.retry.minutes	The amount of time (in min) that database connectivity is checked for.	15
pronet.availability.db.connection.max.retry.count	The number of retries for the database connectivity check.	15
pronet.component.unavailability.attempts.email.interval	The time interval (in min) that an email is generated and sent to the configured administrator email address. To configure email settings, see Configuring e-mail settings to receive alerts.	2

Note

An upgrade to version 11.3.03 deletes the pronet.ha.availability.self.shutdown.mechanism property, that was previously used to control the recovery action.

For 11.3.02

When a process is unavailable or fails on any node in a high-availability deployment mode, by default, all the server processes are shutdown.

Set the pronet.ha.availability.self.shutdown.mechanism property in the <Infrastructure Management Install directory>\pw\custom\conf\pronet.conf file to automatically restart all the processes on the failed node.

Examples

#To automatically restart the processes on a failed node in a high-availability deployment of Infrastructure Management Server
pronet.ha.availability.self.shutdown.mechanism=restart

Notes

It is recommended to set this configuration property only on the secondary server.
If you have set the property to restart all the processes, and at a later point in time, if you want the processes to shut down, you can delete the following entry from the <Infrastructure Management Install directory>\pw\custom\conf\pronet.conf file.
pronet.ha.availability.self.shutdown.mechanism=restart
After you set or delete the pronet.ha.availability.self.shutdown.mechanism property in the <Infrastructure Management Install directory>\pw\custom\conf\pronet.conf file, restart the Infrastructure Management Server by running the following command:
pw system restart force
The changes will be reflected once you restart the Infrastructure Management Server.

Rate process is unable to obtain monitor object data from the cache (applicable to version 11.3.02 and later)

During a failover, Rate process is unable to fetch monitor object data from the cache and crashes. Following error message is displayed:

INFO 03/13 16:28:31 Rate [RateInit-1] 600002 Failed to execute initialization task. [ mPlatformClass=class com.proactivenet.api.sla.SLAPlatformImpl, mObject=null, mStaticClass=null, mMethodName=null mArgClasses=null, mArgs=null]

ERROR 03/13 16:28:31 Rate [JavaRate] 300073 Unable to initialize local MO Cache for Rate

Error Id: 300471

As a workaround, perform the following steps:

Verify database connectivity by running the pw dbconfig list command.
Ensure that the firewall/IPtables rules are not blocking the TCP communication between the HA nodes. For detailed port information, see Network-ports-for-a-high-availability-deployment-of-Infrastructure-Management.
Ensure that the adapter priorities are correctly set if you are using a dual Network Interface Card (NIC).
Add the desired IP address details in the hosts file or modify the setting in the ESX host file to set a different label to the network adapter. Contact your administrator for more details.
Disable the NICs that are not required.

If the above workaround doesn't resolve the issue, do the following:

Stop the primary Infrastructure Management server.
Stop the secondary Infrastructure Management server.
Perform the following first on the primary and then on the secondary server:
1. Go to the <TrueSight Infrastructure Management Install Directory>\pw\pronto\conf directory.
2. Take a backup of the federatedcacheserver-transport-tcp.xml file.
3. Edit the federatedcacheserver-transport-tcp.xml file, and replace the host names with their IP addresses in the following code lines:
  #Original code
  initial_hosts="server1.bmc.com[10590],server2.bmc.com[10590]
  jgroups.tcp.address:server1.bmc.com
  
  #Modified code
  initial_hosts="<IP address of server1.bmc.com>[10590],<IP address of server2.bmc.com>[10590]
  jgroups.tcp.address:<IP address of server1.bmc.com>
4. Save the federatedcacheserver-transport-tcp.xml file.
Start the primary Infrastructure Management server.
Start the secondary Infrastructure Management server.

Cell databases are out of sync

In an application high availability deployment of TrueSight Infrastructure Management, a Critical event is generated when the cell databases (MCDBs) are out of sync.

The MCDBs may go out of sync in the following scenarios:

When the nodes recover from a split-brain condition.
When the secondary server is down for a long time, and the Infrastructure Management processes on the primary server have been restarted multiple times.

Error ID: 101387

As a workaround, perform the following steps:

Stop the primary and the secondary Infrastructure Management servers.
On the secondary server, delete the content from the <Installation Directory>\TrueSight\pw\server\var\<secondary_cell_name> directory.
Copy the content from the <Installation Directory>\TrueSight\pw\server\var\<primary_cell_name> directory of the primary server to the <Installation Directory>\TrueSight\pw\server\var\<secondary_cell_name> directory of the secondary server.
Start the primary server.
Ensure that the primary server processes are up and running.
Start the secondary server.

Cell database status cannot be determined

In an application high availability deployment of TrueSight Infrastructure Management, the cell database (MCDB) sync status cannot be determined to conclude if they are in sync or out of sync. Following message is displayed in the <TrueSight Infrastructure Management Install Directory>\pw\pronto\logs\MCDBMismatch.log file.

INFO 12/17 14:37:11 MCDBMismatchDetect [Thread-72] 600002 tsimCellName=pncell_clm-pun-tjd61f valPNCELL=pncell_clm-pun-tjd61f
INFO 12/17 14:37:32 MCDBMismatchDetect [Thread-72] 600002 getEventCount attempt=1 Not able to execute command:-20191217 143732.186000 mquery: MCLI: BMC_TS-IMC300011E: Could not connect to Cell pncell_clm-pun-tjd61f#2
INFO 12/17 14:37:32 MCDBMismatchDetect [Thread-72] 600002 isEventCountMatchingCould not confirm if there is event count mismatch. primaryEventCount=161 secondaryEventCount=-1
INFO 12/17 14:37:52 MCDBMismatchDetect [Thread-72] 600002 getDataCount attempt=1 Not able to execute command:-20191217 143752.396000 mquery: MCLI: BMC_TS-IMC300011E: Could not connect to Cell pncell_clm-pun-tjd61f#2
INFO 12/17 14:37:52 MCDBMismatchDetect [Thread-72] 600002 isDataCountMatching Could not confirm if there is Data count mismatch. primaryDataCount=1568 secondaryDataCount=-1
INFO 12/17 14:37:52 MCDBMismatchDetect [Thread-72] 600002 executeDetectionMCDBMismatch primaryDataCount= 1568 secondaryDataCount=-1 primaryEventCount=161 secondaryEventCount=-1 Event count and data count could not be determined

As a workaround, perform the following steps:

Check the primary and secondary cell status and ensure that they are up and running. The MCDB sync status will be computed again after a time interval of 70 minutes.

To reconfigure the secondary server

If you need to deploy and use a different secondary server instead of the original one, perform the following steps:

Install Infrastructure Management on the new secondary server.
Stop the primary server and the new secondary server.
On the primary server, edit the following files and replace the original secondary server host name with the new secondary server host name:
- installedDirectory\pw\pronto\tmp\ha-generated.conf
- installedDirectory\pw\server\etc\mcell.dir
- installedDirectory\pw\pronto\data\admin\admin.dir
- installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml
- installedDirectory\pw\pronto\conf\clustermanager.xml
- installedDirectory\pw\pronto\conf\cell_info.list
- installedDirectory\pw\integrations\ibrsd\conf\IBRSD.dir
- installedDirectory\pw\custom\conf\ha.conf
Copy the updated installedDirectory/TrueSight/pw/pronto/tmp/ha-generated.conf file from the primary server to the new secondary server.
On the new secondary server, run the following command
pw ha enable standby file=<location of ha-generated.conf file>
Copy the following folder from the primary server to the new secondary server:
installedDirectory/TrueSight/pw/server/var/pncell_<hostname>#1
Rename the copied folder on the new secondary server to pncell_<hostname>#2
Copy the installedDirectory/TrueSight/pw/pronto/tmp/addremoteagent file from the primary server to the new secondary server.
Start the primary server.
On the primary server, run the addremoteagent file.
Start the secondary server, after the primary server is up and running.

TrueSight Infrastructure Management performs a failover as the primary cell stops responding

In a TrueSight Infrastructure Management high-availability deployment, if the primary cell fails or stops responding, the secondary cell becomes active. After a while, the Infrastructure Management server performs the failover. This may cause the cell database (MCDB) to go out of sync.

You can disable the automatic failover feature for the cell by setting an mcell configuration parameter. If you set the CellDuplicateAutoFailOver parameter to NO, and if the primary cell fails, automatic cell failover will not happen. Instead, it will wait for the agent controller process to perform the failover.

To disable the automatic cell failover, do the following on both primary and secondary Infrastructure Management servers:

Edit the <Infrastructure Management Installation Directory>\pw\server\etc\<cellname>\mcell.conf file.
Set the following configuration parameter to NO:
CellDuplicateAutoFailOver=NO
Restart the cell process.

The HA nodes experience a split-brain condition

A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.

The following Infrastructure Management server properties help to prevent and recover from a split-brain condition.

To set the properties, edit the following files in the installedDirectory\pw\custom\conf\pronet.conf of the secondary server only and make the required changes:

Split-brain condition prevention
Property name	Property details	Default value	Valid values
pronet.ha.split.brain.prevention.support	Controls the split-brain condition prevention support. If set to false the Active-Active node prevention operation does not occur.	true	true; false
pronet.ha.split.brain.prevention.max.retry.count	The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node. The value must be less than 20.	6	Any positive integer
pronet.ha.split.brain.prevention.scan.frequency.in.secs	The frequency (in seconds) at which the monitoring service checks the remote node status.	30	Any positive integer

Split-brain condition recovery
Property name	Property details	Default value	Valid values
pronet.ha.split.brain.recovery.support	Controls the split-brain condition recovery support. If set to false the Active-Active node recovery operation does not occur and the HA deployment stays in the split-brain condition.	true	true; false
pronet.ha.split.brain.recovery.notification.support	Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified.	true	true; false
pronet.ha.split.brain.recovery.scan.frequency.in.secs	The frequency (in seconds) at which the monitoring service checks for a split-brain condition.	30	Any positive integer

Errors and warnings are displayed when failback occurs

When a failback occurs, errors and warnings related to ActiveMQ are displayed in the TrueSight.log file.

As a workaround, perform the following steps:

Back up the following files:
installedDirectory\wildfly\standalone\deployments\activemq-rar.rar.deployed
installedDirectory\wildfly\standalone\deployments\activemq-rar.rar
As per the node status (active or standby), copy the appropriate file to the installedDirectory\wildfly\standalone\deployments folder:
installedDirectory\Active\activemq-rar.rar
installedDirectory\StandBy\activemq-rar.rar
Restart the Infrastructure Management server.

The Rate process does not start on any of the HA nodes

Due to cache replication issues, the Rate process may not start on any of the HA nodes. A null-pointer exception is also displayed in the installedDirectory\pw\pronto\logs\TrueSight.log file.

As a workaround, perform the following steps:

Stop the standby node first and then stop the active node.
Back up the federatedcacheserver-transport-tcp.xml and clustermanager.xml files in the installedDirectory\pw\pronto\conf folder on both nodes.
Edit the installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml file and update the FD, VERIFY_SUSPECT elements timeout value to 30000 and FD element max_tries to 9
<FD max_tries="9" timeout="30000"/>
<VERIFY_SUSPECT timeout="30000"/>
Edit the installedDirectory\pw\pronto\conf\clustermanager.xml file and update FD element timeout value to 30000, FD max_tries to 9 and num_msgs to 5.
<FD max_tries="9" timeout="30000"/>
<VERIFY_SUSPECT timeout="30000" num_msgs="5"/>
Ensure that you perform steps 3 and 4 on both nodes.
Start the previously active node first.
After the node is up and running and you are able to access the operator console, restart the other node.

The Rate process crashes

The Rate process crashes during a failback. Check the installedDirectory\pw\pronto\logs\Rate.log file if it contains the following error message:

Caused by: org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for messageId=<message ID #> returned server error (status=0x86): org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request <request #>

Edit the installedDirectory\pw\pronto\conf\federatedcacheserver.xml file and update the acquire-timeout element value to 90000 seconds.
<acquire-timeout="90000"/>
Restart the Infrastructure Management server processes.

Failback fails due to server not initialized completely

Failback fails and the JServer and Rate processes are not running. The TrueSight.log file contains the following:

Naming Exception while getting Topic Session.
javax.naming.NameNotFoundException: jboss/exported/ConnectionFactory – service jboss.naming.context.java.jboss.exported.jboss.exported.ConnectionFactory
at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:106)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:207)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:184)
at org.jboss.naming.remote.protocol.v1.Protocol$1.handleServerMessage(Protocol.java:127)
at org.jboss.naming.remote.protocol.v1.RemoteNamingServerV1$MessageReciever$1.run(RemoteNamingServerV1.java:73)

As a workaround, perform the following steps:

Update the host file to make sure localhost is resolved to the IP address.
If step #1 does not fix the issue, edit the file installedDirectory\pw\pronto\conf\pronet.conf and installedDirectory\pw\custom\conf\pronet.conf and check values of the property java.naming.provider.url
If it is localhost, change it to 127.0.0.1.

mcell.dir is blank after a failover when the Infrastructure Management disk is full

Note

This problem is observed only when the Presentation Server is in a HA mode.

Perform the following steps to recover the mcell.dir file from the backup folder:

Navigate to pw/pronto/tmp and open the mcell_dir_org_backup.
Navigate to pw/server/etc/ and copy the information from the mcell_dir_org_backup file to mcell.dir.
Update the current active node of the Presentation Server in the mcell.dir file.
Stop and Start the Infrastructure Management server.

Troubleshooting an Infrastructure Management high-availability deployment

Database is not coming up after applying the service pack or fix pack to the primary cluster node (applicable to 11.3.03 and later)

Automatic restart of Infrastructure Management server processes (Supported only with 11.3.02 and later)

For 11.3.03

For 11.3.02

Rate process is unable to obtain monitor object data from the cache (applicable to version 11.3.02 and later)

Cell databases are out of sync

Cell database status cannot be determined

To reconfigure the secondary server

TrueSight Infrastructure Management performs a failover as the primary cell stops responding

The HA nodes experience a split-brain condition

Errors and warnings are displayed when failback occurs

The Rate process does not start on any of the HA nodes

The Rate process crashes

Failback fails due to server not initialized completely

mcell.dir is blank after a failover when the Infrastructure Management disk is full

Related topics

On this page