Troubleshooting an Infrastructure Management high-availability deployment
Consult this topic for troubleshooting information related to a high-availability deployment of Infrastructure Management.
Database is not coming up after applying the service pack or fix pack to the primary cluster node (applicable to 11.3.03 and later)
- Immediately after applying the service pack or fix pack to the primary cluster node, stop the Infrastructure Management server on the primary cluster node.
- On the secondary cluster node, run the
<Install_dir>\pw\pronto\bin\updateRegistryForSybase17.bat
file to update the Sybase 17 registry settings for Windows. - Restart the Infrastructure Management server on the secondary cluster node.
Automatic restart of Infrastructure Management server processes (Supported only with 11.3.02 and later)
For 11.3.03
In an application high-availability deployment of TrueSight Infrastructure Management, when any of the critical server processes become unavailable, recovery action (restart or shutdown) is performed.
If the Infrastructure Management server is unable to establish connectivity with the database for a specific amount of time, both nodes are shut down.
If required, configure the following parameters in the installedDirectory\pw\custom\conf\pronet.conf file:
Property name | Property details | Default value |
---|---|---|
pronet.component.unavailability.recovery.action | Recovery action that will be performed if a critical process is unavailable. Note: This property does not affect the database connectivity check. | shutdown (primary node) restart (secondary node) |
pronet.ha.availability.scan.frequency.in.secs | The polling interval (in sec) of the critical processes availability check. | 60 |
pronet.ha.availability.max.retry.count | The number of retries at the frequency specified by the pronet.ha.availability.scan.frequency.in.secs property. If a critical process is unavailable for this number of retries, recovery action is performed. | 6 |
pronet.availability.db.connection.max.retry.minutes | The amount of time (in min) that database connectivity is checked for. | 15 |
pronet.availability.db.connection.max.retry.count | The number of retries for the database connectivity check. | 15 |
pronet.component.unavailability.attempts.email.interval | The time interval (in min) that an email is generated and sent to the configured administrator email address. To configure email settings, see Configuring e-mail settings to receive alerts. | 2 |
Note
An upgrade to version 11.3.03 deletes the pronet.ha.availability.self.shutdown.mechanism
property, that was previously used to control the recovery action.
For 11.3.02
When a process is unavailable or fails on any node in a high-availability deployment mode, by default, all the server processes are shutdown.
Set the pronet.ha.availability.self.shutdown.mechanism
property in the <Infrastructure Management Install directory>\pw\custom\conf\pronet.conf
file to automatically restart all the processes on the failed node.
Examples
#To automatically restart the processes on a failed node in a high-availability deployment of Infrastructure Management Server pronet.ha.availability.self.shutdown.mechanism=restart
Notes
- It is recommended to set this configuration property only on the secondary server.
If you have set the property to restart all the processes, and at a later point in time, if you want the processes to shut down, you can delete the following entry from the <Infrastructure Management Install directory>\pw\custom\conf\pronet.conf file.
pronet.ha.availability.self.shutdown.mechanism=restart
After you set or delete the
pronet.ha.availability.self.shutdown.mechanism
property in the <Infrastructure Management Install directory>\pw\custom\conf\pronet.conf file, restart the Infrastructure Management Server by running the following command:pw system restart force
The changes will be reflected once you restart the Infrastructure Management Server.
Rate process is unable to obtain monitor object data from the cache (applicable to version 11.3.02 and later)
During a failover, Rate process is unable to fetch monitor object data from the cache and crashes. Following error message is displayed:
INFO 03/13 16:28:31 Rate [RateInit-1] 600002 Failed to execute initialization task. [ mPlatformClass=class com.proactivenet.api.sla.SLAPlatformImpl, mObject=null, mStaticClass=null, mMethodName=null mArgClasses=null, mArgs=null]
ERROR 03/13 16:28:31 Rate [JavaRate] 300073 Unable to initialize local MO Cache for Rate
Error Id: 300471
As a workaround, perform the following steps:
- Verify database connectivity by running the
pw dbconfig list
command. - Ensure that the firewall/IPtables rules are not blocking the TCP communication between the HA nodes. For detailed port information, see Network ports for a high-availability deployment of Infrastructure Management.
Ensure that the adapter priorities are correctly set if you are using a dual Network Interface Card (NIC).
Add the desired IP address details in the hosts file or modify the setting in the ESX host file to set a different label to the network adapter. Contact your administrator for more details.
- Disable the NICs that are not required.
If the above workaround doesn't resolve the issue, do the following:
- Stop the primary Infrastructure Management server.
- Stop the secondary Infrastructure Management server.
- Perform the following first on the primary and then on the secondary server:
- Go to the
<TrueSight Infrastructure Management Install Directory>\pw\pronto\conf
directory. - Take a backup of the
federatedcacheserver-transport-tcp.xml
file. Edit the
federatedcacheserver-transport-tcp.xml
file, and replace the host names with their IP addresses in the following code lines:#Original code initial_hosts="server1.bmc.com[10590],server2.bmc.com[10590] jgroups.tcp.address:server1.bmc.com #Modified code initial_hosts="<IP address of server1.bmc.com>[10590],<IP address of server2.bmc.com>[10590] jgroups.tcp.address:<IP address of server1.bmc.com>
- Save the federatedcacheserver-transport-tcp.xml file.
- Go to the
- Start the primary Infrastructure Management server.
- Start the secondary Infrastructure Management server.
Cell databases are out of sync
In an application high availability deployment of TrueSight Infrastructure Management, a Critical event is generated when the cell databases (MCDBs) are out of sync.
The MCDBs may go out of sync in the following scenarios:
- When the nodes recover from a split-brain condition.
- When the secondary server is down for a long time, and the Infrastructure Management processes on the primary server have been restarted multiple times.
Error ID: 101387
As a workaround, perform the following steps:
- Stop the primary and the secondary Infrastructure Management servers.
- On the secondary server, delete the content from the
<Installation Directory>\TrueSight\pw\server\var\<secondary_cell_name>
directory. - Copy the content from the
<Installation Directory>\TrueSight\pw\server\var\<primary_cell_name>
directory of the primary server to the<Installation Directory>\TrueSight\pw\server\var\<secondary_cell_name>
directory of the secondary server. - Start the primary server.
- Ensure that the primary server processes are up and running.
- Start the secondary server.
Cell database status cannot be determined
In an application high availability deployment of TrueSight Infrastructure Management, the cell database (MCDB) sync status cannot be determined to conclude if they are in sync or out of sync. Following message is displayed in the <TrueSight Infrastructure Management Install Directory>\pw\pronto\logs\MCDBMismatch.log
file.
INFO 12/17 14:37:11 MCDBMismatchDetect [Thread-72] 600002 tsimCellName=pncell_clm-pun-tjd61f valPNCELL=pncell_clm-pun-tjd61f
INFO 12/17 14:37:32 MCDBMismatchDetect [Thread-72] 600002 getEventCount attempt=1 Not able to execute command:-20191217 143732.186000 mquery: MCLI: BMC_TS-IMC300011E: Could not connect to Cell pncell_clm-pun-tjd61f#2
INFO 12/17 14:37:32 MCDBMismatchDetect [Thread-72] 600002 isEventCountMatchingCould not confirm if there is event count mismatch. primaryEventCount=161 secondaryEventCount=-1
INFO 12/17 14:37:52 MCDBMismatchDetect [Thread-72] 600002 getDataCount attempt=1 Not able to execute command:-20191217 143752.396000 mquery: MCLI: BMC_TS-IMC300011E: Could not connect to Cell pncell_clm-pun-tjd61f#2
INFO 12/17 14:37:52 MCDBMismatchDetect [Thread-72] 600002 isDataCountMatching Could not confirm if there is Data count mismatch. primaryDataCount=1568 secondaryDataCount=-1
INFO 12/17 14:37:52 MCDBMismatchDetect [Thread-72] 600002 executeDetectionMCDBMismatch primaryDataCount= 1568 secondaryDataCount=-1 primaryEventCount=161 secondaryEventCount=-1 Event count and data count could not be determined
As a workaround, perform the following steps:
Check the primary and secondary cell status and ensure that they are up and running. The MCDB sync status will be computed again after a time interval of 70 minutes.
To reconfigure the secondary server
If you need to deploy and use a different secondary server instead of the original one, perform the following steps:
- Install Infrastructure Management on the new secondary server.
- Stop the primary server and the new secondary server.
- On the primary server, edit the following files and replace the original secondary server host name with the new secondary server host name:
installedDirectory\pw\pronto\tmp\ha-generated.conf
installedDirectory\pw\server\etc\mcell.dir
installedDirectory\pw\pronto\data\admin\admin.dir
installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml
installedDirectory\pw\pronto\conf\clustermanager.xml
installedDirectory\pw\pronto\conf\cell_info.list
installedDirectory\pw\integrations\ibrsd\conf\IBRSD.dir
installedDirectory\pw\custom\conf\ha.conf
- Copy the updated
installedDirectory/TrueSight/pw/pronto/tmp/ha-generated.conf
file from the primary server to the new secondary server. - On the new secondary server, run the following command
pw ha enable standby file=<location of ha-generated.conf file>
- Copy the following folder from the primary server to the new secondary server:
installedDirectory/TrueSight/pw/server/var/pncell_<hostname>#1
- Rename the copied folder on the new secondary server to
pncell_<hostname>#2
- Copy the
installedDirectory/TrueSight/pw/pronto/tmp/addremoteagent
file from the primary server to the new secondary server. - Start the primary server.
- On the primary server, run the
addremoteagent
file. - Start the secondary server, after the primary server is up and running.
TrueSight Infrastructure Management performs a failover as the primary cell stops responding
In a TrueSight Infrastructure Management high-availability deployment, if the primary cell fails or stops responding, the secondary cell becomes active. After a while, the Infrastructure Management server performs the failover. This may cause the cell database (MCDB) to go out of sync.
You can disable the automatic failover feature for the cell by setting an mcell
configuration parameter. If you set the CellDuplicateAutoFailOver
parameter to NO, and if the primary cell fails, automatic cell failover will not happen. Instead, it will wait for the agent controller process to perform the failover.
To disable the automatic cell failover, do the following on both primary and secondary Infrastructure Management servers:
- Edit the
<Infrastructure Management Installation Directory>\pw\server\etc\<cellname>\mcell.conf
file. - Set the following configuration parameter to NO:
CellDuplicateAutoFailOver=NO
- Restart the cell process.
The HA nodes experience a split-brain condition
A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.
The following Infrastructure Management server properties help to prevent and recover from a split-brain condition.
To set the properties, edit the following files in the installedDirectory\pw\custom\conf\pronet.conf of the secondary server only and make the required changes:
Split-brain condition prevention | |||
---|---|---|---|
Property name | Property details | Default value | Valid values |
| Controls the split-brain condition prevention support. If set to | true | true; |
| The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node. The value must be less than 20. | 6 | Any positive integer |
| The frequency (in seconds) at which the monitoring service checks the remote node status. | 30 | Any positive integer |
Split-brain condition recovery | |||
---|---|---|---|
Property name | Property details | Default value | Valid values |
| Controls the split-brain condition recovery support. If set to | true | true; |
| Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified. | true | true; |
| The frequency (in seconds) at which the monitoring service checks for a split-brain condition. | 30 | Any positive integer |
Errors and warnings are displayed when failback occurs
When a failback occurs, errors and warnings related to ActiveMQ are displayed in the TrueSight.log
file.
As a workaround, perform the following steps:
Back up the following files:
installedDirectory\wildfly\standalone\deployments\activemq-rar.rar.deployed
installedDirectory\wildfly\standalone\deployments\activemq-rar.rarAs per the node status (active or standby), copy the appropriate file to the
installedDirectory\wildfly\standalone\deployments
folder:installedDirectory\Active\activemq-rar.rar
installedDirectory\StandBy\activemq-rar.rar
Restart the Infrastructure Management server.
The Rate process does not start on any of the HA nodes
Due to cache replication issues, the Rate process may not start on any of the HA nodes. A null-pointer exception is also displayed in the installedDirectory\pw\pronto\logs\TrueSight.log
file.
As a workaround, perform the following steps:
- Stop the standby node first and then stop the active node.
- Back up the
federatedcacheserver-transport-tcp.xml
andclustermanager.xml
files in theinstalledDirectory\pw\pronto\conf
folder on both nodes. Edit the
installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml
file and update theFD
,VERIFY_SUSPECT
elements timeout value to30000
andFD
elementmax_tries
to9
<FD max_tries="9" timeout="30000"/> <VERIFY_SUSPECT timeout="30000"/>
Edit the
installedDirectory\pw\pronto\conf\clustermanager.xml
file and updateFD
element timeout value to30000
,FD max_tries
to9
andnum_msgs
to5
.<FD max_tries="9" timeout="30000"/> <VERIFY_SUSPECT timeout="30000" num_msgs="5"/>
- Ensure that you perform steps 3 and 4 on both nodes.
- Start the previously active node first.
- After the node is up and running and you are able to access the operator console, restart the other node.
The Rate process crashes
The Rate process crashes during a failback. Check the installedDirectory\pw\pronto\logs\Rate.log
file if it contains the following error message:
Caused by: org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for messageId=<message ID #> returned server error (status=0x86): org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request <request #>
Edit the
installedDirectory\pw\pronto\conf\federatedcacheserver.xml
file and update theacquire-timeout
element value to90000
seconds.<acquire-timeout="90000"/>
- Restart the Infrastructure Management server processes.
Failback fails due to server not initialized completely
Failback fails and the JServer and Rate processes are not running. The TrueSight.log
file contains the following:
Naming Exception while getting Topic Session.
javax.naming.NameNotFoundException: jboss/exported/ConnectionFactory – service jboss.naming.context.java.jboss.exported.jboss.exported.ConnectionFactory
at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:106)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:207)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:184)
at org.jboss.naming.remote.protocol.v1.Protocol$1.handleServerMessage(Protocol.java:127)
at org.jboss.naming.remote.protocol.v1.RemoteNamingServerV1$MessageReciever$1.run(RemoteNamingServerV1.java:73)
As a workaround, perform the following steps:
- Update the host file to make sure localhost is resolved to the IP address.
- If step #1 does not fix the issue, edit the file
installedDirectory\pw\pronto\conf\pronet.conf
andinstalledDirectory\pw\custom\conf\pronet.conf
and check values of the propertyjava.naming.provider.url
If it is localhost, change it to 127.0.0.1.
mcell.dir is blank after a failover when the Infrastructure Management disk is full
Note
This problem is observed only when the Presentation Server is in a HA mode.
Perform the following steps to recover the mcell.dir
file from the backup folder:
- Navigate to
pw/pronto/tmp
and open themcell_dir_org_backup
. - Navigate to
pw/server/etc/
and copy the information from themcell_dir_org_backup
file tomcell.dir
. - Update the current active node of the Presentation Server in the
mcell.dir
file. - Stop and Start the Infrastructure Management server.
Related topics
Troubleshooting a Presentation Server high-availability deployment
Comments
Do the HA nodes experience a split-brain condition which file are you referring need to be updated?
Is it pw/custom/conf/pronet.conf ?
Hello Prasad,
Yes, that is correct. I have updated the instructions.
Log in or register to comment.