Troubleshooting an Infrastructure Management high-availability deployment


The switchToTLS.pl utility failed to update the TLS entries in the configuration files

Issue

In a TrueSight Infrastructure Management high-availability deployment, when you run the switchToTLS.pl utility to configure the Infrastructure Management Server to Oracle database communication, TLS configuration fails.

Probable cause

The switchToTLS.pl utility failed to update TLS encryption key entries in the following files:

  • mcell.dir
  • cell_info.list
  • IBRSD.dir

Resolution

Do the following to resolve the issue:

  1. Log in to the primary Infrastructure Management server.
  2. Stop the primary server by running the pw system stop command.
  3. Go to the <Infrastructure Management server install directory>\pw\server\etc directory.
  4. Using a text editor, open the mcell.dir file, search the cell entry and update the encryption key to *TLS as shown in the following example: 

    #Type  <name> <encryption key> <host>/<port>


    #Comment the line containing the mc encryption key entry
    #cell  pncell_tsim1  mc  pncell_tsim1:1828 pncell_tsim2:1828


    #Update the encryption key entry as *TLS
    cell  pncell_tsim1  *TLS  pncell_tsim1:1828 pncell_tsim2:1828
  5. Go to the <Infrastructure Management server install directory>\pw\pronto\conf directory.
  6. Using a text editor, open the cell_info.list file, search the cell.SIM entry and update the encryption key to *TLS as shown in the following example: 

    #Type  <name> <encryption key> <host>/<port>

    #Comment the line containing the mc encryption key entry
    #cell.SIM  pncell_tsim1   mc  pncell_tsim1:1828 pncell_tsim2:1828  Production *

    #Update the encryption key entry as *TLS
    cell.SIM   pncell_tsim1  *TLS  pncell_tsim1:1828 pncell_tsim2:1828  Production *
  7. Go to the <Infrastructure Management server install directory>\integrations\ibrsd\conf directory.
  8. Using a text editor, open the IBRSD.dir file, search the cell entry and update the encryption key to *TLS as shown in the following example: 

    #Type  <name> <encryption key> <host>/<port>

    #Comment the line containing the mc encryption key entry
    #cell  pncell_tsim1  mc  pncell_tsim1:1828 pncell_tsim2:1828

    #Update the encryption key entry as *TLS
    cell  pncell_tsim1  *TLS  pncell_tsim1:1828 pncell_tsim2:1828
  9. Restart the primary server by running the pw system start command.
  10. Log in to the secondary Infrastructure Management server.
  11. Stop the secondary server by running the pw system stop command.
  12. Perform the steps from 3 to 8 on the secondary server.
  13. Restart the secondary server by running the pw system start command.

Database is not coming up after applying the service pack or fix pack to the primary cluster node

  1. Immediately after applying the service pack or fix pack to the primary cluster node, stop the Infrastructure Management server on the primary cluster node.
  2. On the secondary cluster node, run the <Install_dir>\pw\pronto\bin\updateRegistryForSybase17.bat file to update the Sybase 17 registry settings for Windows.
  3. Restart the Infrastructure Management server on the secondary cluster node.

Automatic restart of Infrastructure Management server processes

In an application high-availability deployment of TrueSight Infrastructure Management, when any of the critical server processes become unavailable, recovery action (restart or shutdown) is performed.

If the Infrastructure Management server is unable to establish connectivity with the database for a specific amount of time, both nodes are shut down. 

If required, configure the following parameters in the installedDirectory\pw\custom\conf\pronet.conf file: 

Property name

Property details

Default value

pronet.component.unavailability.recovery.action

Recovery action that will be performed if a critical process is unavailable.

Note: This property does not affect the database connectivity check.

shutdown (primary node)

restart (secondary node)

pronet.ha.availability.scan.frequency.in.secs

The polling interval (in sec) of the critical processes availability check.

60

pronet.ha.availability.max.retry.count

The number of retries at the frequency specified by the pronet.ha.availability.scan.frequency.in.secs property. If a critical process is unavailable for this number of retries, recovery action is performed.

6

pronet.availability.db.connection.max.retry.minutes

The amount of time (in min) that database connectivity is checked for.

15

pronet.availability.db.connection.max.retry.count

The number of retries for the database connectivity check.

15

pronet.component.unavailability.attempts.email.interval

The time interval (in min) that an email is generated and sent to the configured administrator email address.

To configure email settings, see Configuring e-mail settings to receive alerts.

2

Note

An upgrade to version 11.3.04 deletes the pronet.ha.availability.self.shutdown.mechanism property, that was previously used to control the recovery action.

Rate process is unable to obtain monitor object data from the cache

During a failover, Rate process is unable to fetch monitor object data from the cache and crashes. Following error message is displayed:

INFO  03/13 16:28:31 Rate                 [RateInit-1] 600002 Failed to execute initialization task. [ mPlatformClass=class com.proactivenet.api.sla.SLAPlatformImpl, mObject=null, mStaticClass=null, mMethodName=null mArgClasses=null, mArgs=null]
ERROR 03/13 16:28:31 Rate                 [JavaRate] 300073 Unable to initialize local MO Cache for Rate

Error Id: 300471

As a workaround, perform the following steps:

  1. Verify database connectivity by running the pw dbconfig list command.
  2. Ensure that the firewall/IPtables rules are not blocking the TCP communication between the HA nodes. For detailed port information, see Network-ports-for-a-high-availability-deployment-of-Infrastructure-Management.
  3. Ensure that the adapter priorities are correctly set if you are using a dual Network Interface Card (NIC).
  4. Add the desired IP address details in the hosts file or modify the setting in the ESX host file to set a different label to the network adapter. Contact your administrator for more details.
  5. Disable the NICs that are not required.

If the above workaround doesn't resolve the issue, do the following:

  1. Stop the primary Infrastructure Management server.
  2. Stop the secondary Infrastructure Management server.
  3. Perform the following first on the primary and then on the secondary server:
    1. Go to the <TrueSight Infrastructure Management Install Directory>\pw\pronto\conf directory.
    2. Take a backup of the federatedcacheserver-transport-tcp.xml file.
    3. Edit the federatedcacheserver-transport-tcp.xml file, and replace the host names with their IP addresses in the following code lines:

      #Original code
      initial_hosts="server1.bmc.com[10590],server2.bmc.com[10590]
      jgroups.tcp.address:server1.bmc.com


      #Modified code
      initial_hosts="
      <IP address of server1.bmc.com>[10590],<IP address of server2.bmc.com>[10590]
      jgroups.tcp.address:<IP address of server1.bmc.com>
    4. Save the federatedcacheserver-transport-tcp.xml file.
  4. Start the primary Infrastructure Management server. 
  5. Start the secondary Infrastructure Management server. 

Cell databases are out of sync

In an application high availability deployment of TrueSight Infrastructure Management, a Critical event is generated when the cell databases (MCDBs) are out of sync.

The MCDBs may go out of sync in the following scenarios:

  • When the nodes recover from a split-brain condition.
  • When the secondary server is down for a long time, and the Infrastructure Management processes on the primary server have been restarted multiple times.

Error ID: 101387

As a workaround, perform the following steps:

  1. Stop the primary and the secondary Infrastructure Management servers.
  2. On the secondary server, delete the content from the <Installation Directory>\TrueSight\pw\server\var\<secondary_cell_name> directory.
  3. Copy the content from the <Installation Directory>\TrueSight\pw\server\var\<primary_cell_name> directory of the primary server to the <Installation Directory>\TrueSight\pw\server\var\<secondary_cell_name> directory of the secondary server.
  4. Start the primary server.
  5. Ensure that the primary server processes are up and running.
  6. Start the secondary server.

Cell database status cannot be determined

In an application high availability deployment of TrueSight Infrastructure Management, the cell database (MCDB) sync status cannot be determined to conclude if they are in sync or out of sync. Following message is displayed in the <TrueSight Infrastructure Management Install Directory>\pw\pronto\logs\MCDBMismatch.log file.

INFO 12/17 14:37:11 MCDBMismatchDetect [Thread-72] 600002 tsimCellName=pncell_clm-pun-tjd61f valPNCELL=pncell_clm-pun-tjd61f
 INFO 12/17 14:37:32 MCDBMismatchDetect [Thread-72] 600002 getEventCount attempt=1 Not able to execute command:-20191217 143732.186000 mquery: MCLI: BMC_TS-IMC300011E: Could not connect to Cell pncell_clm-pun-tjd61f#2
 INFO 12/17 14:37:32 MCDBMismatchDetect [Thread-72] 600002 isEventCountMatchingCould not confirm if there is event count mismatch. primaryEventCount=161 secondaryEventCount=-1
 INFO 12/17 14:37:52 MCDBMismatchDetect [Thread-72] 600002 getDataCount attempt=1 Not able to execute command:-20191217 143752.396000 mquery: MCLI: BMC_TS-IMC300011E: Could not connect to Cell pncell_clm-pun-tjd61f#2
 INFO 12/17 14:37:52 MCDBMismatchDetect [Thread-72] 600002 isDataCountMatching Could not confirm if there is Data count mismatch. primaryDataCount=1568 secondaryDataCount=-1
 INFO 12/17 14:37:52 MCDBMismatchDetect [Thread-72] 600002 executeDetectionMCDBMismatch primaryDataCount= 1568 secondaryDataCount=-1 primaryEventCount=161 secondaryEventCount=-1 Event count and data count could not be determined

As a workaround, perform the following steps:

Check the primary and secondary cell status and ensure that they are up and running. The MCDB sync status will be computed again after a time interval of 70 minutes.

To reconfigure the secondary server

If you need to deploy and use a different secondary server instead of the original one, perform the following steps:

  1. Install Infrastructure Management on the new secondary server.
  2. Stop the primary server and the new secondary server.
  3. On the primary server, edit the following files and replace the original secondary server host name with the new secondary server host name:
    • installedDirectory\pw\pronto\tmp\ha-generated.conf
    • installedDirectory\pw\server\etc\mcell.dir
    • installedDirectory\pw\pronto\data\admin\admin.dir
    • installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml
    • installedDirectory\pw\pronto\conf\clustermanager.xml
    • installedDirectory\pw\pronto\conf\cell_info.list
    • installedDirectory\pw\integrations\ibrsd\conf\IBRSD.dir
    • installedDirectory\pw\custom\conf\ha.conf
  4. Copy the updated installedDirectory/TrueSight/pw/pronto/tmp/ha-generated.conf file from the primary server to the new secondary server.
  5. On the new secondary server, run the following command 
    pw ha enable standby file=<location of ha-generated.conf file>
  6. Copy the following folder from the primary server to the new secondary server:
    installedDirectory/TrueSight/pw/server/var/pncell_<hostname>#1
  7. Rename the copied folder on the new secondary server to pncell_<hostname>#2
  8. Copy the installedDirectory/TrueSight/pw/pronto/tmp/addremoteagent file from the primary server to the new secondary server.
  9. Start the primary server.
  10. On the primary server, run the addremoteagent file.
  11. Start the secondary server, after the primary server is up and running.

TrueSight Infrastructure Management performs a failover as the primary cell stops responding

In a TrueSight Infrastructure Management high-availability deployment, if the primary cell fails or stops responding, the secondary cell becomes active. After a while, the Infrastructure Management server performs the failover. This may cause the cell database (MCDB) to go out of sync.

You can disable the automatic failover feature for the cell by setting an mcell configuration parameter. If you set the CellDuplicateAutoFailOver parameter to NO, and if the primary cell fails, automatic cell failover will not happen. Instead, it will wait for the agent controller process to perform the failover. 

To disable the automatic cell failover, do the following on both primary and secondary Infrastructure Management servers:

  1. Edit the <Infrastructure Management Installation Directory>\pw\server\etc\<cellname>\mcell.conf file.
  2. Set the following configuration parameter to NO:
    CellDuplicateAutoFailOver=NO
  3. Restart the cell process.

The HA nodes experience a split-brain condition

A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.

The following Infrastructure Management server properties help to prevent and recover from a split-brain condition.

To set the properties, edit the following files in the installedDirectory\pw\custom\conf\pronet.conf of the secondary server only and make the required changes:

Split-brain condition prevention

Property name

Property details

Default value

Valid values

pronet.ha.split.brain.prevention.support

Controls the split-brain condition prevention support. If set to false the Active-Active node prevention operation does not occur.

true

true;
false

pronet.ha.split.brain.prevention.max.retry.count

The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node.

The value must be less than 20.

6

Any positive integer

pronet.ha.split.brain.prevention.scan.frequency.in.secs

The frequency (in seconds) at which the monitoring service checks the remote node status.

30

Any positive integer

Split-brain condition recovery

Property name

Property details

Default value

Valid values

pronet.ha.split.brain.recovery.support

Controls the split-brain condition recovery support. If set to false the Active-Active node recovery operation does not occur and the HA deployment stays in the split-brain condition.

true

true;
false

pronet.ha.split.brain.recovery.notification.support

Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified.

true

true;
false

pronet.ha.split.brain.recovery.scan.frequency.in.secs

The frequency (in seconds) at which the monitoring service checks for a split-brain condition.

30

Any positive integer

Errors and warnings are displayed when failback occurs

When a failback occurs, errors and warnings related to ActiveMQ are displayed in the TrueSight.log file.

As a workaround, perform the following steps:

  1. Back up the following files:
    installedDirectory\wildfly\standalone\deployments\activemq-rar.rar.deployed
    installedDirectory\wildfly\standalone\deployments\activemq-rar.rar
  2. As per the node status (active or standby), copy the appropriate file to the installedDirectory\wildfly\standalone\deployments folder:
    installedDirectory\Active\activemq-rar.rar
    installedDirectory\StandBy\activemq-rar.rar
  3. Restart the Infrastructure Management server.

The Rate process does not start on any of the HA nodes

Due to cache replication issues, the Rate process may not start on any of the HA nodes. A null-pointer exception is also displayed in the installedDirectory\pw\pronto\logs\TrueSight.log file.

As a workaround, perform the following steps:

  1. Stop the standby node first and then stop the active node.
  2. Back up the federatedcacheserver-transport-tcp.xml and clustermanager.xml files in the installedDirectory\pw\pronto\conf folder on both nodes.
  3. Edit the installedDirectory\pw\pronto\conf\federatedcacheserver-transport-tcp.xml file and update the FD, VERIFY_SUSPECT elements timeout value to 30000 and FD element max_tries to 9

    <FD max_tries="9" timeout="30000"/>
    <VERIFY_SUSPECT timeout="30000"/>

  4. Edit the installedDirectory\pw\pronto\conf\clustermanager.xml file and update FD element timeout value to 30000, FD max_tries to 9 and num_msgs to 5.

    <FD max_tries="9" timeout="30000"/>
    <VERIFY_SUSPECT timeout="30000" num_msgs="5"/>

  5. Ensure that you perform steps 3 and 4 on both nodes.
  6. Start the previously active node first.
  7. After the node is up and running and you are able to access the operator console, restart the other node. 

The Rate process crashes

The Rate process crashes during a failback. Check the installedDirectory\pw\pronto\logs\Rate.log file if it contains the following error message:

Caused by: org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for messageId=<message ID #> returned server error (status=0x86): org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request <request #> 

  1. Edit the installedDirectory\pw\pronto\conf\federatedcacheserver.xml file and update the acquire-timeout element value to 90000 seconds. 

    <acquire-timeout="90000"/>
  2. Restart the Infrastructure Management server processes.

Failback fails due to server not initialized completely

Failback fails and the JServer and Rate processes are not running. The TrueSight.log file contains the following:

Naming Exception while getting Topic Session.
javax.naming.NameNotFoundException: jboss/exported/ConnectionFactory – service jboss.naming.context.java.jboss.exported.jboss.exported.ConnectionFactory
at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:106)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:207)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:184)
at org.jboss.naming.remote.protocol.v1.Protocol$1.handleServerMessage(Protocol.java:127)
at org.jboss.naming.remote.protocol.v1.RemoteNamingServerV1$MessageReciever$1.run(RemoteNamingServerV1.java:73)

As a workaround, perform the following steps:

  1. Update the host file to make sure localhost is resolved to the IP address.
  2. If step #1 does not fix the issue, edit the file installedDirectory\pw\pronto\conf\pronet.conf and installedDirectory\pw\custom\conf\pronet.conf and check values of the property java.naming.provider.url
    If it is localhost, change it to 127.0.0.1.

mcell.dir is blank after a failover when the Infrastructure Management disk is full

Note

This problem is observed only when the Presentation Server is in a HA mode.

Perform the following steps to recover the mcell.dir file from the backup folder:

  1. Navigate to pw/pronto/tmp and open the mcell_dir_org_backup.
  2. Navigate to pw/server/etc/ and copy the information from the mcell_dir_org_backup file to mcell.dir.
  3. Update the current active node of the Presentation Server in the mcell.dir file.
  4. Stop and Start the Infrastructure Management server.

Related topics

Troubleshooting

Troubleshooting a Presentation Server high-availability deployment


 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*