Troubleshooting a high-availability deployment

After installation, the Secondary server login UI is not displayed

In a high availability deployment, when you install or upgrade the Secondary server, if the Remedy Single Sign-On server is stopped, the login UI for the Secondary server is not displayed after installation. 

To work around this issue, after you install or upgrade the Secondary server, restart the Remedy Single Sign-On server.

TrueSight Presentation Server process does not restart

Whenever a TrueSight Presentation Server process stops, it does not restart successfully.

As a workaround, stop and restart the TrueSight Presentation Server.

Run the following commands from the installedDirectory\truesightpserver\bin directory:

tssh server stop
tssh server start

The secondary server fails to start after installation


The secondary server could fail to start if the communication between the primary and secondary nodes does not happen with the database port (default 5432).

Check the <installedDirectory>\truesightpserver\logs\db-remote-copy.log file if it contains the following error:
pg_basebackup: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "<IP Address>", user "proact", SSL off

If yes, perform the following steps:

  1. Check the host file entries and ensure that the FQDN has prior priority than the hostname.
    For example: 10.0.2.8 hostname.bmc.com hostname.localdomain hostname
  2. Check the <installedDirectory>\truesightpserver\data\pgsql\pg_hba.conf file on the primary Presentation Server.
    Check for entries like the following examples and that the hostnames are correct.
    hostnossl replication proact abc.bmc.com trust
    hostnossl replication proact abcd.bmc.com trust
  3. Test that the connectivity from the secondary server to the primary is through port 5432. You can do this by using telnet.
  4. Entries in the pg_hba.conf file should resolve to the preferred IP address and DNS lookup from one server to another should work.
  5. If the issue is not resolved, stop the Presentation Server services on the secondary server and run the following command:
    "%TRUESIGHTPSERVER_HOME%\truesightpserver\modules\pgsql\bin"\pg_basebackup.exe -h PrimaryServerFQDN -p 5432 -D "%TRUESIGHTPSERVER_HOME%\data-test" -U proact -v -P -w -x
    Check the error for the reason.

After upgrading, the primary server does not come up

After a Presentation Server high-availability upgrade, the primary server might not come up.

This could happen if you stop the primary server before it comes up completely and start the secondary server upgrade.

Perform the following steps:

  1. On the primary server, delete the nodeStatusPersistence.conf file in the installedDirectory\truesightpserver\conf\ha folder.
  2. Start the primary server.
    tssh server start

  3. Verify if the primary server is up completely. It should be the active node.
    tssh ha status

  4. Start the secondary server.
    tssh server start

After a failover or failback, the database service does not come up automatically

After a failover or failback occurs, the database service does not come up automatically.

As a workaround, perform the following steps:

  1. Increase the value of the wrapper.startup.timeout property in the installedDirectory\truesightpserver\conf\services\svc.conf file.
    The default value is 100.
  2. Run the following commands from the installedDirectory\truesightpserver\bin directory to restart the TrueSight Presentation Server:
    tssh server stop
    tssh server start

The HA nodes experience a split-brain condition

A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.

The following TrueSight Presentation Server tssh properties help to prevent and recover from a split-brain condition.

Split-brain condition prevention

Property name

Property details

Default value

Valid values

ha.split.brain.prevention.support

Controls the split-brain condition prevention support. If set to false the Active-Active node prevention operation does not occur.

true

true;
false

ha.split.brain.prevention.max.retry.count

The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node.

5

Any positive integer

ha.split.brain.prevention.scan.frequency.in.secs

The frequency (in seconds) at which the monitoring service checks the remote node status.

30

Any positive integer

Split-brain condition recovery

Property name

Property details

Default value

Valid values

ha.split.brain.recovery.support

Controls the split-brain condition recovery support. If set to false the Active-Active node recovery operation does not occur and the HA deployment stays in the split-brain condition.

true

true;
false

ha.split.brain.recovery.action

Controls the auto-recovery actions after a split-brain condition is identified.
restart - Restarts the most recent active node.
stop - Stops the most recent active node.

restart

restart;
stop

ha.split.brain.recovery.notification.support

Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified.

To configure the email settings, see Monitoring the Presentation Server environment

true

true;
false

ha.split.brain.recovery.max.retry.count

The number of retries to decide if a split-brain condition has occurred.

3

Any positive integer

ha.split.brain.recovery.scan.frequency.in.secs

The frequency (in seconds) at which the monitoring service checks for a split-brain condition.

30

Any positive integer

To set the properties, perform the following steps from the installationDirectory\truesightpserver\bin folder:

  1. Run the tssh properties set <property name> <property value> command to set the properties.
  2. Run the tssh properties reload command to reload the properties.
  3. If you updated the ha.split.brain.prevention.support or ha.split.brain.recovery.support properties, restart the TrueSight Presentation Server services:
    tssh server stop
    tssh server start

Note

The standby node if unable to ping the active node, takes approximately between 5-12 minutes to become active.

Startup sequence of the HA nodes

If both HA nodes are down and then restarted, automatic checks for the restart sequence ensure that there is no data loss. The following properties help configure this functionality:

Property name

Property details

Default value

Valid values

ha.nodes.status.persistence.support

Controls the node persistence support. If set to false the last active node check does not occur.
See To disable the node persistence support.

true

true;
false

ha.nodes.status.persistence.scan.frequency.in.minutesThe update interval (in minutes) for the nodeStatusPersistence.conf file that stores the node status.10Any positive integer
ha.node.max.downtime.minutesThe last active node startup time check (in minutes)45Any positive integer
ha.nodes.status.persistence.email.supportGenerates repeat email reminders when the standby node is down.true

true;
false

ha.nodes.status.persistence.email.frequency.in.minutesTime interval (in minutes) for the repeat email reminders.60Any positive integer
Note: All properties except ha.nodes.status.persistence.email.support require a restart of the Presentation Server to take effect.

To set the properties, perform the following steps from the installationDirectory\truesightpserver\bin folder:

  1. Stop the TrueSight Presentation Server services on the standby node:
    tssh server stop
  2. From the installationDirectory\truesightpserver\bin folder on the active node, run the following command to set the properties:
    tssh properties set <property name> <property value>
  3. Restart the TrueSight Presentation Server services on the active node:
    tssh server start
  4. Start the TrueSight Presentation Server services on the standby node:
    tssh server start


The following startup sequences are handled:

Sequence-1: The standby node is started first. The following message is displayed:

This node was in Standby mode earlier. Currently, the last active node <node details> down. Start the <node details> node first and then start this node.

Use the skipHAcheck option to start this node as active node.

Usage: tssh server start skipHAcheck.

This prevents accidentally starting the standby node before the active node. If you want to start the standby node, use the skipHAcheck option.


Sequence-2: The active node is started first but it has been down for longer than the time configured in the ha.node.max.downtime.minutes property. The following message is displayed:

This node is down for more than 45 minutes. Ensure that this node is the last active node by checking the file C:\Program Files\BMC Software\TrueSightPServer\truesightpserver\conf\ha\nodeStatusPersistence.conf on both nodes <node details>. Use the skipHAcheck option on the last active node.

Usage: tssh server start skipHAcheck.

On both nodes, check the nodeStatusPersistence.conf file for the property ha.current.node.status. Start the node that has the value Active. If both nodes show as Active, compare the ha.current.node.time.stamp field. The node that has the most recent time is the last active node and needs to be restarted first.

After you identify the last active node, restart it by using the skipHAcheck option.

To disable the node persistence support

  1. Stop the TrueSight Presentation Server services on the standby node:
    tssh server stop
  2. From the installationDirectory\truesightpserver\bin folder on the active node, run the following command to disable node persistence support:
    tssh properties set ha.nodes.status.persistence.support false
    tssh properties reload
  3. Restart the TrueSight Presentation Server services on the active node:
    tssh server stop
    tssh server start
  4. On the standby node, delete the following file:
    installationDirectory\truesightpserver\conf\ha\nodeStatusPersistence.conf
  5. Start the TrueSight Presentation Server services on the standby node:
    tssh server start

To start the server using the skipHAcheck option

  • From the command line: Run the tssh server start skipHAcheck command.
  • Using services:
    • (Linux) Run the following command from /etc/init.d
      service BMCTSPSSvc start skipHAcheck
    • (Windows) Add the following parameter in the installedDirectory\truesightpserver\conf\services\svc.conf file and start the service:
      ### Application properties ###
      wrapper.java.app.mainclass=com.bmc.truesight.api.install.services.ServicesWrapper
      wrapper.app.parameter.1=start
      wrapper.app.parameter.2= skipHAcheck

Out of memory issue in a Linux high-availability deployment

In a Linux high-availability deployment, you encounter the following error message:

java.lang.OutOfMemoryError: unable to create new native thread in TrueSight.log and Indexserver log

This might be caused due to constraints on processes. Perform the following steps:

  1. Run the following command to obtain the CSR and Index Server process ID:
    tssh server status
  2. Run the following command to determine the threads usage:
    ps –p processID –lfT |wc –l
  3. Run the following command to check the max user processes (default is 2048):
    ulimit –a
  4. As a root user, update the file /etc/security/limits.conf file, update the value to 4096.
    If the entry does not exist, add the following entries:

    <TSPS_OWNER_NAME> soft nproc 4096
    <TSPS_OWNER_NAME> hard nproc 4096

  5. Restart the Presentation Server.

System sluggish or crashes in a Linux high-availability deployment

In a Linux high-availability deployment, you might notice system sluggishness, crashes, and get the following errors in the log file frequently, even if both nodes are up.

indexserver.log

java.util.concurrent.ExecutionException: RemoteTransportException[[4lVQt-D][127.0.0.1:9300][cluster:monitor/health]]; nested: MasterNotDiscoveredException;

at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:262)

embedcomponents.log

org.infinispan.util.concurrent.TimeoutException: Timed out waiting for

Solution

  1. On each node, edit the /etc/sysctl.conf file and add the following lines to the end of the file:
    net.ipv4.tcp_keepalive_time=600
    net.ipv4.tcp_keepalive_intvl=60
    net.ipv4.tcp_keepalive_probes=20
  2. Stop the secondary server and then stop the primary server.

  3. Restart the primary server.

  4. After you are able to login to the TrueSight console on the primary server, restart the secondary server.
  5. Access the TrueSight console on the secondary server and ensure that the standby message is displayed.

Issues with replication

Replication error occurs

After recovering from a split-brain condition, or after frequent restarts of nodes, a replication error might occur.

  1. Check the <installedDirectory>\truesightpserver\logs\indexserver.log file if it contains an entry like the following:
    Cluster health status changed from [GREEN] to [YELLOW] 
  2. If it does, restart the standby node.

Elasticsearch replication takes a long time

When one of the nodes is down for a long time and is restarted, the Elasticsearch replication might take a long time. The time taken depends on the Elasticsearch data load and network bandwidth.

Increase the values of the following parameters in the <installedDirectory>/truesightpserver/modules/elasticsearch/config/elasticsearch.yml file. The default values are provided below:

indices.recovery.max_bytes_per_sec : 200mb
cluster.routing.allocation.node_concurrent_recoveries: 5

The standby node continuously acquires and releases the leadership lock

Even though the active node is running, the standby node continuously acquires and releases the leadership lock. The following log messages are displayed in the standby node:

INFO [HA-LockAcquiringThread] c.b.t.p.h.LeaderElection BMC_TS-PL000185I: tryAcquiringLeaderLock :: The current status of the remote node [host name] is :Active in attempt #1.

To resolve this issue, perform the following steps:

  1. Stop the secondary server by running the tssh server stop command.
  2. Stop the primary server by running the tssh server stop command.
  3. Start the primary server by running the tssh server start command.
  4. Run the following commands on the primary server:
    tssh properties set ha.split.brain.override.jgroups.lock.check true
    tssh properties reload
  5. Start the secondary server by running the tssh server start command.


Server processes are shutdown when the Presentation Server is in the high-availability mode  (applicable to TrueSight Presentation Server version 11.3.02 and later)

When the Presentation Server is running in the high-availability mode, the server processes are shutdown and display the following error message:

ERROR 03/05 00:44:50.142 [Timer-2] c.b.t.p.c.p.e.ESClientConnection BMC_TS-PL000134E Exception while checking the Indexserver status
java.util.concurrent.ExecutionException: RemoteTransportException[[tNVsgXW][127.0.0.1:9300][cluster:monitor/health]]; nested: MasterNotDiscoveredException;
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:262)
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:249)
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:91)


Caused by: org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for messageId=<message ID

Error Id: BMC_TS-PL000134E

As a workaround, perform the following steps:

  1. Verify database connectivity between the Presentation Server nodes.
  2. Verify that the indexserver.log file contains valid IP Addresses.
  3. Ensure that the adapter priorities are correctly set if you are using a dual Network Interface Card (NIC).

  4. Add the desired IP address details in the hosts file or modify the setting in the ESX host file to set a different label to the network adapter. Contact your administrator for more details.

  5. Disable the NICs that are not required.

If the above workaround doesn't resolve the issue, do the following:

  1. Stop the primary Presentation Server.
  2. Stop the secondary Presentation Server.
  3. Perform the following first on the primary and then on the secondary server:
    1. Go to the <TrueSight Presentation Server Install Directory>\truesightpserver\modules\elasticsearch\config directory.
    2. Take a backup of the elasticsearch.yml file.
    3. Edit the elasticsearch.yml file, and replace the host names with their IP addresses in the following code lines:

      #Original code
      discovery.zen.ping.unicast.hosts="server1.bmc.com[10590],server2.bmc.com[10590]
      network.publish_host:server1.bmc.com
       
       
      #Modified code
      discovery.zen.ping.unicast.hosts="<IP address of server1.bmc.com>[10590],<IP address of server2.bmc.com>[10590]
      network.publish_host:<IP address of server1.bmc.com>

    4. Save the elasticsearch.yml file.
  4. Restart the primary Presentation Server. 
  5. Restart the secondary Presentation Server. 


Was this page helpful? Yes No Submitting... Thank you

Comments