Troubleshooting a high-availability deployment


After installation, the Secondary server login UI is not displayed

In a high availability deployment, when you install or upgrade the Secondary server, if the Remedy Single Sign-On server is stopped, the login UI for the Secondary server is not displayed after installation. 

To work around this issue, after you install or upgrade the Secondary server, restart the Remedy Single Sign-On server.

Presentation Server process does not restart

Whenever a Presentation Server process stops, it does not restart successfully.

As a workaround, stop and restart the Presentation Server.

Run the following commands from the installedDirectory\truesightpserver\bin directory:

tssh server stop
tssh server start

After a failover or failback, the database service does not come up automatically

After a failover or failback occurs, the database service does not come up automatically.

As a workaround, perform the following steps:

  1. Increase the value of the wrapper.startup.timeout property in the installedDirectory\truesightpserver\conf\services\svc.conf file.
    The default value is 100.
  2. Run the following commands from the installedDirectory\truesightpserver\bin directory to restart the Presentation Server:
    tssh server stop
    tssh server start

The HA nodes experience a split-brain condition

A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.

The following TrueSight Presentation Server tssh properties help to prevent and recover from a split-brain condition.

Split-brain condition prevention

Property name

Property details

Default value

Valid values

ha.split.brain.prevention.support

Controls the split-brain condition prevention support. If set to false the Active-Active node prevention operation does not occur.

true

true;
false

ha.split.brain.prevention.max.retry.count

The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node.

5

Any positive integer

The frequency (in seconds) at which the monitoring service checks the remote node status.

30

Any positive integer

Split-brain condition recovery

Property name

Property details

Default value

Valid values

ha.split.brain.recovery.support

Controls the split-brain condition recovery support. If set to false the Active-Active node recovery operation does not occur and the HA deployment stays in the split-brain condition.

true

true;
false

ha.split.brain.recovery.action

Controls the auto-recovery actions after a split-brain condition is identified.
restart - Restarts the most recent active node.
stop - Stops the most recent active node.

restart

restart;
stop

ha.split.brain.recovery.notification.support

Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified.

To configure the email settings, see Monitoring the Presentation Server environment

true

true;
false

ha.split.brain.recovery.max.retry.count

The number of retries to decide if a split-brain condition has occurred.

3

Any positive integer

The frequency (in seconds) at which the monitoring service checks for a split-brain condition.

30

Any positive integer

To set the properties, perform the following steps from the installationDirectory\truesightpserver\bin folder:

  1. Run the tssh properties set <property name> <property value> command to set the properties.
  2. Run the tssh properties reload command to reload the properties.
  3. If you updated the ha.split.brain.prevention.support or ha.split.brain.recovery.support properties, restart the TrueSight Presentation Server services:
    tssh server stop
    tssh server start

Note

The standby node if unable to ping the active node, takes approximately between 5-12 minutes to become active.

System sluggish or crashes in a Linux high-availability deployment

In a Linux high-availability deployment, you might notice system sluggishness, crashes, and get the following errors in the log file frequently, even if both nodes are up.

indexserver.log

java.util.concurrent.ExecutionException: RemoteTransportException[[4lVQt-D][127.0.0.1:9300][cluster:monitor/health]]; nested: MasterNotDiscoveredException;
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:262)

embedcomponents.log

org.infinispan.util.concurrent.TimeoutException: Timed out waiting for

Solution

  1. On each node, edit the /etc/sysctl.conf file and add the following lines to the end of the file:
    net.ipv4.tcp_keepalive_time=600
    net.ipv4.tcp_keepalive_intvl=60
    net.ipv4.tcp_keepalive_probes=20
  2. Stop the secondary server and then stop the primary server.
  3. Restart the primary server.
  4. After you are able to login to the TrueSight console on the primary server, restart the secondary server.
  5. Access the TrueSight console on the secondary server and ensure that the standby message is displayed.