Troubleshooting a high-availability deployment
After installation, the Secondary server login UI is not displayed
In a high availability deployment, when you install or upgrade the Secondary server, if the Remedy Single Sign-On server is stopped, the login UI for the Secondary server is not displayed after installation.
To work around this issue, after you install or upgrade the Secondary server, restart the Remedy Single Sign-On server.
Presentation Server process does not restart
Whenever a Presentation Server process stops, it does not restart successfully.
As a workaround, stop and restart the Presentation Server.
Run the following commands from the installedDirectory\truesightpserver\bin directory:
tssh server stop
tssh server start
After a failover or failback, the database service does not come up automatically
After a failover or failback occurs, the database service does not come up automatically.
As a workaround, perform the following steps:
- Increase the value of the wrapper.startup.timeout property in the installedDirectory\truesightpserver\conf\services\svc.conf file.
The default value is 100. - Run the following commands from the installedDirectory\truesightpserver\bin directory to restart the Presentation Server:
tssh server stop
tssh server start
The HA nodes experience a split-brain condition
A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.
The following TrueSight Presentation Server tssh properties help to prevent and recover from a split-brain condition.
Split-brain condition prevention | |||
---|---|---|---|
Property name | Property details | Default value | Valid values |
ha.split.brain.prevention.support | Controls the split-brain condition prevention support. If set to false the Active-Active node prevention operation does not occur. | true | true; |
ha.split.brain.prevention.max.retry.count | The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node. | 5 | Any positive integer |
The frequency (in seconds) at which the monitoring service checks the remote node status. | 30 | Any positive integer |
Split-brain condition recovery | |||
---|---|---|---|
Property name | Property details | Default value | Valid values |
ha.split.brain.recovery.support | Controls the split-brain condition recovery support. If set to false the Active-Active node recovery operation does not occur and the HA deployment stays in the split-brain condition. | true | true; |
ha.split.brain.recovery.action | Controls the auto-recovery actions after a split-brain condition is identified. | restart | restart; |
ha.split.brain.recovery.notification.support | Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified. To configure the email settings, see Monitoring the Presentation Server environment | true | true; |
ha.split.brain.recovery.max.retry.count | The number of retries to decide if a split-brain condition has occurred. | 3 | Any positive integer |
The frequency (in seconds) at which the monitoring service checks for a split-brain condition. | 30 | Any positive integer |
To set the properties, perform the following steps from the installationDirectory\truesightpserver\bin folder:
- Run the tssh properties set <property name> <property value> command to set the properties.
- Run the tssh properties reload command to reload the properties.
- If you updated the ha.split.brain.prevention.support or ha.split.brain.recovery.support properties, restart the TrueSight Presentation Server services:
tssh server stop
tssh server start
System sluggish or crashes in a Linux high-availability deployment
In a Linux high-availability deployment, you might notice system sluggishness, crashes, and get the following errors in the log file frequently, even if both nodes are up.
indexserver.log
embedcomponents.log
Solution
- On each node, edit the /etc/sysctl.conf file and add the following lines to the end of the file:
net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=20 - Stop the secondary server and then stop the primary server.
- Restart the primary server.
- After you are able to login to the TrueSight console on the primary server, restart the secondary server.
- Access the TrueSight console on the secondary server and ensure that the standby message is displayed.