Troubleshooting a high-availability deployment
After installation, the Secondary server login UI is not displayed
In a high availability deployment, when you install or upgrade the Secondary server, if the Remedy Single Sign-On server is stopped, the login UI for the Secondary server is not displayed after installation.
To work around this issue, after you install or upgrade the Secondary server, restart the Remedy Single Sign-On server.
TrueSight Presentation Server process does not restart
Whenever a TrueSight Presentation Server process stops, it does not restart successfully.
As a workaround, stop and restart the TrueSight Presentation Server.
Run the following commands from the installedDirectory\truesightpserver\bin directory:
tssh server stop
tssh server start
The secondary server fails to start after installation
The secondary server could fail to start if the communication between the primary and secondary nodes does not happen with the database port (default 5432).
Check the <installedDirectory>\truesightpserver\logs\db-remote-copy.log file if it contains the following error:pg_basebackup: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "<IP Address>", user "proact", SSL off
If yes, perform the following steps:
- Check the host file entries and ensure that the FQDN has prior priority than the hostname.
For example:10.0.2.8 hostname.bmc.com hostname.localdomain hostname
- Check the <installedDirectory>\truesightpserver\data\pgsql\pg_hba.conf file on the primary Presentation Server.
Check for entries like the following examples and that the hostnames are correct.hostnossl replication proact abc.bmc.com trust
hostnossl replication proact abcd.bmc.com trust
- Test that the connectivity from the secondary server to the primary is through port 5432. You can do this by using telnet.
- Entries in the pg_hba.conf file should resolve to the preferred IP address and DNS lookup from one server to another should work.
- If the issue is not resolved, stop the Presentation Server services on the secondary server and run the following command:
"%TRUESIGHTPSERVER_HOME%\truesightpserver\modules\pgsql\bin"\pg_basebackup.exe -h PrimaryServerFQDN -p 5432 -D "%TRUESIGHTPSERVER_HOME%\data-test" -U proact -v -P -w -x
Check the error for the reason.
After upgrading, the primary server does not come up
After a Presentation Server high-availability upgrade, the primary server might not come up.
This could happen if you stop the primary server before it comes up completely and start the secondary server upgrade.
Perform the following steps:
- On the primary server, delete the nodeStatusPersistence.conf file in the installedDirectory\truesightpserver\conf\ha folder.
Start the primary server.
tssh server start
Verify if the primary server is up completely. It should be the active node.
tssh ha status
Start the secondary server.
tssh server start
After a failover or failback, the database service does not come up automatically
After a failover or failback occurs, the database service does not come up automatically.
As a workaround, perform the following steps:
- Increase the value of the
wrapper.startup.timeout
property in the installedDirectory\truesightpserver\conf\services\svc.conf file.
The default value is 100. - Run the following commands from the installedDirectory\truesightpserver\bin directory to restart the TrueSight Presentation Server:
tssh server stop
tssh server start
The HA nodes experience a split-brain condition
A split-brain condition occurs when the HA nodes can no longer communicate with each other. Each node assumes that the other node is non-functional and tries to take over the active node role.
The following TrueSight Presentation Server tssh properties help to prevent and recover from a split-brain condition.
Split-brain condition prevention | |||
---|---|---|---|
Property name | Property details | Default value | Valid values |
ha.split.brain.prevention.support | Controls the split-brain condition prevention support. If set to | true | true; |
ha.split.brain.prevention.max.retry.count | The number of retries to check for the remote node status. If the remote node status is not identified as active within the number of retries, the current node becomes active. If the remote node status is identified as active, the current node continues as the standby node. | 5 | Any positive integer |
ha.split.brain.prevention.scan.frequency.in.secs | The frequency (in seconds) at which the monitoring service checks the remote node status. | 30 | Any positive integer |
Split-brain condition recovery | |||
---|---|---|---|
Property name | Property details | Default value | Valid values |
ha.split.brain.recovery.support | Controls the split-brain condition recovery support. If set to | true | true; |
ha.split.brain.recovery.action | Controls the auto-recovery actions after a split-brain condition is identified. | restart | restart; |
ha.split.brain.recovery.notification.support | Controls notifications (Self-health monitoring event and email) when a split-brain condition is identified. To configure the email settings, see Monitoring the Presentation Server environment | true | true; |
ha.split.brain.recovery.max.retry.count | The number of retries to decide if a split-brain condition has occurred. | 3 | Any positive integer |
ha.split.brain.recovery.scan.frequency.in.secs | The frequency (in seconds) at which the monitoring service checks for a split-brain condition. | 30 | Any positive integer |
To set the properties, perform the following steps from the installationDirectory\truesightpserver\bin folder:
- Run the
tssh properties set <property name> <property value>
command to set the properties. - Run the
tssh properties reload
command to reload the properties. - If you updated the
ha.split.brain.prevention.support
orha.split.brain.recovery.support
properties, restart the TrueSight Presentation Server services:tssh server stop
tssh server start
Note
The standby node if unable to ping the active node, takes approximately between 5-12 minutes to become active.
Startup sequence of the HA nodes
If both HA nodes are down and then restarted, automatic checks for the restart sequence ensure that there is no data loss. The following properties help configure this functionality:
Property name | Property details | Default value | Valid values |
---|---|---|---|
ha.nodes.status.persistence.support | Controls the node persistence support. If set to | true | true; |
ha.nodes.status.persistence.scan.frequency.in.minutes | The update interval (in minutes) for the nodeStatusPersistence.conf file that stores the node status. | 10 | Any positive integer |
ha.node.max.downtime.minutes | The last active node startup time check (in minutes) | 45 | Any positive integer |
ha.nodes.status.persistence.email.support | Generates repeat email reminders when the standby node is down. | true | true; |
ha.nodes.status.persistence.email.frequency.in.minutes | Time interval (in minutes) for the repeat email reminders. | 60 | Any positive integer |
Note: All properties except ha.nodes.status.persistence.email.support require a restart of the Presentation Server to take effect. |
To set the properties, perform the following steps from the installationDirectory\truesightpserver\bin folder:
- Stop the TrueSight Presentation Server services on the standby node:
tssh server stop
- From the installationDirectory\truesightpserver\bin folder on the active node, run the following command to set the properties:
tssh properties set <property name> <property value>
- Restart the TrueSight Presentation Server services on the active node:
tssh server start
- Start the TrueSight Presentation Server services on the standby node:
tssh server start
The following startup sequences are handled:
Sequence-1: The standby node is started first. The following message is displayed:
This node was in Standby mode earlier. Currently, the last active node <node details> down. Start the <node details> node first and then start this node.
Use the skipHAcheck option to start this node as active node.
Usage: tssh server start skipHAcheck.
This prevents accidentally starting the standby node before the active node. If you want to start the standby node, use the skipHAcheck option.
Sequence-2: The active node is started first but it has been down for longer than the time configured in the ha.node.max.downtime.minutes
property. The following message is displayed:
This node is down for more than 45 minutes. Ensure that this node is the last active node by checking the file C:\Program Files\BMC Software\TrueSightPServer\truesightpserver\conf\ha\nodeStatusPersistence.conf on both nodes <node details>. Use the skipHAcheck option on the last active node.
Usage: tssh server start skipHAcheck.
On both nodes, check the nodeStatusPersistence.conf file for the property ha.current.node.status
. Start the node that has the value Active
. If both nodes show as Active, compare the ha.current.node.time.stamp
field. The node that has the most recent time is the last active node and needs to be restarted first.
After you identify the last active node, restart it by using the skipHAcheck option.
To disable the node persistence support
- Stop the TrueSight Presentation Server services on the standby node:
tssh server stop
- From the installationDirectory\truesightpserver\bin folder on the active node, run the following command to disable node persistence support:
tssh properties set ha.nodes.status.persistence.support false
tssh properties reload
- Restart the TrueSight Presentation Server services on the active node:
tssh server stop
tssh server start
- On the standby node, delete the following file:
installationDirectory\truesightpserver\conf\ha\nodeStatusPersistence.conf - Start the TrueSight Presentation Server services on the standby node:
tssh server start
To start the server using the skipHAcheck option
- From the command line: Run the
tssh server start skipHAcheck
command. - Using services:
- (Linux) Run the following command from /etc/init.d
service BMCTSPSSvc start skipHAcheck
- (Windows) Add the following parameter in the installedDirectory\truesightpserver\conf\services\svc.conf file and start the service:
### Application properties ###
wrapper.java.app.mainclass=com.bmc.truesight.api.install.services.ServicesWrapper
wrapper.app.parameter.1=start
wrapper.app.parameter.2= skipHAcheck
- (Linux) Run the following command from /etc/init.d
Out of memory issue in a Linux high-availability deployment
In a Linux high-availability deployment, you encounter the following error message:
java.lang.OutOfMemoryError: unable to create new native thread in TrueSight.log and Indexserver log
This might be caused due to constraints on processes. Perform the following steps:
- Run the following command to obtain the CSR and Index Server process ID:
tssh server status
- Run the following command to determine the threads usage:
ps –p processID –lfT |wc –l
- Run the following command to check the max user processes (default is 2048):
ulimit –a
As a root user, update the file /etc/security/limits.conf file, update the value to 4096.
If the entry does not exist, add the following entries:<TSPS_OWNER_NAME> soft nproc 4096
<TSPS_OWNER_NAME> hard nproc 4096
- Restart the Presentation Server.
System sluggish or crashes in a Linux high-availability deployment
In a Linux high-availability deployment, you might notice system sluggishness, crashes, and get the following errors in the log file frequently, even if both nodes are up.
indexserver.log
java.util.concurrent.ExecutionException: RemoteTransportException[[4lVQt-D][127.0.0.1:9300][cluster:monitor/health]]; nested: MasterNotDiscoveredException;
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:262)
embedcomponents.log
org.infinispan.util.concurrent.TimeoutException: Timed out waiting for
Solution
- On each node, edit the /etc/sysctl.conf file and add the following lines to the end of the file:
net.ipv4.tcp_keepalive_time=600
net.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=20
Stop the secondary server and then stop the primary server.
Restart the primary server.
- After you are able to login to the TrueSight console on the primary server, restart the secondary server.
- Access the TrueSight console on the secondary server and ensure that the standby message is displayed.
Issues with replication
Replication error occurs
After recovering from a split-brain condition, or after frequent restarts of nodes, a replication error might occur.
- Check the <installedDirectory>\truesightpserver\logs\indexserver.log file if it contains an entry like the following:
Cluster health status changed from [GREEN] to [YELLOW]
- If it does, restart the standby node.
Elasticsearch replication takes a long time
When one of the nodes is down for a long time and is restarted, the Elasticsearch replication might take a long time. The time taken depends on the Elasticsearch data load and network bandwidth.
Increase the values of the following parameters in the <installedDirectory>/truesightpserver/modules/elasticsearch/config/elasticsearch.yml file. The default values are provided below:
indices.recovery.max_bytes_per_sec : 200mb
cluster.routing.allocation.node_concurrent_recoveries: 5
The standby node continuously acquires and releases the leadership lock
Even though the active node is running, the standby node continuously acquires and releases the leadership lock. The following log messages are displayed in the standby node:
INFO [HA-LockAcquiringThread] c.b.t.p.h.LeaderElection BMC_TS-PL000185I: tryAcquiringLeaderLock :: The current status of the remote node [host name] is :Active in attempt #1.
To resolve this issue, perform the following steps:
- Stop the secondary server by running the
tssh server stop
command. - Stop the primary server by running the
tssh server stop
command. - Start the primary server by running the
tssh server start
command. - Run the following commands on the primary server:
tssh properties set ha.split.brain.override.jgroups.lock.check true
tssh properties reload
- Start the secondary server by running the
tssh server start
command.
Server processes are shutdown when the Presentation Server is in the high-availability mode (applicable to TrueSight Presentation Server version 11.3.02 and later)
When the Presentation Server is running in the high-availability mode, the server processes are shutdown and display the following error message:
ERROR 03/05 00:44:50.142 [Timer-2] c.b.t.p.c.p.e.ESClientConnection BMC_TS-PL000134E Exception while checking the Indexserver status
java.util.concurrent.ExecutionException: RemoteTransportException[[tNVsgXW][127.0.0.1:9300][cluster:monitor/health]]; nested: MasterNotDiscoveredException;
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:262)
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:249)
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:91)
Caused by: org.infinispan.client.hotrod.exceptions.HotRodClientException:Request for messageId=<message ID
Error Id: BMC_TS-PL000134E
As a workaround, perform the following steps:
- Verify database connectivity between the Presentation Server nodes.
- Verify that the indexserver.log file contains valid IP Addresses.
Ensure that the adapter priorities are correctly set if you are using a dual Network Interface Card (NIC).
Add the desired IP address details in the hosts file or modify the setting in the ESX host file to set a different label to the network adapter. Contact your administrator for more details.
- Disable the NICs that are not required.
If the above workaround doesn't resolve the issue, do the following:
- Stop the primary Presentation Server.
- Stop the secondary Presentation Server.
- Perform the following first on the primary and then on the secondary server:
- Go to the <TrueSight Presentation Server Install Directory>\truesightpserver\modules\elasticsearch\config directory.
- Take a backup of the elasticsearch.yml file.
Edit the elasticsearch.yml file, and replace the host names with their IP addresses in the following code lines:
#Original code
discovery.zen.ping.unicast.hosts="server1.bmc.com[10590],server2.bmc.com[10590]
network.publish_host:server1.bmc.com
#Modified code
discovery.zen.ping.unicast.hosts="<IP address of server1.bmc.com>[10590],<IP address of server2.bmc.com>[10590]
network.publish_host:<IP address of server1.bmc.com>
- Save the elasticsearch.yml file.
- Restart the primary Presentation Server.
- Restart the secondary Presentation Server.
Comments
Log in or register to comment.