Troubleshooting TrueSight Infrastructure Management component failure

This topic describes some of the issues and troubleshooting steps related to TSIM server crash. The goal is to help you capture what might be the cause and how to resolve/do further troubleshooting for it.

Issue Symptoms :

TSIM component shows "Disconnected" from TSPS console
All devices show disconnected and data collection stopped
on TSIM server, output of command "pw p l" shows "not running" for some or all processes
on TSIM server, output of command "pw lic list" shows "Failed to connect to server"
under TSIM HA mode, unexpected failover happened

Basic checks :

Run a Health Check on TSIM server(s)
ftp://ftp.bmc.com/pub/TSOM/HealthCheck/
Check for any ERROR/CRITICAL/WARNING issue reported in Health Check HTML report, address it, followed by TSIM restart.
Most crash issues can be resolved after addressing the reported issues in HCT report
Important logs to check for any error messages :
TSIM under High Availability mode: <TSIM install dir>/pw/pronto/logs/ServerComponentAvailability.log TrueSight.log
TSIM under standalone mode: <TSIM install dir>/pw/pronto/logs/TrueSight.log

Reference:

If under <TSIM install dir>/pw/pronto/logs there is a *.hprof file created when crash happened, please check below link for further troubleshooting as well:
Troubleshooting Java memory management

If the issue is cell crash which caused the whole TSIM crash(mcell process is down unexpectedly), please check below doc for what to collect for further troubleshooting:

Which data is required to investigate a TrueSight Infrastructure Management cell crash?

Resolutions for common issues

Symptom	Action	Reference
TrueSight Infrastructure Management(TSIM) server is crashing with "javax.jms.ResourceAllocationException: Usage Manager Memory Usage limit reached" Errors Error in TrueSight.log ERROR 06/07 23:57:50 Msg_Svr [ActiveMQ Connection Executor: tcp://localhost:8093?useInactivityMonitor=false] 344018 Exception message received from JMS API is javax.jms.ResourceAllocationException: Usage Manager Memory Usage limit reached. Stopping producer (ID:45496-1623088408327-1:1:1:1) to prevent flooding topic://topic.ProcessStatusChangeTopic. See http://activemq.apache.org/producer-flow-control.html for more info at org.apache.activemq.broker.region.BaseDestination.waitForSpace(BaseDestination.java:669) at org.apache.activemq.broker.region.BaseDestination.waitForSpace(BaseDestination.java:658) at org.apache.activemq.broker.region.Topic.send(Topic.java:417) at org.apache.activemq.broker.region.DestinationFilter.send(DestinationFilter.java:132)	Perform following steps at problematic TSIM server : 1. Login to TSIM server and take backup of <TSIM install dir>/pw/wildfly/bin/standalone.conf 2. Open standalone.conf file from the mentioned location and search for "JAVA_OPTS" ( mostly it should be at line number #53 approx) 3. Increase jBoss Max memory from -Xmx3072m to -Xmx5120m in configuration file. Before: JAVA_OPTS="-Dservices=DUMMY -Xms256m -Xmx3072m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=false -Djava.net.preferIPv6Ad dresses=true" After : JAVA_OPTS="-Dservices=DUMMY -Xms256m -Xmx5120m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=false -Djava.net.preferIPv6Addresses=true" 3. Save file and restart the TSIM server using "pw sys stop/start" 4. Validate environment. Note: if issue is happening on HA environment, changes should be done on both nodes.
rate process crashed. In TrueSight.log there are many below error massages: "Reader queue is full. Dropping AppsMsg"	1) set the the following property value in pw\custom\conf\pronet.conf #pronet.ipc.socket.recvbuffersize=1024000 to pronet.ipc.socket.recvbuffersize=2048000 2) Restart TSIM.
From health check tool report, there is an entry indicating MFD count is quite high(over 15 K)	Contact BMC support to get help for manually removing the large bunch of MFD instances from DB
In ServerComponentAvailability.log we can see below: ServerComponentAvailability [HA-ServerComponentsAvailability-Monitor] 600002 TrueSight server is not been able to establish connectivity with the database. Details: Database is unavailable for the duration:20 min If using an Oracle database, contact your Oracle Database Administrator immediately to rectify your database connectivity issue. For TrueSight Infrastructure Management using SAP SQL Anywhere database, contact your TrueSight Administrator. Recovery action: Shutting down the TrueSight Infrastructure Management application.	Need to work with Oracle DBA for verifying if Oracle connection is fine between TSIM and Oracle DB
TSIM crashed and after restart jserver can't be initialized even after several hours. From TrueSight.log we can see many messages like below: ACMessageProcessor [AgentConnector-10002] Dropping message ...ID:>[PA-0-10357-1614689340-178252] Msg-TS:>[1614689340000]. This message is dropped because of duplicate message id or message with higher timestamp is already present in existing queue. Missing Resource String And from the health check tool report, InstanceCount has reached 250K	Please reduce the monitor instaces, refer to the section "Configure filters to include or exclude data and events" in BMC documentation. https://docs.bmc.com/docs/TSInfrastructure/113/defining-a-monitoring-policy-774797086.html#monpolicy-1122605386
Agent Controller generated HPROF file, some ISNs disconnected, CPU running at 100% Below are the messages seen in the TrueSight.log: "UsageDataManager [Thread-9,USAGE_PERFORMANCE_DB_UPDATE_POOL] 600002 StoreStreamAttDBUpdateTask Patrol Adapter stream Attribute usage collection DB update .... Started. "UsageDataManager [Thread-9,USAGE_PERFORMANCE_DB_UPDATE_POOL] 600002 StoreStreamAttDBUpdateTask Patrol Adapter stream Attribute usage collection DB update .... Finished."	Extend the Consumption Based Licensing interval to 2 months 1 - back up the file %BMC_PROACTIVENET_HOME%\custom\conf\pronet.conf 2 - Edit %BMC_PROACTIVENET_HOME%\custom\conf\pronet.conf a) comment out the following parameters #usage.data.collection.delay.in.sec=60 #pronet.jserver.licensereport.eventsync.sleep.minutes=1 b) add the following parameters #2 months usage.data.collection.delay.in.sec=5184000 pronet.jserver.licensereport.eventsync.sleep.minutes=86400 c) increase the scheduler interval to 2 months for CBL summarization code. #2 months #usage.summarization.delay.in.sec=86400 usage.summarization.delay.in.sec=5184000 --> Add if the below properties are not available in custom/conf/pronet.conf pronet.cbl.attr.count.task.interval.hours=24000 pronet.cbl.attr.count.task.first.interval.hours=24000
In TrueSight.log we can see below error: ERROR 05/24 09:04:00 Stderr [ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083] 700100 Exception in thread "ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083" ERROR 05/24 09:04:00 Stderr [ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083] 700100 java.lang.OutOfMemoryError: unable to create new native thread ERROR 05/24 09:04:00 Stderr [ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083] 700100 at java.lang.Thread.start0(Native Method)	Below are the steps to increase this limits value for the Linux user used to run the TSIM. Root user access may be required to make these changes: Open the file /etc/security/limits.conf in edit mode Add/Edit the following two parameters to the value 32000: <tsim_user> soft nofile 65536 <tsim_user> hard nofile 65536 <tsim_user> soft nproc 32000 <tsim_user> hard nproc 32000 Save the file Logoff all current putty sessions of the TSIM user and login again Run the "ulimit -a" command to verify if the "max user processes" and "max open files" values have been updated Post the above changes, few changes need to be made in some of the configuration files of the TSIM so as to ensure that the jserver and wildfly processes run with the ulimit settings set for the TSIM user. Below are the steps: Take a backup of <Install_Dir>/TrueSight/pw/pronto/bin/pw file and place the backup file in a different location Open the file <Install_Dir>/TrueSight/pw/pronto/bin/pw file in Edit mode Search for the below line in the file: my $set_ulimit = "ulimit -n 16096; "; Edit this line and update the value as 65536 as shown below: my $set_ulimit = "ulimit -n 65536; "; Save and close the file Take a backup of <Install_Dir>/TrueSight/pw/pronto/lib/perl/PWProcess.pm file and place the backup file in a different location Open the file <Install_Dir>/TrueSight/pw/pronto/lib/perl/PWProcess.pm file in Edit mode Search for the below line in the file: my $set_ulimit = "ulimit -n 4096; "; Edit the line and update the value as 65536 as shown below: my $set_ulimit = "ulimit -n 65536; "; Save and close the file Restart all TSIM services Post successful restart, run the "pw p l" command and note the PIDs for the processes "services" and "jserver" Navigate to location "/proc/<services_PID>" and run command "cat limits": Verify if the below two parameter values are updated as shown below: Max processes 32000 32000 processes Max open files 65536 65536 files Navigate to location "/proc/<jserver_PID>" and run command "cat limits": Verify if the below two parameter values are updated as shown below: Max processes 32000 32000 processes Max open files 65536 65536 files If the updated values are reflected, then the changes are successful	TrueSight Infrastructure Management (TSIM) crashes intermittently with "java.lang.OutOfMemoryError: unable to create new native thread" in TrueSight.log
In TrueSight.log: ERROR 03/13 15:39:24 OracleMon [SerialPollEngine-Worker#2-18] 2500100 Oracle DB Server not responding at the IP/Port/Protocol specified or Connect to DB Timedout ERROR 03/13 15:39:24 OracleMon [SerialPollEngine-Worker#2-18] 2500100 [18]DB Connection timed out disconnecting with the database to free up the connection ERROR 03/13 15:40:16 OracleMon [TimedRetryExecutor-Worker#3] 2500100 Could not execute SQL.....Reason: Closed Statement: next ,SQLState: 08003 ,Vendorcode: 17009 ,SQL: SELECT SUM ( BYTES ) FROM dba_free_space ERROR 03/13 15:40:16 OracleMon [TimedRetryExecutor-Worker#3] 2500100 Closed Statement: next ERROR 03/13 15:40:16 OracleMon [TimedRetryExecutor-Worker#3] 2500100 Could not execute SQL.....Reason: Closed Connection ,SQLState: 08003 ,Vendorcode: 17008 ,SQL: SELECT SUM ( a.phyrds ) FROM v$filestat a , v$datafile b WHERE a.file# = b.file# ERROR 03/13 15:40:16 OracleMon [TimedRetryExecutor-Worker#3] 2500100 [18]Execute on tablecu dba_datafiles failed for instance id :18 ERROR 03/13 15:40:16 OracleMon [TimedRetryExecutor-Worker#3] 2500100 [18]Error Message0 for instance id :18 ERROR 03/13 15:40:16 OracleMon [TimedRetryExecutor-Worker#3] 354118 Exception: Aftet discussing this internally, I found this due to the recycle bin on Oracle, this is an exact match with	Please check if this feature is enabled: show parameter recyclebin If so then, check how many rows are in the recyclebin: select COUNT() FROM recyclebin; Then verify who is contributing the most to the recyclebin size with: SELECT owner,COUNT() FROM dba_recyclebin GROUP BY owner; On Oracle 10.x, if recyclebin has thousands of rows (on Oracle 11.x/12.x it only needs a few rows) it can massively affect the dba_free_space query and therefore have a knock on performance across the rest of the database. The recyclebin can be cleared using the following SQL: purge recyclebin;Or, as SYSDBA for system wide purging. purge dba_recyclebin;Then see how the database and BPPM/TSIM performance is subsequently. Notes, for more details search the web for dba_free_space and recyclebin there will be plenty of pages about this, Oracle Metalink Article Doc ID 271169.1 'Queries on DBA_FREE_SPACE are Slow ' is relevant in this instance. Note, if this is the cause then it is entirely unrelated to BPPM/TSIM as a product but the configuration of Oracle.	Oracle database used with TrueSight Infrastructure Management (TSIM)/ BMC ProactiveNet Performance Management (BPPM) is performing really poorly

Diagnosing and reporting an issue :

If the above basic checks are done and there is not an obvious error/hint found, please collect below and submit a support case to work with BMC:

1) Collect the output reports folder after running the health check tool

2) Collect the output of pw dump 1 from TSIM server

Troubleshooting TrueSight Infrastructure Management component failure

Resolutions for common issues

On this page