Troubleshooting TrueSight Infrastructure Management component failure

This topic was edited by a BMC Contributor and has not been approved.  More information.

This topic describes some of the issues and troubleshooting steps related to TSIM server crash. The goal is to help you capture what might be the cause and how to resolve/do further troubleshooting for it.

Issue Symptoms :

  • TSIM component shows "Disconnected" from TSPS console
  • All devices show disconnected and data collection stopped
  • on TSIM server, output of command "pw p l" shows "not running" for some or all processes
  • on TSIM server, output of command "pw lic list" shows "Failed to connect to server"
  • under TSIM HA mode, unexpected failover happened


Basic checks : 

  • Run a Health Check on TSIM server(s) 

    ftp://ftp.bmc.com/pub/TSOM/HealthCheck/

    Check for any ERROR/CRITICAL/WARNING issue reported in Health Check HTML report, address it, followed by TSIM restart. 

    Most crash issues can be resolved after addressing the reported issues in HCT report

  •  Important logs to check for any error messages : 
    TSIM under High Availability mode: <TSIM install dir>/pw/pronto/logs/ServerComponentAvailability.log    TrueSight.log
    TSIM under standalone mode: <TSIM install dir>/pw/pronto/logs/TrueSight.log

Reference:

  • If under <TSIM install dir>/pw/pronto/logs there is a *.hprof file created when crash happened, please check below link for further troubleshooting as well:
    Troubleshooting Java memory management
  • If the issue is cell crash which caused the whole TSIM crash(mcell process is down unexpectedly), please check below doc for what to collect for further troubleshooting:

            Which data is required to investigate a TrueSight Infrastructure Management cell crash?



Resolutions for common issues

SymptomActionReference

TrueSight Infrastructure Management(TSIM) server is crashing with "javax.jms.ResourceAllocationException: Usage Manager Memory Usage limit reached" Errors

Error in TrueSight.log

ERROR 06/07 23:57:50 Msg_Svr              [ActiveMQ Connection Executor: tcp://localhost:8093?useInactivityMonitor=false] 344018 Exception message received from JMS API is 
javax.jms.ResourceAllocationException: Usage Manager Memory Usage limit reached. Stopping producer (ID:45496-1623088408327-1:1:1:1) to prevent flooding topic://topic.ProcessStatusChangeTopic. See http://activemq.apache.org/producer-flow-control.html for more info
    at org.apache.activemq.broker.region.BaseDestination.waitForSpace(BaseDestination.java:669)
    at org.apache.activemq.broker.region.BaseDestination.waitForSpace(BaseDestination.java:658)
    at org.apache.activemq.broker.region.Topic.send(Topic.java:417)
    at org.apache.activemq.broker.region.DestinationFilter.send(DestinationFilter.java:132)

Perform following steps at problematic TSIM server :

1. Login to TSIM server and take backup of <TSIM install dir>/pw/wildfly/bin/standalone.conf  
2. Open standalone.conf file from the mentioned location and search for "JAVA_OPTS" ( mostly it should be at line number #53 approx)
3. Increase jBoss Max memory from -Xmx3072m to -Xmx5120m  in configuration file.

Before:
JAVA_OPTS="-Dservices=DUMMY -Xms256m -Xmx3072m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=false -Djava.net.preferIPv6Ad        dresses=true"
After :
JAVA_OPTS="-Dservices=DUMMY -Xms256m -Xmx5120m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -Djava.net.preferIPv4Stack=false -Djava.net.preferIPv6Addresses=true"

3. Save file and restart the TSIM server using "pw sys stop/start"
4. Validate environment.

Note: if issue is happening on HA environment, changes should be done on both nodes.


rate process crashed. In TrueSight.log there are many below error massages:

"Reader queue is full. Dropping AppsMsg"

1) set the the following property value in pw\custom\conf\pronet.conf

#pronet.ipc.socket.recvbuffersize=1024000
to
pronet.ipc.socket.recvbuffersize=2048000


2) Restart TSIM. 


From health check tool report, there is an entry indicating MFD count is quite high(over 15 K)

Contact BMC support to get help for manually removing the large bunch of MFD instances from DB



In ServerComponentAvailability.log we can see below:

ServerComponentAvailability [HA-ServerComponentsAvailability-Monitor] 600002  TrueSight server is not been able to establish connectivity with the database.  Details: Database is unavailable for the duration:20 min If using an Oracle database, contact your Oracle Database Administrator immediately to rectify your database connectivity issue.  For TrueSight Infrastructure Management using SAP SQL Anywhere database, contact your TrueSight Administrator.  Recovery action: Shutting down the TrueSight Infrastructure Management application.

Need to work with Oracle DBA for verifying if Oracle connection is fine between TSIM and Oracle DB

TSIM crashed and after restart jserver can't be initialized even after several hours. From TrueSight.log we can see many messages like below:

ACMessageProcessor [AgentConnector-10002] Dropping message ...ID:>[PA-0-10357-1614689340-178252] Msg-TS:>[1614689340000]. This message is dropped because of duplicate message id or message with higher timestamp is already present in existing queue. Missing Resource String

And from the health check tool report, InstanceCount has reached 250K


Please reduce the monitor instaces, refer to the section "Configure filters to include or exclude data and events" in BMC documentation.
https://docs.bmc.com/docs/TSInfrastructure/113/defining-a-monitoring-policy-774797086.html#monpolicy-1122605386

Agent Controller generated HPROF file, some ISNs disconnected, CPU running at 100%

Below are the messages seen in the TrueSight.log:
"UsageDataManager [Thread-9,USAGE_PERFORMANCE_DB_UPDATE_POOL] 600002 StoreStreamAttDBUpdateTask Patrol Adapter stream Attribute usage collection DB update .... Started.
"UsageDataManager [Thread-9,USAGE_PERFORMANCE_DB_UPDATE_POOL] 600002 StoreStreamAttDBUpdateTask Patrol Adapter stream Attribute usage collection DB update .... Finished."

Extend the Consumption Based Licensing interval to 2 months
1 - back up the file %BMC_PROACTIVENET_HOME%\custom\conf\pronet.conf
2 - Edit %BMC_PROACTIVENET_HOME%\custom\conf\pronet.conf
a) comment out the following parameters
#usage.data.collection.delay.in.sec=60
#pronet.jserver.licensereport.eventsync.sleep.minutes=1
b) add the following parameters
#2 months
usage.data.collection.delay.in.sec=5184000
pronet.jserver.licensereport.eventsync.sleep.minutes=86400
c) increase the scheduler interval to 2 months for CBL summarization code.
#2 months
#usage.summarization.delay.in.sec=86400
usage.summarization.delay.in.sec=5184000


--> Add if the below properties are not available in custom/conf/pronet.conf

pronet.cbl.attr.count.task.interval.hours=24000
pronet.cbl.attr.count.task.first.interval.hours=24000

In TrueSight.log we can see below error:

ERROR 05/24 09:04:00 Stderr               [ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083] 700100 Exception in thread "ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083
ERROR 05/24 09:04:00 Stderr               [ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083] 700100 java.lang.OutOfMemoryError: unable to create new native thread
ERROR 05/24 09:04:00 Stderr               [ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083] 700100     at java.lang.Thread.start0(Native Method)

Below are the steps to increase this limits value for the Linux user used to run the TSIM. Root user access may be required to make these changes:

  1. Open the file /etc/security/limits.conf in edit mode
  2. Add/Edit the following two parameters to the value 32000:
    • <tsim_user> soft nofile 65536
    • <tsim_user> hard nofile 65536
    • <tsim_user> soft nproc  32000
    • <tsim_user> hard nproc  32000
  3. Save the file
  4. Logoff all current putty sessions of the TSIM user and login again
  5. Run the "ulimit -a" command to verify if the "max user processes" and "max open files" values have been updated
Post the above changes, few changes need to be made in some of the configuration files of the TSIM so as to ensure that the jserver and wildfly processes run with the ulimit settings set for the TSIM user. Below are the steps:
  1. Take a backup of <Install_Dir>/TrueSight/pw/pronto/bin/pw file and place the backup file in a different location
  2. Open the file <Install_Dir>/TrueSight/pw/pronto/bin/pw file in Edit mode
  3. Search for the below line in the file:
    • my $set_ulimit = "ulimit -n 16096; ";
  4. Edit this line and update the value as 65536 as shown below:
    • my $set_ulimit = "ulimit -n 65536; ";
  5. Save and close the file
  6. Take a backup of <Install_Dir>/TrueSight/pw/pronto/lib/perl/PWProcess.pm file and place the backup file in a different location
  7. Open the file <Install_Dir>/TrueSight/pw/pronto/lib/perl/PWProcess.pm file in Edit mode
  8. Search for the below line in the file:
    • my $set_ulimit = "ulimit -n 4096; ";
  9. Edit the line and update the value as 65536 as shown below:
    • my $set_ulimit = "ulimit -n 65536; ";
  10. Save and close the file
  11. Restart all TSIM services
  12. Post successful restart, run the "pw p l" command and note the PIDs for the processes "services" and "jserver"
  13. Navigate to location "/proc/<services_PID>" and run command "cat limits":
  14. Verify if the below two parameter values are updated as shown below:
    • Max processes             32000                32000                processes
    • Max open files            65536                65536                files
  15. Navigate to location "/proc/<jserver_PID>" and run command "cat limits":
  16. Verify if the below two parameter values are updated as shown below:
    • Max processes             32000                32000                processes
    • Max open files            65536                65536                files
  17. If the updated values are reflected, then the changes are successful


TrueSight Infrastructure Management (TSIM) crashes intermittently with "java.lang.OutOfMemoryError: unable to create new native thread" in TrueSight.log

In TrueSight.log:

ERROR 03/13 15:39:24 OracleMon            [SerialPollEngine-Worker#2-18] 2500100 Oracle DB Server not responding at the IP/Port/Protocol specified or Connect to DB Timedout
ERROR 03/13 15:39:24 OracleMon            [SerialPollEngine-Worker#2-18] 2500100 [18]DB Connection timed out disconnecting with the database to free up the connection 
ERROR 03/13 15:40:16 OracleMon            [TimedRetryExecutor-Worker#3] 2500100  Could not execute SQL.....Reason: Closed Statement: next ,SQLState: 08003 ,Vendorcode: 17009 ,SQL: SELECT  SUM ( BYTES )  FROM dba_free_space
ERROR 03/13 15:40:16 OracleMon            [TimedRetryExecutor-Worker#3] 2500100 Closed Statement: next
ERROR 03/13 15:40:16 OracleMon            [TimedRetryExecutor-Worker#3] 2500100  Could not execute SQL.....Reason: Closed Connection ,SQLState: 08003 ,Vendorcode: 17008 ,SQL: SELECT  SUM ( a.phyrds )  FROM v$filestat a  , v$datafile b WHERE a.file# = b.file#
ERROR 03/13 15:40:16 OracleMon            [TimedRetryExecutor-Worker#3] 2500100 [18]Execute on tablecu dba_datafiles failed for instance id :18
ERROR 03/13 15:40:16 OracleMon            [TimedRetryExecutor-Worker#3] 2500100 [18]Error Message0 for instance id :18
ERROR 03/13 15:40:16 OracleMon            [TimedRetryExecutor-Worker#3] 354118 Exception:
Aftet discussing this internally, I found this due to the recycle bin on Oracle, this is an exact match with

Please check if this feature is enabled:
show parameter recyclebin

If so then, check how many rows are in the recyclebin:
select COUNT(*) FROM recyclebin;  

 Then verify who is contributing the most to the recyclebin size with:
SELECT owner,COUNT(*) FROM dba_recyclebin GROUP BY owner; 


On Oracle 10.x, if recyclebin has thousands of rows (on Oracle 11.x/12.x it only needs a few rows) it can massively affect the dba_free_space query and therefore have a knock on performance across the rest of the database.

The recyclebin can be cleared using the following SQL:
purge recyclebin;Or, as SYSDBA for system wide purging.
purge dba_recyclebin;Then see how the database and BPPM/TSIM performance is subsequently.

Notes, for more details search the web for dba_free_space and recyclebin there will be plenty of pages about this, Oracle Metalink Article Doc ID 271169.1 'Queries on DBA_FREE_SPACE are Slow ' is relevant in this instance.

Note, if this is the cause then it is entirely unrelated to BPPM/TSIM as a product but the configuration of Oracle.

Oracle database used with TrueSight Infrastructure Management (TSIM)/ BMC ProactiveNet Performance Management (BPPM) is performing really poorly


Diagnosing and reporting an issue :

If the above basic checks are done and there is not an obvious error/hint found, please collect below and submit a support case to work with BMC:

1) Collect the output reports folder after running the health check tool

2) Collect the output of pw dump 1 from TSIM server

Was this page helpful? Yes No Submitting... Thank you

Comments