Troubleshooting TrueSight Infrastructure Management component failure
This topic describes some of the issues and troubleshooting steps related to TSIM server crash. The goal is to help you capture what might be the cause and how to resolve/do further troubleshooting for it.
Issue Symptoms :
- TSIM component shows "Disconnected" from TSPS console
- All devices show disconnected and data collection stopped
- on TSIM server, output of command "pw p l" shows "not running" for some or all processes
- on TSIM server, output of command "pw lic list" shows "Failed to connect to server"
- under TSIM HA mode, unexpected failover happened
Basic checks :
Run a Health Check on TSIM server(s)
ftp://ftp.bmc.com/pub/TSOM/HealthCheck/
Check for any ERROR/CRITICAL/WARNING issue reported in Health Check HTML report, address it, followed by TSIM restart.
Most crash issues can be resolved after addressing the reported issues in HCT report
- Important logs to check for any error messages :
TSIM under High Availability mode: <TSIM install dir>/pw/pronto/logs/ServerComponentAvailability.log TrueSight.log
TSIM under standalone mode: <TSIM install dir>/pw/pronto/logs/TrueSight.log
Reference:
- If under <TSIM install dir>/pw/pronto/logs there is a *.hprof file created when crash happened, please check below link for further troubleshooting as well:
Troubleshooting Java memory management
If the issue is cell crash which caused the whole TSIM crash(mcell process is down unexpectedly), please check below doc for what to collect for further troubleshooting:
Which data is required to investigate a TrueSight Infrastructure Management cell crash?
Resolutions for common issues
Symptom | Action | Reference |
---|---|---|
TrueSight Infrastructure Management(TSIM) server is crashing with "javax.jms.ResourceAllocationException: Usage Manager Memory Usage limit reached" Errors | Perform following steps at problematic TSIM server : Note: if issue is happening on HA environment, changes should be done on both nodes. | |
rate process crashed. In TrueSight.log there are many below error massages: "Reader queue is full. Dropping AppsMsg" | 1) set the the following property value in pw\custom\conf\pronet.conf #pronet.ipc.socket.recvbuffersize=1024000 2) Restart TSIM. | |
From health check tool report, there is an entry indicating MFD count is quite high(over 15 K) | Contact BMC support to get help for manually removing the large bunch of MFD instances from DB | |
In ServerComponentAvailability.log we can see below: ServerComponentAvailability [HA-ServerComponentsAvailability-Monitor] 600002 TrueSight server is not been able to establish connectivity with the database. Details: Database is unavailable for the duration:20 min If using an Oracle database, contact your Oracle Database Administrator immediately to rectify your database connectivity issue. For TrueSight Infrastructure Management using SAP SQL Anywhere database, contact your TrueSight Administrator. Recovery action: Shutting down the TrueSight Infrastructure Management application. | Need to work with Oracle DBA for verifying if Oracle connection is fine between TSIM and Oracle DB | |
TSIM crashed and after restart jserver can't be initialized even after several hours. From TrueSight.log we can see many messages like below: ACMessageProcessor [AgentConnector-10002] Dropping message ...ID:>[PA-0-10357-1614689340-178252] Msg-TS:>[1614689340000]. This message is dropped because of duplicate message id or message with higher timestamp is already present in existing queue. Missing Resource String And from the health check tool report, InstanceCount has reached 250K | Please reduce the monitor instaces, refer to the section "Configure filters to include or exclude data and events" in BMC documentation. https://docs.bmc.com/docs/TSInfrastructure/113/defining-a-monitoring-policy-774797086.html#monpolicy-1122605386 | |
Agent Controller generated HPROF file, some ISNs disconnected, CPU running at 100% Below are the messages seen in the TrueSight.log: | Extend the Consumption Based Licensing interval to 2 months 1 - back up the file %BMC_PROACTIVENET_HOME%\custom\conf\pronet.conf 2 - Edit %BMC_PROACTIVENET_HOME%\custom\conf\pronet.conf a) comment out the following parameters #usage.data.collection.delay.in.sec=60 #pronet.jserver.licensereport.eventsync.sleep.minutes=1 b) add the following parameters #2 months usage.data.collection.delay.in.sec=5184000 pronet.jserver.licensereport.eventsync.sleep.minutes=86400 c) increase the scheduler interval to 2 months for CBL summarization code. #2 months #usage.summarization.delay.in.sec=86400 usage.summarization.delay.in.sec=5184000 --> Add if the below properties are not available in custom/conf/pronet.conf pronet.cbl.attr.count.task.interval.hours=24000 pronet.cbl.attr.count.task.first.interval.hours=24000 | |
In TrueSight.log we can see below error: ERROR 05/24 09:04:00 Stderr [ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083] 700100 Exception in thread "ActiveMQ Transport: tcp://localhost/127.0.0.1:8093@45083" | Below are the steps to increase this limits value for the Linux user used to run the TSIM. Root user access may be required to make these changes:
| TrueSight Infrastructure Management (TSIM) crashes intermittently with "java.lang.OutOfMemoryError: unable to create new native thread" in TrueSight.log |
In TrueSight.log: ERROR 03/13 15:39:24 OracleMon [SerialPollEngine-Worker#2-18] 2500100 Oracle DB Server not responding at the IP/Port/Protocol specified or Connect to DB Timedout | Please check if this feature is enabled: If so then, check how many rows are in the recyclebin: Then verify who is contributing the most to the recyclebin size with:
| Oracle database used with TrueSight Infrastructure Management (TSIM)/ BMC ProactiveNet Performance Management (BPPM) is performing really poorly |
Diagnosing and reporting an issue :
If the above basic checks are done and there is not an obvious error/hint found, please collect below and submit a support case to work with BMC:
1) Collect the output reports folder after running the health check tool
2) Collect the output of pw dump 1 from TSIM server
Comments
Log in or register to comment.