Troubleshooting the TrueSight Infrastructure Management Cell crash
The TrueSight Infrastructure Management Cells are event-processing engines that store all events and data in memory as well as on disk almost in real-time. A Cell runs either as a Windows service or as a UNIX daemon on supported platforms.
There following types of Cells used in TrueSight Infrastructure Management :
- TrueSight Infrastructure Management Server Cell—functions as part of the TrueSight Infrastructure Management server to provide local event and service impact management.
- Remote Cell—installed separately from the TrueSight Infrastructure Management server, the Remote Cell functions as part of a larger distributed network of Cells that propagate events to the TrueSight Infrastructure Management server Cell.
When you observe issue with Cell crash for either type of the above Cell, this guide will help you obtain the appropriate logging and troubleshooting steps to either resolve the problem or create a BMC Support case.
Issue symptoms
The TrueSight Infrastructure Management Cell crashes or frequently goes down.
Issue scope
The frequency of the Cell crash may vary from few minutes to months.
- Either one of the Cell in a high availability pair crashes or both Cells crash
- The TrueSight Infrastructure Management Cell in a standalone deployment crashes
Diagnosing Cell crash
If a Cell crash occurs, the use the core file (Unix) or the crash dump file (Windows) for debugging. Make sure that the systems are setup properly for capturing the core dump. Usually, you have to be running with a debug build of the Cell binary (from development) for the core to really be useful for diagnosis.
It is often advised to check if you are running the latest hotfix, because there are often multiple crash conditions corrected in the hotfixes. You can check the build by running the mCell-z command.
For example, If you are running the TrueSight Infrastructure Management version 11.3.02, the latest Cell Hotfix (PAPAN.11.3.02.00x) can be found at the following ftp location:ftp://ftp.bmc.com/pub/TSIM/PATCHES/11.3.02/Cell/
Prepare Linux OS to create crash dump for the Cell process
- Make sure that the ulimit is set correctly for the user by performing the following steps:
- Login as the Cell installation user, for example, srvemstsom.
If you are using su to this user, use su – srvemstsom Execute the following command to see the ulimit:
ulimit -aYou can see the unlimited value for the core file size parameter:
- Login as the Cell installation user, for example, srvemstsom.
- Make sure that the ulimit is set correctly for the process by performing the following steps:
Run the following to command to get the process ID:
ps -elf |grep "mCell -n worker" |grep -v grepRun the following command and replace <PID> with the PID collected from the previous command:
cat /proc/<PID>/limitsIn the displayed settings list, the value of the Max core file size parameter is unlimited for the soft limit and the hard limit.
- Make sure that the current working directory (cwd) is set to a location that the user srvemstsom can write by performing the following steps:
Make adjustment to writable directory.- Login as the installation user: srvemstsom.
If you are using su to this user, su – srvemstsom. Execute the following to get the cwd:
ls -l /proc/<PID>/cwdThe path is going to show /
- Change the cwd to /tmp or another area that srvemstsom can access.
Run the following commands:
gdb -q
attach <PID>
call (int) chdir("/tmp/")
detach
quitRun the following command to verify that the cwd is updated:
ls -l /proc/<PID>/cwd
- Login as the installation user: srvemstsom.
- Crash the mCell to verify that it creates a core dump file. ( you can skip this step if the OS administrator is confident that OS is set to create a dump file).
Run the following command to crash the worker number 2 Cell:
kill -abrt <PID>Verify that the core file was created in the /tmp directory. Run the following command to look for the core file:
ls -altr /tmp/ |grep core
Prepare Microsoft Window OS to create crash dump for the Cell process
Download and install Microsoft's Debug Diagnostic Tool from https://www.microsoft.com/en-us/download/details.aspx?id=58210.
- Launch DebugDiag and configure the Debug Diagnostic Tool to create a crash dump against the mCell.exe process.
- In the Rule tab, right-click anywhere and select Add rule.
- Select Crash and click Next.
- Select A Specific process and click Next.
- Select mCell.exe and click Next.
- Select Full Userdump as the action type for the for first chance exception.
- Click Exceptions, then click Add Exception and then in the Configure Exception dialog box, select Access Violation.
- Click Yes to confirm the action limit.
- Click Activate the rule now and then click Finish.
Reporting an issue
- Ensure that corresponding OS platform is configured or enabled for process crash dump as shown in above section
- Stop Cell(s) on the server.
- Ti take a backup of the mCell and mCell.dat files, copy the files from the $MCell_HOME/bin/debug directory and paste them in to the $MCell_HOME/bin directory.
- Run the mCell -z command and verify that the output contains D after the version number. For example:
BMC TrueSight Impact Manager 11.3.04D (Build 241511537 - 2-Sep-2020) [l6]
This confirms the debug binary is in place. - In the $MCell_HOME/etc/<Cell> directory, create or the edit the mCell.trace file and add the following content:
ALL ALL stderr - In the $MCell_HOME/etc/<Cell> directory, edit the mCell.conf file and add the following entries:
Trace=Yes
TraceRuleLevel=2
TraceFileSize=25M - Start the Cell(s).
The next time that the Cell crashes, zip and return the following to BMC Customer Support:
- $MCell_HOME/etc/<Cell>
- $MCell_HOME/log/<Cell>
- $MCell_HOME/var/<Cell>
- Output of mCell -z
- Core dump file (For Linux, check with OS administrator for the location. For Windows, the location is shown during the configuration of Debug Diagnostic Tool.)