Additional RSCD agent troubleshooting information
The following topics apply to troubleshooting issues with the agent:
- The primary source for agent debugging is the rscd.log file (see #Logging for more information).
- The logging level can be adjusted in the log4crc.txt file. (see Controlling agent logging with the log4crc.txt file for more information)
- On Microsoft Windows, the rscdsvc.log file is also available.
Two utilities are available for administering the RSCD agent in the <BMCServerAutomationInstallation>/RSC directory:
agentctl, you can start, stop, restart, kill, or pause the agent. The restart option is used during the upgrade process and allows for the ability to bring down the agent, perform a command, and start the agent back up. For more detailed information about the
agentctlcommand, see the related man page.
secadmin, you can create the secure file from the command line. For more detailed information about the
secadminutility, see Configuring the secure file.
For the agent to work properly, all hostnames specified in the configuration files under the rsc directory must be able to resolve the hostnames. If DNS is not configured, modify the hosts file or use the agent's IP address.
The patch analysis function of BMC Server Automation requires the Microsoft XML (MSXML) parser version 6.0 SP2 or later to be installed on the server on which the RSCD agent is installed. You can install the RSCD agent on a computer on which MSXML is not installed, but patch analysis does not function correctly until MSXML 6.0 SP2 or later is installed. Run a live audit on target agents to determine its presence, then download the appropriate SP from the Microsoft site and deploy it using BMC Server Automation.
The following processes run on each agent, depending on the operating system. Additional processes might be running, if there are jobs running on the Agent, or if jobs have not exited properly. Processes that did not properly exit can be killed, to ensure that the agent can be restarted.
|Operating system||Processes||Additional details|
RSCDSvc is in charge of restarting the Agent process whenever it shuts down.
RSCDSvc attempts to restart the Agent process up to 50 times. After that, if the service does not succeed in restarting the Agent, it stops attempting to restart the Agent and shuts itself down.
|Linux or AIX|
The first process to start is the Agent watcher process, rscw (on Linux and AIX) or rscd (on HP-UX and Solaris). The watcher process spawns and monitors the other two processes, the Agent listener and the Agent logger, both named rscd. Whenever a listener or logger process shuts down, the watcher process tries to restart it.
Unlike on Windows, there is no fixed count after which the Agent watcher process shuts itself down. It remains functional until it is killed by the superuser or by the operating system as part of a system shutdown or restart.
|HP-UX or Solaris|
On all OS platforms, an agent log records all transactions between the application server and the agent. This does not mean that everything that appears on the agent (such as script output that is written to stdout) appears in the logs; however, all commands issued from the application server are logged by the role:user who executed them, a date/timestamp, and the actual command.
The log file can be viewed using the
logman command from the NSH command line.
On Windows, the rscd.log file can be found in <BMCServerAutomationInstallation>/RSC. On UNIX, the file is located in the .../RSC/log directory. In these same directories, you might see files named rscd.log1, rscd.log2, ..., rscd.logn. Each time an agent goes down and reboots, the log file is saved off, numbered, and a new one takes its place.
For rollback information and job logs, view the logs in the .../RSC/transactions directory.
For more information see Controlling agent logging with the log4crc.txt file.
Stopping and starting the agent
BMC recommends that you start and stop the RSCD Agent using the procedures in the table below. The UNIX rscd script executes the
ps command on the agent and
greps for the rscd process. If found, it executes the
kill command on the process. The UNIX script does not use the
agentctl command; however, the user can alter the script to do so.
In addition to the information below, users can also use the
agentctl command on both platforms to manage the RSCD Agent.
Stop BladeLogic RSCD Agent Service
Start BladeLogic RSCD Agent Service
Due to a bug in the Windows agent, it will not always stop cleanly when shutting down.
Various executables (in particular rscdsvc.exe) can be locked by system monitoring tools.
Agent does not start
You have just installed an RSCD Agent and it does not start. Also, no logs are showing up in the rscd.log file, or perhaps it is not being created.
To troubleshoot this problem, validate your hosts file. Assuming that your server is called myserver and it has an IP address of 192.168.0.9, you will want to see something like this:
127.0.0.1 localhost 192.168.0.9 myserver
If you don't have an entry for myserver (or if you have a typo), your RSCD agent might not start.
Windows agent does not shut down cleanly
If the Windows agent is in the process of handling any type of command when you stop it, the agent might not stop cleanly. If the agent does not stop cleanly, when the upgrade finishes and tries to restart the agent, the restart will fail.
How can I tell if the agent did not stop cleanly?
After stopping the agent, execute
If you see a line similar to the one below, then the agent did not stop cleanly.
TCP 0.0.0.0:4750 0.0.0.0:0 LISTENING
Is there any way I can get the agent to fully stop?
Try performing the following actions:
- Start taskmgr and see if any processes are owned by BladeLogicRSCD. If you see such processes, stop them. After stopping the processes, check netstat again to see if the agent is still listening.
- If the agent is still listening, check the netstat output to see which client is connected to the agent. If you can identify the relevant client, go to that client and try to kill the connection.
- If after executing steps 1 and 2, the agent is still listening on port 4750, wait another 10 - 15 minutes.
- If, after waiting, the agent is still listening, unfortunately you will probably have to reboot.
After running the installer, you can check for the log file in C:\<WINDIR>\Temp\RSCD-Install.log.
Agent executables are locked
In this problem scenario, the agent's executable files are locked, and the agent cannot start.
How can I determine what has locked those files?
A good tool to use is Handle, which can be found at http://www.sysinternals.com. You can run this command against the locked executables and it should return the process that has it locked.
What if Handle does not tell me anything?
If Handle does not tell you anything, check the following to see if there is a lock on it:
- Improved logging for WMI (this has been a major culprit in locked rscdsvc.exe files).
- Other monitoring software on the box
What else could be wrong if the agent is not starting?
Prior to 8.1, the BladelogicRSCD user used to have a hard-coded default password. However, since 8.1 this password is randomly generated and is stored in the registry key HKEY_LOCAL_MACHINE\SECURITY\SAM\BladeLogic\Operations Manager\RSCD as 'E' and 'S' values. Also, in prior versions, if the password for this user was changed using
chapw, the same location has a value 'p' with the changed password.
Now, if for any reason the user exists with either random password or changed password but the registry keys that store the password are deleted or the keys are not created to store the password, the agent will not start and throw the same error that you are seeing. If the user does not exist, we recreate the user and registry entries and that would explain why it succeeds once the user is deleted.
So you want to check the registry key and values once it starts up to ensure that the registry keys are created. Also, if you see BladeLogicRSCD@BL-WINWWW in the logs, the machine BL-WINWWW might be on a domain. If it is a domain controller in a multi-master environment, it may have something to do with the error.
Upgrade of an agent is not an issue with these changes, though, as the newer versions of the agent are backward-compatible.
When a deploy job runs on an agent, the job is executed, and content and XML instruction files for rollback are pushed to the agent and saved. The content and instruction files are saved in the transactions directory along with the job log files (see Logging for more information).
Because the content and instruction files must reside on the agent in the transactions directory in order for the user to roll back the deploy job, files that are associated with an installation that may need to be rolled back should NOT be deleted. For those jobs that do not require rollback, either configure the job not to allow rollback, or delete the appropriate content and instruction files from the transactions directory.
Files left behind in the temporary or staging directory can be deleted at any time. The staging directory for each server is set in a server's property list. You can view the property via the Configuration Manager console.
(Windows 2000) Multiple jobs cannot map to SMB server
For agent mounting with Windows 2000 agents, only one job at a time can attempt to map to an SMB server. There are two issues for Windows 2000:
- The drive letters available are a single set from A-Z, while a Deploy Job only allows access to 15 drive letters.
- The drive mappings cannot be shared between different logon sessions for the same user account. The deployment cache mapping attempts to access and clear mapped drives from a different session and prevent the creating session from properly freeing the drive mapping. When this condition occurs, restart that Windows 2000 computer to remove the defunct drive mappings.
To prevent this issue, in version 7.4 a change was made to force all deployment jobs into single-job mode whenever an agent mapping exists for a Windows 2000 server.
(Red Hat Linux) Agent installation experiences problem
Before installing the agent on a Red Hat Enterprise Linux 5.0 platform, make sure that the SELINUX setting in /etc/selinux/config is set to
disabled. If it is not, change the value to
disabledand reboot the server prior to the installation.
VMFS file system mount point is not correct in the console
If a VMFS file system has a space in its mount point name, then that mount point may not appear properly under the File System node in the BMC Server Automation Console.
Secure logs do not receive signatures
Setting the agent log rolling size limit to a low value (such as 100KB) can cause problems with the secure logs, including some not getting signed.
Hardware information object jobs run for a long time
Hardware Information Snapshot and Audit Jobs run for a longer time when multiple objects in the hierarchy are selected with the recursive option.
Workaround: For better performance, do one of the following:
- Select the top node only and click Recurse subfolders to enable child nodes.
Select one or more of the leaf nodes needed for the snapshot and clear the Recurse subfoldersoption if it was selected. (This option is disabled by default for leaf nodes.)
UNIX hardware information objects produce messages
Hardware Information Object system commands invoked by UNIX objects (for example, UnixUsers and UnixGroups) might produce messages that seem like errors but do not prevent the UNIX object from doing its job. Umbrella facilities such as SELinux can occasionally inject into system commands errors or warnings that are associated with their own corrupted configurations rather than with the system command. To determine whether a message represents a real problem (assuming that the UNIX object's action was successful), use the following methods:
- Increase the agent or BLDeploy logging level to DEBUG or DEBUG2.
- Check the appropriate log to see if a system command invocation preceded the error message.
Run the same command from the command line. If the same error or warning appears but the command succeeds, the issue is not likely to be related to the UNIX object.
Console freezes when running PowerShell command
The console can freeze if you run a Network Shell script that uses an nexec command to run a PowerShell command on an agent or you browse an extended object that is defined to run a PowerShell command on an agent using remote execution. The freeze always occurs when using PowerShell version 1 and sometimes with PowerShell version 2.
Workaround:Run the PowerShell command directly on the agent using a command line interface or upgrade to PowerShell version 2 and invoke the PowerShell command using the "-InputFormat none" flag.