Frequently asked questions (FAQ) about performance and scalability
This topic provides answers to frequently asked questions about performance and scalability.
If a high performance SAN, or equivalent, is used for data storage, you should experience acceptable performance even if you double the default data retention (raw or condensed). If more than double the retention is desired, BMC Software recommends that you only increase retention on the condensed data (the impact is lower). BMC Software recommends that you retain raw data for no more than 8 days and condensed data for no more than 1 year (365 days).
For data collection, we used 30 days for stats data and 90 days for rate and baseline data.
You cannot connect to more than 10 remote cells at one time. Disconnect some cells and connect to the cells that you want to monitor.
On average, a BMC PATROL Agent collects 850 parameters. The number of parameters might differ, depending on the specific monitoring solutions that are loaded on the agent. These figures were tested on Windows and UNIX computers.
The most common symptom of the load on the system exceeding capacity is that the analytical engine runs out of memory. If this occurs,
OutOfMemory exceptions are logged in one of the files that are located in the <installationDirectory>\pw\pronto\logs directory:
- If the
OutOfMemoryexception is caused by Rate or JServer processes, the log is in the ProactiveNet.log file.
- If the
OutOfMemoryexception is caused by the agent controller process, the log is in the pronet_cntl.out file.
You can increase the
MaxHeap value in one of the following files, located in the <installationDirectory>\pw\custom\conf directory:
- For the Rate process, pnrate.conf
- For the JServer process, pnjserver.conf
- For the agent controller, pnagentcntl.conf
The following list presents additional common issues that can arise when capacity is exceeded:
- Gaps sporadically appear in data collection across all monitors, or artificial alarm delays occur after a threshold condition has been violated. You might also see gaps in the data shown in graphs. When this occurs, you might see pending, cache size limit exceeded, and dropping messages logged in the Infrastructure Management log file.
- Gaps in data could result from memory issues or I/O bottlenecks. If you do not see
OutOfMemoryexceptions logged, the system probably has an I/O issue. Check the I/O status by looking at the percentage of the disk that is busy (for example, the
iostat cmdon Solaris). If the system consistently shows that over 30%, the system probably has an I/O issue.
- User response becomes much slower when using the web interface. If user response is slow for all interactions, the system has a problem with memory or CPU. If user response is only slow when graphing, the system probably has an I/O issue.
- Check the out-of-the-box monitors that Infrastructure Management creates on the BMC TrueSight Infrastructure Management Server. If you see sustained (several minutes) CPU spikes above 80%, this usually indicates some resource issue, although not necessarily a lack of CPU.
In general it depends on what you are trying to accomplish. For best performance, it is always better to have the agent that runs the adapter installed in the same network as the server from which it is pulling data. This ensures that the adapter calls (for example, web service and database queries) do not have latency issues. The BMC TrueSight Infrastructure Management Server and Agent connection bundles up the data more efficiently, which creates better cross network traffic.
It is possible to use one large adapter to pull in data from a Infrastructure Management application, but there are times when it is better to use multiple adapters across one (or many) local agents. If you use separate adapters for different application types, you can set different polling intervals for the high priority applications (this allows more scaling in the end). You could also do this for different application instances, as well (not just for application types).
For the analytical engine to perform all of the tasks quickly, the majority of the data it operates on is cached in memory. Keep in mind that baseline and abnormality generation requires resources beyond basic monitoring. To perform baselining and abnormality generation more quickly, more memory is consumed.
Clustering provides continued operation in the event of a failure. It does not address performance. The provided scale estimates represent the minimum number of computers required to support a given workload.
A clustered environment requires additional computers so that, in the case of a failure, functionality continues to operate. Use the scale estimates to ensure that the failover environment will operate as expected.
No, in a controlled lab for a large environment, BMC can scale up to only 200,000 metrics.