Product monitoring recommendation
MainView Middleware Monitor includes a degree of self-monitoring, but in the case of significant problems that are external to the processes themselves, might not be able to detect issues, or might be impacted to such a degree that it cannot process any events, which means that it cannot alert users to the problem.
For this reason it is best practice to implement monitoring of the OS where the services are running, the service processes themselves, and the database instance health and status.
Following lists represent a good starting point, each customer organization will end up adding to and customizing this list, based on their environment and system monitoring experience.
MVMM service process items to be monitored:
- You need to monitor the following processes: QPTS, QPES, QPHS, qpcgateway, QPAS. Note that qpas simply shows as java and wrapper processes, but the PID for the currently running QPAS process is in the qpas.pid file on Linux. On MS Window check that these Windows Service entries are running.
- Check that the processes are all running.
If you have a tool for monitoring the age or content of log files, check that a line like the one below is being written to the log files qpts.log, qpes.log, and qphs.log about every 5 minutes. Alternatively monitor that the last update time for those log files is no older than 15 minutes.
Time:2020-03-26 14:36:43.739646, Category=com.mqsoftware.StatsUtils.StatsUtils_Collection, File=d:\bmcprojects\dev\trunk\mqsdev\statsutils\statsutils_collection.cpp, Line=417, Process=91832, Thread=90276, Session=, Level=QP_LM_STATS) QPPS: T(K=0.3437500, U=0.6093750) M(PFC=196, WS=207454208, PWS=366141440, PU=196149248, PPU=352845824, MnWS=204800, MxWS=1413120) IO(RC=195, WC=16773, OC=400, RTC=319423, WTC=5799570, OTC=8230)
Operating system metrics to be monitored:
- Swap/paging space used and rate.
- CPU usage should be between an upper and lower threshold, the actual threshold numbers are different for every installation, but if you are seeing CPU usage above 90% for long stretches of time, there is probably something wrong, or the monitoring load may have increased to the point where the server machine needs to be more powerful. Similarly, if you are seeing, for example, less than 2% CPU utilization for the services in a production environment that normally is in the 30-50% range, something might be wrong. These thresholds will need to be adjusted as your monitoring footprint changes.
Database Instance items to be monitored:
- Database Instance is up.
- Database Listener is running and can be connected to.
- Buffer Cache Hit Ratio is over 95%, this number should trend to 99%, but it may go lower briefly, especially when the services are restarted.