Designing a Knowledge Module
This section discusses the following topics:
What should a KM do
The following sections discuss items that help make a productive KM:
The goal of a KM is to present information instead of just data, in a manner that the customer can understand and use and in a way that they are comfortable with to interpret the information.
Get all info about application
Link knowledge about the application and how users use it, and give all that knowledge back to the Patrol users.
Talk to the DBA, to the end users, and developers and always talk to the support organization to get management and administrative features that are the most important. Support organizations are always bombarded with customer requests and "How do I"-question.
Ask them what are the main problems are, what most people call support for?
If support says 30% of the calls are because of configuration issues, you might want to create a parameter that checks the configuration of the application (you might even check the configuration of the KM, because the KM will probably require some support as well).
Sometimes a simple check can significantly decease the support calls. It might be even more basic/simple to do then what you are thinking about.
Definition of a good KM
So how do you define a good KM versus a bad one? Developers might think they wrote the best KM possible, because it lists every metric one can possibly capture from the application. This is definitely not the definition of a good KM.
You might even be on a bad track if your KM just displays a ton of information or even worse, a ton of raw data. If the intelligence is limited to displaying every possible value you can retrieve then you should consider the KM to be quite dumb, because the added value of the KM will be limited.
Some KM's leave the intelligence up to the end-user and requires so much user interaction in the form of response() functions that it would compromise the KM's usability and scalability.
Understanding the definition of what a good KM is impacts your design in many ways. The amount of functionality to be provided, the way in which you handle instance creation, the number and types of parameters as well as the user interaction you provide are all impacted by this definition. So let's look at what a good definition might be.
Proactive availability alarms
The KM should certainly sound the klaxons when the application has gone down. But it should also predict downtime before it occurs by giving you a chance to fix the issue before locked-out users start calling you. Proactive detection of problems takes many forms. For example, degradation in performance is a warning symptom that the KM can measure. Similarly, space problems or an upward trend in space usage can be measured, and application failure can be predicted and brought to the attention of the administrators before it happens.
The measurement of the performance of an application is valuable both as an indicator of availability problems and as a tool for tuning the application, and thereby improving the operational productivity. Measuring the real performance of the application with the KM can take the guesswork out of tuning the overall performance of the application.
When the KM detects an issue, can it fix it? Your KM can fix it via what PATROL calls a "recovery action." It can also prompt the operator before performing the action. Alternatively, you might just want notification via pager, e-mail, or sending a trap or ticket to your help desk system.
Availability and performance metrics can be categorized as "monitoring" metrics. The other side of that coin is "management" or "administration" of the application. This refers to the user taking interactive action on the application.
In some cases it is desirable to add administrative features to the KM. It is possible to build a KM that is entirely administrative in function. Usually an administrative KM will be highly interactive and require extensive use of the PSL response function. A different form of administration KM might be "silent" and be comprised of automated actions (Recovery Actions) to automatically respond to agent-detected problems.
Should your KM design be of the administrative type, you might want to include some of these traditional administrative functions:
- Application startup and shutdown
- Configuration changes
- Performance tuning
- User administration
- Security changes
- Archiving or restoring data
Capacity planning is also called "the gentle art of convincing your boss you need more." Although not as critical for short-term operational stability, another important issue for long-term productivity is the measurement of the overall performance of the system. This involves metrics such as performance time and resource usage to determine when increased usage trends will require new hardware or other big-ticket items.
The following are some general guidelines for KM development:
- Enhanced the applications functionality
- Try to be non-intrusive
- Beware of development effort
- Use common data sources
- Command line access
- Define a management API
Non-intrusive Application Management
The definition of non-intrusive application management is that it does not require application modification and uses existing interfaces. Most applications have some interfaces to collect information about them. Good KM design suggests that you use these interfaces where possible. The alternative is to modify the application to provide management information through SNMP or build a command line tool using the application's API. This is far more complicated and in many cases simply not possible.
Finding Data Sources
To find data sources, determine how your customers manage their applications such as reusing existing scripts or automating existing actions and escalation routines.
There are numerous sources of data from which to collect management information in a non-intrusive fashion. Listed below are some of the most commonly used interfaces:
- Scripts - If you already have scripts written in shell, Perl, SQL, C, or any other language, you can reuse them.
- Processes - is a system process up, zombied, or down? How much CPU, memory has it used?
- Sub processes - If the application has a batch or an interactive command line executable for accessing the application, you can use the executable to get information or to send commands to the application
- Files - The presence or absence of a file can have significance. The size of a database file can be a parameter. Archiving or removing a file can be a useful administration command.
- Log Files - plain-text error or history files can yield alert situations or provide historical perspective.
- SNMP MIB(s) - Get the data from the management information base (MIB) for the application. There are PSL functions specifically for accessing values via SNMP.
- Environment variables - The environment can contain useful information such as installation directories. In some cases an application will work without environment variables being set. You have to know if this is the case.
- Ports - Some applications that initiate network connections will be visible in netstat output.
- API's - If the application offers a published API, such as C or C++, you can access the application by this method. Typically, you would use the API to build an executable that is launched by the KM as a sub-process.
- Performance Counters Registry (Windows NT only) - The Windows NT registry and performance counters often contains application information.
Extending Existing KMs
Another source of useful data is other PATROL KMs that are running on the same agent. If your application is ORACLE-based you can get data out of the ORACLE KM. However, there is a ne line between adding value to your KM and creating a dependency on another KM. Nevertheless, if there is a consistent dependency between your application and another application already monitored by a PATROL KM, then why not save yourself time?
Data available from other KMs can take a few forms. First, the discovery phase might benefit from the already discovered icons for the database. Rather than discover Oracle servers yourself, you can use PSL within the agent to get the ORACLE KM instance list through a PSL get on the internal instances variable such as:
inst list = get("/ORACLE7/instances");
The values of measured parameters from other KMs are easily available within PSL in the agent via the PSL get() function. These values could be included in arithmetic to get a more application-specific performance metric or some other measurement.
If you use existing KMs, beware that the KM might change with later releases or that the data returned might change.
Also you have to degrade gracefully. If the data isn't there you need to run without it or fail discovery and tell the user that you are dependent on the other KM.
Portable KM Design
An important advantage of a KM as a form of application monitoring and management is the ease with which the KM can be ported to a new hardware or software platform.
This advantage is achieved through the portable infrastructure layer created by the agent, KM, and PSL features of PATROL. However, PATROL cannot fully insulate the KM developer from the idiosyncrasies of the various platforms. Although all base PATROL functionality is portable, the boundary between PATROL and the outside world is the point at which PATROL loses control; and as a KM developer, you will need to know what to deal with. Two main areas can affect a KM:
- Operating system: Differences between the external world of Windows NT, UNIX, Linux, etc.
- Application: Differences in how the application runs in different contexts
Both of these areas have their own issues.
Portable agent functionality
The PATROL Agent is a virtual layer of abstraction above the operating system on which it runs. The KM executes within the agent's space rather than directly on the operating system. The agent insulates the KM from a number of portability issues. All the purely agent-related functionality within the agent is portable across all the various agent platforms. For example, the internal PSL syntax, operators, and control structures are portable.
Portability Issues at the Interface with the OS
Areas of non portability arise at the interface with the operating system. The agent must gather information from the operating system, and this information is necessarily different on different agent platforms. Thus, issues such as files, processes, and spawned executables exhibit different behavior. Although PSL insulates against some of the porting issues, it is not possible to hide all of the differences, nor would that be desirable.
Portable Areas of PSL
The PSL language is portable for all the specific areas of functionality that occur fully within the PATROL space. Unless the PSL function has to go to the external operating system, all the functionality is identical across all operating system ports of the PATROL Agent. Some of the specific areas where portability is ensured are as follows:
- Control Flow - Basic statements such as if, else, while, for each, and switch
- Operators - mathematical, string, logical, relational, conditional, and ternary
- String functions - includes nthline and other string functions, list/set functions, and sorting
- Agent symbol table - variables controlled via PSL get, set, and other PSL functions
- KM objects - fully portable instances, parameters, recovery
- Inter-PSL actions - PSL locks, condition variables, and other synchronization primitives
Non-Portable Areas of PSL
The areas where PSL has to access the operating system for information are the causes of nonbreakable functionality. Although the functionality is generally the same, the results of these operations depend completely on the external environment. Some of these areas are as follows:
- File access functions: file, cat, fopen, fseek, ftell, read, readln, write, and close
- Process table analysis functions: process, proc_exists
- Child process launching: system, execute, popen, read, readln, write, close, and the "exit status" special variable
- Environment variables: getenv
- Special areas: PSL internal function
- SNMP functions: snmp get, snmp set, and snmp walk
These items are not considered non-portable, because the functions will behave exactly the same on all platforms, but sometimes the arguments you have to supply to these functions to make them work could be significantly different.
Options to Avoid in KM Development
Just because a feature exists doesn't always mean you must use it. For example using a response function in discovery is possible but could result in your KM never instantiating because no user was at the console to provide the input for the script to continue.
The PATROL development console has many options to select from, and sometimes it can be difficult to know what to do with them all.
Some of the options you must try to avoid using in your KM include the following:
- State Change Actions: These are console-centric actions and therefore are limited to OS commands. They are not very powerful and generally are left alone.
- Setup Commands: Ignore setup commands because as these commands run once at the start of the agent, they are not useful in a KM. You can get the same "once only" execution through prediscovery by doing the once-only action just before converting prediscovery to discovery.
- Self-Polling Parameters: Avoid the button in the scheduling dialog box for parameters.
- Interactive Tasks: All the tasks have a "Task Is Interactive" button. Generally, it must be left alone.
- History Level: History level is best left as inherited.
- In_transition: Old PSL function that serves no purpose anymore.When first called, a transition timer is started for <timeout> seconds and the value 1 is returned. While this timer is running, subsequent calls also return 1. After the timer expires, calls to in transition() will still continue to return 1 until the next full discovery cycle, after which it will finally return 0 and reset the timer. The timer is also reset by calls to change state().
- Change state: A function with a lot of side-effects. If you call change state() on an instance with parameters then this will automatically suspend the execution of all these parameters. It's better to let one of the parameters under the instance go into alarm, since this also allows you to add recovery actions if needed.
- Simple Discovery: Maybe this is useful for entry level KM's, but this option definitely lacks the options if you really want to be in control. You achieve the same functionality as simple discovery by writing some simple PSL scripts. Because of the lack of control and functionality, we will not spend any time on Simple Discovery in this book.
- Menu Command%%f. . . gmacro: This macro will ask for user input and substitute the result in the menu command text prior to sending it to the agent. Since this is a literal replacement, improper input by the operator can result in compiler errors.
- Statically linked libraries: It is possible to statically link libraries. This feature has not been used at all and is there for historical reasons.
KM Tracing and Logging
If you develop a KM you want to be sure that you can remotely diagnose problems with the KM. Therefore it is a good idea to think about debugging and tracing early in the development process.
There are a lot of options to turn on debugging (usually via an administration response() function. The output can be written to file or system output window.
If you write the output to file, don't write it to the agent's error log. The agent error log is indeed "The agent error log" and not "km error log".