High-availability deployment and best practices for Infrastructure Management

Consult the following topics for information and recommendations on how to deploy and configure Infrastructure Management components to achieve high availability (HA):

TrueSight Infrastructure Management Server HA

HA for the TrueSight Infrastructure Management Server is supported through application-level HA and also through operating system clustering.

Application-level HA

You can configure application-level HA only if you are using an Oracle database. For more information, see Considerations-for-a-high-availability-deployment-of-Infrastructure-Management.

Operating system clustering

You can configure operating system clustering if you are using the embedded SAP SQL Anywhere database. The two servers in the cluster must be configured with shared storage between the two nodes. See Installing-Infrastructure-Management-in-high-availability-cluster-mode. BMC recommends leveraging a high speed SAN for storage.

Data collection Integration Services HA

The Integration Service is stateless, which allows the BMC PATROL Agent to automatically send performance data and events to another Integration Service if the primary instance is not available. There is no concern for maintaining monitoring-related configuration at the Integration Service instances because no such configuration exists. Additionally, there is no association between the Integration Service instances and specific PATROL Agents to be maintained or otherwise managed by administrators at the Integration Service nodes.

Infrastructure Management enables you to cluster Integration Service nodes. These Integration Service cluster configurations are simple software settings referenced in policies. The configuration settings for a cluster are stored as a cluster in Central Monitoring Administration. The cluster configurations contain connectivity information in the form of PATROL Agent variables that instruct the agent(s) about how to connect to the first, second, third, and fourth Integration Service nodes that are grouped in the cluster. There is no in-built load balancing with these cluster configurations; however, all the Integration Service instances support active/active HA.

You can include up to four Integration Service nodes in a single cluster. BMC recommends referencing clusters in staging policies only.

PATROL Agents attempt to connect to the list of Integration Services in the cluster, in the order that the Integration Services are listed. When an agent loses connection to the first Integration Service instance, it automatically connects to the second instance in the list. When the first Integration Service is once again available for connection, the agent does not automatically connected back to the first instance. It remains connect to the instance it is currently and successfully connected to.

Multiple Integration Service instances can run behind a load balancer. This means that a third-party load balancer can be placed between PATROL Agents and the Integration Services to support full active/active HA fault tolerance and true load balancing of event and performance data across multiple Integration Service processes running on different hosts. Generally, in large environments, BMC recommends leveraging load balancers as a best practice. This is, however, a recommendation, not a requirement. It basically ensures that the Integration Service tier is not overloaded if or when there is an event storm or an interruption in communication between the agents and the Integration Service nodes causing a flood of cached data to be sent to the Infrastructure Management Servers through the Integration Service nodes.

Best practice

Consider high availability (HA) as part of the Integration Service node deployment.

If you plan to deploy the Integration Service on a VMware virtual machine (VM), you can utilize VMware HA. Utilizing VMware HA simplifies administration because it is transparent to the PATROL Agents (the connections for both the performance metrics and events automatically reconnect when the VM is restarted). For further information, see Considerations-for-deploying-Infrastructure-Management-on-a-VM.

Staging Integration Service HA

The staging Integration Service in the preceding diagram is not shown in a cluster, and it is not included in the cluster configuration within the product. However, you can configure staging Integration Service nodes for redundancy. You can do this by setting up multiple staging Integration Service nodes and designating their connectivity information in a comma-separated list for the PATROL Agent Integration Service configuration variable.

An agent installation package or a single policy must never contain configuration for multiple staging Integration Service nodes that are associated with different Infrastructure Management Servers.

Event management cell HA

HA for the event management cells is provided through an built-in primary/secondary configuration as an active and hot standby cell pair. Event sources such as Integration Services are configured to send events first to the primary cell. If the primary cell is not available, the event source sends events to the secondary cell. The cells automatically synchronize live event data so that events are kept in synch between the two cells. The secondary cell is configured and operates as a “hot standby” cell.

The primary and secondary cells monitor each other. During a failover, the secondary cell detects that the primary cell is not available and it takes over the event processing functionality. When the secondary cell detects that the primary cell has become available, it synchronizes events with the primary cell and switches back to standby mode. The primary cell then continues the event processing and synchronization with the secondary cell.

For information on setting up cell HA, see for Windows or Linux. To apply feature packs or fix packs on remote HA cells, see Applying feature packs or fix packs on remote cells in HA mode.

Best practices

It is critical that you set up the primary and secondary cells with the same knowledge base configuration. The synchronization process only synchronizes event data and dynamic data in data classes. This synchronization includes updates and dynamic data in the custom data classes you create. Events, dynamic data, and any updates or deletions to either are synchronized both ways. It does not synchronize configuration data in the knowledge base flat files.
The synchronization of knowledge base configuration flat files must be manually managed or automated with custom scripts or other methods.
Never set up event propagation so that events only propagate to the primary or secondary cell. Always leverage a multiple host definition (primary/secondary) for the destination configuration of the HA cell pair in the mcell.dir and other cell configuration files.
Do not configure cell HA manually. Use the cell CLI for setting up or configuring cell HA.
Use the same cell name for the primary and secondary cells.
If cell HA is monitored to detect conditions in which the primary and secondary cells are no longer synchronized, automatic failback is fine. However, if it is not monitored, then auto failback must be disabled. Synchronization failures can cause the primary and secondary to get massively out of sync. This creates a scenario where you might failover to a cell that is out of sync, causing data tables and events information to be out of date. Failback results in that out-of-sync state being written back to the primary, causing an irretrievable loss of information. Additionally, when a scenario occurs where a network connection is lost between the primary and secondary cells, both system might detect that the other host is down and become the active cell. This can also lead to inconsistent data and events. It is a best practice to avoid the use of automatic switchback except under circumstances previously indicated.

PATROL Agent HA

PATROL Agents that run on the managed node that they monitor, in general, do not require HA. However, PATROL Agents that monitor large domain sources, such as VMware vSphere, or remote operating system monitoring require HA configurations in most environments. HA for the PATROL Agent is supported with operating system clustering or other third-party solutions such as VMware HA.

Best practice

Consider whether high availability (HA) is needed for the PATROL Agents used for collection. If the agent is performing local collection on a host that provides some service, the agent might already be part of the host-level HA setup for the service or application on that host. However, if the PATROL Agent is performing remote collection, that agent must be configured for HA.

Tip: If the agent is running on a VMware virtual machine (VM), VMware HA is a recommended option.

SAP SQL Anywhere HA

The SAP SQL Anywhere database is embedded and installed with the BMC TrueSight Infrastructure Management Server. If you use the out-of the-box embedded SAP SQL Anywhere database, HA for the database is supported as part of the file system replication on a shared storage disk for the BMC TrueSight Infrastructure Management Server. For more information, see Installing-the-Infrastructure-Management-Server-in-HA-OS-cluster-mode-on-Windows.

Oracle HA

HA for the Oracle database is supported thorough a third-party database availability management solution. It is best supported using Oracle RAC. For more information, see Installing-the-Infrastructure-Management-Server-on-Microsoft-Windows-with-Oracle and the Oracle database documentation at www.oracle.com.