Best practices for designing a Kubernetes cluster for BMC Helix Service Management and BMC Helix IT Operations Management

Review these best practices when designing a Kubernetes cluster for BMC Helix Service Management and BMC Helix IT Operations Management to ensure that you meet the infrastructure requirements.

Managing cluster sizing

Designing the configuration of cluster worker nodes is a dynamic process, and there is no single optimal design. Instead, multiple viable configurations often exist, depending on the specific needs of the environment.

While the requirements of the applications are a primary consideration, you must also consider the following factors:

The specific Kubernetes distribution being used
Resource demands of other co-located application workloads
System-level resource usage outside of Kubernetes such as operating system services, and monitoring agents
The variety and availability of hardware within the organization

Although worker nodes may vary in capacity, standardizing their specifications such as CPU, memory, and storage can simplify the cluster management, capacity planning, and scaling strategies.

To evaluate cluster requirements

Use the following steps to determine an acceptable cluster shape:

Determine the application requirements for the cluster.
Gather the application requirements along with any other resource needs expected to run on the cluster.
Determine operational excellency standards to calculate additional head room in the cluster.
- Operations typically require an additional 20% in resources, though this figure may vary across different organizations. The extra resources provide Kubernetes with the capacity to address scheduling inefficiencies and prevent worker nodes from being pushed beyond their maximum capacity.
- When this maximum limit is exceeded, cluster alerts are triggered, indicating the need to monitor the health of a worker node.
Assess the landscape of large pods.
- Identify the minimum size of a worker node. The minimum size of a worker node depends on the largest pod resource requirements. A large pod is defined as one that consumes a significant portion of a node's resources.
- Sort the Service Management pod specifications spreadsheet in Sizing and scalability considerations by memory in descending order to find the highest memory requirement and again by CPU to identify the maximum CPU.
Establish vertical worker node requirements (maximum capacity)
- Make sure that worker nodes can handle the largest pod with at least double the resources.
  This approach provides a baseline for the minimum resources needed for each worker node.
- Be aware that DaemonSets in the cluster will reduce available resources, effectively decreasing the capacity of all worker nodes.
  This point should be taken into account in the final assessment of minimum node capacities.
- Additionally, Kubernetes administrators should review the cluster kube-reserved and system-reserved values.
Establish horizontal worker node requirements (minimum number of worker nodes)
- Determine the minimum number of worker nodes required for the successful scheduling of all large pods.
  Add all the required large pod replicas and then divide the value by the number of worker nodes that are anticipated to handle.
- Make sure that the cluster has a minimum number of worker nodes in accordance with high availability design requirements.
  See Sizing and scalability considerations in BMC Helix Service Management documentation and Sizing and scalability considerations in BMC Helix IT Operations Management documentation.
- The requirement that necessitates the most worker nodes should be used as the guiding factor.
Perform the following steps to conduct an initial cluster evaluation:
1. Divide the required CPU and memory determined in step 2 by the number of worker nodes determined in step 5.
2. Check if the resulting number of worker nodes meets the baseline requirements for CPU and memory established in step 4.
3. If the worker nodes do not meet the requirements set in step 4, adjust those requirements and evaluate again. Consider increasing the overhead requirements from step 2 to facilitate the scheduling of additional worker nodes within the current minimum.
Adjust worker node size or count.
Make any necessary adjustments to the size or number of worker nodes to optimize cost or performance metrics while maintaining the established constraints.

Implementing redundancy strategies

After you determine the sizing requirements for a cluster, consider additional redundancy planning. It is inevitable that a cluster may experience the loss of a worker node, whether that loss is planned such as for maintenance or unplanned such as due to hardware failure. To manage this risk, you must establish a tolerance for loss. This tolerance refers to the number of nodes that can be offline before the cluster operates below its resource requirements.

To prevent resource loss in the cluster, add extra worker nodes of the largest size, matching the number of nodes you want to tolerate losing. For example, a cluster with 12 worker nodes that only requires 10 can tolerate the loss of two nodes before performance is affected.

When a cluster operates below its resource requirements, it can lead to degraded performance and potentially result in application outages. BMC Helix applications will attempt to avoid outages in such scenarios; however, depending on the system load, performance might be reduced, and continued loss of nodes may ultimately lead to an application outage.

Using larger worker nodes can simplify scheduling and maintenance for cluster owners. On the other hand, smaller worker nodes may make redundancy planning for loss more cost-effective.

Choosing the Right Disaster Recovery Model

Disaster recovery can be Active-Active or Active-Passive. BMC Helix supports only Active-Passive due to limitations in their data service components.

To achieve 15-minute Recovery Time Objective (RTO) requires Active-Active compatibility, you require a Kubernetes stretch cluster, that involves high complexity and strict operational demands.

Stretch Clusters (Highly Available Kubernetes)

The best solution for disaster recovery is a Kubernetes stretch cluster, particularly effective when built across multiple low-latency failure zones. These zones are usually separate data centers, allowing for redundancy and failover capabilities. For example, in a two-data center setup, if one data center fails, the cluster may become inoperable due to a majority of control nodes going offline.

Stretch clusters redistribute containers to unaffected zones during a failure, helping to minimize downtime for applications, as long as the remaining zones have sufficient resources. To function properly, stretch clusters must maintain a quorum, meaning a majority of control planes must be operational. For instance, a three-zone stretch cluster can tolerate the loss of one zone, while a five-zone cluster can tolerate the loss of two.

Implementing stretch clusters can be complex and requires specialized internal expertise. Kubernetes endorses the project Kubespray to aide in setting up such architecture.

Embedded disaster recovery

BMC Helix Service Management and BMC Helix IT Operations Management comes with support for an embedded disaster recovery process. You can replace it with a bring-your-own solution from another vendor. It is an Active-Passive solution suitable for clusters that are not capable of meeting the low-latency criteria of a highly available stretch cluster.
Embedded disaster recovery works by keeping snapshot data of all the data service components at a given interval. This data might also work for backup and restore capabilities in the event of a total loss of data. The snapshots can be replicated to another cluster that is running a mirror of the Helix application with embedded disaster recovery enabled. These replicated snapshots can then be restored to the cluster in the event of a disaster requiring application recovery. After a successful failover and a recovery of the original cluster, data would be replicated back to the original data center, enabling failover to the original cluster.

Where to go from here

Sizing and scalability considerations in BMC Helix Service Management documentation

Sizing and scalability considerations in BMC Helix IT Operations Management documentation.