Capacity pools
Overview
Capacity Pools are logical groupings of homogeneous resources of infrastructure capacity. They aggregate the capacity data coming from different entities available in the IT infrastructure. For example, by business importance, customer, owner, geography, and so on.
The Capacity Pools view displays three key indicators representing the overall capacity and the level of efficiency and risk for each group, or capacity pool. From these key indicators, you can navigate to more infrastructure-oriented views. As a system administrator, you can rank all capacity pools available in the workspace, in the Capacity Pools View. Each Capacity Pool view displays the ranking based on Usage, Risk, and Efficiency score.
Capacity goals: Usage, Risk, and Efficiency
The daily goals that drive the capacity manager managing a large number of servers of any technology are:
- Keep track of Capacity Usage: Keep track of which parts of your infrastructure are highly used and which ones have spare resources. This is the most basic activity and is also called capacity visibility. It lets you take ownership of all IT resources so that you can redeploy an old server for a development project, or create a subnet by using spare network ports that are not being used.
- Keep Capacity Risk low at all times: Capacity Risk is measured in terms of resource saturation, either actual or impending. Measuring saturation is not just about overall utilization exceeding resources or trending in an uncomfortable direction, but also about detecting resource contention where preconfigured limits are being exceeded, or where one workload is crowding out others. You should flag any capacity risk at any level, and map it back to the business workloads that are affected before the risk is realized.
- Keep Efficiency high: The Capacity Pools view enables you to maintain an efficient environment, regardless of the pool of capacity (production, test, QA), it represents. Depending on the business requirements, SLAs, and how cautious you want to be, you can set an appropriate expectation for efficiency and set aside some headroom appropriate to the business need. If the headroom is exceeded, you want to know. With the Capacity Pools view, avoid wasting spare capacity, whether expensive disk storage or high memory servers sitting idle. A periodic check for typical areas of waste like snapshot policies that are no longer necessary, Virtual Machines that are idle while occupying precious disk space, servers that are no longer being used, can help you maximize the efficiency of your infrastructure.
BMC Helix Continuous Optimization helps you achieve these goals simultaneously and lets you communicate these three indicators (usage, risk, and efficiency) to business owners. Your management might not be familiar with the finer points of memory ballooning, but they understand usage, risk, and efficiency indicators if you can relate them directly to the business applications. BMC Helix Continuous Optimization provides the tools needed to report these high-level indicators in a business-aware manner, and at the same time lets you drill down into the root causes so you can understand and influence the indicators.
Defining Capacity Pools
You organize your infrastructure into silos or segments in a way that aligns with your business purpose. These segments are called capacity pools. For example, a set of VMware clusters that are supporting production usage for XYZ projects in Toronto could be grouped together into a capacity pool called “XYZ – Toronto.” Once you define a capacity pool, BMC Helix Continuous Optimization automatically analyzes the right metrics on a daily basis and creates the high-level indicators, along with a structure of formulas and lower-level elements (clusters, frames) into which you can drill down.
Usage, Risk, and Efficiency indicators
The capacity pools you define are technology-specific. Make sure that they are related to business use.
- Technology specific: The analysis of usage, risk, and efficiency necessarily depends on the exact behavior of the Operating System, hypervisor, and other layers. BMC Helix Continuous Optimization contains out-of-the-box knowledge on many technology platforms, such as VMware vSphere, IBM AIX on PowerVM, and KVM.
- Related to business use: It is common for multiple infrastructure silos, or capacity pools, to support a single business purpose. In such a case, create separate capacity pools for each technology in the pool. Similarly, a single ‘technical service’ such as a web back-end server farm may support multiple applications. In this case, it might be more appropriate to create one separate capacity pool for the web server farm, and one for other parts of an application. The key factor to consider here is whether you want to report on the usage, risk, and efficiency of this capacity pool.
BMC Helix Continuous Optimization produces indicators on Usage, Risk, and Efficiency, for each capacity pool that you have defined:
- A Usage indicator counts the level of usage per capacity pool so that you can track which capacity pools have excess capacity available.
- A Risk indicator counts any indication that CPU, memory, or disk is either already saturated or will soon be, or that workloads are experiencing capacity contention.
- A Capacity Efficiency indicator counts resources that are being wasted so that efficiency could be improved after a detailed analysis.
For a given technology (for example, VMware vSphere or IBM Power series), there are specific methods used for gathering, summarizing, and analyzing specific metrics, to find such risk conditions or efficiency conditions. BMC Helix Continuous Optimization automates the computation to use the metrics already being collected to find these indicators for a large number of servers. The result of this automated process is saved in the database and presented in the Capacity-Pools-view.
Capacity Pool Indicators: Capacity Pools View
The Usage, Risk, and Efficiency indicators for each capacity pool are displayed in a graphical view called a Capacity Pools View.
Figure 1: Capacity Pools View
The capacity risk indicator is a score value ranging from 1 through 100, higher being worse. The score is computed by adding risks based on CPU, memory, and disk storage risk scores for the capacity pool. These three components of the Risk score are computed as follows:
- The CPU risk score is the worst (maximum) CPU risk score of the contained elements in the capacity pool. For example, if the capacity pool contains 10 VMware clusters, then the highest CPU risk score out of these ten is used for the capacity pool’s CPU risk score.
- Similarly, the memory and disk risk scores are the maximum corresponding risk scores of the contained elements.
- For each contained element, technology-specific criteria for impending and current saturation are used to compute each of these three risk scores.
- For example, for VMware clusters, the CPU days to saturation indicator is evaluated. If the CPU days to saturation indicator shows that the cluster CPU is projected to exceed its threshold within a certain time horizon (both the threshold and the horizon are user-configurable), then that is one criterion for impending saturation risk.
- Another criterion for VMware clusters is the number of times the CPU read-time metric has exceeded a threshold within the past thirty days. This criterion is for current saturation risk.
- For each such criterion, depending on its severity, a penalty is added to the CPU risk score of the cluster.
- Similarly, for each container, technology-specific criteria for impending and current memory and disk saturation are evaluated and used to add a penalty for the memory or disk risk score, respectively, for the container.
The above computations are performed for all capacity pools in a batch once a day, based on the data in the database. The Capacity Pools View displays the capacity risk score, color-coding it appropriately.
When a capacity pool’s risk score is high, drill down to see which resources (CPU, memory, or disk) are showing a high risk. In addition, drill down to the individual containers (for example, VMware clusters) to see which ones are causing the high-risk score. For each container, see the detailed breakdown of each criterion contributing to the high score.
A similar breakdown into CPU, memory and disk space is made for the Efficiency score and the Usage score.
In the Capacity Pools View, interacting with any of these indicators can provide hints on which kinds of quantities are counted (Figure 2), and details on the three components of the score: CPU, memory, and storage (Figure 3).
Figure 2: Tooltip explaining the Efficiency formula
Figure 3: Detail window breaking down Usage indicator into CPU, Memory, and Disk
In the example shown in Figure 3, the Usage score of 34 is shown to be composed of three separate Usage numbers for CPU, Memory, and Disk, respectively. This allows you to investigate and explain where these numbers are coming from.
Drill down into an entire new detail page, where all the contained elements in the capacity pool are listed, and each element’s contribution to the scores can be seen. For example, a capacity pool that shows a Risk score of 100 is shown below.
Figure 4: Detail page of a capacity pool
The detail page for the capacity pool with the 100% Risk score is shown above in Figure 4. The Contained Elements table shows all the clusters in the capacity pool with their individual Usage, Risk, and Efficiency scores, categorized by CPU, Memory, and Storage. Thus, for each cluster row in the table, you see a total of 12 individual scores including the totals for Usage, Risk, and Efficiency.
In Figure 4, you can quickly see that the high-risk score is caused by a high CPU risk score in two clusters, the first two in the table. This immediately narrows down the investigation to these two clusters, and in particular to their CPU-related metrics.
Also shown at the top of the detail page, is aggregated configuration information for the capacity pool (number of CPUs), and a thirty-day time chart of a few consolidated metrics. These pieces of information can reveal at a glance, whether there have been significant changes recently to this capacity pool.
In the table of Contained Elements, each cluster name can be used to drill down to another view showing details for that cluster. This drill-down could be used to examine any given cluster in detail. But in this example, there is more than one cluster with a high-risk score.
Figure 5: Risk details for all clusters in a Capacity Pool
The risk details page shows the components of the risk indicator for each cluster, down to the individual metric-based indicators (CPU pressure index, days to saturation) for the cluster that contributes to it. We can see that the same two clusters that have the high-risk scores are showing two metrics as high risk:
- CPU Days to Saturation shows as Saturated. The CPU Days to Saturation is not a raw metric, but a calculation over the past 30 days, which uses a linear trend to find how long in the future it will be before a given resource breaches a preset threshold. In this case, the two clusters are showing the threshold is already saturated.
- CPU Pressure Index shows as 100. The CPU Pressure Index is also not a raw metric, but a calculation that combines two metrics: CPU utilization and CPU ready time. CPU ready time is ordinarily used by system admins to detect CPU contention in the cluster. BMC Helix Continuous Optimization starts with the CPU utilization value and adds penalty points for any indications of CPU contention seen by counting how many times CPU ready time has exceeded a threshold within the recent past. This resulting number is weighted and scaled to remain within the 0-100 range.
In Figure 5, you see similar metrics that would contribute to the high risk for memory and storage. For example, the Memory Pressure Index is a derived score based on VMware’s reported metrics consumed memory, active memory, ballooning, and host swapping.