Capacity pools
This topic gives an overview of the Capacity Pools view in BMC Helix Capacity Optimization. Refer to the following sections for details:
Overview
Capacity Pools are logical groupings of homogeneous resources of infrastructure capacity. They aggregate the capacity data coming from different entities available in the IT infrastructure; for example, by business importance, customer, owner, geography, and so on.
The Capacity Pools view displays three key indicators representing the overall capacity and the level of efficiency and risk for each group, or capacity pool. From these key indicators you can navigate to more infrastructure-oriented views. As a system administrator, you can rank all capacity pools available in the workspace, in the Capacity Pools View. Each Capacity Pool view displays the ranking based on Usage, Risk, and Efficiency score.
Why should you use Capacity Pools view?
For a quick view of the IT resources that either you are responsible to fund and manage the budget for, or manage the capacity for. The Capacity Pools view enables you to easily see the capacity utilization of aggregated entities and effectively manage the capacity of your entire IT infrastructure.
Who should use Capacity Pools view?
Executives with budget responsibility, or technical owners of the infrastructure. The Capacity Pools view gives you the ability to see the capacity that you are responsible for. You can get a quick view of all the capacity pools in one place, and drill down to specific areas, as required.
When should you use Capacity Pools view?
- When you want to review the risk, usage and efficiency of the IT resources that you are responsible for, either as a budget owner, or capacity manager
- When there are areas where the key indicator values reflect action need to be taken
- During recurring service/capacity review meetings with the leadership team
Capacity goals: Usage, Risk, and Efficiency
There are the three daily goals that drive the capacity manager managing a large number of servers of any technology:
Keep track of Capacity Usage
You should keep track of which parts of your infrastructure are highly used and which ones have spare resources. This is the most basic activity, and is also called capacity visibility. It lets you take ownership of all IT resources, so that you can redeploy an old server for a development project, or create a subnet by using spare network ports that are not being used.
Keep Capacity Risk low at all times
Capacity Risk is measured in terms of resource saturation, either actual or impending. Measuring saturation is not just about overall utilization exceeding resources or trending in an uncomfortable direction, but also about detecting resource contention where preconfigured limits are being exceeded, or where one workload is crowding out others. You should flag any capacity risk at any level, and map it back to the business workloads that are affected before the risk is realized.
Keep Efficiency high
The Capacity Pools view enables you to maintain an efficient environment, regardless of the pool of capacity (production, test, QA etc), it represents. Depending on the business requirements, SLAs, and how cautious you want to be, you can set an appropriate expectation for efficiency and set aside some headroom appropriate to the business need. If the headroom is exceeded, you want to know.
WIth the Capacity Pools view, you can avoid wasting spare capacity, whether it is expensive disk storage or high memory servers sitting idle. A periodic check for typical areas of waste like snapshot policies that are no longer necessary, Virtual Machines that are idle while occupying precious disk space, servers that are no longer being used, can help you maximize the efficiency of your infrastructure.
BMC Helix Capacity Optimization helps you achieve all three of these goals simultaneously, and most importantly, it lets you communicate these three indicators (usage, risk, and efficiency) to business owners. Your management may not be familiar with the finer points of memory ballooning, but they understand usage, risk, and efficiency indicators if you can relate them directly to the business applications. BMC Helix Capacity Optimization provides the tools needed to report these high-level indicators in a business-aware manner, and at the same time lets you drill down into the root causes so you can understand and influence the indicators. It makes the conversation about real data, and yet is grounded in terms of business importance.
Using Usage, Risk, and Efficiency indicators
Defining Capacity Pools
You organize your infrastructure into silos or segments in a way that aligns with your business purpose. We call these segments capacity pools. For example, a set of VMware clusters that are supporting production usage for XYZ projects in Toronto could be grouped together into a capacity pool called “XYZ – Toronto.” Once you define a capacity pool, BMC Helix Capacity Optimization automatically analyzes the right metrics on a daily basis and creates the high-level indicators, along with a structure of formulas and lower-level elements (clusters, frames) into which you can drill down.
The capacity pools you define are technology specific. You must ensure that they are related to business use.
Technology specific
The analysis of usage, risk, and efficiency necessarily depends on the exact behavior of the Operating System, hypervisor, and other layers. BMC Helix Capacity Optimization contains out-of-the-box knowledge on many technology platforms, namely: VMware vSphere, IBM AIX on PowerVM and KVM.
Related to business use
It is common for multiple infrastructure silos, or capacity pools, to support a single business purpose. In such a case, you can create separate capacity pools for each technology in the pool. Similarly, a single ‘technical service’ such as a web back-end server farm may support multiple applications. In this case, it might be more appropriate to create one separate capacity pool for the web server farm, and one for other parts of an application. Business communities are familiar with this level of dependency. The key factor to consider here is whether you want to report on the usage, risk, and efficiency of this capacity pool.
BMC Helix Capacity Optimization produces indicators on Usage, Risk, and Efficiency, for each capacity pool that you have defined:
- A Usage indicator counts the level of usage per capacity pool, so that you can track which capacity pools have excess capacity available.
- A Risk indicator counts any indication that CPU, memory, or disk is either already saturated or will soon be, or that workloads are experiencing capacity contention.
- A Capacity Efficiency indicator counts resources that are being wasted, so that efficiency could be improved after a detailed analysis.
For a given technology (for example, VMware vSphere or IBM Power series), there are specific methods used for gathering, summarizing, and analyzing specific metrics, to find such risk conditions or efficiency conditions. BMC Helix Capacity Optimization automates the computation to use the metrics already being collected to find these indicators for a large number of servers. The result of this automated process is saved in the database and presented in the Capacity-Pools-view. This feature lets you manage large infrastructures by quickly focusing attention on servers that are flagged with these conditions. These computed indicators and recommendations let you concentrate your efforts on investigating portions of your infrastructure with the highest probability of capacity risks, or inefficiency.
Viewing Capacity Pool Indicators: Capacity Pools View
On a daily basis, BMC Helix Capacity Optimization produces indicators on Usage, Risk, and Efficiency, for each capacity pool that you have defined. These indicators are displayed in a graphical view called a Capacity Pools View.
A Capacity Risk indicator, as mentioned above, counts any indication that a resource like CPU or memory is getting saturated on any server in a capacity pool, or that the capacity allocation needs tuning to balance demand among competing workloads. If you see such an indication for a capacity pool, you would like to pay attention to the server to find out whether there is a risk to be addressed.
Figure 1: Capacity Pools View
The capacity risk indicator is a score value ranging from 1 through 100, higher being worse. The score is computed by adding risks based on CPU, memory, and disk storage risk scores for the capacity pool. These three components of the Risk score are computed as follows:
- The CPU risk score is the worst (maximum) CPU risk score of the contained elements in the capacity pool. For example, if the capacity pool contains 10 VMware clusters, then the highest CPU risk score out of these ten is used for the capacity pool’s CPU risk score.
- Similarly, the memory and disk risk scores are the maximum corresponding risk scores of the contained elements.
- For each contained element, technology-specific criteria for impending and current saturation are used to compute each of these three risk scores.
- For example, for VMware clusters, the CPU days to saturation indicator is evaluated. If the CPU days to saturation indicator shows that the cluster CPU is projected to exceed its threshold within a certain time horizon (both the threshold and the horizon are user-configurable), then that is one criterion for impending saturation risk.
- Another criterion for VMware clusters is the number of times the CPU read- time metric has exceeded a threshold within the past thirty days. This criterion is for current saturation risk.
- For each such criterion, depending on its severity, a penalty is added to the CPU risk score of the cluster.
- Similarly, for each container, technology-specific criteria for impending and current memory and disk saturation are evaluated and used to add a penalty for the memory or disk risk score, respectively, for the container.
The above computations are performed for all capacity pools in a batch once a day, based on collected data in the database. The Capacity Pools View displays the capacity risk score, color-coding it appropriately.
When a capacity pool’s risk score is high, you should drill down to see which resources (CPU, memory, or disk) are showing high risk. In addition, you can drill down to the individual containers (for example, VMware clusters) to see which ones are causing the high risk score. For each container, you can see the detailed breakdown of each criterion contributing to the high score.
A similar breakdown into CPU, memory, and disk space is made for the Efficiency score and the Usage score.
In the Capacity Pools View, interacting with any of these indicators can provide hints on which kinds of quantities are counted (Figure 2), and details on the three components of the score: CPU, memory, and storage (Figure 3).
Figure 2: Pop-up explanation of Efficiency formula
Figure 3: Detail window breaking down Usage indicator into CPU, Memory, and Disk
In the example shown in Figure 3, the Usage score of 34 is shown to be composed of three separate Usage numbers for CPU, Memory, and Disk, respectively. This allows you to both investigate, and to explain where these numbers are coming from.
Furthermore, the console also lets you drill down into an entire new detail page, where all the contained elements in the capacity pool are listed, and each element’s contribution to the scores can be seen. For example, a capacity pool that shows a Risk score of 100 is shown below.
Figure 4: Detail page of a capacity pool
The detail page for the capacity pool with the 100% Risk score is shown above in Figure 4. The Contained Elements table shows all the clusters in the capacity pool with their individual Usage, Risk, and Efficiency scores, categorized by CPU, Memory, and Storage. Thus, for each cluster row in the table, you see a total of 12 individual scores including the totals for Usage, Risk, and Efficiency.
In Figure 4 you can quickly see that the high risk score is caused by a high CPU risk score in two clusters, the first two in the table. This immediately narrows down the investigation to these two clusters, and in particular to their CPU-related metrics.
Also shown at the top of the detail page, is aggregated configuration information for the capacity pool (number of CPUs, etc.), and a thirty-day time chart of a few consolidated metrics. These pieces of information can reveal at a glance, whether there have been significant changes recently to this capacity pool.
In the table of Contained Elements, each cluster name is a hyperlink that can be used to drill down to yet another view showing details for that cluster. This drill-down could be used to examine any given cluster in detail. But in our example, where there is more than one cluster with a high risk score, there is yet another, more focused avenue available within the Detail view that makes sense to pursue. At the bottom is a link to “Risk details”, which leads to a risk details page (Figure 5 below).
Figure 5: Risk details for all clusters in a Capacity Pool
The risk details page above shows the components of the risk indicator for each cluster, down to the individual metric-based indicators (CPU pressure index, days to saturation) for the cluster that contribute to it. We can see that the same two clusters that have the high risk scores, are showing two metrics as high risk:
- CPU Days to Saturation shows as Saturated. The CPU Days to Saturation is not a raw metric, but a calculation over the past 30 days, which uses a linear trend to find how long in the future it will be before a given resource breaches a preset threshold. In our case, these two clusters are showing the threshold is already saturated.
- CPU Pressure Index shows as 100. The CPU Pressure Index is also not a raw metric, but a calculation that combines two metrics: CPU utilization and CPU ready time. CPU ready time is ordinarily used by system admins to detect CPU contention in the cluster. BMC Helix Capacity Optimization starts with the CPU utilization value and adds penalty points for any indications of CPU contention seen by counting how many times CPU ready time has exceeded a threshold within the recent past. This resulting number is weighted and scaled to remain within the 0-100 range.
Thus, we know not only which clusters are the ones causing high risk in the capacity pool, and that their CPU capacity is at risk, but also exactly why BMC Helix Capacity Optimization believes the CPU capacity is at risk.
The logic for determining risk would ordinarily entail time-based trending as well as a judgment of an expert who has been monitoring CPU contention. BMC Helix Capacity Optimization instead automates this logic, and stores and presents all of these intermediate calculations in a logical manner so that you can follow along by clicking through the Capacity Pools View.
In Figure 5 you can also see similar metrics that would contribute to high risk for memory and storage. For example, the Memory Pressure Index is a derived score based on VMware’s reported metrics consumed memory, active memory, ballooning, and host swapping.
Once again, the names of the clusters are links to their individual pages, where the underlying metrics can be seen in as much detail as needed by linking to yet another view of individual metrics at the individual element level of a single cluster.
Similar links and pages, with pointers to the exact metrics, are provided for Efficiency and Usage scores, as well.
Summary: Managing Large Infrastructures with the Capacity Pools View
The Capacity Pools View facilitates a conversation between different stakeholders, primarily between the executive with budget responsibility and the technical owner of the infrastructure, the capacity manager.
At the top level, the Capacity Pools view starts with a business-level grouping. The conversation is therefore grounded in the business reasons and context for each investigation of usage, risk, and efficiency, instead of getting lost in technical details.
Next, the Capacity Pools View allows multiple levels of drill-down. This capability not only allows the capacity manager to investigate the capacity questions down to the individual elements, but also lets the executive participate in the process by simplifying and focusing the questions and answers at each level into a small number of indicators.
Finally, at the element level, the Capacity Pools view links into the same element-level views that the capacity manager is already using to view detailed technical metrics. Thus, the capacity manager can navigate back and forth between communicating effectively and building an understanding of the underlying root causes of risks, efficiency, and usage. The formulas that BMC Helix Capacity Optimization uses to compute the indicators are based on conditions specific to each type of platform: VMware vSphere, IBM AIX, KVM, etc. These computed indicators are used consistently in the BMC Helix Capacity Optimization user interface.
The indicators and component scores are computed from a large number of metrics collected on the individual vSphere clusters. These metrics measure the CPU, memory, and disk capacity available and used in the capacity pool, and formulas combine these metrics using weights to compute the indicator values. Correspondingly, behind this view is a large analytical structure tracing the dependencies from the top-level indicators, through the various contributing formulas, down to individual clusters, and within each cluster down to CPU, memory, and storage capacity.
Thus, BMC Helix Capacity Optimization allows you, the capacity manager, to manage usage, risk, and efficiency for large infrastructures. Not only does it provide you with a powerful, automated facility to drill down into the root causes, but also provides simple, business-aligned scores that facilitate the conversation with the business stakeholders.