Three day-to-day goals drive the capacity manager managing a large number of servers of any technology:
Capacity Risk is measured in terms of resource saturation, either actual or impending. Measuring saturation is not just about overall utilization exceeding resources or trending in an uncomfortable direction, but also about detecting resource contention where pre-configured limits are being exceeded, or where one workload is crowding out others. You should flag any capacity risk at any level, and map it back to the business workloads that are affected before the risk is realized.
Depending on the business requirements, SLAs, and how cautious you want to be, you can set an appropriate expectation for efficiency. No one want to waste spare capacity, whether it is expensive disk storage or high memory servers sitting idle. And yet, safety requires setting aside some headroom appropriate to the business need. If the headroom is exceeded, you want to know. Also, a periodic check should be conducted for typical areas of waste: snapshot policies that are no longer necessary, Virtual Machines sitting idle while occupying precious disk space, and servers that are no longer being used.
You should keep track of which parts of your infrastructure are highly used and which ones have spare resources. This is the most basic activity, and is also called capacity visibility. It lets you take ownership of all IT resources, so that you can redeploy an old server for a development project, or create a subnet by using spare network ports that are not being used.
TrueSight Capacity Optimization helps you achieve all three of these goals simultaneously, and most importantly, it lets you communicate these three indicators (risk, efficiency, usage) to business owners. Your management may not be familiar with the finer points of memory ballooning, but they understand risk, efficiency, and usage indicators if you can relate them directly to the business applications. TrueSight Capacity Optimization provides the tools needed to report these high-level indicators in a business-aware manner, and at the same time lets you drill down into the root causes so you can understand and influence the indicators. It makes the conversation about real data, and yet grounded in terms of business importance.
You organize your infrastructure into silos or segments in a way that aligns with your business purpose. We call these segments “capacity pools”. For example, a set of VMware clusters that are supporting production usage for XYZ projects in Toronto could be grouped together into a capacity pool called “XYZ – Toronto”. Once you define a capacity pool, TrueSight Capacity Optimization automatically analyzes the right metrics on a daily basis and creates the high-level indicators, along with a structure of formulas and lower-level elements (clusters, frames) into which you can drill down.
The analysis of risk, efficiency, and usage necessarily depends on the exact behavior of the Operating System, hypervisor, and other layers. TrueSight Capacity Optimization contains out-of-the-box knowledge on many technology platforms. In 10.0, capacity pools can be defined for two commonly-used platforms: VMware vSphere and IBM AIX on PowerVM. Additional platforms will be added in the future.
You don’t have to define exactly one capacity pool per business purpose. It is common for multiple infrastructure silos to support a single business purpose; in this case, it is easier to create separate capacity pools for each technology in such a silo. Similarly, a single 'technical service' such as a web back-end server farm might support multiple applications. In this case, it might be more appropriate to create one separate capacity pool for the web server farm, and one for other parts of an application. Business communities are familiar with this level of dependency. The key question to ask yourself is: Would it make sense to report on the risk, efficiency, and usage of this capacity pool?
A Risk indicator counts any indication that CPU, memory, or disk is either already saturated or will soon be, or that workloads are experiencing capacity contention.
A Capacity Efficiency indicator counts resources that are being wasted, so that efficiency could be improved after a detailed analysis.
A Usage indicator counts the level of usage per capacity pool, so that you can track which capacity pools have excess capacity available.
For a given technology (for example, VMware vSphere or IBM Power series), there are specific methods used for gathering, summarizing, and analyzing specific metrics, to find such risk conditions or efficiency conditions. A capacity planner who knows which metrics are important and what they mean could do this analysis manually using the TrueSight Capacity Optimization console.
We have incorporated some of this knowledge into the product, and automated the computation to use the metrics already being collected to find these indicators for a large number of servers. The result of this automated process is saved in the database and presented in a new TrueSight Capacity Optimization console called the Capacity Pools View. This feature lets you manage large infrastructures by quickly focusing attention on servers that are flagged with these conditions. These computed indicators and recommendations cannot be a hundred percent foolproof, but they do let you concentrate your efforts on investigating portions of your infrastructure with the highest probability of capacity risks or inefficiency.
On a daily basis, TrueSight Capacity Optimization produces indicators on Risk, Efficiency, and Usage, for each capacity pool that you have defined. These indicators are displayed in a graphical view called a Capacity Pools View.
A Capacity Risk indicator, as mentioned above, counts any indication that a resource like CPU or memory is getting saturated on any server in a capacity pool, or that the capacity allocation needs tuning to balance demand among competing workloads. If you see such an indication for a capacity pool, you would like to pay attention to the server to find out whether there is a risk to be addressed.
Figure 1: Capacity Pools View
The capacity risk indicator is a “score” value from 1 through 100, higher being worse. The score is computed by adding risks based on CPU, memory, and disk storage risk scores for the capacity pool. These three components of the Risk score are computed as follows:
The above computations are performed for all capacity pools in a batch once a day, based on collected data in the database. The Capacity Pools View displays the capacity risk score, color-coding it appropriately.
When a capacity pool’s risk score is high, you should drill down to see which resources (CPU, memory, or disk) are showing high risk. In addition, you can drill down to the individual containers (for example, VMware clusters) to see which ones are causing the high risk score. For each container, you can see the detailed break-down of each criterion contributing to the high score.
A similar breakdown into CPU, memory, and disk space is made for the Efficiency score and the Usage score.
In the Capacity Pools View, interacting with any of these indicators can get you hints on which kinds of quantities are counted (Figure 2), and details on the three components of the score: CPU, memory, and storage (Figure 3).
Figure 2: Pop-up explanation of Efficiency formula
Figure 3: Detail window breaking down Usage indicator into CPU, Memory, and Disk
In the example shown in Figure 3, the Usage score of 34 is shown to be composed of three separate Usage numbers for CPU, Memory, and Disk, respectively. This allows you to both investigate and to explain where these numbers are coming from.
Furthermore, the console also lets you drill down into an entire new detail page, where all the contained elements in the Capacity Pool are listed, and each element’s contribution to the scores can be seen. For example, a capacity pool that shows a Risk score of 100 is shown below.
Figure 4: Detail page of a Capacity Pool
The detail page for the capacity pool with the 100% Risk score is shown above in Figure 4. The Contained Elements table shows all the clusters in the capacity pool with their individual Usage, Risk, and Efficiency scores, broken out by CPU, Memory, and Storage. Thus, for each cluster row in the table, you see a total of 12 individual scores including the totals for Usage, Risk, and Efficiency.
In Figure 4 you can see easily that the high risk score is caused by a high CPU risk score in two clusters, the first two in the table. This immediately narrows down the investigation to these two clusters, and in particular to their CPU related metrics.
Also shown at the top of the detail page, is aggregated configuration information for the Capacity Pool (number of CPUs, etc.), and a thirty-day time chart of a few consolidated metrics. These pieces of information can reveal at a glance whether there have been significant changes recently to this capacity pool.
In the table of Contained Elements, each cluster name is a hyperlink that can be used to drill down to yet another view showing details for that cluster. This drill-down could be used to examine any given cluster in detail. But in our example, where there is more than one cluster with a high risk score, there is yet another, more focused avenue available within the Detail view that makes sense to pursue. At the bottom is a link to “Risk details”, which leads to a risk details page (Figure 5 below).
Figure 5: Risk details for all clusters in a Capacity Pool
The risk details page above shows the components of the risk indicator for each cluster, down to the individual metric-based indicators (CPU pressure index, days to saturation) for the cluster that contribute to it. We can see that the same two clusters that have the high risk scores, are showing two metrics as high risk:
Thus, we know not only which clusters are the ones causing high risk in the capacity pool, and that their CPU capacity is at risk, but also exactly why TrueSight Capacity Optimization believes the CPU capacity is at risk.
The logic for determining risk would ordinarily entail time-based trending as well as a judgment of an expert who has been monitoring CPU contention. TrueSight Capacity Optimization instead automates this logic, and stores and presents all of these intermediate calculations in a logical manner so that you can follow along by clicking through the Capacity Pools View.
In Figure 5 you can also see similar metrics that would contribute to high risk for memory and storage. For example, the “Memory Pressure Index” is a derived score based on VMware’s reported metrics consumed memory, active memory, ballooning, and host swapping.
Once again, the names of the clusters are links to their individual pages, where the underlying metrics can be seen in as much detail as needed by linking to yet another view of individual metrics at the individual element level of a single cluster.
Similar links and pages, with pointers to the exact metrics, are provided for Efficiency and Usage scores, as well.
The Capacity Pools View facilitates a conversation between different stakeholders, primarily between the executive with budget responsibility and the technical owner of the infrastructure, the capacity manager.
At the top level, the Capacity Pools View starts with a business-level grouping. The conversation is therefore grounded in the business reasons and context for each investigation of usage, risk, and efficiency, instead of getting lost in technical details.
Next, the Capacity Pools View allows multiple levels of drill-down. This capability not only allows the capacity manager to investigate the capacity questions down to the individual elements, but also lets the executive participate in the process by simplifying and focusing the questions and answers at each level into a small number of indicators.
Finally, at the element level, the Capacity Pools View links into the same element-level views that the capacity manager is already using to view detailed technical metrics. Thus, the capacity manager can navigate back and forth between communicating effectively and building an understanding of the underlying root causes of risks, efficiency, and usage. The formulas that BMC TrueSight Capacity Optimization uses to compute the indicators are based on conditions specific to each type of platform: VMware vSphere, IBM AIX, etc. These computed indicators are used consistently in the BMC TrueSight Capacity Optimization UI.
The indicators and component scores are computed from a large number of metrics collected on the individual vSphere clusters. These metrics measure the CPU, memory, and disk capacity available and used in the capacity pool, and formulas combine these metrics using weights to compute the indicator values. Correspondingly, behind this view is a large analytical structure tracing the dependencies from the top-level indicators, through the various contributing formulas, down to individual clusters, and within each cluster down to CPU, memory, and storage capacity.
Thus, TrueSight Capacity Optimization allows you, the capacity manager, to manage risk, efficiency, and usage for large infrastructures. Not only does it provide you with a powerful, automated facility to drill down into the root causes, but also provides simple, business-aligned scores that facilitate the conversation with the business stakeholders.