Page tree

Unsupported content

   

This version of the documentation is no longer supported. However, the documentation is available for your convenience. You will not be able to leave comments.

Skip to end of metadata
Go to start of metadata

Capacity goals: Risk, Efficiency, and Usage

Three day-to-day goals drive the capacity manager managing a large number of servers of any technology:

Keep Capacity Risk low at all times

Capacity Risk is measured in terms of resource saturation, either actual or impending. Measuring saturation is not just about overall utilization exceeding resources or trending in an uncomfortable direction, but also about detecting resource contention where pre-configured limits are being exceeded, or where one workload is crowding out others. You should flag any capacity risk at any level, and map it back to the business workloads that are affected before the risk is realized.

Keep Efficiency high, once in production

Depending on the business requirements, SLAs, and how cautious you want to be, you can set an appropriate expectation for efficiency. No one want to waste spare capacity, whether it is expensive disk storage or high memory servers sitting idle. And yet, safety requires setting aside some headroom appropriate to the business need. If the headroom is exceeded, you want to know. Also, a periodic check should be conducted for typical areas of waste: snapshot policies that are no longer necessary, Virtual Machines sitting idle while occupying precious disk space, and servers that are no longer being used.

Keep track of Capacity Usage

You should keep track of which parts of your infrastructure are highly used and which ones have spare resources. This is the most basic activity, and is also called capacity visibility. It lets you take ownership of all IT resources, so that you can redeploy an old server for a development project, or create a subnet by using spare network ports that are not being used.

TrueSight Capacity Optimization helps you achieve all three of these goals simultaneously, and most importantly, it lets you communicate these three indicators (risk, efficiency, usage) to business owners. Your management may not be familiar with the finer points of memory ballooning, but they understand risk, efficiency, and usage indicators if you can relate them directly to the business applications. TrueSight Capacity Optimization provides the tools needed to report these high-level indicators in a business-aware manner, and at the same time lets you drill down into the root causes so you can understand and influence the indicators. It makes the conversation about real data, and yet grounded in terms of business importance.

Using Risk, Efficiency, and Usage indicators

Defining Capacity Pools

You organize your infrastructure into silos or segments in a way that aligns with your business purpose. We call these segments “capacity pools”. For example, a set of VMware clusters that are supporting production usage for XYZ projects in Toronto could be grouped together into a capacity pool called “XYZ – Toronto”.  Once you define a capacity pool, TrueSight Capacity Optimization automatically analyzes the right metrics on a daily basis and creates the high-level indicators, along with a structure of formulas and lower-level elements (clusters, frames) into which you can drill down.

The capacity pools you define are technology specific. You must ensure that they are related to business use.

Technology specific

The analysis of risk, efficiency, and usage necessarily depends on the exact behavior of the Operating System, hypervisor, and other layers. TrueSight Capacity Optimization contains out-of-the-box knowledge on many technology platforms. In 10.0, capacity pools can be defined for two commonly-used platforms: VMware vSphere and IBM AIX on PowerVM. Additional platforms will be added in the future.

Related to business use

You don’t have to define exactly one capacity pool per business purpose. It is common for multiple infrastructure silos to support a single business purpose; in this case, it is easier to create separate capacity pools for each technology in such a silo. Similarly, a single 'technical service' such as a web back-end server farm might support multiple applications. In this case, it might be more appropriate to create one separate capacity pool for the web server farm, and one for other parts of an application. Business communities are familiar with this level of dependency. The key question to ask yourself is: Would it make sense to report on the risk, efficiency, and usage of this capacity pool?

A Risk indicator counts any indication that CPU, memory, or disk is either already saturated or will soon be, or that workloads are experiencing capacity contention.

A Capacity Efficiency indicator counts resources that are being wasted, so that efficiency could be improved after a detailed analysis.

A Usage indicator counts the level of usage per capacity pool, so that you can track which capacity pools have excess capacity available.

For a given technology (for example, VMware vSphere or IBM Power series), there are specific methods used for gathering, summarizing, and analyzing specific metrics, to find such risk conditions or efficiency conditions. A capacity planner who knows which metrics are important and what they mean could do this analysis manually using the TrueSight Capacity Optimization console.

We have incorporated some of this knowledge into the product, and automated the computation to use the metrics already being collected to find these indicators for a large number of servers. The result of this automated process is saved in the database and presented in a new TrueSight Capacity Optimization console called the Capacity Pools View. This feature lets you manage large infrastructures by quickly focusing attention on servers that are flagged with these conditions. These computed indicators and recommendations cannot be a hundred percent foolproof, but they do let you concentrate your efforts on investigating portions of your infrastructure with the highest probability of capacity risks or inefficiency.

Viewing Capacity Pool Indicators: Capacity Pools View

On a daily basis, TrueSight Capacity Optimization produces indicators on Risk, Efficiency, and Usage, for each capacity pool that you have defined. These indicators are displayed in a graphical view called a Capacity Pools View.

A Capacity Risk indicator, as mentioned above, counts any indication that a resource like CPU or memory is getting saturated on any server in a capacity pool, or that the capacity allocation needs tuning to balance demand among competing workloads. If you see such an indication for a capacity pool, you would like to pay attention to the server to find out whether there is a risk to be addressed.

Figure 1: Capacity Pools View

The capacity risk indicator is a “score” value from 1 through 100, higher being worse. The score is computed by adding risks based on CPU, memory, and disk storage risk scores for the capacity pool. These three components of the Risk score are computed as follows:

  • The CPU risk score is the worst (maximum) CPU risk score of the contained elements in the capacity pool. For example, if the capacity pool contains ten VMware clusters, then the highest CPU risk score out of these ten is used for the capacity pool’s CPU risk score.
  • The memory and disk risk scores are similarly the maximum corresponding risk scores of the contained elements.
  • For each contained element, technology-specific criteria for impending and current saturation are used to compute each of these three risk scores.
    • For example, for VMware clusters, the CPU days to saturation indicator is evaluated. If the CPU days to saturation indicator shows that the cluster CPU is projected to exceed its threshold within a certain time horizon (both the threshold and the horizon are user configurable), then that is one criterion for impending saturation risk.
    • Another criterion for VMware clusters is the number of times the CPU ready time metric has exceeded a threshold within the past thirty days. This criterion is for current saturation risk.
    • For each such criterion, depending on its severity, a penalty is added to the CPU risk score of the cluster.
    • Similarly, for each container, technology-specific criteria for impending and current memory and disk saturation are evaluated and used to add a penalty for the memory or disk risk score, respectively, for the container.

The above computations are performed for all capacity pools in a batch once a day, based on collected data in the database. The Capacity Pools View displays the capacity risk score, color-coding it appropriately.

When a capacity pool’s risk score is high, you should drill down to see which resources (CPU, memory, or disk) are showing high risk. In addition, you can drill down to the individual containers (for example, VMware clusters) to see which ones are causing the high risk score. For each container, you can see the detailed break-down of each criterion contributing to the high score.

A similar breakdown into CPU, memory, and disk space is made for the Efficiency score and the Usage score.

In the Capacity Pools View, interacting with any of these indicators can get you hints on which kinds of quantities are counted (Figure 2), and details on the three components of the score: CPU, memory, and storage (Figure 3).

Figure 2: Pop-up explanation of Efficiency formula

Figure 3: Detail window breaking down Usage indicator into CPU, Memory, and Disk

In the example shown in Figure 3, the Usage score of 34 is shown to be composed of three separate Usage numbers for CPU, Memory, and Disk, respectively. This allows you to both investigate and to explain where these numbers are coming from.

Furthermore, the console also lets you drill down into an entire new detail page, where all the contained elements in the Capacity Pool are listed, and each element’s contribution to the scores can be seen. For example, a capacity pool that shows a Risk score of 100 is shown below.

Figure 4: Detail page of a Capacity Pool

The detail page for the capacity pool with the 100% Risk score is shown above in Figure 4. The Contained Elements table shows all the clusters in the capacity pool with their individual Usage, Risk, and Efficiency scores, broken out by CPU, Memory, and Storage. Thus, for each cluster row in the table, you see a total of 12 individual scores including the totals for Usage, Risk, and Efficiency.

In Figure 4 you can see easily that the high risk score is caused by a high CPU risk score in two clusters, the first two in the table. This immediately narrows down the investigation to these two clusters, and in particular to their CPU related metrics.

Also shown at the top of the detail page, is aggregated configuration information for the Capacity Pool (number of CPUs, etc.), and a thirty-day time chart of a few consolidated metrics. These pieces of information can reveal at a glance whether there have been significant changes recently to this capacity pool.

In the table of Contained Elements, each cluster name is a hyperlink that can be used to drill down to yet another view showing details for that cluster. This drill-down could be used to examine any given cluster in detail. But in our example, where there is more than one cluster with a high risk score, there is yet another, more focused avenue available within the Detail view that makes sense to pursue. At the bottom is a link to “Risk details”, which leads to a risk details page (Figure 5 below).

Figure 5: Risk details for all clusters in a Capacity Pool

The risk details page above shows the components of the risk indicator for each cluster, down to the individual metric-based indicators (CPU pressure index, days to saturation) for the cluster that contribute to it. We can see that the same two clusters that have the high risk scores, are showing two metrics as high risk:

  • “CPU Days to Saturation” shows as “saturated”. The “days to saturation” is not a raw metric, but a calculation over the past 30 days, which uses a linear trend to find how long in the future it will be before a given resource breaches a preset threshold. In our case, these two clusters are showing the threshold is already saturated.
  •  “CPU Pressure Index” shows as 100. The CPU Pressure Index is also not a raw metric, but a calculation that combines two metrics: CPU utilization and CPU ready time. CPU ready time is ordinarily used by system admins to detect CPU contention in the cluster. TrueSight Capacity Optimization starts with the CPU utilization value and adds penalty points for any indications of CPU contention seen by counting how many times CPU ready time has exceeded a threshold within the recent past. This resulting number is weighted and scaled to remain within the 0-100 range.

Thus, we know not only which clusters are the ones causing high risk in the capacity pool, and that their CPU capacity is at risk, but also exactly why TrueSight Capacity Optimization believes the CPU capacity is at risk.

The logic for determining risk would ordinarily entail time-based trending as well as a judgment of an expert who has been monitoring CPU contention. TrueSight Capacity Optimization instead automates this logic, and stores and presents all of these intermediate calculations in a logical manner so that you can follow along by clicking through the Capacity Pools View.

In Figure 5 you can also see similar metrics that would contribute to high risk for memory and storage. For example, the “Memory Pressure Index” is a derived score based on VMware’s reported metrics consumed memory, active memory, ballooning, and host swapping.

Once again, the names of the clusters are links to their individual pages, where the underlying metrics can be seen in as much detail as needed by linking to yet another view of individual metrics at the individual element level of a single cluster.

Similar links and pages, with pointers to the exact metrics, are provided for Efficiency and Usage scores, as well.

Summary: Managing Large Infrastructures with the Capacity Pools View

The Capacity Pools View facilitates a conversation between different stakeholders, primarily between the executive with budget responsibility and the technical owner of the infrastructure, the capacity manager.

At the top level, the Capacity Pools View starts with a business-level grouping. The conversation is therefore grounded in the business reasons and context for each investigation of usage, risk, and efficiency, instead of getting lost in technical details.

Next, the Capacity Pools View allows multiple levels of drill-down. This capability not only allows the capacity manager to investigate the capacity questions down to the individual elements, but also lets the executive participate in the process by simplifying and focusing the questions and answers at each level into a small number of indicators.

Finally, at the element level, the Capacity Pools View links into the same element-level views that the capacity manager is already using to view detailed technical metrics. Thus, the capacity manager can navigate back and forth between communicating effectively and building an understanding of the underlying root causes of risks, efficiency, and usage. The formulas that BMC TrueSight Capacity Optimization uses to compute the indicators are based on conditions specific to each type of platform: VMware vSphere, IBM AIX, etc. These computed indicators are used consistently in the BMC TrueSight Capacity Optimization UI.

The indicators and component scores are computed from a large number of metrics collected on the individual vSphere clusters. These metrics measure the CPU, memory, and disk capacity available and used in the capacity pool, and formulas combine these metrics using weights to compute the indicator values. Correspondingly, behind this view is a large analytical structure tracing the dependencies from the top-level indicators, through the various contributing formulas, down to individual clusters, and within each cluster down to CPU, memory, and storage capacity.

Thus, TrueSight Capacity Optimization allows you, the capacity manager, to manage risk, efficiency, and usage for large infrastructures. Not only does it provide you with a powerful, automated facility to drill down into the root causes, but also provides simple, business-aligned scores that facilitate the conversation with the business stakeholders.

  • No labels