View GPU health and capacity by using the Nvidia GPU dashboard

BMC Helix Dashboards provides an example dashboard for BMC Helix Continuous Optimization to monitor GPU performance and health across GPU-enabled servers. Use this dashboard to gain near real-time visibility into GPU utilization, capacity, and operational health.

You can use this example dashboard as a template and customize it to meet your monitoring and capacity-planning requirements.

The dashboard displays the following information about GPU health and performance:

GPU details, such as GPU ID, GPU model, driver version, and BIOS version
GPU utilization and GPU memory utilization
GPU temperature and GPU memory temperature
GPU power utilization
SM clock speed and memory clock speed
Forecast trends for GPU utilization, memory usage, temperature, and power consumption

The GPU dashboard retrieves metrics in near real time from the backend time-series data source and does not rely on Capacity Optimization datamarts.

Scenario:

Alan is a capacity administrator at ABC Company, managing GPU resources. As demand for GPU usage increases, he monitors GPU usage to make sure the current infrastructure can meet performance and capacity requirements.

Alan uses the GPU dashboard in BMC Helix Dashboards to assess GPU behavior and inform capacity decisions based on real-time metrics and usage trends.

The dashboard provides the following information:

Identify GPUs with sustained high utilization or memory pressure
Detect thermal or power-related risks that could affect performance
Analyze usage trends to anticipate future capacity requirements
Decide when to rebalance workloads or plan for additional GPU capacity

Before you begin

The servers are onboarded to BMC Helix Continuous Optimization.
GPU metrics are collected by supported ETL processes.
Servers are active and sending data within the selected Time range.
Forecast panels will not render if the minimum historical data requirement is not met.

To view the Nvidia GPU dashboard

The Nvidia GPU dashboard is delivered as an out-of-the-box add-on package.

To install the dashboard package:

Log in to BMC Helix Portal.
Open BMC Helix Continuous Optimization application.
Navigate to Administration > Maintenance.
On Additional Packages, locate Nvidia GPU dashboard, and click Install.
Verify that the package status changes to Installed.

Note: If the package is not installed, the GPU dashboards are not visible on BMC Helix Dashboards.

To view the Nvidia GPU dashboard

On the homepage, select the BMC Helix Dashboards application.
Navigate to Dashboards > Helix Continuous Optimization folder.

Under the Dashboards tab, the following dashboards are available:

Nvidia GPU Dashboard
Nvidia GPU Per Server Dashboard

Nvidia GPU Dashboard

The Nvidia GPU Dashboard provides near real-time visibility into GPU performance and health across monitored servers. The dashboard visualizes GPU metrics collected by ETLs and retrieved directly from backend metric stores, without materializing data into datamarts.

By aggregating GPU utilization, memory usage, temperature, and power consumption into ranked tables and time-series charts, the dashboard enables operators to monitor current GPU usage, detect resource contention, identify anomalies, and analyze performance trends over time.

These visualizations allow users to compare GPU behavior across servers and identify potential capacity or performance issues—panels update based on the selected Server, GPU, and Time range.

The Nvidia GPU Dashboard is organized into the following sections:

Global filters at the top of the page that allow drilling down into the data by Server, GPU, and Time range.
Ranked summary tables that highlight GPUs with the highest compute and memory utilization.
Time-series charts that display utilization, temperature, and power trends over time.

Dashboard controls

Use the following controls to filter data on the page

Server: Filter the data by Server. Select a specific Server or All.
- Selecting a specific server displays GPU data only for that server.
- Selecting All displays GPU data across all available servers.
- All filters apply globally to the dashboard. Any change to a filter or time range updates all panels on the page.
GPU: Filter by GPU within the selected server(s). Select a specific GPU or All
- Selecting a specific GPU displays data only for that GPU.
- Selecting All displays data for all GPUs within the selected server(s).
- The GPU selection applies globally to all tables and charts.
Time range selection: Controls the interval for all panels.
- Quick ranges: Choose a predefined interval or search
- Absolute time range: Set From and To, then click Apply time range. for on
  - From: Start date and time for the data range
  - To: End date and time for the data range
Time zone: By default, the time picker uses your browser time zone. In the time picker, select Change time settings to select a different time zone.

Important: If your data is collected in UTC (or another zone), set the time picker’s Time zone to match your data when interpreting chart timestamps

Panels in the Nvidia GPU Dashboard

The summary dashboard presents the following information:

Panels	Description
Top 10 with Highest GPU Utilization	Identifies the GPUs with the highest utilization across servers, helping administrators quickly spot GPUs under heavy compute load.
Top 10 with Highest GPU Memory Utilization	Shows GPUs consuming the most GPU memory.
Time series metrics panels
GPU Utilization	Shows the percentage of time the GPU processes workloads over the selected time range.
GPU Memory Utilization	Shows the percentage of GPU memory in use over time.
GPU Temperature	Shows the GPU’s operating temperature over time.
GPU Memory Temperature	Shows the operating temperature of GPU memory over time.
GPU Power Utilization	Shows the GPU’s power consumption over time.

Hover over a data point to view a tooltip with the time stamp, server name, GPU ID, and metric value. If multiple GPUs are shown, the tooltip lists one value per GPU and the color matches the series.

If multiple GPUs are displayed in the chart, the tooltip lists values for each GPU at the selected time stamp. Each entry is color-coded to match the corresponding line in the chart, making it easy to identify which value belongs to which GPU.

Nvidia GPU Per Server Dashboard

The Nvidia per Server Dashboard provides a focused, server-scoped view of GPU performance metrics. It enables detailed analysis of GPU utilization, memory usage, temperature, and power consumption for GPUs hosted on a selected server.

This dashboard is primarily used for deep-dive analysis and troubleshooting after identifying potential issues in the Nvidia GPU Dashboard. Isolating GPU metrics to a single server helps operators understand intra-server GPU behavior and resource distribution.

All tables and charts on this dashboard dynamically update based on the selected server, GPU, and time range. Drill-down navigation enables users to move between server-level and GPU-level views for further investigation.

You can access the Nvidia GPU per Server Dashboard in the following ways:

Navigate to Dashboards > Helix Capacity Optimization, open the Nvidia GPU per Server Dashboard.
From the Nvidia GPU Dashboard, click the Server, GPU ID, or GPU Name link to navigate to the corresponding GPU per Server Dashboard.

You can also open this page by selecting Nvidia GPU per Server Dashboard and then choosing the required Server or GPU ID from the available options.

The GPU Per Server Dashboard provides detailed GPU metrics for a specific server.

To view metrics for one server (all GPUs)

When you select one Server and All in the GPU list, charts show one line per GPU on that server.

GPU_Per_Server_Dashboard_All GPUs1.png

In this view:

All GPU-related charts are displayed, with each chart showing multiple lines where each line represents a different GPU on the server.
At the bottom of the page, the dashboard displays individual sections for each GPU ID.
At the bottom of the page, click the expand control next to a GPU ID to view charts for that GPU.
You can expand or collapse individual GPU sections to compare metrics across multiple GPUs within the same server.

Use this view when you want to compare GPU performance within a single server.

To view metrics for one server (one GPU)

When you select one Server and All in the GPU list, charts show one line per GPU on that server.

ALL servers_one__gpu1.png

In this view:

All charts display data exclusively for the selected GPU.
Each chart shows a single data line representing the selected GPU.
At the bottom of the page, click the expand control next to a GPU ID to view charts for that GPU
No additional GPU sections or expandable GPU sections are shown.

Use this view when you want to perform focused analysis of an individual GPU or investigate a specific performance or health issue.

Panels in the Nvidia GPU Per Server Dashboard

Panels	Description
GPU Utilization	Shows the percentage of time the GPU is actively processing workloads.
GPU Memory Utilization	Shows the percentage of GPU memory now in use.
GPU Temperature	Shows the GPU's current operating temperature.
GPU Memory Temperature	Shows the current operating temperature of the GPU memory.
GPU Power Utilization	Shows the GPU's power consumption during operation.
GPU model level metrics panels
GPU Model, Driver Version, and VBios Version	Provides hardware and firmware details for the selected GPU.
GPU Utilization	Indicates how actively the selected GPU processes workloads.
GPU Temperature	Reports the operating temperature of the selected GPU.
GPU Memory Utilization	Indicates the level of GPU memory usage for the selected GPU.
GPU Memory Temperature	Reports the operating temperature of GPU memory for the selected GPU.
GPU Power Utilization	Reflects the power consumed by the selected GPU during operation.
GPU Utilization Forecast (3 months)	Shows the projected GPU utilization for the next three months based on historical usage trends.
GPU Memory Utilization Forecast	Shows the projected GPU memory utilization based on historical usage trends.
GPU Forecast (3 months)	Shows the projected overall GPU usage trend for the next three months.
GPU Memory Temperature forecast	Shows the projected GPU memory temperature based on historical temperature trends.
GPU Power Utilization Forecast	Shows the projected GPU power consumption based on historical usage patterns.
SM Clock Speed	Shows the operating speed of the GPU processing units, measured in MHz.
Memory Clock Speed	Shows the GPU memory's operating speed, measured in MHz.

Benefits

Near real-time visibility into GPU utilization
Comprehensive monitoring of overall GPU health
Unified operational view across platforms
Improved capacity planning for GPU resources
Faster and easier issue diagnosis

Limitations and notes

Forecast charts require at least 7 days of historical data for the selected Server or GPU.
Only active servers appear in Server. If a server is inactive, reactivate data collection and refresh the page.
These dashboards read directly from the time‑series data source configured for BMC Helix Dashboards; they do not use Capacity Optimization datamarts.

View GPU health and capacity by using the Nvidia GPU dashboard

Before you begin

To view the Nvidia GPU dashboard

To view the Nvidia GPU dashboard

Nvidia GPU Dashboard

Panels in the Nvidia GPU Dashboard

Nvidia GPU Per Server Dashboard

To view metrics for one server (all GPUs)

To view metrics for one server (one GPU)

Panels in the Nvidia GPU Per Server Dashboard

Benefits

Limitations and notes

BMC Helix Continuous Optimization 26.1

On this page