View GPU health and capacity by using Nvidia GPU dashboard
BMC Helix Dashboards provides an example dashboard for BMC Helix Continuous Optimization to monitor GPU performance and health across GPU-enabled servers. Use this dashboard to gain near real-time visibility into GPU utilization, capacity, and operational health.
You can use this example dashboard as a template and customize it to meet your monitoring and capacity-planning requirements.
The dashboard displays the following information about GPU health and performance:
- GPU details, such as GPU ID, GPU model, driver version, and vBIOS version
- GPU utilization and GPU memory utilization
- GPU temperature and GPU memory temperature
- GPU power utilization
- SM clock speed and memory clock speed
- Forecast trends for GPU utilization, memory usage, temperature, and power consumption
The GPU dashboard retrieves metrics in near real time from the backend time-series data source and does not rely on Capacity Optimization datamarts.
Before you begin
The servers are onboarded to BMC Helix Continuous Optimization.
GPU metrics are collected by supported ETL processes.
Servers are active and sending data within the selected Time range.
Forecast panels will not render if the minimum historical data requirement is not met.
To install the Nvidia GPU dashboard
The Nvidia GPU dashboard is delivered as an out-of-the-box add-on package.
To install the dashboard package:
- Log in to BMC Helix Portal.
- Select the BMC Helix Continuous Optimization application.
- Navigate to Administration > Maintenance.
- Open the Additional Packages tab, find Nvidia GPU dashboard, and click Install.

- Verify that the status is Installed.
To view the Nvidia GPU dashboard
- On the homepage, select the BMC Helix Dashboards application.
- Navigate to Dashboards > Helix Continuous Optimization folder.
Under the Dashboards tab, the following dashboards are available:
Nvidia GPU Dashboard
The Nvidia GPU Dashboard provides near real-time visibility into GPU performance and health across monitored servers. The dashboard visualizes GPU metrics collected by ETLs and retrieved directly from backend metric stores, without materializing data into datamarts.
By aggregating GPU utilization, memory usage, temperature, and power consumption into ranked tables and time-series charts, the dashboard enables operators to monitor current GPU usage, detect resource contention, identify anomalies, and analyze performance trends over time.
These visualizations allow users to compare GPU behavior across servers and identify potential capacity or performance issues—panels update based on the selected Server, GPU, and Time range.

The Nvidia GPU Dashboard is organized into the following sections:
- Global filters at the top of the page that allow drilling down into the data by Server, GPU, and Time range.
- Ranked summary tables that highlight GPUs with the highest compute and memory utilization.
- Time-series charts that display utilization, temperature, and power trends over time.
Dashboard controls
Use the following controls to filter data on the page
- Server: Filter the data by Server. Select a specific Server or All.

- Selecting a specific server displays GPU data only for that server.
- Selecting All displays GPU data across all available servers.
- All filters apply globally to the dashboard. Any change to a filter or time range updates all panels on the page.
- GPU: Filter by GPU within the selected server(s). Select a specific GPU or All

- Selecting a specific GPU displays data only for that GPU.
- Selecting All displays data for all GPUs within the selected server(s).
- The GPU selection applies globally to all tables and charts.
- Time range selection: Controls the interval for all panels.

- Quick ranges: Choose a predefined interval or search
- Absolute time range: Set From and To, then click Apply time range. for on
- From: Start date and time for the data range
- To: End date and time for the data range
Time zone: By default, the time picker uses your browser time zone. In the time picker, select Change time settings to select a different time zone.
Panels in the Nvidia GPU Dashboard
The summary dashboard presents the following information:
| Panels | Description |
|---|---|
| Top 10 with Highest GPU Utilization | Identifies the GPUs with the highest utilization across servers, helping administrators quickly spot GPUs under heavy compute load. |
| Top 10 with Highest GPU Memory Utilization | Shows GPUs consuming the most GPU memory. |
| Time series metrics panels | |
| GPU Utilization | Shows the percentage of time the GPU processes workloads over the selected time range. |
| GPU Memory Utilization | Shows the percentage of GPU memory in use over time. |
| GPU Temperature | Shows the GPU’s operating temperature over time. |
| GPU Memory Temperature | Shows the operating temperature of GPU memory over time. |
| GPU Power Utilization | Shows the GPU’s power consumption over time. |
Hover over a data point to view a tooltip with the time stamp, server name, GPU ID, and metric value. If multiple GPUs are shown, the tooltip lists one value per GPU and the color matches the series.
If multiple GPUs are displayed in the chart, the tooltip lists values for each GPU at the selected time stamp. Each entry is color-coded to match the corresponding line in the chart, making it easy to identify which value belongs to which GPU.
Nvidia GPU Per Server Dashboard
The Nvidia per Server Dashboard provides a focused, server-scoped view of GPU performance metrics. It enables detailed analysis of GPU utilization, memory usage, temperature, and power consumption for GPUs hosted on a selected server.
This dashboard is primarily used for deep-dive analysis and troubleshooting after identifying potential issues in the Nvidia GPU Dashboard. Isolating GPU metrics to a single server helps operators understand intra-server GPU behavior and resource distribution.
All tables and charts on this dashboard dynamically update based on the selected server, GPU, and time range. Drill-down navigation enables users to move between server-level and GPU-level views for further investigation.
You can access the Nvidia GPU per Server Dashboard in the following ways:
- Navigate to Dashboards > Helix Capacity Optimization, open the Nvidia GPU per Server Dashboard.
- From the Nvidia GPU Dashboard, click the Server, GPU ID, or GPU Name link to navigate to the corresponding GPU per Server Dashboard.
You can also open this page by selecting Nvidia GPU per Server Dashboard and then choosing the required Server or GPU ID from the available options.
The GPU Per Server Dashboard provides detailed GPU metrics for a specific server.
To view metrics for one server (all GPUs)
When you select one Server and All in the GPU list, charts show one line per GPU on that server.

In this view:
- All GPU-related charts are displayed, with each chart showing multiple lines where each line represents a different GPU on the server.
- At the bottom of the page, the dashboard displays individual sections for each GPU ID.
At the bottom of the page, click the expand control next to a GPU ID to view charts for that GPU.
You can expand or collapse individual GPU sections to compare metrics across multiple GPUs within the same server.
Use this view when you want to compare GPU performance within a single server.
To view metrics for one server (one GPU)
When you select one Server and All in the GPU list, charts show one line per GPU on that server.

In this view:
- All charts display data exclusively for the selected GPU.
- Each chart shows a single data line representing the selected GPU.
- At the bottom of the page, click the expand control next to a GPU ID to view charts for that GPU
- No additional GPU sections or expandable GPU sections are shown.
Use this view when you want to perform focused analysis of an individual GPU or investigate a specific performance or health issue.
Panels in the Nvidia GPU Per Server Dashboard
| Panels | Description |
|---|---|
| GPU Utilization | Shows the percentage of time the GPU is actively processing workloads. |
| GPU Memory Utilization | Shows the percentage of GPU memory now in use. |
| GPU Temperature | Shows the GPU's current operating temperature. |
| GPU Memory Temperature | Shows the current operating temperature of the GPU memory. |
| GPU Power Utilization | Shows the GPU's power consumption during operation. |
| GPU model level metrics panels | |
| GPU Model, Driver Version, and VBios Version | Provides hardware and firmware details for the selected GPU. |
| GPU Utilization | Indicates how actively the selected GPU processes workloads. |
| GPU Temperature | Reports the operating temperature of the selected GPU. |
| GPU Memory Utilization | Indicates the level of GPU memory usage for the selected GPU. |
| GPU Memory Temperature | Reports the operating temperature of GPU memory for the selected GPU. |
| GPU Power Utilization | Reflects the power consumed by the selected GPU during operation. |
| GPU Utilization Forecast (3 months) | Shows the projected GPU utilization for the next three months based on historical usage trends. |
| GPU Memory Utilization Forecast | Shows the projected GPU memory utilization based on historical usage trends. |
| GPU Forecast (3 months) | Shows the projected overall GPU usage trend for the next three months. |
| GPU Memory Temperature forecast | Shows the projected GPU memory temperature based on historical temperature trends. |
| GPU Power Utilization Forecast | Shows the projected GPU power consumption based on historical usage patterns. |
| SM Clock Speed | Shows the operating speed of the GPU processing units, measured in MHz. |
| Memory Clock Speed | Shows the GPU memory's operating speed, measured in MHz. |
Limitations and notes
- Forecast charts require at least 7 days of historical data for the selected Server or GPU.
- Only active servers appear in Server. If a server is inactive, reactivate data collection and refresh the page.
- These dashboards read directly from the time‑series data source configured for BMC Helix Dashboards; they do not use Capacity Optimization datamarts.
