Collecting GPU Data Using NVIDIA dcgm-exporter Scripts

The NVIDIA dcgm-exporter integration enables collection of GPU metrics from NVIDIA GPUs, including:

GPU utilization, memory usage, temperature, clock rates
Encoder/decoder activity

Data is collected via the Remote ETL Engine (REE), processed into CSV files, and imported into BMC Helix Continuous Optimization (BHCO) for monitoring, performance analysis, and capacity planning.

Before you begin

Before you start collecting GPU metrics, ensure that the following prerequisites are met:

dcgm-exporter installed and running on all GPU hosts.
REE must have network access to GPU endpoints.
REE must have bash, curl, and jq installed.
Create a text that lists all DCGM-exporter endpoints and their corresponding and polling intervals. Example:
```
http://gpu-host1:9400/metrics,30 
http://gpu-host2:9400/metrics,30 
http://gpu-host3:9400/metrics,10 
```

Ensure the polling interval matches the data update frequency configured in dcgm-exporter.

Scripts and Usage

Use the following scripts to collect, aggregate, and prepare GPU metrics for import into BHCO.

Script	Purpose	When to Use	How to Run
dcgm-integration.sh	Wrapper that automates collection and aggregation	For scheduled, repeatable execution	$ ./dcgm-integration.sh -i <path_to_endpoint_file>
dcgm-collector.sh	Polls GPU endpoints and stores raw JSON	Always as the first step	./dcgm-collector.sh -l <dcgm-exporter-endpoint> -p <dcgm-exporter polling interval> -d <duration is seconds>
dcgm-aggregator.sh	Aggregates JSON files into CSV	After collection, before ETL import	$ ./dcgm-aggregator.sh
include/metrics_mapping.sh	Maps dcgm metrics to BHCO metrics	Always used by the integration script	Auto-included
include/tools.sh	Checks and installs jq if missing	Always used by the integration script	Auto-included

The following steps outline how the scripts collect, process, and prepare GPU metrics for BHCO.

The dcgm-integration.sh script reads the endpoint file and launches dcgm-collector.sh for each GPU host.
dcgm-collector.sh script polls dcgm-exporter endpoint at configured intervals and files are stored under <BCO_HOME>/etl/scripts/nvidia-dcgm/dcgm-data/<GPU host>_<port> directory.
dcgm-aggregator.sh processes the raw JSON files, generates CSV files in the <BCO_HOME>/etl/scripts/nvidia-dcgm/metrics-etl directory, and removes the processed JSON files. For details on available GPU metrics and their statistics, see GPU Metrics in NVIDIA Summary Datamart.
All log files are stored in <BCO_HOME>/etl/scripts/nvidia-dcgm/ for review and troubleshooting.

You can test a GPU endpoint manually using the following command:

$ curl -s http://<gpu-host>:9400/metrics

Replace <gpu-host> with the hostname or IP of your GPU server. This will show the raw metrics exposed by dcgm-exporter.

System Task Creation

Follow these steps to create and schedule a system task that runs the GPU integration scripts automatically.

Go to Administration > ETL & System Tasks > System Tasks.
On the System tasks page, click Add > Add process runner task. The Add task page displays the configuration properties.
Create a Process Runner task on the Remote ETL Engine (REE).

Configure the following settings:

Property	Description
Name	The name of the system task or ETL job. Helps identify the task in the scheduler or task list.
Description	Optional field to provide details about the task’s purpose or functionality.
Maximum execution time before warning	The maximum time the task is allowed to run before a warning is triggered. Example: 4 hours.
Frequency	Determines how often the task runs. Can be Predefined (e.g., Each Day) or Custom (specific number of days/hours).
Predefined frequency	If using a predefined option, select from choices like Each Day, Each Week, etc.
Start timestamp	The exact time the task starts. Includes: - Hour (0–23) - Minute (0–59) - Week day (if applicable) - Month day (if applicable)
Custom frequency	If using a custom schedule, define the interval, e.g., every 1 day.
Custom start timestamp	The date and time when the task should first run, e.g., 09/10/2025 08:47.
Running on scheduler	Specifies the Remote ETL Engine (REE) or scheduler node where the task will execute.

Click Save and schedule the task.

For more information, see Maintaining System tasks.

ETL Setup

To import GPU metrics collected by the dcgm-exporter scripts, you need to configure an ETL. For detailed step-by-step instructions on creating a Generic CSV Parser ETL, see Generic - CSV file parser.

Only the GPU-specific settings and paths are mentioned below. All other ETL configuration steps (basic and advanced properties, scheduling, simulation/production modes, etc.) are covered in Generic - CSV file parser page.

GPU-specific ETL settings:

Property	Description
Type	Select Generic – CSV Parser.
File location	<BCO_HOME>/etl/scripts/nvidia-dcgm/dcgm-data/metrics-etl – directory where scripts generate CSV files.
Append suffix to parsed file	Enable to avoid overwriting previously imported CSV files.
Entity catalog	Select or create a catalog for GPU entities.
Collection level	Define data granularity (per GPU, per host, or per cluster).
Object relationships	Map GPU metrics to their parent systems or clusters.
Percentage format	Set to 0–100 for GPU metrics expressed as percentages.

Schedule the ETL to run at the same frequency as the system task that executes the GPU scripts, starting a few minutes later to ensure CSV files are ready.

Verification

Check ETL logs – Ensure that CSV files were parsed and imported successfully.
Verify GPU metrics – Navigate to Workspace > Systems to confirm that GPU metrics are visible.
For a visual representation and analysis of the imported data, see Servers GPU Views.
For detailed metric definitions, see in GPU Metrics in NVIDIA Summary Datamart.
Confirm updates – Make sure the metrics refresh according to the scheduled frequency.

Best Practices

Always run dcgm-integration.sh via a System Task rather than manually to ensure consistent data collection.
Update /etc/dcgm-exporter/default-counters.csv if you need to collect additional GPU metrics.
Keep your endpoint file current whenever GPU hosts are added or removed.
Regularly monitor logs to troubleshoot any issues such as network errors, missing tools (like jq), or misconfigured endpoints.