Collecting GPU data using NVIDIA dcgm-exporter scripts


The NVIDIA dcgm-exporter integration enables BMC Helix Continuous Optimization administrators to collect GPU performance metrics from NVIDIA GPU–enabled hosts. The collected metrics include GPU utilization, memory usage, temperature, clock rates, and activity of the encoder and decoder.

You use this integration to monitor GPU performance and health, analyze workload behavior, and support capacity planning in BMC Helix Continuous Optimization. If GPU metrics are not collected, BMC Helix Continuous Optimization does not display GPU performance data, which limits visibility into GPU usage and capacity trends.

This integration is typically configured after dcgm-exporter is installed and running on GPU hosts and before GPU views and reports are used in BMC Helix Continuous Optimization. The Remote ETL Engine (REE) collects the metrics, processes them into CSV files, and imports the data into BMC Helix Continuous Optimization for analysis and reporting.

What is NVIDIA dcgm-exporter?

NVIDIA dcgm-exporter is an NVIDIA-provided component that exposes GPU performance and health metrics through an HTTP endpoint using NVIDIA Data Center GPU Manager (DCGM).

BMC Helix Continuous Optimization does not install or manage dcgm-exporter. You must install and run dcgm-exporter on GPU-enabled hosts, either directly on the host or as a container, using NVIDIA-supported methods. Once running, BMC Helix Continuous Optimization collects GPU metrics by polling the /metrics endpoint.

For installation and configuration details, see the official NVIDIA dcgm-exporter documentation: https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html

GPU metric collection and import rely on the following components:

  • A Process Runner system task that executes the dcgm-exporter collection and aggregation scripts.
  • A Generic – CSV Parser ETL that imports the generated CSV files into BHCO.

Before you begin

Before you start collecting GPU metrics, ensure that the following requirements are met:

  • dcgm-exporter is installed and running on all GPU hosts.
  • REE must have network access to GPU endpoints.
  • REE must have bash, curl, and jq installed.
  • A text that lists all dcgm-exporter endpoints and their corresponding and polling intervals is created. Example:

    http://gpu-host1:9400/metrics,30 
    http://gpu-host2:9400/metrics,30 
    http://gpu-host3:9400/metrics,10 
    
Information
Important

Ensure the polling interval matches the data update frequency configured in dcgm-exporter.

Scripts and usage

Use the following scripts to collect, aggregate, and prepare GPU metrics for import into BHCO:

ScriptPurposeWhen to useHow to run
dcgm-integration.shWrapper that automates collection and aggregationFor scheduled, repeatable execution$ ./dcgm-integration.sh -i <path_to_endpoint_file>
dcgm-collector.shPolls GPU endpoints and stores raw JSONAlways as the first step./dcgm-collector.sh -l <dcgm-exporter-endpoint> -p <dcgm-exporter polling interval> -d <duration is seconds>
dcgm-aggregator.shAggregates JSON files into CSVAfter collection, before ETL import$ ./dcgm-aggregator.sh
include/metrics_mapping.shMaps dcgm metrics to BHCO metricsAlways used by the integration scriptAuto-included
include/tools.shChecks and installs jq if missingAlways used by the integration scriptAuto-included

To collect and process GPU metrics using the dcgm-exporter scripts

  1. The dcgm-integration.sh script reads the endpoint file and launches dcgm-collector.sh for each GPU host.

  2. dcgm-collector.sh script polls dcgm-exporter endpoint at configured intervals and files are stored under <BCO_HOME>/etl/scripts/nvidia-dcgm/dcgm-data/<GPU host>_<port> directory.
  3. dcgm-aggregator.sh processes the raw JSON files, generates CSV files in the  <BCO_HOME>/etl/scripts/nvidia-dcgm/metrics-etl directory, and removes the processed JSON files. For details on available GPU metrics and their statistics, see GPU Metrics in NVIDIA Summary Datamart.
  4. All log files are stored in <BCO_HOME>/etl/scripts/nvidia-dcgm/ for review and troubleshooting.

How the GPU collection scripts work

The GPU collection scripts run only on the Remote ETL Engine (REE). They do not execute any commands on the GPU hosts.

The REE collects GPU metrics by sending HTTP requests to the dcgm-exporter /metrics endpoint exposed on each GPU-enabled host. The scripts use curl to retrieve the metrics and process the response locally on the REE.

  • No remote login is required.
  • No username or password is used.
  • No SSH access or key exchange is needed.

The only requirement is network connectivity from the REE to the dcgm-exporter endpoint (for example, port 9400 over HTTP).

How to Collect and Import GPU Metrics Using NVIDIA dcgm-exporter

Follow these steps to collect GPU metrics from NVIDIA GPU–enabled hosts and import them into BMC Helix Continuous Optimization:

  1. Ensure prerequisites are met

    • dcgm-exporter is installed and running on all GPU-enabled hosts.
    • The Remote ETL Engine (REE) has network access to the GPU endpoints.
    • Bash, curl, and jq are installed on the REE.
  2. Create an endpoint file on the REE

    • List all dcgm-exporter endpoints along with their polling intervals in a text file. Example:

      http://gpu-host1:9400/metrics,30 http://gpu-host2:9400/metrics,30 http://gpu-host3:9400/metrics,10 
  3. Create a process runner script on the REE

    • The script should execute the dcgm-integration.sh script using the endpoint file:

      ./dcgm-integration.sh -i <path_to_endpoint_file> 
  4. Schedule the process runner script as a System Task

    • Configure the task to run at the desired frequency (every 5, 15, or 60 minutes), preferably starting at the top of the hour.
    • This ensures metrics are collected consistently for accurate reporting.
  5. Create and schedule a Generic – CSV Parser ETL

    • Configure the ETL to import the generated CSV files from:

      <BCO_HOME>/etl/scripts/nvidia-dcgm/metrics-etl 
    • Enable the option to append a .done suffix once a CSV file is processed.
    • Schedule the ETL to run a few minutes after the System Task to ensure CSV files are ready for import.
Information
Note

Once configured, all GPU metric collection and processing takes place on the REE. No additional steps are required on the GPU hosts.

To test a GPU endpoint manually

  1. Open a terminal on a system that has network access to the GPU host.
  2. Run the following command:

    $ curl -s http://<gpu-host>:9400/metrics 
  3. Replace <gpu-host> with the hostname or IP address of your GPU server.

The command returns the raw metrics exposed by dcgm-exporter, which helps verify that the endpoint is reachable and exporting data correctly.

To create and schedule a system task

  1. Select Administration > ETL & System Tasks > System Tasks.
  2. On the System tasks page, select Add > Add process runner task.
    The Add task page displays the configuration properties. 

  3. Create a Process Runner task on the Remote ETL Engine (REE).1760506233979-297.png

    image (1).jpg

  4. Configure the following settings:

    PropertyDescription
    NameThe name of the system task or ETL job. Helps identify the task in the scheduler or task list.
    DescriptionOptional field to provide details about the task’s purpose or functionality.
    Maximum execution time before warningThe maximum time the task is allowed to run before a warning is triggered. Example: 4 hours.
    FrequencyDetermines how often the task runs. Can be Predefined (e.g., Each Day) or Custom (specific number of days/hours).
    Predefined frequencyIf using a predefined option, select from choices like Each Day, Each Week, etc.
    Start timestampThe exact time the task starts. Includes: - Hour (0–23) - Minute (0–59) - Week day (if applicable) - Month day (if applicable)
    Custom frequencyIf using a custom schedule, define the interval, e.g., every 1 day.
    Custom start timestampThe date and time when the task should first run, e.g., 09/10/2025 08:47.
    Running on scheduler

    Specifies the Remote ETL Engine (REE) or scheduler node where the task will execute.

  5. Click Save and schedule the task.

For more information, see Maintaining System tasks.

ETL setup

To import GPU metrics collected by the dcgm-exporter scripts, you need to configure an ETL. For detailed step-by-step instructions on creating a Generic CSV Parser ETL, see  Generic - CSV file parser.

Information
Important

Only the GPU-specific settings and paths are specified in the following table. All other ETL configuration steps, such as basic and advanced properties, scheduling, simulation or production modes are covered on the Generic - CSV file parser page.

GPU-specific ETL settings:

PropertyDescription
TypeSelect Generic – CSV Parser.
File location<BCO_HOME>/etl/scripts/nvidia-dcgm/dcgm-data/metrics-etl – directory where scripts generate CSV files.
Append suffix to parsed fileEnable the property to avoid overwriting previously imported CSV files.
Entity catalogSelect or create a catalog for GPU entities.
Collection levelDefine data granularity (per GPU, per host, or per cluster).
Object relationshipsMap GPU metrics to their parent systems or clusters.
Percentage formatSet to 0–100 for GPU metrics expressed as percentages.

image (7).png

 

Note

Schedule the ETL to run at the same frequency as the system task that executes the GPU scripts, starting a few minutes later to ensure CSV files are ready.

To verify GPU metrics in BMC Helix Continuous Optimization

  1. Check ETL logs – Ensure that CSV files were parsed and imported successfully.
  2. Verify GPU metrics – Navigate to Workspace > Systems to confirm that GPU metrics are visible. 
    For a visual representation and analysis of the imported data, see Servers GPU Views.
    For detailed metric definitions, see in GPU Metrics in NVIDIA Summary Datamart.
  3. Confirm updates – Make sure the metrics refresh according to the scheduled frequency.

Best practices

  • Always run dcgm-integration.sh via a System Task rather than manually to ensure consistent data collection.
  • Update /etc/dcgm-exporter/default-counters.csv if you need to collect additional GPU metrics.
  • Keep your endpoint file current whenever GPU hosts are added or removed.
  • Regularly monitor logs to troubleshoot any issues such as network errors, missing tools (like jq), or misconfigured endpoints.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*

Helix Continuous Optimization