Collecting GPU Data Using NVIDIA dcgm-exporter Scripts


The NVIDIA dcgm-exporter integration enables collection of GPU metrics from NVIDIA GPUs, including:

  • GPU utilization, memory usage, temperature, clock rates
  • Encoder/decoder activity

Data is collected via the Remote ETL Engine (REE), processed into CSV files, and imported into BMC Helix Continuous Optimization (BHCO) for monitoring, performance analysis, and capacity planning.

There are two components to collect and import GPU metrics:

  • A process runner task that executes the GPU collection scripts (dcgm-exporter utilities).
  • The Generic - CSV Parser ETL, which imports the collected CSV data into BHCO.

You can understand the GPU metrics collection and processing as follows:

  1. Collect – dcgm-collector.sh polls each GPU host and saves raw JSON metrics.
  2. Aggregate – dcgm-aggregator.sh converts JSON files into CSV for import.
  3. Automate – dcgm-integration.sh runs collection and aggregation on a schedule.
  4. Import – CSV files are imported into BHCO using the Generic – CSV Parser ETL.
  5. Verify – Check logs and Workspace to confirm metrics are imported and updated.

You can test the dcgm-exporter endpoint using the following command:

$ curl -s 

Default dcgm-exporter port is 9400.

Before you begin

Before you start collecting GPU metrics, ensure that the following prerequisites are met:

  • dcgm-exporter installed and running on all GPU hosts.
  • REE must have network access to GPU endpoints.
  • REE must have bash, curl, and jq installed.
  • Create a text that lists all DCGM-exporter endpoints and their corresponding and polling intervals. Example:

    http://gpu-host1:9400/metrics,30 
    http://gpu-host2:9400/metrics,30 
    http://gpu-host3:9400/metrics,10 
Information

Ensure the polling interval matches the data update frequency configured in dcgm-exporter.

Scripts and Usage

Use the following scripts to collect, aggregate, and prepare GPU metrics for import into BHCO.

ScriptPurposeWhen to UseHow to Run
dcgm-integration.shWrapper that automates collection and aggregationFor scheduled, repeatable execution$ ./dcgm-integration.sh -i <path_to_endpoint_file>
dcgm-collector.shPolls GPU endpoints and stores raw JSONAlways as the first step./dcgm-collector.sh -l <dcgm-exporter-endpoint> -p <dcgm-exporter polling interval> -d <duration is seconds>
dcgm-aggregator.shAggregates JSON files into CSVAfter collection, before ETL import$ ./dcgm-aggregator.sh
include/metrics_mapping.shMaps dcgm metrics to BHCO metricsAlways used by the integration scriptAuto-included
include/tools.shChecks and installs jq if missingAlways used by the integration scriptAuto-included

The following steps outline how the scripts collect, process, and prepare GPU metrics for BHCO.

  1. The dcgm-integration.sh script reads the endpoint file and launches dcgm-collector.sh for each GPU host.
  2. dcgm-collector.sh script polls dcgm-exporter endpoint at configured intervals and files are stored under <BCO_HOME>/etl/scripts/nvidia-dcgm/dcgm-data/<GPU host>_<port> directory.
  3. dcgm-aggregator.sh processes the raw JSON files, generates CSV files in the  <BCO_HOME>/etl/scripts/nvidia-dcgm/metrics-etl directory, and removes the processed JSON files. For details on available GPU metrics and their statistics, see GPU Metrics in NVIDIA Summary Datamart.
  4. All log files are stored in <BCO_HOME>/etl/scripts/nvidia-dcgm/ for review and troubleshooting.

You can test a GPU endpoint manually using the following command:

$ curl -s http://<gpu-host>:9400/metrics 

Replace <gpu-host> with the hostname or IP of your GPU server. This will show the raw metrics exposed by dcgm-exporter.

System Task Creation

Follow these steps to create and schedule a system task that runs the GPU integration scripts automatically.

  1. Go to Administration > ETL & System Tasks > System Tasks.
  2. On the System tasks page, click Add > Add process runner task. The Add task page displays the configuration properties. 
  3. Create a Process Runner task on the Remote ETL Engine (REE).1760506233979-297.png
  4. Configure the following settings:

    PropertyDescription
    NameThe name of the system task or ETL job. Helps identify the task in the scheduler or task list.
    DescriptionOptional field to provide details about the task’s purpose or functionality.
    Maximum execution time before warningThe maximum time the task is allowed to run before a warning is triggered. Example: 4 hours.
    FrequencyDetermines how often the task runs. Can be Predefined (e.g., Each Day) or Custom (specific number of days/hours).
    Predefined frequencyIf using a predefined option, select from choices like Each Day, Each Week, etc.
    Start timestampThe exact time the task starts. Includes: - Hour (0–23) - Minute (0–59) - Week day (if applicable) - Month day (if applicable)
    Custom frequencyIf using a custom schedule, define the interval, e.g., every 1 day.
    Custom start timestampThe date and time when the task should first run, e.g., 09/10/2025 08:47.
    Running on schedulerSpecifies the Remote ETL Engine (REE) or scheduler node where the task will execute.
  5. Click Save and schedule the task.

For more information, see Maintaining System tasks.

ETL Setup

To import GPU metrics collected by the dcgm-exporter scripts, you need to configure an ETL. For detailed step-by-step instructions on creating a Generic CSV Parser ETL, see  Generic - CSV file parser.

Only the GPU-specific settings and paths are mentioned below. All other ETL configuration steps (basic and advanced properties, scheduling, simulation/production modes, etc.) are covered in Generic - CSV file parser page.

GPU-specific ETL settings:

PropertyDescription
TypeSelect Generic – CSV Parser.
File location<BCO_HOME>/etl/scripts/nvidia-dcgm/dcgm-data/metrics-etl – directory where scripts generate CSV files.
Append suffix to parsed fileEnable to avoid overwriting previously imported CSV files.
Entity catalogSelect or create a catalog for GPU entities.
Collection levelDefine data granularity (per GPU, per host, or per cluster).
Object relationshipsMap GPU metrics to their parent systems or clusters.
Percentage formatSet to 0–100 for GPU metrics expressed as percentages.

Schedule the ETL to run at the same frequency as the system task that executes the GPU scripts, starting a few minutes later to ensure CSV files are ready.

Verification

  1. Check ETL logs – Ensure that CSV files were parsed and imported successfully.
  2. Verify GPU metrics – Navigate to Workspace > Systems to confirm that GPU metrics are visible. 
    For a visual representation and analysis of the imported data, see Servers GPU Views.
    For detailed metric definitions, see in GPU Metrics in NVIDIA Summary Datamart.
  3. Confirm updates – Make sure the metrics refresh according to the scheduled frequency.

Best Practices

  • Always run dcgm-integration.sh via a System Task rather than manually to ensure consistent data collection.
  • Update /etc/dcgm-exporter/default-counters.csv if you need to collect additional GPU metrics.
  • Keep your endpoint file current whenever GPU hosts are added or removed.
  • Regularly monitor logs to troubleshoot any issues such as network errors, missing tools (like jq), or misconfigured endpoints.

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*

BMC Helix Continuous Optimization 25.4