Collecting GPU Data Using NVIDIA dcgm-exporter Scripts
There are two components to collect and import GPU metrics:
- A process runner task that executes the GPU collection scripts (dcgm-exporter utilities).
- The Generic - CSV Parser ETL, which imports the collected CSV data into BHCO.
You can understand the GPU metrics collection and processing as follows:
- Collect – dcgm-collector.sh polls each GPU host and saves raw JSON metrics.
- Aggregate – dcgm-aggregator.sh converts JSON files into CSV for import.
- Automate – dcgm-integration.sh runs collection and aggregation on a schedule.
- Import – CSV files are imported into BHCO using the Generic – CSV Parser ETL.
- Verify – Check logs and Workspace to confirm metrics are imported and updated.
You can test the dcgm-exporter endpoint using the following command:
$ curl -s
Default dcgm-exporter port is 9400.
Before you begin
Before you start collecting GPU metrics, ensure that the following prerequisites are met:
- dcgm-exporter installed and running on all GPU hosts.
- REE must have network access to GPU endpoints.
- REE must have bash, curl, and jq installed.
Create a text that lists all DCGM-exporter endpoints and their corresponding and polling intervals. Example:
http://gpu-host1:9400/metrics,30 http://gpu-host2:9400/metrics,30 http://gpu-host3:9400/metrics,10
Scripts and Usage
Use the following scripts to collect, aggregate, and prepare GPU metrics for import into BHCO.
Script | Purpose | When to Use | How to Run |
---|---|---|---|
dcgm-integration.sh | Wrapper that automates collection and aggregation | For scheduled, repeatable execution | $ ./dcgm-integration.sh -i <path_to_endpoint_file> |
dcgm-collector.sh | Polls GPU endpoints and stores raw JSON | Always as the first step | ./dcgm-collector.sh -l <dcgm-exporter-endpoint> -p <dcgm-exporter polling interval> -d <duration is seconds> |
dcgm-aggregator.sh | Aggregates JSON files into CSV | After collection, before ETL import | $ ./dcgm-aggregator.sh |
include/metrics_mapping.sh | Maps dcgm metrics to BHCO metrics | Always used by the integration script | Auto-included |
include/tools.sh | Checks and installs jq if missing | Always used by the integration script | Auto-included |
The following steps outline how the scripts collect, process, and prepare GPU metrics for BHCO.
- The dcgm-integration.sh script reads the endpoint file and launches dcgm-collector.sh for each GPU host.
- dcgm-collector.sh script polls dcgm-exporter endpoint at configured intervals and files are stored under <BCO_HOME>/etl/scripts/nvidia-dcgm/dcgm-data/<GPU host>_<port> directory.
- dcgm-aggregator.sh processes the raw JSON files, generates CSV files in the <BCO_HOME>/etl/scripts/nvidia-dcgm/metrics-etl directory, and removes the processed JSON files. For details on available GPU metrics and their statistics, see GPU Metrics in NVIDIA Summary Datamart.
- All log files are stored in <BCO_HOME>/etl/scripts/nvidia-dcgm/ for review and troubleshooting.
You can test a GPU endpoint manually using the following command:
$ curl -s http://<gpu-host>:9400/metrics
System Task Creation
Follow these steps to create and schedule a system task that runs the GPU integration scripts automatically.
- Go to Administration > ETL & System Tasks > System Tasks.
- On the System tasks page, click Add > Add process runner task. The Add task page displays the configuration properties.
- Create a Process Runner task on the Remote ETL Engine (REE).
Configure the following settings:
Property Description Name The name of the system task or ETL job. Helps identify the task in the scheduler or task list. Description Optional field to provide details about the task’s purpose or functionality. Maximum execution time before warning The maximum time the task is allowed to run before a warning is triggered. Example: 4 hours. Frequency Determines how often the task runs. Can be Predefined (e.g., Each Day) or Custom (specific number of days/hours). Predefined frequency If using a predefined option, select from choices like Each Day, Each Week, etc. Start timestamp The exact time the task starts. Includes: - Hour (0–23) - Minute (0–59) - Week day (if applicable) - Month day (if applicable) Custom frequency If using a custom schedule, define the interval, e.g., every 1 day. Custom start timestamp The date and time when the task should first run, e.g., 09/10/2025 08:47. Running on scheduler Specifies the Remote ETL Engine (REE) or scheduler node where the task will execute. - Click Save and schedule the task.
For more information, see Maintaining System tasks.
ETL Setup
To import GPU metrics collected by the dcgm-exporter scripts, you need to configure an ETL. For detailed step-by-step instructions on creating a Generic CSV Parser ETL, see Generic - CSV file parser.
GPU-specific ETL settings:
Property | Description |
---|---|
Type | Select Generic – CSV Parser. |
File location | <BCO_HOME>/etl/scripts/nvidia-dcgm/dcgm-data/metrics-etl – directory where scripts generate CSV files. |
Append suffix to parsed file | Enable to avoid overwriting previously imported CSV files. |
Entity catalog | Select or create a catalog for GPU entities. |
Collection level | Define data granularity (per GPU, per host, or per cluster). |
Object relationships | Map GPU metrics to their parent systems or clusters. |
Percentage format | Set to 0–100 for GPU metrics expressed as percentages. |
Verification
- Check ETL logs – Ensure that CSV files were parsed and imported successfully.
- Verify GPU metrics – Navigate to Workspace > Systems to confirm that GPU metrics are visible.
For a visual representation and analysis of the imported data, see Servers GPU Views.
For detailed metric definitions, see in GPU Metrics in NVIDIA Summary Datamart. - Confirm updates – Make sure the metrics refresh according to the scheduled frequency.
Best Practices
- Always run dcgm-integration.sh via a System Task rather than manually to ensure consistent data collection.
- Update /etc/dcgm-exporter/default-counters.csv if you need to collect additional GPU metrics.
- Keep your endpoint file current whenever GPU hosts are added or removed.
- Regularly monitor logs to troubleshoot any issues such as network errors, missing tools (like jq), or misconfigured endpoints.