Did you know that every GPU in your environment can generate valuable data? When properly analyzed, these metrics can optimize performance, reduce costs, and accelerate strategic decisions.
NVIDIA’s DCGM-Exporter allows you to extract detailed GPU metrics, seamlessly integrating with Prometheus and Grafana for smart dashboards and real-time visualizations.
Many companies miss opportunities by not properly tracking their GPU resources. With strategic monitoring, you can:
The real value comes when you can centralize metrics and turn them into actionable insights.
Here you’ll learn, step by step:
All with a focus on turning raw metrics into strategic information for your business.
To quickly run DCGM-Exporter on a GPU-enabled machine:
docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:4.4.1-4.5.2-ubuntu22.04
Test the metrics endpoint:
curl localhost:9400/metrics
Expected output:
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-604ac76c-d9cf-xxx"} 139
NVIDIA maintains an official Helm Chart to install DCGM-Exporter in Kubernetes clusters:
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter
Check the pod:
kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" -n default
Access metrics locally:
kubectl port-forward svc/dcgm-exporter 8080:9400
curl http://127.0.0.1:8080/metrics
Add a scrape job to collect DCGM-Exporter metrics in prometheus.yml
:
scrape_configs:
- job_name: "dcgm-exporter"
static_configs:
- targets: ["host.docker.internal:9400"]
Use
host.docker.internal
in Docker Desktop (Windows/Mac). On Linux, replace it with the host machine IP.
Restart Prometheus:
docker run -d --name prometheus --network=host \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
Now your DCGM-Exporter metrics will be collected properly.
NVIDIA provides an official dashboard for metric visualization:
grafana/dcgm-exporter-dashboard.json
Just import the JSON into Grafana and start exploring real-time insights.
To fully leverage your data, send local Prometheus metrics to a central Prometheus, which connects directly with Grafana, enabling:
Want to know more or implement this integration in your environment? Get in touch and turn your metrics into smart decisions!
With DCGM-Exporter, you can monitor GPUs in on-premise environments or Kubernetes clusters, seamlessly integrating with Prometheus and Grafana.