• Advanced GPU Monitoring with DCGM-Exporter, Prometheus, and Grafana
blog-thumb

Introduction

Did you know that every GPU in your environment can generate valuable data? When properly analyzed, these metrics can optimize performance, reduce costs, and accelerate strategic decisions.

NVIDIA’s DCGM-Exporter allows you to extract detailed GPU metrics, seamlessly integrating with Prometheus and Grafana for smart dashboards and real-time visualizations.


Why GPU Monitoring Matters

Many companies miss opportunities by not properly tracking their GPU resources. With strategic monitoring, you can:

  • Detect performance bottlenecks before they impact production.
  • Optimize resource usage and reduce costs.
  • Make data-driven decisions instead of relying on assumptions.
  • Monitor multiple clusters in a centralized way.

The real value comes when you can centralize metrics and turn them into actionable insights.


What You’ll Learn in This Post

Here you’ll learn, step by step:

  1. Quickly set up DCGM-Exporter in Docker
  2. Deploy in Kubernetes using Helm Chart
  3. Integrate metrics into central Prometheus
  4. Visualize powerful dashboards in Grafana

All with a focus on turning raw metrics into strategic information for your business.


Running DCGM-Exporter in Docker

To quickly run DCGM-Exporter on a GPU-enabled machine:

docker run -d --gpus all --cap-add SYS_ADMIN --rm -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:4.4.1-4.5.2-ubuntu22.04

Test the metrics endpoint:

curl localhost:9400/metrics

Expected output:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-604ac76c-d9cf-xxx"} 139

Deploying DCGM-Exporter in Kubernetes

NVIDIA maintains an official Helm Chart to install DCGM-Exporter in Kubernetes clusters:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter

Check the pod:

kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" -n default

Access metrics locally:

kubectl port-forward svc/dcgm-exporter 8080:9400
curl http://127.0.0.1:8080/metrics

Setting up Local Prometheus in Docker

Add a scrape job to collect DCGM-Exporter metrics in prometheus.yml:

scrape_configs:
  - job_name: "dcgm-exporter"
    static_configs:
      - targets: ["host.docker.internal:9400"]

Use host.docker.internal in Docker Desktop (Windows/Mac). On Linux, replace it with the host machine IP.

Restart Prometheus:

docker run -d --name prometheus --network=host \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

Now your DCGM-Exporter metrics will be collected properly.


Grafana Integration

NVIDIA provides an official dashboard for metric visualization:

Just import the JSON into Grafana and start exploring real-time insights.

To fully leverage your data, send local Prometheus metrics to a central Prometheus, which connects directly with Grafana, enabling:

  • Accurate, real-time visualizations.
  • Actionable insights for strategic decisions.
  • Simplified monitoring of multiple clusters.

Want to know more or implement this integration in your environment? Get in touch and turn your metrics into smart decisions!


Conclusion

With DCGM-Exporter, you can monitor GPUs in on-premise environments or Kubernetes clusters, seamlessly integrating with Prometheus and Grafana.