Prometheus + Grafana Monitoring Stack

Scrape every host and container, build dashboards that actually answer questions, and get paged before users notice.

The Stack

Three components, each with one job:

  • Prometheus: Time-series database. Pulls metrics from targets on a schedule, stores them, evaluates alert rules.
  • Exporters: Small daemons on each host/service that expose metrics in Prometheus format. node_exporter for hosts, cadvisor for containers, plus app-specific exporters.
  • Grafana: Visualization. Connects to Prometheus as a data source and renders dashboards.
  • Alertmanager: Receives alerts from Prometheus, routes them (email/Discord/PagerDuty), handles grouping and silencing.

This is the metrics half. The logs half is covered in the Loki + Promtail guide — Grafana queries both side by side.

Compose: The Whole Stack

services:
  prometheus:
    image: prom/prometheus:v2.55.0
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus:/etc/prometheus
      - prom_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=20GB'
      - '--web.enable-lifecycle'
    networks: [monitoring]
    expose: ["9090"]

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager:/etc/alertmanager
    networks: [monitoring]
    expose: ["9093"]

  node_exporter:
    image: prom/node-exporter:v1.8.2
    container_name: node_exporter
    restart: unless-stopped
    pid: host
    network_mode: host
    volumes:
      - /:/host:ro,rslave
    command:
      - '--path.rootfs=/host'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices: ["/dev/kmsg"]
    networks: [monitoring]
    expose: ["8080"]

  grafana:
    image: grafana/grafana:11.3.0
    container_name: grafana
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: 'false'
    networks: [monitoring, edge]
    expose: ["3000"]

volumes:
  prom_data:
  grafana_data:

networks:
  monitoring:
  edge:
    external: true

Put Grafana on the edge network so NPM/Authelia can front it; keep Prometheus and Alertmanager internal-only.

prometheus.yml

global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    cluster: homelab

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node
    static_configs:
      - targets:
          - '192.168.1.100:9100'   # proxmox
          - '192.168.1.110:9100'   # truenas
          - '192.168.1.120:9100'   # npmserv

  - job_name: cadvisor
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: cloudflared
    static_configs:
      - targets: ['192.168.1.120:2000']

  - job_name: blackbox
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://lab.example.com
          - https://grafana.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Scrape Targets Worth Adding

  • node_exporter on every Linux host. CPU, memory, disk, network, filesystem. The foundation.
  • cadvisor on every Docker host. Per-container CPU, memory, network, blkio.
  • Proxmox PVE exporter for VM/CT-level metrics and PVE cluster health.
  • blackbox_exporter for synthetic checks — HTTP, TCP, ICMP probes against your public endpoints. Tells you if a service is reachable from outside, not just "up."
  • App-specific exporters: postgres_exporter, redis_exporter, nginx-prometheus-exporter, smartctl_exporter for disk health.

Recording Rules and Alerts

Drop into ./prometheus/rules/homelab.yml:

groups:
  - name: host
    rules:
      - alert: HostDown
        expr: up{job="node"} == 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: 'Host {{ $labels.instance }} is down'

      - alert: DiskFillingFast
        expr: |
          predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 4*24*3600) < 0
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: '{{ $labels.instance }}:{{ $labels.mountpoint }} will fill in 4 days'

      - alert: MemoryPressure
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
        for: 10m
        labels: { severity: warning }

  - name: containers
    rules:
      - alert: ContainerRestarting
        expr: rate(container_start_time_seconds[15m]) > 0
        for: 5m
        annotations:
          summary: 'Container {{ $labels.name }} is restart-looping'

  - name: synthetics
    rules:
      - alert: EndpointDown
        expr: probe_success == 0
        for: 5m
        annotations:
          summary: 'Synthetic probe failing for {{ $labels.instance }}'

predict_linear is the right primitive for capacity alerts — it extrapolates rather than firing on a static threshold. Tune the lookback window to your write rate.

Alertmanager Routing

route:
  receiver: discord
  group_by: [alertname, cluster, severity]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers: ['severity="critical"']
      receiver: discord-critical
      group_wait: 0s

receivers:
  - name: discord
    webhook_configs:
      - url: https://discord.com/api/webhooks/.../slack
        send_resolved: true
  - name: discord-critical
    webhook_configs:
      - url: https://discord.com/api/webhooks/.../slack
        send_resolved: true

inhibit_rules:
  - source_matchers: ['alertname=HostDown']
    target_matchers: ['severity=warning']
    equal: [instance]

The Discord webhook uses Discord's Slack-compatible endpoint (append /slack). The inhibit rule prevents a host being down from also firing 30 "app on that host is unreachable" alerts.

Grafana Setup

  • Provision the Prometheus data source via ./grafana/provisioning/datasources/prometheus.yml. Config-as-code beats clicking through UIs.
  • Start with community dashboards. Grafana IDs worth importing:
    • 1860 — Node Exporter Full
    • 14282 — cAdvisor (per-container)
    • 10180 — Proxmox via PVE exporter
    • 17215 — Blackbox synthetic checks
  • Build one home dashboard with the 8–12 metrics you actually look at: host CPU/mem/disk, internet uptime, backup job freshness, service status. Skip everything else on the home page.
  • Put Grafana behind Authelia (see the Authelia guide) and enable header auth so you skip Grafana's built-in login.

Retention and Storage

  • Local retention 30 days at 30s scrape interval is roughly 1–2GB per target. Plan accordingly.
  • Long-term storage: If you want year+ retention, ship to Thanos, Mimir, or VictoriaMetrics. For most homelabs, 30 days local is plenty.
  • Downsample with recording rules before retention shrinks. Pre-compute 1h and 1d aggregates so dashboards stay fast as data ages.

Common Pitfalls

  • High cardinality labels kill Prometheus. Never label with user IDs, request paths with IDs, or anything else unbounded.
  • Scrape interval too aggressive: 30s is plenty for homelab. 5s adds 6x storage and load for no operational benefit.
  • Alert fatigue: If an alert fires more than weekly and you ignore it, it is broken. Tune the threshold or delete the rule.
  • No alert testing: Use amtool to send a fake alert through your routing. Confirm it actually reaches Discord/email.
  • Dashboards-as-truth without backups: Grafana's SQLite DB contains all dashboards. Include grafana_data volume in backups, or provision dashboards from JSON in git.

Validation Checklist

  • Prometheus Status → Targets shows every target UP
  • Recording rules evaluate without errors (Status → Rules)
  • Test alert fires end-to-end (Discord/email actually receives it)
  • Grafana dashboards load in <3s on a stale cache
  • One home dashboard with the metrics you actually check
  • Grafana data and Prometheus rules are in your backup job
  • Retention set explicitly; disk usage measured and under budget

- Crafted by Axiom|Spectre