HomeLab HQ | Data Extremes

The Stack

Three components, each with one job:

Prometheus: Time-series database. Pulls metrics from targets on a schedule, stores them, evaluates alert rules.
Exporters: Small daemons on each host/service that expose metrics in Prometheus format. node_exporter for hosts, cadvisor for containers, plus app-specific exporters.
Grafana: Visualization. Connects to Prometheus as a data source and renders dashboards.
Alertmanager: Receives alerts from Prometheus, routes them (email/Discord/PagerDuty), handles grouping and silencing.

This is the metrics half. The logs half is covered in the Loki + Promtail guide — Grafana queries both side by side.

Compose: The Whole Stack

services:
  prometheus:
    image: prom/prometheus:v2.55.0
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus:/etc/prometheus
      - prom_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=20GB'
      - '--web.enable-lifecycle'
    networks: [monitoring]
    expose: ["9090"]

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager:/etc/alertmanager
    networks: [monitoring]
    expose: ["9093"]

  node_exporter:
    image: prom/node-exporter:v1.8.2
    container_name: node_exporter
    restart: unless-stopped
    pid: host
    network_mode: host
    volumes:
      - /:/host:ro,rslave
    command:
      - '--path.rootfs=/host'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices: ["/dev/kmsg"]
    networks: [monitoring]
    expose: ["8080"]

  grafana:
    image: grafana/grafana:11.3.0
    container_name: grafana
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: 'false'
    networks: [monitoring, edge]
    expose: ["3000"]

volumes:
  prom_data:
  grafana_data:

networks:
  monitoring:
  edge:
    external: true

Put Grafana on the edge network so NPM/Authelia can front it; keep Prometheus and Alertmanager internal-only.

prometheus.yml

global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    cluster: homelab

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']

  - job_name: node
    static_configs:
      - targets:
          - '192.168.1.100:9100'   # proxmox
          - '192.168.1.110:9100'   # truenas
          - '192.168.1.120:9100'   # npmserv

  - job_name: cadvisor
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: cloudflared
    static_configs:
      - targets: ['192.168.1.120:2000']

  - job_name: blackbox
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://lab.example.com
          - https://grafana.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Scrape Targets Worth Adding

node_exporter on every Linux host. CPU, memory, disk, network, filesystem. The foundation.
cadvisor on every Docker host. Per-container CPU, memory, network, blkio.
Proxmox PVE exporter for VM/CT-level metrics and PVE cluster health.
blackbox_exporter for synthetic checks — HTTP, TCP, ICMP probes against your public endpoints. Tells you if a service is reachable from outside, not just "up."
App-specific exporters: postgres_exporter, redis_exporter, nginx-prometheus-exporter, smartctl_exporter for disk health.

Recording Rules and Alerts

Drop into ./prometheus/rules/homelab.yml:

groups:
  - name: host
    rules:
      - alert: HostDown
        expr: up{job="node"} == 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: 'Host {{ $labels.instance }} is down'

      - alert: DiskFillingFast
        expr: |
          predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 4*24*3600) < 0
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: '{{ $labels.instance }}:{{ $labels.mountpoint }} will fill in 4 days'

      - alert: MemoryPressure
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
        for: 10m
        labels: { severity: warning }

  - name: containers
    rules:
      - alert: ContainerRestarting
        expr: rate(container_start_time_seconds[15m]) > 0
        for: 5m
        annotations:
          summary: 'Container {{ $labels.name }} is restart-looping'

  - name: synthetics
    rules:
      - alert: EndpointDown
        expr: probe_success == 0
        for: 5m
        annotations:
          summary: 'Synthetic probe failing for {{ $labels.instance }}'

predict_linear is the right primitive for capacity alerts — it extrapolates rather than firing on a static threshold. Tune the lookback window to your write rate.

Alertmanager Routing

route:
  receiver: discord
  group_by: [alertname, cluster, severity]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers: ['severity="critical"']
      receiver: discord-critical
      group_wait: 0s

receivers:
  - name: discord
    webhook_configs:
      - url: https://discord.com/api/webhooks/.../slack
        send_resolved: true
  - name: discord-critical
    webhook_configs:
      - url: https://discord.com/api/webhooks/.../slack
        send_resolved: true

inhibit_rules:
  - source_matchers: ['alertname=HostDown']
    target_matchers: ['severity=warning']
    equal: [instance]

The Discord webhook uses Discord's Slack-compatible endpoint (append /slack). The inhibit rule prevents a host being down from also firing 30 "app on that host is unreachable" alerts.

Grafana Setup

Provision the Prometheus data source via ./grafana/provisioning/datasources/prometheus.yml. Config-as-code beats clicking through UIs.
Start with community dashboards. Grafana IDs worth importing:
- 1860 — Node Exporter Full
- 14282 — cAdvisor (per-container)
- 10180 — Proxmox via PVE exporter
- 17215 — Blackbox synthetic checks
Build one home dashboard with the 8–12 metrics you actually look at: host CPU/mem/disk, internet uptime, backup job freshness, service status. Skip everything else on the home page.
Put Grafana behind Authelia (see the Authelia guide) and enable header auth so you skip Grafana's built-in login.

Retention and Storage

Local retention 30 days at 30s scrape interval is roughly 1–2GB per target. Plan accordingly.
Long-term storage: If you want year+ retention, ship to Thanos, Mimir, or VictoriaMetrics. For most homelabs, 30 days local is plenty.
Downsample with recording rules before retention shrinks. Pre-compute 1h and 1d aggregates so dashboards stay fast as data ages.

Common Pitfalls

High cardinality labels kill Prometheus. Never label with user IDs, request paths with IDs, or anything else unbounded.
Scrape interval too aggressive: 30s is plenty for homelab. 5s adds 6x storage and load for no operational benefit.
Alert fatigue: If an alert fires more than weekly and you ignore it, it is broken. Tune the threshold or delete the rule.
No alert testing: Use amtool to send a fake alert through your routing. Confirm it actually reaches Discord/email.
Dashboards-as-truth without backups: Grafana's SQLite DB contains all dashboards. Include grafana_data volume in backups, or provision dashboards from JSON in git.

Validation Checklist

Prometheus Status → Targets shows every target UP
Recording rules evaluate without errors (Status → Rules)
Test alert fires end-to-end (Discord/email actually receives it)
Grafana dashboards load in <3s on a stale cache
One home dashboard with the metrics you actually check
Grafana data and Prometheus rules are in your backup job
Retention set explicitly; disk usage measured and under budget