The Stack
Three components, each with one job:
- Prometheus: Time-series database. Pulls metrics from targets on a schedule, stores them, evaluates alert rules.
- Exporters: Small daemons on each host/service that expose metrics in Prometheus format.
node_exporterfor hosts,cadvisorfor containers, plus app-specific exporters. - Grafana: Visualization. Connects to Prometheus as a data source and renders dashboards.
- Alertmanager: Receives alerts from Prometheus, routes them (email/Discord/PagerDuty), handles grouping and silencing.
This is the metrics half. The logs half is covered in the Loki + Promtail guide — Grafana queries both side by side.
Compose: The Whole Stack
services:
prometheus:
image: prom/prometheus:v2.55.0
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus:/etc/prometheus
- prom_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=20GB'
- '--web.enable-lifecycle'
networks: [monitoring]
expose: ["9090"]
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
volumes:
- ./alertmanager:/etc/alertmanager
networks: [monitoring]
expose: ["9093"]
node_exporter:
image: prom/node-exporter:v1.8.2
container_name: node_exporter
restart: unless-stopped
pid: host
network_mode: host
volumes:
- /:/host:ro,rslave
command:
- '--path.rootfs=/host'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices: ["/dev/kmsg"]
networks: [monitoring]
expose: ["8080"]
grafana:
image: grafana/grafana:11.3.0
container_name: grafana
restart: unless-stopped
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
GF_USERS_ALLOW_SIGN_UP: 'false'
networks: [monitoring, edge]
expose: ["3000"]
volumes:
prom_data:
grafana_data:
networks:
monitoring:
edge:
external: true
Put Grafana on the edge network so NPM/Authelia can front it; keep Prometheus and Alertmanager internal-only.
prometheus.yml
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
cluster: homelab
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
- job_name: node
static_configs:
- targets:
- '192.168.1.100:9100' # proxmox
- '192.168.1.110:9100' # truenas
- '192.168.1.120:9100' # npmserv
- job_name: cadvisor
static_configs:
- targets: ['cadvisor:8080']
- job_name: cloudflared
static_configs:
- targets: ['192.168.1.120:2000']
- job_name: blackbox
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://lab.example.com
- https://grafana.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Scrape Targets Worth Adding
node_exporteron every Linux host. CPU, memory, disk, network, filesystem. The foundation.cadvisoron every Docker host. Per-container CPU, memory, network, blkio.- Proxmox PVE exporter for VM/CT-level metrics and PVE cluster health.
blackbox_exporterfor synthetic checks — HTTP, TCP, ICMP probes against your public endpoints. Tells you if a service is reachable from outside, not just "up."- App-specific exporters: postgres_exporter, redis_exporter, nginx-prometheus-exporter, smartctl_exporter for disk health.
Recording Rules and Alerts
Drop into ./prometheus/rules/homelab.yml:
groups:
- name: host
rules:
- alert: HostDown
expr: up{job="node"} == 0
for: 5m
labels: { severity: critical }
annotations:
summary: 'Host {{ $labels.instance }} is down'
- alert: DiskFillingFast
expr: |
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 4*24*3600) < 0
for: 30m
labels: { severity: warning }
annotations:
summary: '{{ $labels.instance }}:{{ $labels.mountpoint }} will fill in 4 days'
- alert: MemoryPressure
expr: |
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
for: 10m
labels: { severity: warning }
- name: containers
rules:
- alert: ContainerRestarting
expr: rate(container_start_time_seconds[15m]) > 0
for: 5m
annotations:
summary: 'Container {{ $labels.name }} is restart-looping'
- name: synthetics
rules:
- alert: EndpointDown
expr: probe_success == 0
for: 5m
annotations:
summary: 'Synthetic probe failing for {{ $labels.instance }}'
predict_linear is the right primitive for capacity alerts — it extrapolates rather than firing on a static threshold. Tune the lookback window to your write rate.
Alertmanager Routing
route:
receiver: discord
group_by: [alertname, cluster, severity]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers: ['severity="critical"']
receiver: discord-critical
group_wait: 0s
receivers:
- name: discord
webhook_configs:
- url: https://discord.com/api/webhooks/.../slack
send_resolved: true
- name: discord-critical
webhook_configs:
- url: https://discord.com/api/webhooks/.../slack
send_resolved: true
inhibit_rules:
- source_matchers: ['alertname=HostDown']
target_matchers: ['severity=warning']
equal: [instance]
The Discord webhook uses Discord's Slack-compatible endpoint (append /slack). The inhibit rule prevents a host being down from also firing 30 "app on that host is unreachable" alerts.
Grafana Setup
- Provision the Prometheus data source via
./grafana/provisioning/datasources/prometheus.yml. Config-as-code beats clicking through UIs. - Start with community dashboards. Grafana IDs worth importing:
- 1860 — Node Exporter Full
- 14282 — cAdvisor (per-container)
- 10180 — Proxmox via PVE exporter
- 17215 — Blackbox synthetic checks
- Build one home dashboard with the 8–12 metrics you actually look at: host CPU/mem/disk, internet uptime, backup job freshness, service status. Skip everything else on the home page.
- Put Grafana behind Authelia (see the Authelia guide) and enable header auth so you skip Grafana's built-in login.
Retention and Storage
- Local retention 30 days at 30s scrape interval is roughly 1–2GB per target. Plan accordingly.
- Long-term storage: If you want year+ retention, ship to Thanos, Mimir, or VictoriaMetrics. For most homelabs, 30 days local is plenty.
- Downsample with recording rules before retention shrinks. Pre-compute 1h and 1d aggregates so dashboards stay fast as data ages.
Common Pitfalls
- High cardinality labels kill Prometheus. Never label with user IDs, request paths with IDs, or anything else unbounded.
- Scrape interval too aggressive: 30s is plenty for homelab. 5s adds 6x storage and load for no operational benefit.
- Alert fatigue: If an alert fires more than weekly and you ignore it, it is broken. Tune the threshold or delete the rule.
- No alert testing: Use
amtoolto send a fake alert through your routing. Confirm it actually reaches Discord/email. - Dashboards-as-truth without backups: Grafana's SQLite DB contains all dashboards. Include
grafana_datavolume in backups, or provision dashboards from JSON in git.
Validation Checklist
- Prometheus Status → Targets shows every target UP
- Recording rules evaluate without errors (Status → Rules)
- Test alert fires end-to-end (Discord/email actually receives it)
- Grafana dashboards load in <3s on a stale cache
- One home dashboard with the metrics you actually check
- Grafana data and Prometheus rules are in your backup job
- Retention set explicitly; disk usage measured and under budget