Skip to content

Monitoring & Observability

Comprehensive monitoring stack providing visibility into infrastructure, services, and applications across the entire homelab. Multi-layered approach combining system monitoring, log aggregation, and real-time metrics.

System Monitoring

✅ Checkmk

Enterprise-grade monitoring platform tracking infrastructure health, service availability, and performance metrics across all systems.

Features: - Agent-based and agentless monitoring - Auto-discovery of services - Flexible alerting rules - Performance graphing - Distributed monitoring


Log Management

📋 Graylog

Centralized log aggregation and analysis platform. Collects, indexes, and analyzes logs from all infrastructure components and applications.

Features: - Real-time log streaming - Full-text search - Alert correlations - Dashboard creation - Multi-source ingestion


Real-Time Metrics

📊 Netdata

Lightweight, real-time system monitoring with per-second metrics and zero-configuration deployment. Primarily used on VPS for public-facing services.

Features: - Real-time metrics (1-second granularity) - Zero-configuration auto-detection - Thousands of metrics collected - Minimal resource usage - Web-based dashboards


Container Monitoring

🔄 WUD (Watchtower Update Daemon)

Container update tracking and notification system. Monitors Docker images for new versions and alerts when updates are available.

Features: - Multi-registry support - Update notifications - Webhook integrations - Per-container tracking - Scheduled checking

⏱️ Uptime Kuma

Service uptime monitoring with status pages and multi-channel notifications.

Features: - HTTP/HTTPS monitoring - TCP port checks - Ping monitoring - Status pages - Multi-notification support


Monitoring Architecture

Data Flow

Hosts/Services
  ↓
Checkmk Agents → Checkmk Server → Alerts/Dashboards
  ↓
Graylog Inputs → Graylog Server → Indexed Logs
  ↓
Netdata Collectors → Netdata Dashboards
  ↓
Alert Channels (Email, Discord, Slack)

Monitoring Layers

Infrastructure Layer - Proxmox nodes (CPU, RAM, storage) - TrueNAS (pool health, disk SMART) - Network devices (switches, APs) - UPS status and battery

Application Layer - Docker containers - Kubernetes pods - Database performance - Web service response times

Log Layer - System logs (syslog) - Application logs (stdout/stderr) - Security logs (auth, firewall) - Audit trails


Alert Strategy

Severity Levels

Critical (Immediate Action) - Service down - Disk failure - Out of disk space - Backup failure

Warning (Next Business Day) - High CPU/memory usage - Disk > 80% full - Certificate expiring soon - Failed login attempts

Info (Informational) - Service restarts - Configuration changes - Update available

Notification Channels

Discord - Primary notification channel - Separate channels per severity - Rich embeds with context

Email - Critical alerts only - Sent to admin email - Includes runbook links

Mobile (Pushover) - Critical alerts for immediate attention - Push notifications to phone


Dashboard Organization

Checkmk Views

Infrastructure Overview - All hosts status - Service problems - Performance graphs - Downtime schedule

Docker Monitoring - Container status - Resource usage - Image versions - Network stats

Storage Health - Pool capacity - SMART status - Replication status - Backup job status

Graylog Streams

Application Logs

Stream: Docker Containers
Rules: source_type:docker
Dashboard: Container logs by service

Security Logs

Stream: Authentication
Rules: facility:auth OR facility:authpriv
Dashboard: Failed logins, sudo usage

Error Tracking

Stream: Errors
Rules: level:>= error
Dashboard: Error count by source


Retention Policies

Checkmk

  • RRD Metrics: 400 days at decreasing granularity
  • Events: 30 days in database
  • Availability: 1 year

Graylog

  • Hot Storage: 7 days (fast SSD)
  • Warm Storage: 30 days (HDD)
  • Cold Archive: 90 days (compressed)
  • Deletion: After 90 days

Netdata

  • RAM: 1 hour at 1-second granularity
  • Disk: 7 days at 10-second granularity

Performance Monitoring

Key Metrics

System Metrics - CPU utilization and load average - Memory usage and swap - Disk I/O and latency - Network throughput and errors

Service Metrics - Response time (p50, p95, p99) - Request rate - Error rate - Saturation (queue depth)

Application Metrics - Database query times - Cache hit rates - API endpoint performance - Background job duration


Log Correlation

Example: Troubleshooting Service Outage

  1. Checkmk Alert: "Jellyfin HTTP check failed"
  2. Graylog Search: Filter logs from Jellyfin container around alert time
  3. Error Found: "Database connection timeout"
  4. Root Cause: PostgreSQL container restarted (OOM)
  5. Resolution: Increase PostgreSQL memory limit

Graylog Query Examples

# Find all errors in last hour
level:error AND timestamp:[now-1h TO now]

# Docker container restarts
message:"Container died" OR message:"Started container"

# Failed authentication attempts
facility:auth AND message:*failed*

# High response times
application:nginx AND response_time:>=5000

Monitoring Best Practices

  1. Start Simple – Monitor essentials first (up/down, disk, CPU)
  2. Alert on Symptoms – Alert on user-facing issues, not every metric
  3. Reduce Noise – Tune thresholds to minimize false positives
  4. Document Runbooks – Link alerts to resolution steps
  5. Test Alerts – Verify notifications work before relying on them
  6. Review Regularly – Adjust thresholds based on normal patterns
  7. Correlate Data – Use multiple systems for context
  8. Retention Planning – Balance storage cost vs. historical value

Integration Examples

Checkmk → Discord Webhook

# Notification script
#!/bin/bash
curl -X POST $DISCORD_WEBHOOK \
  -H "Content-Type: application/json" \
  -d "{
    \"content\": \"🚨 **$NOTIFY_HOSTNAME** - $NOTIFY_SERVICEDESC is $NOTIFY_SERVICESTATE\",
    \"embeds\": [{
      \"title\": \"Service Alert\",
      \"description\": \"$NOTIFY_SERVICEOUTPUT\",
      \"color\": 15158332
    }]
  }"

Graylog → Slack Alert

{
  "type": "slack",
  "configuration": {
    "webhook_url": "https://hooks.slack.com/services/xxx",
    "channel": "#alerts",
    "custom_message": "${alert_definition.title}\n${foreach backlog message}${message.message}\n${end}"
  }
}

Disaster Recovery

Monitoring System Failure

If Checkmk is Down: 1. Check server status via Proxmox 2. Review container logs 3. Verify database connectivity 4. Restore from backup if needed

If Graylog is Down: 1. Logs still buffered at source (rsyslog) 2. Check Elasticsearch/MongoDB status 3. Verify storage availability 4. Restore indices if corrupted

Backup Strategy

  • Checkmk Config: Daily exports via API
  • Graylog Config: MongoDB backups included in TrueNAS snapshots
  • Dashboards: Exported to Git repository