Monitoring & Observability¶

Comprehensive monitoring stack providing visibility into infrastructure, services, and applications across the entire homelab. Multi-layered approach combining system monitoring, log aggregation, and real-time metrics.

System Monitoring¶

✅ Checkmk¶

Enterprise-grade monitoring platform tracking infrastructure health, service availability, and performance metrics across all systems.

Features: - Agent-based and agentless monitoring - Auto-discovery of services - Flexible alerting rules - Performance graphing - Distributed monitoring

Log Management¶

📋 Graylog¶

Centralized log aggregation and analysis platform. Collects, indexes, and analyzes logs from all infrastructure components and applications.

Features: - Real-time log streaming - Full-text search - Alert correlations - Dashboard creation - Multi-source ingestion

Real-Time Metrics¶

📊 Netdata¶

Lightweight, real-time system monitoring with per-second metrics and zero-configuration deployment. Primarily used on VPS for public-facing services.

Features: - Real-time metrics (1-second granularity) - Zero-configuration auto-detection - Thousands of metrics collected - Minimal resource usage - Web-based dashboards

Container Monitoring¶

🔄 WUD (Watchtower Update Daemon)¶

Container update tracking and notification system. Monitors Docker images for new versions and alerts when updates are available.

Features: - Multi-registry support - Update notifications - Webhook integrations - Per-container tracking - Scheduled checking

⏱️ Uptime Kuma¶

Service uptime monitoring with status pages and multi-channel notifications.

Features: - HTTP/HTTPS monitoring - TCP port checks - Ping monitoring - Status pages - Multi-notification support

Monitoring Architecture¶

Data Flow¶

Hosts/Services
  ↓
Checkmk Agents → Checkmk Server → Alerts/Dashboards
  ↓
Graylog Inputs → Graylog Server → Indexed Logs
  ↓
Netdata Collectors → Netdata Dashboards
  ↓
Alert Channels (Email, Discord, Slack)

Monitoring Layers¶

Infrastructure Layer - Proxmox nodes (CPU, RAM, storage) - TrueNAS (pool health, disk SMART) - Network devices (switches, APs) - UPS status and battery

Application Layer - Docker containers - Kubernetes pods - Database performance - Web service response times

Log Layer - System logs (syslog) - Application logs (stdout/stderr) - Security logs (auth, firewall) - Audit trails

Alert Strategy¶

Severity Levels¶

Critical (Immediate Action) - Service down - Disk failure - Out of disk space - Backup failure

Warning (Next Business Day) - High CPU/memory usage - Disk > 80% full - Certificate expiring soon - Failed login attempts

Info (Informational) - Service restarts - Configuration changes - Update available

Notification Channels¶

Discord - Primary notification channel - Separate channels per severity - Rich embeds with context

Email - Critical alerts only - Sent to admin email - Includes runbook links

Mobile (Pushover) - Critical alerts for immediate attention - Push notifications to phone

Dashboard Organization¶

Checkmk Views¶

Infrastructure Overview - All hosts status - Service problems - Performance graphs - Downtime schedule

Docker Monitoring - Container status - Resource usage - Image versions - Network stats

Storage Health - Pool capacity - SMART status - Replication status - Backup job status

Graylog Streams¶

Application Logs

Stream: Docker Containers
Rules: source_type:docker
Dashboard: Container logs by service

Security Logs

Stream: Authentication
Rules: facility:auth OR facility:authpriv
Dashboard: Failed logins, sudo usage

Error Tracking

Stream: Errors
Rules: level:>= error
Dashboard: Error count by source

Retention Policies¶

Checkmk¶

RRD Metrics: 400 days at decreasing granularity
Events: 30 days in database
Availability: 1 year

Graylog¶

Hot Storage: 7 days (fast SSD)
Warm Storage: 30 days (HDD)
Cold Archive: 90 days (compressed)
Deletion: After 90 days

Netdata¶

RAM: 1 hour at 1-second granularity
Disk: 7 days at 10-second granularity

Performance Monitoring¶

Key Metrics¶

System Metrics - CPU utilization and load average - Memory usage and swap - Disk I/O and latency - Network throughput and errors

Service Metrics - Response time (p50, p95, p99) - Request rate - Error rate - Saturation (queue depth)

Application Metrics - Database query times - Cache hit rates - API endpoint performance - Background job duration

Log Correlation¶

Example: Troubleshooting Service Outage¶

Checkmk Alert: "Jellyfin HTTP check failed"
Graylog Search: Filter logs from Jellyfin container around alert time
Error Found: "Database connection timeout"
Root Cause: PostgreSQL container restarted (OOM)
Resolution: Increase PostgreSQL memory limit

Graylog Query Examples¶

# Find all errors in last hour
level:error AND timestamp:[now-1h TO now]

# Docker container restarts
message:"Container died" OR message:"Started container"

# Failed authentication attempts
facility:auth AND message:*failed*

# High response times
application:nginx AND response_time:>=5000

Monitoring Best Practices¶

Start Simple – Monitor essentials first (up/down, disk, CPU)
Alert on Symptoms – Alert on user-facing issues, not every metric
Reduce Noise – Tune thresholds to minimize false positives
Document Runbooks – Link alerts to resolution steps
Test Alerts – Verify notifications work before relying on them
Review Regularly – Adjust thresholds based on normal patterns
Correlate Data – Use multiple systems for context
Retention Planning – Balance storage cost vs. historical value

Integration Examples¶

Checkmk → Discord Webhook¶

# Notification script
#!/bin/bash
curl -X POST $DISCORD_WEBHOOK \
  -H "Content-Type: application/json" \
  -d "{
    \"content\": \"🚨 **$NOTIFY_HOSTNAME** - $NOTIFY_SERVICEDESC is $NOTIFY_SERVICESTATE\",
    \"embeds\": [{
      \"title\": \"Service Alert\",
      \"description\": \"$NOTIFY_SERVICEOUTPUT\",
      \"color\": 15158332
    }]
  }"

Graylog → Slack Alert¶

{
  "type": "slack",
  "configuration": {
    "webhook_url": "https://hooks.slack.com/services/xxx",
    "channel": "#alerts",
    "custom_message": "${alert_definition.title}\n${foreach backlog message}${message.message}\n${end}"
  }
}

Disaster Recovery¶

Monitoring System Failure¶

If Checkmk is Down: 1. Check server status via Proxmox 2. Review container logs 3. Verify database connectivity 4. Restore from backup if needed

If Graylog is Down: 1. Logs still buffered at source (rsyslog) 2. Check Elasticsearch/MongoDB status 3. Verify storage availability 4. Restore indices if corrupted

Backup Strategy¶

Checkmk Config: Daily exports via API
Graylog Config: MongoDB backups included in TrueNAS snapshots
Dashboards: Exported to Git repository