Monitoring & Observability¶
Comprehensive monitoring stack providing visibility into infrastructure, services, and applications across the entire homelab. Multi-layered approach combining system monitoring, log aggregation, and real-time metrics.
System Monitoring¶
✅ Checkmk¶
Enterprise-grade monitoring platform tracking infrastructure health, service availability, and performance metrics across all systems.
Features: - Agent-based and agentless monitoring - Auto-discovery of services - Flexible alerting rules - Performance graphing - Distributed monitoring
Log Management¶
📋 Graylog¶
Centralized log aggregation and analysis platform. Collects, indexes, and analyzes logs from all infrastructure components and applications.
Features: - Real-time log streaming - Full-text search - Alert correlations - Dashboard creation - Multi-source ingestion
Real-Time Metrics¶
📊 Netdata¶
Lightweight, real-time system monitoring with per-second metrics and zero-configuration deployment. Primarily used on VPS for public-facing services.
Features: - Real-time metrics (1-second granularity) - Zero-configuration auto-detection - Thousands of metrics collected - Minimal resource usage - Web-based dashboards
Container Monitoring¶
🔄 WUD (Watchtower Update Daemon)¶
Container update tracking and notification system. Monitors Docker images for new versions and alerts when updates are available.
Features: - Multi-registry support - Update notifications - Webhook integrations - Per-container tracking - Scheduled checking
⏱️ Uptime Kuma¶
Service uptime monitoring with status pages and multi-channel notifications.
Features: - HTTP/HTTPS monitoring - TCP port checks - Ping monitoring - Status pages - Multi-notification support
Monitoring Architecture¶
Data Flow¶
Hosts/Services
↓
Checkmk Agents → Checkmk Server → Alerts/Dashboards
↓
Graylog Inputs → Graylog Server → Indexed Logs
↓
Netdata Collectors → Netdata Dashboards
↓
Alert Channels (Email, Discord, Slack)
Monitoring Layers¶
Infrastructure Layer - Proxmox nodes (CPU, RAM, storage) - TrueNAS (pool health, disk SMART) - Network devices (switches, APs) - UPS status and battery
Application Layer - Docker containers - Kubernetes pods - Database performance - Web service response times
Log Layer - System logs (syslog) - Application logs (stdout/stderr) - Security logs (auth, firewall) - Audit trails
Alert Strategy¶
Severity Levels¶
Critical (Immediate Action) - Service down - Disk failure - Out of disk space - Backup failure
Warning (Next Business Day) - High CPU/memory usage - Disk > 80% full - Certificate expiring soon - Failed login attempts
Info (Informational) - Service restarts - Configuration changes - Update available
Notification Channels¶
Discord - Primary notification channel - Separate channels per severity - Rich embeds with context
Email - Critical alerts only - Sent to admin email - Includes runbook links
Mobile (Pushover) - Critical alerts for immediate attention - Push notifications to phone
Dashboard Organization¶
Checkmk Views¶
Infrastructure Overview - All hosts status - Service problems - Performance graphs - Downtime schedule
Docker Monitoring - Container status - Resource usage - Image versions - Network stats
Storage Health - Pool capacity - SMART status - Replication status - Backup job status
Graylog Streams¶
Application Logs
Stream: Docker Containers
Rules: source_type:docker
Dashboard: Container logs by service
Security Logs
Stream: Authentication
Rules: facility:auth OR facility:authpriv
Dashboard: Failed logins, sudo usage
Error Tracking
Stream: Errors
Rules: level:>= error
Dashboard: Error count by source
Retention Policies¶
Checkmk¶
- RRD Metrics: 400 days at decreasing granularity
- Events: 30 days in database
- Availability: 1 year
Graylog¶
- Hot Storage: 7 days (fast SSD)
- Warm Storage: 30 days (HDD)
- Cold Archive: 90 days (compressed)
- Deletion: After 90 days
Netdata¶
- RAM: 1 hour at 1-second granularity
- Disk: 7 days at 10-second granularity
Performance Monitoring¶
Key Metrics¶
System Metrics - CPU utilization and load average - Memory usage and swap - Disk I/O and latency - Network throughput and errors
Service Metrics - Response time (p50, p95, p99) - Request rate - Error rate - Saturation (queue depth)
Application Metrics - Database query times - Cache hit rates - API endpoint performance - Background job duration
Log Correlation¶
Example: Troubleshooting Service Outage¶
- Checkmk Alert: "Jellyfin HTTP check failed"
- Graylog Search: Filter logs from Jellyfin container around alert time
- Error Found: "Database connection timeout"
- Root Cause: PostgreSQL container restarted (OOM)
- Resolution: Increase PostgreSQL memory limit
Graylog Query Examples¶
# Find all errors in last hour
level:error AND timestamp:[now-1h TO now]
# Docker container restarts
message:"Container died" OR message:"Started container"
# Failed authentication attempts
facility:auth AND message:*failed*
# High response times
application:nginx AND response_time:>=5000
Monitoring Best Practices¶
- Start Simple – Monitor essentials first (up/down, disk, CPU)
- Alert on Symptoms – Alert on user-facing issues, not every metric
- Reduce Noise – Tune thresholds to minimize false positives
- Document Runbooks – Link alerts to resolution steps
- Test Alerts – Verify notifications work before relying on them
- Review Regularly – Adjust thresholds based on normal patterns
- Correlate Data – Use multiple systems for context
- Retention Planning – Balance storage cost vs. historical value
Integration Examples¶
Checkmk → Discord Webhook¶
# Notification script
#!/bin/bash
curl -X POST $DISCORD_WEBHOOK \
-H "Content-Type: application/json" \
-d "{
\"content\": \"🚨 **$NOTIFY_HOSTNAME** - $NOTIFY_SERVICEDESC is $NOTIFY_SERVICESTATE\",
\"embeds\": [{
\"title\": \"Service Alert\",
\"description\": \"$NOTIFY_SERVICEOUTPUT\",
\"color\": 15158332
}]
}"
Graylog → Slack Alert¶
{
"type": "slack",
"configuration": {
"webhook_url": "https://hooks.slack.com/services/xxx",
"channel": "#alerts",
"custom_message": "${alert_definition.title}\n${foreach backlog message}${message.message}\n${end}"
}
}
Disaster Recovery¶
Monitoring System Failure¶
If Checkmk is Down: 1. Check server status via Proxmox 2. Review container logs 3. Verify database connectivity 4. Restore from backup if needed
If Graylog is Down: 1. Logs still buffered at source (rsyslog) 2. Check Elasticsearch/MongoDB status 3. Verify storage availability 4. Restore indices if corrupted
Backup Strategy¶
- Checkmk Config: Daily exports via API
- Graylog Config: MongoDB backups included in TrueNAS snapshots
- Dashboards: Exported to Git repository