Server Monitoring with AI: Smart Alerts
Your server runs 24/7. But who is really monitoring it? Classic monitoring tools send alerts when CPU exceeds 90%... even if it's just an apt update running. Result: you receive 15 notifications a day, ignore them all, and when a real problem occurs, you miss it.
AI-powered monitoring changes the game entirely. Instead of dumb thresholds, an intelligent agent understands context, distinguishes a normal spike from a real anomaly, and only alerts you when it truly matters.
In this guide, we'll set up an intelligent monitoring system with OpenClaw that monitors your server and sends relevant Telegram alerts—not spam.
🔍 Classic Monitoring vs AI Monitoring
The Problem with Fixed Thresholds
Traditional monitoring operates with simple rules:
IF cpu > 90% THEN alert
IF ram > 85% THEN alert
IF disk > 80% THEN alert
This seems logical. In practice, it's unusable:
| Situation | Classic Threshold | AI Monitoring |
|---|---|---|
apt upgrade running |
🔴 ALERT: CPU 95%! | ✅ Known process, ignored |
| Nightly backup | 🔴 ALERT: Disk I/O! | ✅ Scheduled task, normal |
| CPU spike at 3 AM for no reason | ✅ Alert (if > threshold) | 🔴 ALERT: Abnormal activity |
| RAM slowly rising over 3 days | ❌ Never detected (below threshold) | 🔴 ALERT: Likely memory leak |
| Service restarting 5 times in 1h | ❌ No rule for this | 🔴 ALERT: Service instability |
What AI Brings
AI doesn't replace metric collection—it replaces human interpretation. An AI agent can:
- Understand context: A CPU spike during a cron job is normal
- Detect patterns: "This service restarted 3 times in 2h, which never happened before"
- Correlate metrics: RAM + CPU + I/O rising together = different problem than CPU alone
- Adjust sensitivity: Alert faster at night (no one should be using the server)
- Summarize intelligently: Instead of 10 alerts, one contextual message
📊 Essential Metrics to Monitor
Before configuring the agent, let's define what we monitor and why.
CPU
# Instant CPU usage
top -bn1 | grep "Cpu(s)" | awk '{print $2}'
# Load average (1, 5, 15 minutes)
cat /proc/loadavg
# Number of cores (to interpret load)
nproc
What the AI needs to know: A load average of 4.0 on a 4-core machine = 100% used. On an 8-core machine = 50%.
RAM
# Used/available memory
free -h
# Detail with buffers/cache
free -m | awk 'NR==2{printf "Used: %sMB / %sMB (%.1f%%)\n", $3, $2, $3*100/$2}'
Classic trap: Linux uses free RAM as disk cache. 90% "used" RAM isn't necessarily a problem if 40% is cache.
Disk
# Disk space
df -h /
# Inodes (often forgotten, can block even with free space)
df -i /
# Current I/O
iostat -x 1 3 2>/dev/null || echo "iostat not available"
Services and Uptime
# Critical services
systemctl is-active nginx postgresql docker openclaw
# Server uptime
uptime -p
# Listening ports
ss -tlnp | grep -E ':(80|443|5432|8080|3000)'
Recent Logs
# Recent system errors
journalctl -p err -n 20 --no-pager
# Failed SSH attempts
journalctl -u sshd -n 50 --no-pager | grep -i "failed\|invalid"
# OOM killer (processes killed due to lack of RAM)
dmesg | grep -i "oom\|killed process" | tail -5
🛠️ Metric Collection Script
Let's create a script that collects everything and produces a structured report for the AI agent to interpret.
#!/bin/bash
# /root/scripts/server-health.sh
# Collects server metrics for AI analysis
echo "=== SERVER HEALTH REPORT ==="
echo "Date: $(date -u +'%Y-%m-%d %H:%M UTC')"
echo "Hostname: $(hostname)"
echo "Uptime: $(uptime -p)"
echo ""
# CPU
echo "--- CPU ---"
echo "Load average: $(cat /proc/loadavg | awk '{print $1, $2, $3}')"
echo "Cores: $(nproc)"
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}')
echo "CPU usage: ${CPU_USAGE}%"
echo ""
# RAM
echo "--- MEMORY ---"
free -m | awk 'NR==2{printf "Total: %sMB\nUsed: %sMB\nFree: %sMB\nBuffers/Cache: %sMB\nAvailable: %sMB\nUsage: %.1f%%\n", $2, $3, $4, $6, $7, $3*100/$2}'
echo ""
# Disk
echo "--- DISK ---"
df -h / | tail -1 | awk '{printf "Total: %s\nUsed: %s\nAvailable: %s\nUsage: %s\n", $2, $3, $4, $5}'
echo "Inodes: $(df -i / | tail -1 | awk '{print $5}') used"
echo ""
# Services
echo "--- SERVICES ---"
for svc in nginx postgresql docker; do
STATUS=$(systemctl is-active $svc 2>/dev/null || echo "not-found")
echo "$svc: $STATUS"
done
echo ""
# Ports
echo "--- LISTENING PORTS ---"
ss -tlnp 2>/dev/null | grep -E 'LISTEN' | awk '{print $4}' | sort -u
echo ""
# Recent errors
echo "--- RECENT ERRORS (last 30 min) ---"
journalctl -p err --since "30 min ago" --no-pager -q 2>/dev/null | tail -10
echo ""
# Security
echo "--- SECURITY ---"
FAILED_SSH=$(journalctl -u sshd --since "1 hour ago" --no-pager 2>/dev/null | grep -c "Failed\|Invalid" || echo "0")
echo "Failed SSH attempts (1h): $FAILED_SSH"
# Docker (if available)
if command -v docker &>/dev/null; then
echo ""
echo "--- DOCKER ---"
docker ps --format "{{.Names}}: {{.Status}}" 2>/dev/null
STOPPED=$(docker ps -a --filter "status=exited" --format "{{.Names}}" 2>/dev/null | wc -l)
echo "Stopped containers: $STOPPED"
fi
Make it executable:
chmod +x /root/scripts/server-health.sh
Test it:
/root/scripts/server-health.sh
You should see a complete structured report. This is what the AI agent will interpret.
⚙️ OpenClaw Configuration: Heartbeat + Cron
OpenClaw offers two mechanisms for periodic tasks: heartbeat and cron jobs. For monitoring, we'll use both.
Heartbeat: Light Continuous Monitoring
OpenClaw's heartbeat runs at regular intervals (configurable). Perfect for quick checks.
In your OpenClaw configuration, add to the heartbeat:
# In your HEARTBEAT.md or heartbeat configuration
## Server Monitoring
- Check server health via /root/scripts/server-health.sh
- If anomaly detected, alert via Telegram
- DO NOT alert for: short CPU spikes (<5min), RAM usage <85%, known cron jobs
Cron Job: Scheduled Detailed Report
For a more in-depth report (trend analysis, comparison with previous day), configure an OpenClaw cron:
# Health report every 6 hours
0 */6 * * * Execute /root/scripts/server-health.sh, analyze results, compare with previous reports. If anomaly, send Telegram summary. Otherwise, log silently.
Store History
For the AI to compare, store the reports:
# Create history directory
mkdir -p /root/monitoring/history
# Wrapper script that stores + analyzes
#!/bin/bash
# /root/scripts/monitor-and-store.sh
REPORT_DIR="/root/monitoring/history"
TIMESTAMP=$(date +%Y%m%d_%H%M)
REPORT_FILE="$REPORT_DIR/health_${TIMESTAMP}.txt"
# Collect
/root/scripts/server-health.sh > "$REPORT_FILE"
# Keep only last 7 days
find "$REPORT_DIR" -name "health_*.txt" -mtime +7 -delete
# Display for agent
cat "$REPORT_FILE"
The agent can then compare the current report with previous ones:
# Compare disk usage with yesterday
grep "Usage:" /root/monitoring/history/health_$(date -d yesterday +%Y%m%d)*.txt
🤖 The Intelligent Monitoring Agent
Here's the core of the system: the prompt that transforms raw metrics into smart analysis.
Prompt for OpenClaw Heartbeat
## Task: Intelligent Server Monitoring
Execute `/root/scripts/server-health.sh` and analyze the results.
### Alert Rules
**DO NOT alert if:**
- CPU < 85% (normal usage)
- RAM < 80% (including buffers/cache)
- Disk < 75%
- Load average < number_of_cores × 0.8
- Resource-intensive processes are known tasks (apt, backup, cron)
**ALERT if:**
- A critical service is down (nginx, postgresql, docker)
- CPU > 90% for more than 5 minutes without identifiable reason
- Available RAM < 500MB
- Disk > 85% or inodes > 80%
- More than 50 failed SSH attempts in 1h
- A Docker container is stopped when it should be running
- OOM errors in logs
### Telegram Alert Format
If alert needed, send ONE structured message:
🚨 **Server Alert [hostname]**
**Problem:** [short description]
**Detail:** [contextual explanation]
**Impact:** [what's affected]
**Suggested Action:** [what to do]
### If Everything is Fine
Send nothing. Log silently.
Concrete Analysis Example
Imagine this report:
--- CPU ---
Load average: 3.8 2.1 1.5
Cores: 4
CPU usage: 92%
--- MEMORY ---
Usage: 78.5%
--- SERVICES ---
nginx: active
postgresql: active
docker: active
Classic Monitoring: 🔴 ALERT: CPU 92%!
AI Monitoring:
- Load average 3.8 on 4 cores = 95% load
- BUT 5min load is 2.1 and 15min is 1.5
- → The spike is recent and transient (likely a deployment)
- RAM and services OK
- → No alert, monitor in next cycle
📱 Telegram Alerts: Quality vs Quantity
The Alert Spam Problem
Any monitoring system that sends more than 2-3 alerts per day on average ends up ignored. This is "alert fatigue"—a real problem in ops.
Anti-Spam Strategy
| Technique | How | Why |
|---|---|---|
| Cooldown | No more than 1 alert per hour for the same issue | Avoid bursts |
| Aggregation | Group related issues into 1 message | Fewer notifications |
| Escalation | 1st occurrence = log, 2nd = alert, 3rd = urgent alert | Filter false positives |
| Auto-resolution | Send "✅ Resolved" when issue disappears | Reduce stress |
| Daily Digest | Daily summary even if everything is fine | Confirm it's working |
Implementing Cooldown
# /root/scripts/alert-cooldown.sh
# Checks if an alert was recently sent
ALERT_TYPE="$1" # e.g., "cpu_high", "disk_full"
COOLDOWN_DIR="/root/monitoring/cooldowns"
COOLDOWN_SECONDS=3600 # 1 hour
mkdir -p "$COOLDOWN_DIR"
LAST_ALERT_FILE="$COOLDOWN_DIR/$ALERT_TYPE"
if [ -f "$LAST_ALERT_FILE" ]; then
LAST_TIME=$(cat "$LAST_ALERT_FILE")
NOW=$(date +%s)
DIFF=$((NOW - LAST_TIME))
if [ "$DIFF" -lt "$COOLDOWN_SECONDS" ]; then
echo "COOLDOWN_ACTIVE"
exit 1
fi
fi
# Record this alert
date +%s > "$LAST_ALERT_FILE"
echo "ALERT_OK"
exit 0
The Daily Digest
Even when everything is fine, a daily message reassures:
📊 **Daily Server Report**
📅 2025-01-15
✅ All services operational
💻 Avg CPU: 23% | Max: 67%
🧠 RAM: 4.2GB / 8GB (52%)
💾 Disk: 34GB / 80GB (42%)
🔒 12 failed SSH attempts
🐳 5 active Docker containers
Next check in 24h.
This kind of message, you read. And when it says "⚠️ Disk at 78%, +3% in 24h", you act.
🔧 Advanced Configuration
Monitoring Specific Services
Adapt the script for your services:
# Add to server-health.sh script
# Check if a site responds
echo "--- WEB CHECK ---"
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:80 2>/dev/null)
echo "Nginx HTTP: $HTTP_CODE"
# Check database size
echo "--- DATABASE ---"
if command -v psql &>/dev/null; then
DB_SIZE=$(psql -U postgres -t -c "SELECT pg_size_pretty(pg_database_size('main'));" 2>/dev/null)
echo "DB size: $DB_SIZE"
fi
# Check SSL certificates
echo "--- SSL ---"
DOMAIN="your-domain.com"
EXPIRY=$(echo | openssl s_client -servername $DOMAIN -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -dates 2>/dev/null | grep notAfter | cut -d= -f2)
echo "SSL expiry: $EXPIRY"