📑 Table des matières

Monitoring serveur avec l'IA : alertes intelligentes

Automatisation 🟡 Intermédiaire ⏱️ 13 min de lecture 📅 2026-02-24

Server Monitoring with AI: Smart Alerts

Your server runs 24/7. But who is really monitoring it? Classic monitoring tools send alerts when CPU exceeds 90%... even if it's just an apt update running. Result: you receive 15 notifications a day, ignore them all, and when a real problem occurs, you miss it.

AI-powered monitoring changes the game entirely. Instead of dumb thresholds, an intelligent agent understands context, distinguishes a normal spike from a real anomaly, and only alerts you when it truly matters.

In this guide, we'll set up an intelligent monitoring system with OpenClaw that monitors your server and sends relevant Telegram alerts—not spam.


🔍 Classic Monitoring vs AI Monitoring

The Problem with Fixed Thresholds

Traditional monitoring operates with simple rules:

IF cpu > 90% THEN alert
IF ram > 85% THEN alert
IF disk > 80% THEN alert

This seems logical. In practice, it's unusable:

Situation Classic Threshold AI Monitoring
apt upgrade running 🔴 ALERT: CPU 95%! ✅ Known process, ignored
Nightly backup 🔴 ALERT: Disk I/O! ✅ Scheduled task, normal
CPU spike at 3 AM for no reason ✅ Alert (if > threshold) 🔴 ALERT: Abnormal activity
RAM slowly rising over 3 days ❌ Never detected (below threshold) 🔴 ALERT: Likely memory leak
Service restarting 5 times in 1h ❌ No rule for this 🔴 ALERT: Service instability

What AI Brings

AI doesn't replace metric collection—it replaces human interpretation. An AI agent can:

  • Understand context: A CPU spike during a cron job is normal
  • Detect patterns: "This service restarted 3 times in 2h, which never happened before"
  • Correlate metrics: RAM + CPU + I/O rising together = different problem than CPU alone
  • Adjust sensitivity: Alert faster at night (no one should be using the server)
  • Summarize intelligently: Instead of 10 alerts, one contextual message

📊 Essential Metrics to Monitor

Before configuring the agent, let's define what we monitor and why.

CPU

# Instant CPU usage
top -bn1 | grep "Cpu(s)" | awk '{print $2}'

# Load average (1, 5, 15 minutes)
cat /proc/loadavg

# Number of cores (to interpret load)
nproc

What the AI needs to know: A load average of 4.0 on a 4-core machine = 100% used. On an 8-core machine = 50%.

RAM

# Used/available memory
free -h

# Detail with buffers/cache
free -m | awk 'NR==2{printf "Used: %sMB / %sMB (%.1f%%)\n", $3, $2, $3*100/$2}'

Classic trap: Linux uses free RAM as disk cache. 90% "used" RAM isn't necessarily a problem if 40% is cache.

Disk

# Disk space
df -h /

# Inodes (often forgotten, can block even with free space)
df -i /

# Current I/O
iostat -x 1 3 2>/dev/null || echo "iostat not available"

Services and Uptime

# Critical services
systemctl is-active nginx postgresql docker openclaw

# Server uptime
uptime -p

# Listening ports
ss -tlnp | grep -E ':(80|443|5432|8080|3000)'

Recent Logs

# Recent system errors
journalctl -p err -n 20 --no-pager

# Failed SSH attempts
journalctl -u sshd -n 50 --no-pager | grep -i "failed\|invalid"

# OOM killer (processes killed due to lack of RAM)
dmesg | grep -i "oom\|killed process" | tail -5

🛠️ Metric Collection Script

Let's create a script that collects everything and produces a structured report for the AI agent to interpret.

#!/bin/bash
# /root/scripts/server-health.sh
# Collects server metrics for AI analysis

echo "=== SERVER HEALTH REPORT ==="
echo "Date: $(date -u +'%Y-%m-%d %H:%M UTC')"
echo "Hostname: $(hostname)"
echo "Uptime: $(uptime -p)"
echo ""

# CPU
echo "--- CPU ---"
echo "Load average: $(cat /proc/loadavg | awk '{print $1, $2, $3}')"
echo "Cores: $(nproc)"
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}')
echo "CPU usage: ${CPU_USAGE}%"
echo ""

# RAM
echo "--- MEMORY ---"
free -m | awk 'NR==2{printf "Total: %sMB\nUsed: %sMB\nFree: %sMB\nBuffers/Cache: %sMB\nAvailable: %sMB\nUsage: %.1f%%\n", $2, $3, $4, $6, $7, $3*100/$2}'
echo ""

# Disk
echo "--- DISK ---"
df -h / | tail -1 | awk '{printf "Total: %s\nUsed: %s\nAvailable: %s\nUsage: %s\n", $2, $3, $4, $5}'
echo "Inodes: $(df -i / | tail -1 | awk '{print $5}') used"
echo ""

# Services
echo "--- SERVICES ---"
for svc in nginx postgresql docker; do
    STATUS=$(systemctl is-active $svc 2>/dev/null || echo "not-found")
    echo "$svc: $STATUS"
done
echo ""

# Ports
echo "--- LISTENING PORTS ---"
ss -tlnp 2>/dev/null | grep -E 'LISTEN' | awk '{print $4}' | sort -u
echo ""

# Recent errors
echo "--- RECENT ERRORS (last 30 min) ---"
journalctl -p err --since "30 min ago" --no-pager -q 2>/dev/null | tail -10
echo ""

# Security
echo "--- SECURITY ---"
FAILED_SSH=$(journalctl -u sshd --since "1 hour ago" --no-pager 2>/dev/null | grep -c "Failed\|Invalid" || echo "0")
echo "Failed SSH attempts (1h): $FAILED_SSH"

# Docker (if available)
if command -v docker &>/dev/null; then
    echo ""
    echo "--- DOCKER ---"
    docker ps --format "{{.Names}}: {{.Status}}" 2>/dev/null
    STOPPED=$(docker ps -a --filter "status=exited" --format "{{.Names}}" 2>/dev/null | wc -l)
    echo "Stopped containers: $STOPPED"
fi

Make it executable:

chmod +x /root/scripts/server-health.sh

Test it:

/root/scripts/server-health.sh

You should see a complete structured report. This is what the AI agent will interpret.


⚙️ OpenClaw Configuration: Heartbeat + Cron

OpenClaw offers two mechanisms for periodic tasks: heartbeat and cron jobs. For monitoring, we'll use both.

Heartbeat: Light Continuous Monitoring

OpenClaw's heartbeat runs at regular intervals (configurable). Perfect for quick checks.

In your OpenClaw configuration, add to the heartbeat:

# In your HEARTBEAT.md or heartbeat configuration
## Server Monitoring
- Check server health via /root/scripts/server-health.sh
- If anomaly detected, alert via Telegram
- DO NOT alert for: short CPU spikes (<5min), RAM usage <85%, known cron jobs

Cron Job: Scheduled Detailed Report

For a more in-depth report (trend analysis, comparison with previous day), configure an OpenClaw cron:

# Health report every 6 hours
0 */6 * * * Execute /root/scripts/server-health.sh, analyze results, compare with previous reports. If anomaly, send Telegram summary. Otherwise, log silently.

Store History

For the AI to compare, store the reports:

# Create history directory
mkdir -p /root/monitoring/history

# Wrapper script that stores + analyzes
#!/bin/bash
# /root/scripts/monitor-and-store.sh

REPORT_DIR="/root/monitoring/history"
TIMESTAMP=$(date +%Y%m%d_%H%M)
REPORT_FILE="$REPORT_DIR/health_${TIMESTAMP}.txt"

# Collect
/root/scripts/server-health.sh > "$REPORT_FILE"

# Keep only last 7 days
find "$REPORT_DIR" -name "health_*.txt" -mtime +7 -delete

# Display for agent
cat "$REPORT_FILE"

The agent can then compare the current report with previous ones:

# Compare disk usage with yesterday
grep "Usage:" /root/monitoring/history/health_$(date -d yesterday +%Y%m%d)*.txt

🤖 The Intelligent Monitoring Agent

Here's the core of the system: the prompt that transforms raw metrics into smart analysis.

Prompt for OpenClaw Heartbeat

## Task: Intelligent Server Monitoring

Execute `/root/scripts/server-health.sh` and analyze the results.

### Alert Rules

**DO NOT alert if:**
- CPU < 85% (normal usage)
- RAM < 80% (including buffers/cache)
- Disk < 75%
- Load average < number_of_cores × 0.8
- Resource-intensive processes are known tasks (apt, backup, cron)

**ALERT if:**
- A critical service is down (nginx, postgresql, docker)
- CPU > 90% for more than 5 minutes without identifiable reason
- Available RAM < 500MB
- Disk > 85% or inodes > 80%
- More than 50 failed SSH attempts in 1h
- A Docker container is stopped when it should be running
- OOM errors in logs

### Telegram Alert Format

If alert needed, send ONE structured message:

🚨 **Server Alert [hostname]**
**Problem:** [short description]
**Detail:** [contextual explanation]
**Impact:** [what's affected]
**Suggested Action:** [what to do]

### If Everything is Fine
Send nothing. Log silently.

Concrete Analysis Example

Imagine this report:

--- CPU ---
Load average: 3.8 2.1 1.5
Cores: 4
CPU usage: 92%

--- MEMORY ---
Usage: 78.5%

--- SERVICES ---
nginx: active
postgresql: active
docker: active

Classic Monitoring: 🔴 ALERT: CPU 92%!

AI Monitoring:
- Load average 3.8 on 4 cores = 95% load
- BUT 5min load is 2.1 and 15min is 1.5
- → The spike is recent and transient (likely a deployment)
- RAM and services OK
- → No alert, monitor in next cycle


📱 Telegram Alerts: Quality vs Quantity

The Alert Spam Problem

Any monitoring system that sends more than 2-3 alerts per day on average ends up ignored. This is "alert fatigue"—a real problem in ops.

Anti-Spam Strategy

Technique How Why
Cooldown No more than 1 alert per hour for the same issue Avoid bursts
Aggregation Group related issues into 1 message Fewer notifications
Escalation 1st occurrence = log, 2nd = alert, 3rd = urgent alert Filter false positives
Auto-resolution Send "✅ Resolved" when issue disappears Reduce stress
Daily Digest Daily summary even if everything is fine Confirm it's working

Implementing Cooldown

# /root/scripts/alert-cooldown.sh
# Checks if an alert was recently sent

ALERT_TYPE="$1"  # e.g., "cpu_high", "disk_full"
COOLDOWN_DIR="/root/monitoring/cooldowns"
COOLDOWN_SECONDS=3600  # 1 hour

mkdir -p "$COOLDOWN_DIR"

LAST_ALERT_FILE="$COOLDOWN_DIR/$ALERT_TYPE"

if [ -f "$LAST_ALERT_FILE" ]; then
    LAST_TIME=$(cat "$LAST_ALERT_FILE")
    NOW=$(date +%s)
    DIFF=$((NOW - LAST_TIME))

    if [ "$DIFF" -lt "$COOLDOWN_SECONDS" ]; then
        echo "COOLDOWN_ACTIVE"
        exit 1
    fi
fi

# Record this alert
date +%s > "$LAST_ALERT_FILE"
echo "ALERT_OK"
exit 0

The Daily Digest

Even when everything is fine, a daily message reassures:

📊 **Daily Server Report**
📅 2025-01-15

✅ All services operational
💻 Avg CPU: 23% | Max: 67%
🧠 RAM: 4.2GB / 8GB (52%)
💾 Disk: 34GB / 80GB (42%)
🔒 12 failed SSH attempts
🐳 5 active Docker containers

Next check in 24h.

This kind of message, you read. And when it says "⚠️ Disk at 78%, +3% in 24h", you act.


🔧 Advanced Configuration

Monitoring Specific Services

Adapt the script for your services:

# Add to server-health.sh script

# Check if a site responds
echo "--- WEB CHECK ---"
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:80 2>/dev/null)
echo "Nginx HTTP: $HTTP_CODE"

# Check database size
echo "--- DATABASE ---"
if command -v psql &>/dev/null; then
    DB_SIZE=$(psql -U postgres -t -c "SELECT pg_size_pretty(pg_database_size('main'));" 2>/dev/null)
    echo "DB size: $DB_SIZE"
fi

# Check SSL certificates
echo "--- SSL ---"
DOMAIN="your-domain.com"
EXPIRY=$(echo | openssl s_client -servername $DOMAIN -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -dates 2>/dev/null | grep notAfter | cut -d= -f2)
echo "SSL expiry: $EXPIRY"