📑 Table of contents

Code Execution and Checkpoints in Hermes Agent

Hermes Agent 🔴 Advanced ⏱️ 11 min read 📅 2026-05-05

Introduction

Hermes Agent doesn't just execute shell commands. It has two powerful mechanisms for code execution and file modification management: Code Execution (running Python scripts with access to Hermes tools) and Checkpoints (automatic filesystem snapshots for rollback). These features transform Hermes into a truly secure development environment where every modification can be undone and every complex script runs in isolation.

Consider a concrete example: you ask the agent to refactor an authentication module, update the tests, and generate documentation. Without Code Execution, that's 15 to 20 separate tool calls, each consuming context tokens. With Code Execution, the agent writes a Python script that does everything in a single turn. And if the result isn't right? Checkpoints let you go back with a simple /rollback.

This article covers both features in detail, their configuration, and how to combine them for an optimal development workflow.

Code Execution: Running Python with Hermes Tools

The concept

The execute_code tool lets the AI model write and execute a Python script that can call Hermes tools (terminal, file read/write, search, patch, web). This is called Programmatic Tool Calling (PTC).

The problem it solves: when a task requires reading multiple files, processing them, and writing results, the agent would normally make one tool call per step — read, then process mentally, then write, then re-read, etc. Each call consumes context tokens and a conversation turn. execute_code collapses the entire chain into a single call: the agent writes a complete Python script, the script calls tools via RPC, and only the final result is returned to the LLM.

Architecture: the parent process generates a hermes_tools.py module with RPC stubs, then launches the script in a child process. Tool calls travel via a Unix socket (local execution) or files (remote execution via Docker/SSH) to the parent for dispatch. Only the script's stdout is returned to the LLM — intermediate tool results never enter the context.

Available tools in the sandbox

execute_code scripts have access to a strict subset of Hermes tools:

Tool Description
read_file Read files
write_file Write files
search_files Search file contents and names
patch Targeted find-and-replace modifications
terminal Execute shell commands
web_search Web searches
web_extract Web content extraction

The intersection of this list and your session's enabled toolsets determines which stubs are actually generated. If web isn't enabled in your session, web_search and web_extract won't be available in scripts.

Note the deliberate absence of external-effect tools: send_message, delegate_task, clarify, memory, and code_execution itself are not available. The script cannot send messages, delegate tasks, or create subagents.

Two execution modes

Project mode (default): the script runs in the session's working directory with the active virtual environment's Python. Project dependencies (pandas, numpy, requests, installed packages) and relative paths work normally. This is the mode to use for most development tasks.

Strict mode: the script runs in an isolated temp directory with Hermes' own Python (sys.executable). Maximum isolation and reproducibility, but project dependencies and relative paths won't work. Useful for standalone scripts that don't need external packages.

# In ~/.hermes/config.yaml
code_execution:
  mode: project    # project (default) | strict

Limits and security

The sandbox automatically applies several layers of protection:

Environment variable scrubbing: variables containing API_KEY, TOKEN, SECRET, PASSWORD, CREDENTIAL are automatically removed from the child process environment. Even if the agent tries to access os.environ['OPENROUTER_API_KEY'] in a script, the variable won't be there. This protection is non-bypassable — it's applied before the process launches.

Tool whitelist: only the 7 tools listed above are available. No access to send_message, delegate_task, memory, or any tool with irreversible external effects.

Resource limits:
- Timeout: 5 minutes default (300 seconds)
- Max tool calls: 50 per script
- Stdout cap: 50 KB maximum
- Stderr cap: 10 KB maximum

If stdout exceeds 50 KB, only the first 500 bytes and the last 500 are returned (head + tail), allowing you to see the beginning and end of the output without saturating context.

Platform restriction: the sandbox requires Unix sockets (POSIX), so Linux and macOS only. Automatically disabled on Windows.

Built-in utility functions

The hermes_tools module injected into each script provides specific helpers:

from hermes_tools import terminal, read_file, write_file, search_files, patch

# Utility functions
from hermes_tools import json_parse, shell_quote, retry

# json_parse — json.loads equivalent that tolerates control characters
# Useful for parsing shell command output containing ANSI codes
data = json_parse(raw_output)

# shell_quote — secure shell escaping for dynamic variables
cmd = f"grep {shell_quote(user_input)} /var/log/app.log"

# retry — retry with exponential backoff for network operations
result = retry(fetch_api, max_attempts=3, delay=2)

The terminal() function in scripts is foreground-only — no background or PTY. It returns a dict with output, exit_code, and optionally error.

When does the agent trigger execute_code?

The agent automatically decides to use execute_code when a task requires 3+ tool calls with processing logic between them. Typical examples:

  • Data pipelines: read 5 JSON files, filter by criteria, aggregate results, write a report
  • Multi-file search: search a pattern across 10 files, extract matching lines, synthesize
  • Conditional loops: fetch N web pages, parse each response, accumulate results, stop when a condition is met
  • Complex transformations: read a 1000-row CSV, calculate metrics per group, generate a formatted output file
  • Bulk updates: read a list of files, apply the same replacement to each, verify syntax

For simple tasks (single tool call, no intermediate logic), the agent prefers direct tool calls — faster and just as effective.

Checkpoints: Automatic Filesystem Snapshots

The concept

Checkpoints create automatic snapshots of your working directory before each file modification by the agent. If the agent makes a mistake — deletes a file, breaks code, applies a wrong patch — you can instantly go back.

The implementation is elegant: it uses shadow git repositories. An invisible git repository is maintained alongside your working directory, with no visible .git folder in your project. Each snapshot is a git commit in this shadow repo. Your project doesn't need to be a git repository for checkpoints to work — the system works with any directory.

Activation

# Via CLI flag (simplest)
hermes --checkpoints

# Or in config.yaml
checkpoints:
  enabled: true
  max_snapshots: 50

Checkpoints are disabled by default. They must be explicitly enabled because they consume some disk space and CPU for git snapshots.

Technical architecture

Checkpoints are stored in ~/.hermes/checkpoints/{hash}/ where {hash} is the first 16 characters of the SHA-256 of the directory's absolute path. This deterministic approach ensures the same directory always uses the same shadow repo, even after a restart.

The shadow repo uses GIT_DIR + GIT_WORK_TREE environment variables to avoid polluting your project with a .git folder. The shadow repo is strictly isolated from the user's git configuration — it doesn't inherit ~/.gitconfig, hooks, or GPG signing.

Automatic exclusions: the following items are not snapshotted because they're either too large, volatile, or sensitive:

  • Dependencies: node_modules/, dist/, build/, __pycache__/, .venv/, venv/
  • Environment variables: .env, .env.*, .env.local, .env.*.local
  • Cache: .next/, .nuxt/, .cache/, coverage/, .pytest_cache/
  • Metadata: .git/, *.log, .DS_Store

This list is sufficient for most projects. Directories with more than 50,000 files are automatically skipped to avoid slowdowns.

Automatic triggering

Checkpoints are created automatically before file modification operations:

  • write_file — before each file write (creation or full replacement)
  • patch — before each targeted modification (find-and-replace)

Only one checkpoint per directory per conversation turn. If the agent modifies 5 files in the same directory during a single turn, only one snapshot is taken before the first modification. This deduplication prevents redundant snapshots.

Checkpoints are not created for overly sensitive directories:
- / (filesystem root)
- $HOME (home directory)

This protection prevents accidental snapshots of the entire system, which would be expensive in time and disk space.

The /rollback command

The /rollback command is your interface for managing checkpoints in session:

/rollback                 # List all available checkpoints
/rollback 1               # Restore checkpoint N
/rollback diff 1          # Preview changes since checkpoint 1
/rollback 1 config.yaml   # Restore a single file from checkpoint 1

List checkpoints (/rollback with no argument): displays a table with the number, short hash, date, snapshot reason, and statistics (files changed, insertions, deletions). Checkpoints are listed from newest to oldest.

Full restore (/rollback N): all files in the directory return to checkpoint N's state. The last chat turn is also undone, because the filesystem state changed — the agent must reconsider its actions in the new context.

Partial restore (/rollback N file): only the specified file is restored from checkpoint N. Other files keep their current state. Ideal when the agent broke one specific file without touching others — just restore the damaged file.

Preview (/rollback diff N): displays modifications made since checkpoint N without changing anything. Essential for verifying what you'll restore before doing it.

Snapshot limits and pruning

50 snapshots maximum per directory by default. The oldest are automatically purged when the limit is reached. Pruning can also be triggered manually.

Configure with checkpoints.max_snapshots in your config.yaml. For long development sessions (multi-step refactors), increase this to 100 or 200.

Integrity verification

Rollback operations include security validations:

  • Relative path required: absolute paths are rejected — only paths relative to the working directory are accepted
  • Path traversal protection: attempts like ../../etc/passwd are blocked
  • Commit hash validation: hashes are verified (4-64 hex characters, not starting with -) to prevent git command injection

Combining Code Execution and Checkpoints

The --checkpoints + execute_code combo is the most powerful setup for development with Hermes:

  1. Enable checkpoints at startup: hermes --checkpoints
  2. Ask the agent to perform complex modifications
  3. The agent uses execute_code to orchestrate operations
  4. Each write_file in the script triggers an automatic checkpoint
  5. If the result is wrong: /rollback N to go back

It's like undo/redo for all agent operations, with the power of a Python script for complex operations.

Concrete workflow

hermes --checkpoints

# In the session:
> Refactor auth.py to use JWT instead of sessions.
  Update the 3 corresponding test files and verify syntax.

The agent generates an execute_code script that:
1. Reads auth.py and the test files
2. Analyzes the existing authentication logic
3. Rewrites the module with JWT
4. Updates the tests
5. Verifies Python syntax of each modified file

Each write_file in the script triggers a checkpoint. If a test fails:

> Run the tests with pytest

/rollback 3          # Go back to checkpoint 3 (before test rewrite)
> Fix the error in JWT tests
# ~/.hermes/config.yaml
checkpoints:
  enabled: true
  max_snapshots: 100

code_execution:
  mode: project

Always launch with: hermes --checkpoints

A Hostinger VPS with 2 vCPU and 4 GB RAM provides more than enough resources to run Hermes with checkpoints and code execution, even on substantial projects.

Conclusion

Code Execution and Checkpoints are complementary features that transform Hermes Agent into a reliable, reversible development environment. execute_code lets the agent manipulate complex data in a single turn (saving tokens and time), while checkpoints guarantee that every file modification can be cleanly undone with /rollback.

Key takeaways:

  • execute_code is triggered automatically by the agent when a task needs 3+ tool calls — no need to request it explicitly
  • Environment variable scrubbing protects your API keys even in Python scripts
  • Enable --checkpoints for any work involving file modifications — the ability to go back is priceless
  • /rollback is your safety net: full or partial restore, with diff preview
  • Project mode for everyday development, strict mode when isolation is the priority
  • Shadow git — your project doesn't need to be a git repo, checkpoints work anywhere

For more on security around code execution, see our article on security and permissions and discover how the filesystem and context system interacts with these mechanisms.