About ClawBench

ClawBench is an agent orchestration benchmark that tests AI models through the full agent stack -- not raw API calls. It evaluates thinking-block stripping, retry logic, tool orchestration, and end-to-end task completion through the OpenClaw gateway.

Unlike traditional benchmarks that test model capabilities in isolation, ClawBench tests the complete agent system: the model, the orchestration middleware, and the tool execution layer working together.

Categories

Tool Accuracy

Tests whether the agent calls the right tools with the correct arguments and interprets results correctly.

Code Generation

Evaluates the agent's ability to write, modify, and debug code in a sandboxed environment.

Reasoning

Measures logical reasoning, multi-step deduction, and the ability to handle ambiguity.

Error Recovery

Tests how the agent recovers from errors, retries failed operations, and adapts its approach.

Multi-Step

Evaluates the agent's ability to plan and execute multi-step tasks that require sequencing.

Research

Tests information gathering, synthesis, and the ability to answer questions requiring multiple sources.

Context Management

Measures how well the agent maintains context across long conversations and complex tasks.

Score Guide

90-100Excellent -- agent handles complex orchestration reliably
70-89Solid -- agent works well for most tasks
40-69Functional -- agent works but has gaps
0-39Needs work -- significant agent capability issues

Prerequisites

Node.js 18+ requirednode --version to check

1. OpenClaw installed -- ClawBench runs through the OpenClaw gateway. Install from github.com/openclaw/openclaw

2. Gateway running -- openclaw gateway start

3. Gateway token -- Find in ~/.openclaw/openclaw.json under gateway.auth.token

4. Enable chat completions -- In openclaw.json:

"gateway": { "http": { "endpoints": { "chatCompletions": { "enabled": true } } } }

Run a Benchmark

Step 1 -- Install ClawBench:

git clone https://github.com/MrSlothuus/clawbench.git
cd clawbench
npm link

Step 2 -- Set your token:

export OPENCLAW_GATEWAY_TOKEN="your-token-here"

Step 3 -- Preview (no execution):

clawbench --dry-run

Step 4 -- Run the benchmark:

clawbench

Step 5 -- Submit to leaderboard:

clawbench --submit

Test a specific model:

clawbench --model anthropic/claude-sonnet-4-6 --submit

Full docs: github.com/MrSlothuus/clawbench

Sponsors