About ClawBench
ClawBench is an agent orchestration benchmark that tests AI models through the full agent stack -- not raw API calls. It evaluates thinking-block stripping, retry logic, tool orchestration, and end-to-end task completion through the OpenClaw gateway.
Unlike traditional benchmarks that test model capabilities in isolation, ClawBench tests the complete agent system: the model, the orchestration middleware, and the tool execution layer working together.
Categories
Tool Accuracy
Tests whether the agent calls the right tools with the correct arguments and interprets results correctly.
Code Generation
Evaluates the agent's ability to write, modify, and debug code in a sandboxed environment.
Reasoning
Measures logical reasoning, multi-step deduction, and the ability to handle ambiguity.
Error Recovery
Tests how the agent recovers from errors, retries failed operations, and adapts its approach.
Multi-Step
Evaluates the agent's ability to plan and execute multi-step tasks that require sequencing.
Research
Tests information gathering, synthesis, and the ability to answer questions requiring multiple sources.
Context Management
Measures how well the agent maintains context across long conversations and complex tasks.
Score Guide
| 90-100 | Excellent -- agent handles complex orchestration reliably |
| 70-89 | Solid -- agent works well for most tasks |
| 40-69 | Functional -- agent works but has gaps |
| 0-39 | Needs work -- significant agent capability issues |
Prerequisites
Node.js 18+ required — node --version to check
1. OpenClaw installed -- ClawBench runs through the OpenClaw gateway. Install from github.com/openclaw/openclaw
2. Gateway running -- openclaw gateway start
3. Gateway token -- Find in ~/.openclaw/openclaw.json under gateway.auth.token
4. Enable chat completions -- In openclaw.json:
"gateway": { "http": { "endpoints": { "chatCompletions": { "enabled": true } } } }Run a Benchmark
Step 1 -- Install ClawBench:
git clone https://github.com/MrSlothuus/clawbench.git cd clawbench npm link
Step 2 -- Set your token:
export OPENCLAW_GATEWAY_TOKEN="your-token-here"
Step 3 -- Preview (no execution):
clawbench --dry-run
Step 4 -- Run the benchmark:
clawbench
Step 5 -- Submit to leaderboard:
clawbench --submit
Test a specific model:
clawbench --model anthropic/claude-sonnet-4-6 --submit
Full docs: github.com/MrSlothuus/clawbench