About ClawBench

ClawBench is an agent orchestration benchmark that tests AI models through the full agent stack -- not raw API calls. It evaluates thinking-block stripping, retry logic, tool orchestration, and end-to-end task completion through the OpenClaw gateway.

Unlike traditional benchmarks that test model capabilities in isolation, ClawBench tests the complete agent system: the model, the orchestration middleware, and the tool execution layer working together.

Score Guide

90-100	Excellent -- agent handles complex orchestration reliably
70-89	Solid -- agent works well for most tasks
40-69	Functional -- agent works but has gaps
0-39	Needs work -- significant agent capability issues

Prerequisites

Node.js 18+ required — node --version to check

1. OpenClaw installed -- ClawBench runs through the OpenClaw gateway. Install from github.com/openclaw/openclaw

2. Gateway running -- openclaw gateway start

3. Gateway token -- Find in ~/.openclaw/openclaw.json under gateway.auth.token

4. Enable chat completions -- In openclaw.json:

"gateway": { "http": { "endpoints": { "chatCompletions": { "enabled": true } } } }

Run a Benchmark

Step 1 -- Install ClawBench:

git clone https://github.com/MrSlothuus/clawbench.git
cd clawbench
npm link

Step 2 -- Set your token:

export OPENCLAW_GATEWAY_TOKEN="your-token-here"

Step 3 -- Preview (no execution):

clawbench --dry-run

Step 4 -- Run the benchmark:

clawbench

Step 5 -- Submit to leaderboard:

clawbench --submit

Test a specific model:

clawbench --model anthropic/claude-sonnet-4-6 --submit

Full docs: github.com/MrSlothuus/clawbench

About ClawBench

Categories

Tool Accuracy

Code Generation

Reasoning

Error Recovery

Multi-Step

Research

Context Management

Score Guide

Prerequisites

Run a Benchmark

Sponsors