About ClawBench
ClawBench is an agent orchestration benchmark that tests AI models through the full agent stack -- not raw API calls. It evaluates thinking-block stripping, retry logic, tool orchestration, and end-to-end task completion through the OpenClaw gateway.
Unlike traditional benchmarks that test model capabilities in isolation, ClawBench tests the complete agent system: the model, the orchestration middleware, and the tool execution layer working together.
Categories
Tool Accuracy
Tests whether the agent calls the right tools with the correct arguments and interprets results correctly.
Code Generation
Evaluates the agent's ability to write, modify, and debug code in a sandboxed environment.
Reasoning
Measures logical reasoning, multi-step deduction, and the ability to handle ambiguity.
Error Recovery
Tests how the agent recovers from errors, retries failed operations, and adapts its approach.
Multi-Step
Evaluates the agent's ability to plan and execute multi-step tasks that require sequencing.
Research
Tests information gathering, synthesis, and the ability to answer questions requiring multiple sources.
Context Management
Measures how well the agent maintains context across long conversations and complex tasks.
Score Guide
| 90-100 | Excellent -- agent handles complex orchestration reliably |
| 70-89 | Solid -- agent works well for most tasks |
| 40-69 | Functional -- agent works but has gaps |
| 0-39 | Needs work -- significant agent capability issues |
Run a Benchmark
# Install and run npx clawbench --gateway-token <your-token> # Submit results to the leaderboard npx clawbench --submit --gateway-token <your-token> # Test a specific model npx clawbench --submit --model anthropic/claude-sonnet-4-6 --gateway-token <token>