ClawBench: Can AI Agents Complete Everyday Online Tasks?
Summary
ClawBench introduces the first benchmark for evaluating AI agents on real-world web tasks using live production websites. Instead of sandboxed environments, it uses targeted HTTP interception to safely block only final submission requests while preserving authentic complexity. The benchmark includes 153 everyday tasks across 144 platforms (booking flights, job applications, purchases). A novel five-layer recording system captures session replays, screenshots, HTTP traffic, agent reasoning, and browser actions. An agentic evaluator compares agent trajectories against human references. Results show frontier models like Claude Sonnet 4.6 achieve only 33.3% success rate despite scoring 65-75% on traditional benchmarks, revealing a massive gap between controlled evaluation and real-world performance.
Key findings
- Frontier AI models show dramatic performance drops from 65-75% on existing benchmarks to 6.5-33.3% on real-world web tasks
- Safe evaluation on live websites is possible through targeted HTTP interception of only final submission requests
- Five-layer behavioral recording enables traceable failure diagnosis beyond binary pass/fail scores
- Model performance varies significantly across task categories, with no single model dominating all domains
How to implement
- Integrate the HTTP interception mechanism into existing web automation testing frameworks to enable safe evaluation of RPA bots on production sites without triggering real transactions
- Adopt the five-layer recording infrastructure to build debugging tools for web agent failures, allowing developers to trace exactly where and why their automation scripts break on dynamic websites
- Use the agentic evaluator approach to automatically validate customer service chatbots that need to complete forms or reservations, comparing bot trajectories against human customer service representatives