ISSUE 001 April 11, 2026

AI Research Weekly – April 11, 2026

/ 018.5/10

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Summary

ClawBench introduces the first benchmark for evaluating AI agents on real-world web tasks using live production websites. Instead of sandboxed environments, it uses targeted HTTP interception to safely block only final submission requests while preserving authentic complexity. The benchmark includes 153 everyday tasks across 144 platforms (booking flights, job applications, purchases). A novel five-layer recording system captures session replays, screenshots, HTTP traffic, agent reasoning, and browser actions. An agentic evaluator compares agent trajectories against human references. Results show frontier models like Claude Sonnet 4.6 achieve only 33.3% success rate despite scoring 65-75% on traditional benchmarks, revealing a massive gap between controlled evaluation and real-world performance.

Key findings

Frontier AI models show dramatic performance drops from 65-75% on existing benchmarks to 6.5-33.3% on real-world web tasks
Safe evaluation on live websites is possible through targeted HTTP interception of only final submission requests
Five-layer behavioral recording enables traceable failure diagnosis beyond binary pass/fail scores
Model performance varies significantly across task categories, with no single model dominating all domains

How to implement

Integrate the HTTP interception mechanism into existing web automation testing frameworks to enable safe evaluation of RPA bots on production sites without triggering real transactions
Adopt the five-layer recording infrastructure to build debugging tools for web agent failures, allowing developers to trace exactly where and why their automation scripts break on dynamic websites
Use the agentic evaluator approach to automatically validate customer service chatbots that need to complete forms or reservations, comparing bot trajectories against human customer service representatives

AI Research Weekly – April 11, 2026

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Summary

Key findings

How to implement

Small Vision-Language Models are Smart Compressors for Long Video Understanding

Summary

Key findings

How to implement

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

Summary

Key findings

How to implement

PIArena: A Platform for Prompt Injection Evaluation

Summary

Key findings

How to implement

Differentially Private Language Generation and Identification in the Limit

Summary

Key findings

How to implement

QEIL v2: Heterogeneous Computing for Edge Intelligence via Roofline-Derived Pareto-Optimal Energy Modeling and Multi-Objective Orchestration

Summary

Key findings

How to implement

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Summary

Key findings

How to implement

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Summary

Key findings

How to implement

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Summary

Key findings

How to implement

SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

Summary

Key findings

How to implement

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Summary

Key findings

How to implement

The Impact of Dimensionality on the Stability of Node Embeddings

Summary

Key findings

How to implement

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Summary

Key findings

How to implement

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Summary

Key findings

How to implement

Artificial Intelligence in MRI-Based Glioma Imaging: From Radiomics-Based Machine Learning to Deep Learning Approaches

Summary

Key findings

How to implement

RewardFlow: Generate Images by Optimizing What You Reward

Summary

Key findings

How to implement

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Summary

Key findings

How to implement

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Summary

Key findings

How to implement

Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

Summary

Key findings

How to implement

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Summary

Key findings