Back to archive
ISSUE 004

AI Research Weekly – adversarial learning & more – April 26, 2026

/ 018.2/10

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

This paper addresses the 'Tools Tax' - the overhead of injecting all tool schemas into LLM prompts on every turn in Model Context Protocol (MCP) systems. The authors propose Tool Attention, which dynamically selects relevant tools using semantic similarity scores, stateful gating functions, and lazy schema loading. Their approach reduces tool-related tokens by 95% (47k to 2.4k tokens) while maintaining task success rates. The mechanism works by keeping compact tool summaries always in context but only loading full schemas for the top-k most relevant tools per turn, combined with a hallucination gate to handle false negatives.

  • The Tools Tax scales linearly with catalog size and dominates effective context window past ~50 tools, causing reasoning degradation
  • Tool Attention achieves 95% reduction in tool tokens per turn while improving projected task success by 22 percentage points
  • Two-phase lazy loading (summaries always present, full schemas on-demand) preserves tool discoverability while minimizing context overhead
  • Semantic gating based on Intent-Schema Overlap scores provides both efficiency gains and defensive benefits against tool poisoning attacks
  • Implement Tool Attention middleware in enterprise chatbots that integrate 50+ internal APIs to reduce per-session costs from $55 to $8 while improving response accuracy
  • Deploy in code assistants like GitHub Copilot that access multiple development tools (Git, databases, CI/CD) to prevent context window exhaustion during long coding sessions
  • Integrate into customer service agents with access to CRM, ticketing, knowledge base, and communication tools to maintain reasoning quality across extended support conversations
Read paper
/ 028.0/10

The Sample Complexity of Multicalibration

This paper resolves the fundamental question of how many samples are needed to learn multicalibrated predictors. While regular calibration requires Θ(ε^-2) samples, the authors prove multicalibration requires Θ(ε^-3) samples, showing it's inherently harder. They construct ingenious hard instances using coding theory that force any learner to essentially solve a complex decoding problem. The results extend beyond means to other statistical properties like quantiles and expectiles, and hold for weighted Lp error metrics. Surprisingly, this matches the online adversarial setting's sample complexity, unlike regular calibration which is easier in batch settings.

  • Multicalibration sample complexity is Θ(ε^-3) for ECE metric, separating it from Θ(ε^-2) marginal calibration
  • Lower bounds extend to weighted Lp metrics with optimal exponent 3/p for p ∈ [1,2]
  • Results hold for regular elicitable properties including expectiles and bounded-density quantiles
  • Batch and online multicalibration have matching sample complexities, unlike marginal calibration
  • When deploying fair ML models in production, use these bounds to estimate required training data size - if needing multicalibration error ≤ 0.01, plan for ~1M samples rather than assuming 10K suffices
  • For A/B testing platforms ensuring fairness across user segments, implement the randomized predictor construction to achieve multicalibration guarantees with provably minimal sample requirements
  • In recommendation systems requiring calibration across demographic groups, apply the Lp multicalibration framework with appropriate p values to balance different fairness objectives while respecting sample complexity constraints
Read paper
/ 038.0/10

MathDuels: Evaluating LLMs as Problem Posers and Solvers

MathDuels introduces a self-play benchmark where language models both create and solve mathematical problems. Unlike static benchmarks that saturate as models improve, MathDuels maintains discriminative power by having stronger models generate harder problems that challenge previously dominant solvers. The system uses a three-stage generation pipeline (meta-prompting, problem generation, difficulty amplification) and Rasch modeling to jointly estimate solver abilities and problem difficulties. Testing 19 frontier models reveals that authoring and solving capabilities are partially decoupled - strong problem solvers aren't necessarily strong problem authors. This dual-role evaluation exposes capability gaps invisible in solve-only benchmarks and automatically scales difficulty as participants strengthen.

  • Authoring and solving capabilities are partially decoupled - the best solver (GPT-5.4-high) ranks second overall while Gemini-3.1-Pro-high leads due to superior problem authoring
  • Problems from stronger newcomers break prior top-3 solvers at 3.4x the rate of other participants, demonstrating automatic difficulty scaling
  • 61% of generated problems provide discriminative signal by defeating at least one solver, with remaining 39% solved by all participants
  • Three-stage generation pipeline roughly doubles error rates at each stage, creating progressively harder problems that test reasoning over computation
  • Build adaptive assessment systems for coding interviews where candidates author programming challenges for peers while solving others' problems, revealing both technical skills and problem-design thinking
  • Create self-scaling evaluation frameworks for AI research labs to continuously benchmark model capabilities without manual curation as new models are released
  • Develop competitive programming platforms where participants earn points both for solving problems and authoring challenges that stump other competitors, automatically maintaining difficulty levels
Read paper
/ 048.0/10

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

This paper introduces RedirectQA, a dataset that tests whether large language models consistently access the same factual knowledge when entities are referred to by different names (e.g., 'Pelé' vs 'Edson Arantes do Nascimento'). Testing 13 models, they find 23.7% inconsistency rates where models answer correctly for one name but incorrectly for another, despite the underlying fact being identical. Models are more robust to minor spelling changes than major lexical variations like abbreviations. Frequency analysis reveals both entity-level and surface-level frequencies predict accuracy, suggesting cross-surface coupling rather than independent memorization of each name variant.

  • LLMs show 23.7% inconsistency in factual QA when only entity surface forms change, challenging surface-invariant memorization assumptions
  • Models are more robust to spelling variants (punctuation, diacritics) than lexical variants (abbreviations, aliases)
  • Both entity-level and surface-level frequencies predict accuracy, indicating cross-surface coupling in factual access
  • Even strong instruction-tuned models like GPT-4o-mini exhibit surface-form sensitivity in factual knowledge retrieval
  • Implement multi-variant entity testing in RAG system evaluation pipelines by generating alternative entity names to test knowledge base consistency before deployment
  • Build entity normalization preprocessing layers that identify and handle problematic abbreviations/aliases based on model-specific robustness patterns discovered through RedirectQA-style testing
  • Design factual QA training data augmentation strategies that oversample lexically diverse entity mentions to improve cross-surface consistency during fine-tuning
Read paper
/ 058.0/10

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

This paper reveals a critical inconsistency in large language models: while they can accurately predict that humans weigh loyalty more heavily in close relationships when making moral decisions, the models themselves consistently prioritize fairness-based rules regardless of context. Using the Whistleblower's Dilemma across varying crime severity and relationship closeness, researchers tested six LLMs on three perspectives: moral rightness (what should be done), predicted human behavior (what people typically do), and the model's own decision. Models showed sophisticated understanding of human social dynamics but failed to incorporate this nuance into their own moral reasoning, creating a gap between internal world-modeling and decision-making that could lead to socially tone-deaf AI systems.

  • LLMs exhibit systematic inconsistency between their predictions of human moral behavior and their own moral decisions
  • Models accurately predict that humans prioritize loyalty in close relationships but fail to incorporate this social sensitivity into their own judgments
  • Moral reasoning varies significantly based on evaluative perspective (prescriptive vs. descriptive vs. decisional), revealing non-monolithic machine morality
  • Models demonstrate context-dependent moral foundation weighting, with loyalty increasing and fairness decreasing as relational closeness grows, but only in behavioral predictions
  • Build AI ethics advisors that explicitly separate 'what's right' from 'what people typically do' recommendations, helping users understand both normative standards and social expectations in workplace conflict resolution
  • Design customer service chatbots that adjust moral reasoning based on relationship context - applying stricter fairness principles for strangers but acknowledging loyalty considerations for long-term customers
  • Develop content moderation systems that account for relationship context when evaluating reported violations, distinguishing between stranger interactions and close friend dynamics to reduce over-moderation
Read paper
/ 068.0/10

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

This paper introduces Transient Turn Injection (TTI), a novel attack method that bypasses LLM safety mechanisms through stateless, multi-turn interactions. Unlike traditional jailbreaks that rely on persistent conversation context, TTI uses an automated attacker LLM to generate independent prompts that gradually elicit harmful information without triggering per-turn safety filters. Testing across 13 state-of-the-art models reveals significant vulnerabilities, with Gemini models showing 34-40% unsafe response rates and only Anthropic's Constitutional AI showing strong resistance. The work demonstrates that current stateless moderation approaches are insufficient against sophisticated adversarial strategies.

  • TTI attacks consistently outperform traditional PAIR attacks across all tested models, with TTI achieving 2-10x higher success rates
  • Gemini model variants show highest vulnerability to TTI with 34-40% unsafe response rates, while Anthropic Claude shows strongest resistance at 2%
  • Medical and harmful content categories represent the most common vulnerability types across nearly all evaluated models
  • Per-turn safety filters fail against distributed adversarial intent, requiring session-level and context-aware defenses
  • Implement session-level risk scoring systems that aggregate prompt similarity and escalation patterns across API calls to detect coordinated TTI attacks before they succeed
  • Deploy Constitutional AI-style training pipelines that embed ethical reasoning directly into model weights rather than relying solely on post-hoc safety classifiers
  • Build automated red-teaming frameworks that simulate TTI attacks against internal LLM deployments to identify vulnerabilities before production release
Read paper
/ 078.0/10

TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

TraceScope presents a novel two-stage system for analyzing sophisticated phishing attacks that evade traditional detection. An operator agent interacts with suspicious websites in a sandboxed browser to bypass CAPTCHAs and interaction gates, recording all activity. A separate adjudicator agent then analyzes the recorded evidence using a MITRE ATT&CK checklist to render verdicts. The system achieved 0.94 precision and 0.78 recall on live URLs, significantly outperforming existing visual classifiers that rely on static snapshots. Key innovation lies in decoupling risky interaction from analysis, enabling safe investigation of evasive threats.

  • 25% of URLs appearing offline to standard crawlers are actually live but use anti-bot cloaking
  • Interactive phishing detection achieves 0.78 recall vs 0.25 for best static baseline on live URLs
  • Logo-less and non-credential phishing (crypto wallets, PII harvesting) represent major blind spots for existing detectors
  • Decoupled operator-adjudicator architecture with immutable evidence prevents observer effects while enabling forensic analysis
  • Security operations centers can deploy TraceScope as second-stage analysis for URLs flagged by lightweight filters, automating the manual triage workflow that currently requires human analysts
  • Email security vendors can integrate the operator-adjudicator pattern to analyze suspicious attachments or embedded content, capturing dynamic behaviors missed by static sandboxes
  • Threat intelligence platforms can adapt the evidence recording and MITRE ATT&CK mapping components to build automated incident response reports from any interactive malware analysis
Read paper
/ 088.0/10

Compliance Moral Hazard and the Backfiring Mandate

This paper develops a mechanism design framework for competing firms to share risk signals truthfully in federated learning settings. The key innovation is Temporal Value Assignment (TVA), which uses strictly proper scoring rules to credit institutions for early accurate warnings about risky customers. The surprising main result is that mandatory information sharing without proper incentives can actually reduce welfare below autarky levels due to strategic underreporting. The framework addresses three strategic frictions: compliance moral hazard, adversarial adaptation, and information destruction through intervention. A network Shapley value characterization shows that coalition value depends on cross-border transaction volume rather than institution size.

  • Information-sharing mandates can backfire: mandatory sharing without incentive alignment performs barely better than no sharing (56% vs 54% of first-best welfare) due to strategic underreporting creating worse global models than honest local ones
  • TVA mechanism achieves 87% of first-best welfare by implementing truthful reporting as Bayes-Nash equilibrium through temporal discounting that creates first-mover advantage for early detection
  • Network Shapley value is proportional to weighted cross-border degree, meaning coalition design should prioritize high inter-institutional transaction volume over total asset size
  • Temporal discounting with γ close to 1 provides sublinear regret against adaptive adversaries and acts as commitment device that deters strategic manipulation
  • Build federated fraud detection across payment processors (Stripe, Square, PayPal) where each processor contributes local transaction graph embeddings and receives credits based on early accurate merchant risk predictions, with temporal discounting favoring faster flagging
  • Implement collaborative cybersecurity threat intelligence sharing among cloud providers where each contributes attack pattern features and earns reputation scores for early accurate threat warnings, preventing free-riding on others' detection investments
  • Deploy multi-platform abuse detection federation across social media companies where platforms share user behavior embeddings and receive regulatory compliance credits for timely identification of coordinated inauthentic behavior networks
Read paper
/ 098.0/10

Fine-Tuning Regimes Define Distinct Continual Learning Problems

This paper reveals that continual learning method comparisons are unreliable because they depend heavily on which network parameters remain trainable during sequential task learning. The authors show that changing only the 'fine-tuning regime' (which layers can be updated) can completely reverse the relative rankings of standard continual learning algorithms like EWC, LwF, and GEM. They formalize this as constrained optimization over parameter subspaces and demonstrate across five datasets that deeper adaptation regimes produce larger gradient updates, more forgetting, and stronger coupling between the two. This challenges the validity of existing comparative studies and suggests continual learning benchmarks need regime-aware evaluation protocols.

  • Method rankings in continual learning are not preserved across different fine-tuning regimes - the same algorithms can rank completely differently based solely on trainable depth
  • Deeper adaptation regimes (more trainable parameters) consistently produce larger gradient magnitudes and higher forgetting rates
  • The relationship between gradient magnitude and forgetting becomes stronger in more permissive training regimes
  • Current continual learning benchmarks may be drawing regime-specific conclusions that don't generalize across deployment conditions
  • Redesign continual learning benchmarks for production ML systems by testing across multiple fine-tuning regimes before selecting algorithms for sequential model updates in recommendation systems or fraud detection
  • Implement regime-aware hyperparameter tuning in MLOps pipelines that need to adapt models to new data streams, testing both shallow and deep adaptation strategies
  • Build more robust continual learning evaluation frameworks for edge AI systems where computational constraints require different trainable depth configurations across deployment environments
Read paper
/ 107.5/10

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

This paper addresses hallucinations in Large Vision-Language Models (LVLMs) by introducing HalluScope, a benchmark that separates hallucination causes into visual perception failures, learned object co-occurrence biases, and textual instruction presuppositions. The key finding is that modern LVLMs hallucinate primarily due to over-reliance on textual priors rather than visual perception limitations. The authors propose HalluVL-DPO, a preference optimization framework using synthetic training data to fine-tune models toward more visually grounded responses. Results show substantial improvements on targeted hallucination tasks while preserving general multimodal capabilities.

  • Modern LVLM hallucinations stem primarily from textual instruction presuppositions (35-85% performance drop) rather than visual backbone failures
  • Adversarial objects (contextually plausible but absent) cause 8-37% accuracy drops due to learned co-occurrence biases
  • Sample-weighted preference optimization using synthetic data effectively reduces prompt-induced hallucinations while maintaining general performance
  • Cross-model transferability of synthetic preference data enables scalable hallucination mitigation across different LVLM architectures
  • Build automated quality control systems for medical imaging reports by fine-tuning LVLMs with HalluVL-DPO to reduce false positive findings when radiologists' prompts contain implicit assumptions about pathology presence
  • Develop more reliable autonomous vehicle perception systems by applying the diagnostic framework to identify when navigation instructions override visual evidence about road conditions or obstacles
  • Create robust content moderation tools for social media platforms by training models to resist generating false claims about image content when user-submitted captions contain misleading presuppositions
Read paper