ISSUE 004 April 26, 2026

AI Research Weekly – adversarial learning & more – April 26, 2026

/ 018.2/10

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Summary

This paper addresses the 'Tools Tax' - the overhead of injecting all tool schemas into LLM prompts on every turn in Model Context Protocol (MCP) systems. The authors propose Tool Attention, which dynamically selects relevant tools using semantic similarity scores, stateful gating functions, and lazy schema loading. Their approach reduces tool-related tokens by 95% (47k to 2.4k tokens) while maintaining task success rates. The mechanism works by keeping compact tool summaries always in context but only loading full schemas for the top-k most relevant tools per turn, combined with a hallucination gate to handle false negatives.

Key findings

The Tools Tax scales linearly with catalog size and dominates effective context window past ~50 tools, causing reasoning degradation
Tool Attention achieves 95% reduction in tool tokens per turn while improving projected task success by 22 percentage points
Two-phase lazy loading (summaries always present, full schemas on-demand) preserves tool discoverability while minimizing context overhead
Semantic gating based on Intent-Schema Overlap scores provides both efficiency gains and defensive benefits against tool poisoning attacks

How to implement

Implement Tool Attention middleware in enterprise chatbots that integrate 50+ internal APIs to reduce per-session costs from $55 to $8 while improving response accuracy
Deploy in code assistants like GitHub Copilot that access multiple development tools (Git, databases, CI/CD) to prevent context window exhaustion during long coding sessions
Integrate into customer service agents with access to CRM, ticketing, knowledge base, and communication tools to maintain reasoning quality across extended support conversations

Read paper

/ 028.0/10

The Sample Complexity of Multicalibration

Summary

This paper resolves the fundamental question of how many samples are needed to learn multicalibrated predictors. While regular calibration requires Θ(ε^-2) samples, the authors prove multicalibration requires Θ(ε^-3) samples, showing it's inherently harder. They construct ingenious hard instances using coding theory that force any learner to essentially solve a complex decoding problem. The results extend beyond means to other statistical properties like quantiles and expectiles, and hold for weighted Lp error metrics. Surprisingly, this matches the online adversarial setting's sample complexity, unlike regular calibration which is easier in batch settings.

Key findings

Multicalibration sample complexity is Θ(ε^-3) for ECE metric, separating it from Θ(ε^-2) marginal calibration
Lower bounds extend to weighted Lp metrics with optimal exponent 3/p for p ∈ [1,2]
Results hold for regular elicitable properties including expectiles and bounded-density quantiles
Batch and online multicalibration have matching sample complexities, unlike marginal calibration

How to implement

When deploying fair ML models in production, use these bounds to estimate required training data size - if needing multicalibration error ≤ 0.01, plan for ~1M samples rather than assuming 10K suffices
For A/B testing platforms ensuring fairness across user segments, implement the randomized predictor construction to achieve multicalibration guarantees with provably minimal sample requirements
In recommendation systems requiring calibration across demographic groups, apply the Lp multicalibration framework with appropriate p values to balance different fairness objectives while respecting sample complexity constraints

Read paper

/ 038.0/10

MathDuels: Evaluating LLMs as Problem Posers and Solvers

Summary

MathDuels introduces a self-play benchmark where language models both create and solve mathematical problems. Unlike static benchmarks that saturate as models improve, MathDuels maintains discriminative power by having stronger models generate harder problems that challenge previously dominant solvers. The system uses a three-stage generation pipeline (meta-prompting, problem generation, difficulty amplification) and Rasch modeling to jointly estimate solver abilities and problem difficulties. Testing 19 frontier models reveals that authoring and solving capabilities are partially decoupled - strong problem solvers aren't necessarily strong problem authors. This dual-role evaluation exposes capability gaps invisible in solve-only benchmarks and automatically scales difficulty as participants strengthen.

Key findings

Authoring and solving capabilities are partially decoupled - the best solver (GPT-5.4-high) ranks second overall while Gemini-3.1-Pro-high leads due to superior problem authoring
Problems from stronger newcomers break prior top-3 solvers at 3.4x the rate of other participants, demonstrating automatic difficulty scaling
61% of generated problems provide discriminative signal by defeating at least one solver, with remaining 39% solved by all participants
Three-stage generation pipeline roughly doubles error rates at each stage, creating progressively harder problems that test reasoning over computation

How to implement

Build adaptive assessment systems for coding interviews where candidates author programming challenges for peers while solving others' problems, revealing both technical skills and problem-design thinking
Create self-scaling evaluation frameworks for AI research labs to continuously benchmark model capabilities without manual curation as new models are released
Develop competitive programming platforms where participants earn points both for solving problems and authoring challenges that stump other competitors, automatically maintaining difficulty levels

Read paper

/ 048.0/10

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Summary

This paper introduces RedirectQA, a dataset that tests whether large language models consistently access the same factual knowledge when entities are referred to by different names (e.g., 'Pelé' vs 'Edson Arantes do Nascimento'). Testing 13 models, they find 23.7% inconsistency rates where models answer correctly for one name but incorrectly for another, despite the underlying fact being identical. Models are more robust to minor spelling changes than major lexical variations like abbreviations. Frequency analysis reveals both entity-level and surface-level frequencies predict accuracy, suggesting cross-surface coupling rather than independent memorization of each name variant.

Key findings

LLMs show 23.7% inconsistency in factual QA when only entity surface forms change, challenging surface-invariant memorization assumptions
Models are more robust to spelling variants (punctuation, diacritics) than lexical variants (abbreviations, aliases)
Both entity-level and surface-level frequencies predict accuracy, indicating cross-surface coupling in factual access
Even strong instruction-tuned models like GPT-4o-mini exhibit surface-form sensitivity in factual knowledge retrieval

How to implement

Implement multi-variant entity testing in RAG system evaluation pipelines by generating alternative entity names to test knowledge base consistency before deployment
Build entity normalization preprocessing layers that identify and handle problematic abbreviations/aliases based on model-specific robustness patterns discovered through RedirectQA-style testing
Design factual QA training data augmentation strategies that oversample lexically diverse entity mentions to improve cross-surface consistency during fine-tuning

Read paper

/ 058.0/10

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Summary

This paper reveals a critical inconsistency in large language models: while they can accurately predict that humans weigh loyalty more heavily in close relationships when making moral decisions, the models themselves consistently prioritize fairness-based rules regardless of context. Using the Whistleblower's Dilemma across varying crime severity and relationship closeness, researchers tested six LLMs on three perspectives: moral rightness (what should be done), predicted human behavior (what people typically do), and the model's own decision. Models showed sophisticated understanding of human social dynamics but failed to incorporate this nuance into their own moral reasoning, creating a gap between internal world-modeling and decision-making that could lead to socially tone-deaf AI systems.

Key findings

LLMs exhibit systematic inconsistency between their predictions of human moral behavior and their own moral decisions
Models accurately predict that humans prioritize loyalty in close relationships but fail to incorporate this social sensitivity into their own judgments
Moral reasoning varies significantly based on evaluative perspective (prescriptive vs. descriptive vs. decisional), revealing non-monolithic machine morality
Models demonstrate context-dependent moral foundation weighting, with loyalty increasing and fairness decreasing as relational closeness grows, but only in behavioral predictions

How to implement

Build AI ethics advisors that explicitly separate 'what's right' from 'what people typically do' recommendations, helping users understand both normative standards and social expectations in workplace conflict resolution
Design customer service chatbots that adjust moral reasoning based on relationship context - applying stricter fairness principles for strangers but acknowledging loyalty considerations for long-term customers
Develop content moderation systems that account for relationship context when evaluating reported violations, distinguishing between stranger interactions and close friend dynamics to reduce over-moderation

Read paper

/ 068.0/10

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

Summary

This paper introduces Transient Turn Injection (TTI), a novel attack method that bypasses LLM safety mechanisms through stateless, multi-turn interactions. Unlike traditional jailbreaks that rely on persistent conversation context, TTI uses an automated attacker LLM to generate independent prompts that gradually elicit harmful information without triggering per-turn safety filters. Testing across 13 state-of-the-art models reveals significant vulnerabilities, with Gemini models showing 34-40% unsafe response rates and only Anthropic's Constitutional AI showing strong resistance. The work demonstrates that current stateless moderation approaches are insufficient against sophisticated adversarial strategies.

Key findings

TTI attacks consistently outperform traditional PAIR attacks across all tested models, with TTI achieving 2-10x higher success rates
Gemini model variants show highest vulnerability to TTI with 34-40% unsafe response rates, while Anthropic Claude shows strongest resistance at 2%
Medical and harmful content categories represent the most common vulnerability types across nearly all evaluated models
Per-turn safety filters fail against distributed adversarial intent, requiring session-level and context-aware defenses

How to implement

Implement session-level risk scoring systems that aggregate prompt similarity and escalation patterns across API calls to detect coordinated TTI attacks before they succeed
Deploy Constitutional AI-style training pipelines that embed ethical reasoning directly into model weights rather than relying solely on post-hoc safety classifiers
Build automated red-teaming frameworks that simulate TTI attacks against internal LLM deployments to identify vulnerabilities before production release

Read paper

/ 078.0/10

TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

Summary

TraceScope presents a novel two-stage system for analyzing sophisticated phishing attacks that evade traditional detection. An operator agent interacts with suspicious websites in a sandboxed browser to bypass CAPTCHAs and interaction gates, recording all activity. A separate adjudicator agent then analyzes the recorded evidence using a MITRE ATT&CK checklist to render verdicts. The system achieved 0.94 precision and 0.78 recall on live URLs, significantly outperforming existing visual classifiers that rely on static snapshots. Key innovation lies in decoupling risky interaction from analysis, enabling safe investigation of evasive threats.

Key findings

25% of URLs appearing offline to standard crawlers are actually live but use anti-bot cloaking
Interactive phishing detection achieves 0.78 recall vs 0.25 for best static baseline on live URLs
Logo-less and non-credential phishing (crypto wallets, PII harvesting) represent major blind spots for existing detectors
Decoupled operator-adjudicator architecture with immutable evidence prevents observer effects while enabling forensic analysis

How to implement

Security operations centers can deploy TraceScope as second-stage analysis for URLs flagged by lightweight filters, automating the manual triage workflow that currently requires human analysts
Email security vendors can integrate the operator-adjudicator pattern to analyze suspicious attachments or embedded content, capturing dynamic behaviors missed by static sandboxes
Threat intelligence platforms can adapt the evidence recording and MITRE ATT&CK mapping components to build automated incident response reports from any interactive malware analysis

Read paper

/ 088.0/10

Compliance Moral Hazard and the Backfiring Mandate

Summary

This paper develops a mechanism design framework for competing firms to share risk signals truthfully in federated learning settings. The key innovation is Temporal Value Assignment (TVA), which uses strictly proper scoring rules to credit institutions for early accurate warnings about risky customers. The surprising main result is that mandatory information sharing without proper incentives can actually reduce welfare below autarky levels due to strategic underreporting. The framework addresses three strategic frictions: compliance moral hazard, adversarial adaptation, and information destruction through intervention. A network Shapley value characterization shows that coalition value depends on cross-border transaction volume rather than institution size.

Key findings

Information-sharing mandates can backfire: mandatory sharing without incentive alignment performs barely better than no sharing (56% vs 54% of first-best welfare) due to strategic underreporting creating worse global models than honest local ones
TVA mechanism achieves 87% of first-best welfare by implementing truthful reporting as Bayes-Nash equilibrium through temporal discounting that creates first-mover advantage for early detection
Network Shapley value is proportional to weighted cross-border degree, meaning coalition design should prioritize high inter-institutional transaction volume over total asset size
Temporal discounting with γ close to 1 provides sublinear regret against adaptive adversaries and acts as commitment device that deters strategic manipulation

How to implement

Build federated fraud detection across payment processors (Stripe, Square, PayPal) where each processor contributes local transaction graph embeddings and receives credits based on early accurate merchant risk predictions, with temporal discounting favoring faster flagging
Implement collaborative cybersecurity threat intelligence sharing among cloud providers where each contributes attack pattern features and earns reputation scores for early accurate threat warnings, preventing free-riding on others' detection investments
Deploy multi-platform abuse detection federation across social media companies where platforms share user behavior embeddings and receive regulatory compliance credits for timely identification of coordinated inauthentic behavior networks

Read paper

/ 098.0/10

Fine-Tuning Regimes Define Distinct Continual Learning Problems

Summary

This paper reveals that continual learning method comparisons are unreliable because they depend heavily on which network parameters remain trainable during sequential task learning. The authors show that changing only the 'fine-tuning regime' (which layers can be updated) can completely reverse the relative rankings of standard continual learning algorithms like EWC, LwF, and GEM. They formalize this as constrained optimization over parameter subspaces and demonstrate across five datasets that deeper adaptation regimes produce larger gradient updates, more forgetting, and stronger coupling between the two. This challenges the validity of existing comparative studies and suggests continual learning benchmarks need regime-aware evaluation protocols.

Key findings

Method rankings in continual learning are not preserved across different fine-tuning regimes - the same algorithms can rank completely differently based solely on trainable depth
Deeper adaptation regimes (more trainable parameters) consistently produce larger gradient magnitudes and higher forgetting rates
The relationship between gradient magnitude and forgetting becomes stronger in more permissive training regimes
Current continual learning benchmarks may be drawing regime-specific conclusions that don't generalize across deployment conditions

How to implement

Redesign continual learning benchmarks for production ML systems by testing across multiple fine-tuning regimes before selecting algorithms for sequential model updates in recommendation systems or fraud detection
Implement regime-aware hyperparameter tuning in MLOps pipelines that need to adapt models to new data streams, testing both shallow and deep adaptation strategies
Build more robust continual learning evaluation frameworks for edge AI systems where computational constraints require different trainable depth configurations across deployment environments

Read paper

/ 107.5/10

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Summary

This paper addresses hallucinations in Large Vision-Language Models (LVLMs) by introducing HalluScope, a benchmark that separates hallucination causes into visual perception failures, learned object co-occurrence biases, and textual instruction presuppositions. The key finding is that modern LVLMs hallucinate primarily due to over-reliance on textual priors rather than visual perception limitations. The authors propose HalluVL-DPO, a preference optimization framework using synthetic training data to fine-tune models toward more visually grounded responses. Results show substantial improvements on targeted hallucination tasks while preserving general multimodal capabilities.

Key findings

Modern LVLM hallucinations stem primarily from textual instruction presuppositions (35-85% performance drop) rather than visual backbone failures
Adversarial objects (contextually plausible but absent) cause 8-37% accuracy drops due to learned co-occurrence biases
Sample-weighted preference optimization using synthetic data effectively reduces prompt-induced hallucinations while maintaining general performance
Cross-model transferability of synthetic preference data enables scalable hallucination mitigation across different LVLM architectures

How to implement

Build automated quality control systems for medical imaging reports by fine-tuning LVLMs with HalluVL-DPO to reduce false positive findings when radiologists' prompts contain implicit assumptions about pathology presence
Develop more reliable autonomous vehicle perception systems by applying the diagnostic framework to identify when navigation instructions override visual evidence about road conditions or obstacles
Create robust content moderation tools for social media platforms by training models to resist generating false claims about image content when user-submitted captions contain misleading presuppositions

Read paper