Back to archive
ISSUE 006

AI Research Weekly – autonomous driving & more – May 10, 2026

/ 018.0/10

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

This research discovers that large language models internally organize social roles along a single dominant dimension: granularity, ranging from micro-level individual perspectives to macro-level institutional reasoning. The authors construct a 'Granularity Axis' using contrast-based methods on 75 ordered social roles, finding it aligns with the primary component of role representation space and accounts for over half the variance. Crucially, they demonstrate this axis is causally manipulable through activation steering, allowing real-time control over whether an LLM responds from an individual, community, organizational, or institutional perspective. The findings reveal that role-conditioned behavior operates along a continuous social-scale manifold rather than discrete personas.

  • Social role granularity is the dominant geometric axis in LLM role representation space, accounting for 52.6% of variance in Qwen3-8B
  • A contrast-based Granularity Axis constructed from micro/macro endpoints successfully predicts intermediate granularity levels with monotonic ordering
  • The axis transfers across model families (Qwen3-8B and Llama-3.1-8B) and remains stable across layers and prompt variations
  • Activation steering along this axis causally shifts output granularity, enabling real-time control of social perspective scale
  • Models differ in steering responsiveness based on their default operating regime, with some showing ceiling effects
  • Build multi-agent debate systems that actively monitor and prevent perspective collapse by measuring agent positions on the granularity axis during conversations
  • Develop customer service chatbots that dynamically adjust social perspective based on query type - individual support mode for personal issues, institutional mode for policy questions
  • Create policy simulation platforms that apply granularity steering to ensure different stakeholder agents (citizens, organizations, governments) maintain appropriate perspective scales throughout scenarios
Read paper
/ 028.0/10

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

The AI Co-Mathematician introduces a stateful workspace where multiple AI agents collaborate asynchronously on mathematical research tasks. Unlike traditional chat interfaces, it maintains persistent project state, manages uncertainty through review cycles, and produces native LaTeX documents with margin annotations. The system delegates work across specialized agents (literature review, computation, theorem proving) while allowing human steering. Early users solved open problems including a Kourovka Notebook question. The system scored 48% on FrontierMath Tier 4, surpassing previous AI systems through orchestration rather than raw model capability improvements.

  • Stateful multi-agent orchestration significantly outperforms single-model approaches, achieving 48% accuracy on FrontierMath Tier 4 versus 19% for base Gemini 3.1 Pro
  • Interactive human steering is crucial - early users successfully resolved open mathematical problems through iterative collaboration with the AI system
  • Native mathematical artifacts (LaTeX with margin annotations) enable better uncertainty communication and workflow integration than transient chat outputs
  • Asynchronous parallel workstreams with hard programmatic constraints prevent common AI failure modes like hallucinated proofs and premature success claims
  • Build a stateful research assistant for legal teams that maintains case files, delegates document review to specialized agents, and produces annotated briefs with uncertainty markers for complex litigation workflows
  • Create an engineering design workspace where AI agents handle parallel workstreams for requirements analysis, literature review, and prototype validation while maintaining persistent project state and review cycles
  • Develop a scientific research platform that orchestrates AI agents for hypothesis generation, literature synthesis, and experimental design while tracking failed approaches and enabling human steering of long-running investigations
Read paper
/ 037.8/10

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

This paper introduces LOPE (Lorem Perturbation for Exploration), which prepends randomly generated Latin placeholder text to prompts during reinforcement learning training. When all generated responses fail for a difficult question (zero-advantage problem), LOPE adds Lorem Ipsum sequences to shift the model's output distribution and discover new reasoning pathways. Experiments show consistent improvements across 1.7B-7B parameter models on math reasoning tasks. The key insight is that task-irrelevant, low-perplexity perturbations can break models out of local reasoning patterns without corrupting comprehension, outperforming both standard resampling and high-temperature generation approaches.

  • Lorem Ipsum perturbations unlock orthogonal reasoning pathways that standard logit-space exploration (temperature sampling) cannot reach
  • Effective perturbations require pseudo-Latin vocabulary and low perplexity to avoid interfering with English reasoning while maintaining response quality
  • LOPE maintains higher question-level success rates throughout training compared to naive resampling, improving data utilization efficiency
  • Training signal shaping techniques are necessary to handle off-policy optimization when mixing perturbed and original responses
  • Integrate LOPE into existing RLHF training pipelines for code generation models to improve success rates on hard programming problems by prepending Latin sequences when all initial solutions fail
  • Apply prompt perturbation during fine-tuning of customer service chatbots to discover diverse response strategies for difficult queries, using controlled noise to break out of repetitive answer patterns
  • Implement LOPE in mathematical reasoning systems like homework helpers or tutoring applications to generate alternative solution approaches when standard methods fail, improving problem-solving coverage
Read paper
/ 047.5/10

TIDE: Every Layer Knows the Token Beneath the Context

TIDE introduces a parallel memory system that maintains token identity information throughout all transformer layers, addressing two critical problems: rare tokens receiving insufficient training updates due to frequency imbalances, and semantically distinct tokens becoming indistinguishable in similar contexts. The architecture adds K independent memory blocks that map token indices to semantic vectors, injected into each layer via learned routers. This provides rare tokens with amplified gradient signals and prevents contextual collapse by maintaining discrete token-specific information independent of contextual hidden states. Experiments show consistent improvements across model scales and downstream tasks.

  • Rare tokens receive orders of magnitude fewer gradient updates than common tokens, leading to systematically undertrained embeddings
  • Contextual collapse causes semantically distinct tokens to become indistinguishable in similar syntactic environments due to FFN Lipschitz constraints
  • TIDE's K-pathway gradient amplification provides rare tokens with K-fold increase in training signal compared to standard transformers
  • Memory blocks learn to specialize for different frequency regimes, with routers adaptively weighting memory contributions based on token rarity
  • Implement TIDE architecture in domain-specific language models handling technical terminology or rare entities (medical, legal, scientific) to improve rare term understanding and generation accuracy
  • Deploy TIDE in multilingual models where low-resource languages suffer from rare token problems, using memory blocks to maintain language-specific semantic representations across all layers
  • Integrate TIDE into code generation models to better handle rare API names, variable identifiers, and domain-specific tokens that appear infrequently in training data
Read paper
/ 057.5/10

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

UniPool fundamentally redesigns Mixture-of-Experts (MoE) by replacing the standard approach of giving each layer its own private expert set with a single global pool of experts shared across all layers. This addresses redundancy in deep MoE layers where experts learn similar transformations. The approach uses pool-level load balancing and NormRouter for stable training. Across five model scales (182M-978M parameters), UniPool consistently outperforms vanilla MoE while using only 41.6-66.7% of the expert parameters, enabling sublinear expert scaling with model depth.

  • Deep MoE layers exhibit substantial expert redundancy - randomizing routing in single deep layers drops accuracy by only 1.0-1.6 points
  • Global expert sharing consistently improves validation loss across all tested scales (182M-978M parameters)
  • Pool size becomes an explicit scaling hyperparameter - expert parameters can grow sublinearly with depth while maintaining performance
  • Shared pool requires co-designed components: pool-level auxiliary loss and NormRouter for stable training
  • Deploy larger language models with reduced memory footprint by using UniPool's sublinear expert scaling to maintain quality with fewer expert parameters
  • Optimize existing MoE model architectures by replacing per-layer expert sets with shared pools, potentially reducing training and inference costs by 35-60%
  • Build deeper transformer models more efficiently by allocating expert capacity globally rather than linearly with depth, enabling better depth-width tradeoffs
Read paper
/ 067.2/10

Continuous Latent Diffusion Language Model

Cola DLM proposes a hierarchical approach to language modeling that first maps text to continuous latent variables via a VAE, then models semantic structure using a block-causal diffusion transformer, and finally generates text through conditional decoding. This separates global semantic organization from local text realization, enabling non-autoregressive generation and natural multimodal extension. The key insight is using diffusion for latent prior transport rather than token-level recovery. Experiments show competitive scaling behavior compared to autoregressive baselines, though with notable likelihood-generation quality misalignment.

  • Diffusion-based latent prior transport outperforms token-level observation recovery for text generation
  • Generation quality and likelihood estimation can be structurally misaligned in continuous latent language models
  • Joint training of VAE and diffusion transformer from stable initialization achieves better scaling than fixed or scratch training
  • The approach naturally extends to unified text-image modeling through shared continuous latent space
  • Build a document editing system that generates coherent long-form content by modeling global document structure in latent space before generating local text segments
  • Develop a multimodal chatbot that processes both text queries and images by mapping both modalities to a shared continuous latent space for unified reasoning
  • Create a content generation pipeline for marketing materials that first plans semantic structure (tone, key points) in latent space, then realizes text conditioned on visual brand elements
Read paper
/ 077.2/10

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

This paper introduces SCALELOGIC, a synthetic logical reasoning framework that enables precise control over reasoning depth and logical expressiveness for studying RL training efficiency. The key discovery is that training compute follows a power law T ∝ D^γ with reasoning depth, where the scaling exponent γ increases monotonically from 1.04 to 2.60 as logical expressiveness grows from simple implication-only to full first-order reasoning with quantification. More expressive training settings yield larger downstream performance gains (up to +10.66 points) on real mathematical reasoning benchmarks, demonstrating that what models train on, not just training volume, shapes transfer performance. Curriculum-based training substantially improves scaling efficiency across all settings.

  • RL training cost follows power law T ∝ D^γ with reasoning depth, where scaling exponent γ increases monotonically with logical expressiveness (1.04 to 2.60)
  • More expressive training logic yields larger downstream performance gains on mathematical reasoning benchmarks (up to +10.66 percentage points)
  • Curriculum training substantially improves scaling efficiency, reducing power-law exponent from 2.60 to 2.30 in most expressive setting
  • Power-law scaling relationship holds across multiple RL algorithms (DAPO, GRPO, GSPO), indicating broad applicability
  • Build RL training curricula for mathematical reasoning models by starting with simple implication-only problems and progressively introducing conjunction, negation, and quantification to maximize compute efficiency
  • Design synthetic data generation pipelines for code reasoning assistants using SCALELOGIC's controlled complexity framework to systematically scale training difficulty while maintaining verifiable rewards
  • Optimize resource allocation for reasoning model training by using the discovered power-law relationships to predict training compute requirements based on target reasoning depth and logical complexity
Read paper
/ 087.2/10

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

A²TGPO improves reinforcement learning for multi-turn AI agents by solving three key problems in credit assignment. Instead of pooling rewards across all turns, it groups rewards by turn position, recognizing that agents at the same interaction depth face similar contexts. It rescales accumulated rewards to prevent later turns from dominating gradients, and adapts clipping ranges based on how informative each turn is. Testing on question-answering tasks with tool use shows consistent 1.7+ point improvements over existing methods across different model sizes.

  • Turn-group normalization by (prompt, turn-index) eliminates cross-position incomparability in multi-turn agent training
  • Variance-rescaled discounted accumulation keeps advantage magnitudes comparable across different trajectory depths
  • Adaptive turn-level clipping based on information gain allows more aggressive updates for informative turns
  • Method achieves +1.75 average improvement on multi-hop tasks and +1.69 on single-hop tasks across three model backbones
  • Train customer service chatbots that use multiple tools (CRM lookup, knowledge base search, ticket creation) by properly crediting which tool calls actually help resolve issues versus just extending conversations
  • Improve code generation agents that iteratively run tests, read documentation, and refactor code by better rewarding the specific debugging steps that lead to working solutions
  • Optimize web navigation agents for e-commerce or form-filling tasks by distinguishing between productive clicks that advance toward goals versus exploratory actions that gather context
Read paper
/ 097.0/10

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

This paper introduces Continuous-Time Distribution Matching (CDM), which improves few-step diffusion model generation by replacing fixed discrete timestep training with dynamic continuous scheduling. Instead of training only at predetermined timesteps that match inference, CDM uses randomly sampled continuous timesteps and introduces an off-trajectory supervision mechanism. This mechanism uses velocity-driven extrapolation to create intermediate points between sampling steps and enforces distribution matching on these off-trajectory latents. The approach achieves state-of-the-art 4-step generation quality without requiring GANs or reward models, demonstrating that the common assumption of strict training-inference alignment is unnecessarily restrictive.

  • Training-inference timestep alignment is not necessary and actually restricts performance - dynamic continuous scheduling outperforms fixed discrete schedules
  • Distribution matching loss captures the teacher's CFG-free distribution rather than merely acting as a training stabilizer
  • Velocity-driven off-trajectory supervision effectively corrects numerical integration errors that occur during few-step inference
  • CDM achieves state-of-the-art 4-step generation quality without requiring complex auxiliary objectives like GANs or reward models
  • Integrate CDM into existing text-to-image APIs to reduce inference latency from 50-100 steps to 4 steps while maintaining quality, enabling real-time image generation for interactive applications
  • Apply continuous-time scheduling to custom diffusion model training pipelines for domain-specific applications like medical imaging or product visualization where speed is critical
  • Implement velocity-driven extrapolation in existing diffusion distillation frameworks to improve quality of few-step video generation models for content creation platforms
Read paper
/ 106.8/10

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

SwiftI2V tackles the computational challenge of generating high-resolution (2K) videos from images by splitting the process into two stages: low-resolution motion generation followed by high-resolution detail synthesis. The key innovation is Conditional Segment-wise Generation (CSG), which processes videos in small temporal segments with bidirectional attention between segments, avoiding memory explosion while maintaining quality. This achieves 202× speedup over end-to-end approaches while matching quality, enabling 2K video generation on consumer GPUs like RTX 4090.

  • Two-stage decoupling of motion modeling and detail synthesis enables 202× GPU-time reduction compared to end-to-end 2K video generation
  • Conditional Segment-wise Generation with bidirectional attention maintains quality while processing videos in bounded memory segments
  • Stage transition training with synthesized artifacts bridges the gap between separately trained cascade stages
  • Method achieves competitive VBench-I2V scores while reducing peak memory to 33.5GB, enabling consumer GPU deployment
  • Integrate SwiftI2V into content creation platforms to offer real-time 2K video generation from user-uploaded images without requiring expensive GPU clusters
  • Deploy segmented generation approach in mobile video editing apps, processing long-form content in streaming fashion to avoid memory constraints
  • Adapt bidirectional segment attention mechanism for other high-resolution generative tasks like panoramic image synthesis or long document generation
Read paper
/ 11HIDDEN GEM8.5/10

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

XL-SafetyBench introduces the first benchmark separating country-specific LLM safety into two dimensions: adversarial robustness against localized attacks (jailbreak benchmark) and cultural sensitivity awareness (cultural benchmark). Testing 37 models across 10 countries reveals that safety and cultural awareness aren't coupled in frontier models, and that local models often appear safe due to comprehension failures rather than genuine alignment. The benchmark uses native-speaker validation and country-grounded content rather than translations, providing more accurate safety assessment for global AI deployment.

  • Safety alignment and cultural awareness are uncoupled capabilities that should be evaluated separately
  • Local models exhibit strong ASR-NSR trade-off (r=-0.81), indicating apparent safety stems from comprehension failure
  • Country-grounded attacks reveal vulnerabilities missed by translation-based benchmarks
  • Cultural sensitivity requires implicit detection within natural tasks, not just explicit cultural knowledge
  • Build automated safety monitoring systems that separately track jailbreak resistance and cultural appropriateness for customer-facing chatbots deployed across multiple countries
  • Create model selection frameworks for global enterprises that weight safety vs cultural competence based on deployment regions and use cases
  • Develop training pipelines that specifically target cultural sensitivity deficits without compromising adversarial robustness for multilingual AI assistants
Read paper
/ 12HIDDEN GEM8.2/10

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

KinDER introduces a comprehensive benchmark for robot physical reasoning with 25 procedurally-generated environments targeting five core challenges: spatial relations, multi-object manipulation, tool use, geometric constraints, and dynamic constraints. The benchmark includes both 2D and 3D environments using object-centric states, enabling fair comparison across task planning, reinforcement learning, imitation learning, and foundation model approaches. Evaluation of 13 baselines reveals that bilevel planning achieves highest success rates (57%) but foundation models show surprising capabilities on complex tasks, while standard RL methods struggle with sparse rewards. The benchmark provides standardized metrics, teleoperation interfaces, and demonstration datasets to accelerate physical reasoning research.

  • Bilevel planning with engineered skills achieves highest overall success rate (57%) but requires significant engineering effort per environment
  • Foundation models (VLA) surprisingly outperform classical methods on complex tool-use tasks like DynPushPullHook2D despite different training domains
  • Vision-language models cannot effectively leverage visual information beyond object-centric states in physical reasoning tasks
  • Standard reinforcement learning methods (PPO, SAC) fail on most environments due to sparse rewards and long horizons
  • Imitation learning methods show reasonable performance and generalization to out-of-distribution object counts
  • Use KinDER environments to systematically evaluate and benchmark new robot learning algorithms before deploying on expensive physical hardware, focusing on specific physical reasoning capabilities like tool use or packing
  • Fine-tune vision-language-action models on KinDER demonstration datasets to improve physical reasoning capabilities for warehouse automation tasks involving object manipulation and tool use
  • Integrate KinDER's object-centric state representation and procedural generation framework into existing robot simulation pipelines to create diverse training scenarios for manipulation policies
Read paper
/ 13HIDDEN GEM8.0/10

EMO: Pretraining Mixture of Experts for Emergent Modularity

EMO introduces a novel training approach for Mixture-of-Experts models that enforces modularity by constraining all tokens within a document to select experts from a shared pool. This simple modification during pretraining causes expert subsets to specialize along semantic domains (math, code, health) rather than syntactic patterns. The resulting model can deploy only 25% of experts for domain-specific tasks with just 1% performance drop, compared to 10%+ degradation in standard MoEs. EMO matches full-model performance while enabling memory-efficient deployment and fine-grained capability control.

  • Document-level expert pooling during training induces semantic specialization without human-defined domain labels
  • Expert subsets retain 97-99% performance when using only 12.5-25% of total experts for domain-specific tasks
  • EMO experts specialize at semantic levels (domains like math/code) versus syntactic patterns (prepositions/punctuation) in standard MoEs
  • Global load balancing is critical for stable training when combining document constraints with expert utilization objectives
  • Deploy lightweight domain-specific models by extracting code-specialized expert subsets for IDE autocomplete, reducing memory footprint by 75% while maintaining performance
  • Build modular chatbot systems where different expert subsets handle medical queries, legal advice, and creative writing, enabling targeted capability updates without full model retraining
  • Implement content filtering by selectively disabling expert clusters associated with adult content or misinformation while preserving other capabilities for child-safe applications
Read paper
/ 14HIDDEN GEM8.0/10

SkillOS: Learning Skill Curation for Self-Evolving Agents

SkillOS introduces a novel approach to building self-evolving AI agents through learned skill curation. The system separates a frozen agent executor from a trainable skill curator that manages an external skill repository. Using reinforcement learning on grouped task streams, the curator learns to insert, update, and delete skills based on their downstream impact on related future tasks. The approach consistently outperforms memory-free and existing memory-based baselines across household automation, e-commerce, and reasoning tasks, while requiring fewer interaction steps. Notably, a small trained curator outperforms frontier models used directly for curation, and the learned curator generalizes across different executor models and task domains.

  • RL-trained 8B skill curator outperforms Gemini-2.5-Pro used directly for skill curation, showing specialized training beats raw model scale
  • Grouping related tasks for training enables learning complex skill operations (update/delete) from delayed feedback signals
  • Skills evolve from task-specific procedures to higher-level meta-strategies and develop richer internal structure over time
  • Learned curators generalize across different executor models and task domains without retraining
  • Build customer service agents that learn from interaction patterns to develop reusable troubleshooting skills, automatically refining responses based on success rates across similar support tickets
  • Create code review assistants that accumulate project-specific best practices from past reviews, evolving domain-specific guidelines while generalizing patterns across different codebases
  • Develop personal productivity agents that learn user-specific workflows from task completion histories, building personalized automation skills that adapt as work patterns change
Read paper
/ 15HIDDEN GEM8.0/10

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

This paper introduces SWE-WebDev Bench, a comprehensive evaluation framework for AI app-building platforms that goes beyond code quality to assess whether platforms can function as complete software agencies. Testing six platforms across 68 metrics, it reveals four critical shortcomings: specification bottlenecks where platforms compress rich requirements into oversimplified plans, frontend-backend decoupling where polished UIs mask broken infrastructure, universal production readiness failures with no platform exceeding 60% engineering quality, and widespread security vulnerabilities. The benchmark introduces novel concepts like canary requirements for testing genuine comprehension and separate evaluation of app creation versus modification requests.

  • No AI platform exceeds 60% engineering quality score, with all requiring 15-66 developer-hours of post-generation work to reach production readiness
  • Frontend engineering quality (61-74% across platforms) poorly predicts backend infrastructure capability (0-49% on background jobs), revealing systematic decoupling
  • Canary retention rates vary 5.5× across platforms (17.7%-97.7%), indicating most platforms silently drop culturally-specific requirements users cannot verify
  • App modification requests consistently degrade quality compared to creation, with surviving requirements showing 3× higher loss rates than new ones
  • Platform engineering teams can implement the 68-metric evaluation suite to identify specific architectural blind spots in their AI coding platforms, prioritizing fixes for backend infrastructure generation over frontend polish based on the decoupling findings
  • Product teams building AI-powered development tools can adopt the canary requirements methodology to test whether their systems genuinely understand domain-specific constraints versus pattern-matching on generic SaaS templates
  • Enterprise software buyers can use the ACR/AMR evaluation framework to assess whether AI platforms can handle iterative development workflows without breaking existing functionality during feature additions or modifications
Read paper