Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization
Summary
This paper addresses a critical problem in multimodal reasoning models: while reinforcement learning improves answer accuracy, it often produces unfaithful reasoning that contradicts final answers or misrepresents visual content. The authors propose Faithful GRPO (FGRPO), which treats logical consistency and visual grounding as hard constraints rather than reward terms. Using Lagrangian optimization with decoupled advantage computation, FGRPO adaptively enforces these constraints during training. Experiments show FGRPO reduces inconsistency rates from 26% to 1.7% while improving both visual grounding (+13%) and accuracy (+2%) on spatial reasoning benchmarks, demonstrating that faithful reasoning and correct answers are complementary rather than competing objectives.
Key findings
- Standard RLVR training creates accuracy-faithfulness tradeoff: models achieve higher accuracy but produce reasoning that contradicts their answers (26% inconsistency rate) or misrepresents visual content
- Treating consistency and grounding as hard constraints via Lagrangian optimization eliminates this tradeoff, improving both reasoning quality and final accuracy
- Decoupled advantage computation prevents signal cancellation in group normalization, enabling each constraint to contribute meaningful gradients during training
- Adaptive dual ascent automatically balances constraint pressures without manual weight tuning, outperforming fixed multiplier approaches
How to implement
- Integrate FGRPO into visual question answering systems for autonomous vehicles to ensure spatial reasoning explanations accurately describe road scenes and justify navigation decisions
- Apply constrained optimization framework to medical imaging AI systems to enforce that diagnostic reasoning chains are both logically consistent and accurately grounded in radiological findings
- Adapt decoupled advantage computation technique to robotics training pipelines where multiple reward signals (safety, efficiency, accuracy) need independent normalization to prevent dominant signals from nullifying others