Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Summary
Vector Policy Optimization (VPO) addresses a key limitation in current LLM training: models trained with standard RL methods collapse to low-diversity outputs that perform poorly in test-time search scenarios. VPO trains models to generate multiple candidate solutions within single rollouts while optimizing across randomly sampled reward weightings rather than fixed scalar rewards. This encourages models to cover the Pareto frontier of different objective trade-offs. Across four domains, VPO consistently outperforms standard approaches on best@k metrics, with improvements growing as search budget increases. The method works by separating exploration (handled during training via diverse reward optimization) from exploitation (handled by test-time search procedures).
Key findings
- VPO consistently outperforms scalar RL baselines on best@k metrics across diverse domains, with gaps widening as search budget increases
- Multi-answer generation alone is insufficient - stochastic reward scalarization is critical for maintaining candidate diversity throughout training
- VPO-trained models unlock problems in evolutionary search that scalar-trained models cannot solve at any candidate budget
- The approach works even when evaluated on the same scalar objective used to train baseline methods, suggesting preserved reasoning strategies improve performance
How to implement
- Implement VPO for training coding assistants that generate multiple solution approaches, then use test-time search to select the best implementation based on performance, readability, and maintainability trade-offs
- Train customer service chatbots with VPO to generate diverse response candidates optimizing for different objectives (helpfulness, brevity, empathy), then select responses based on customer context and satisfaction metrics
- Apply VPO to train scientific reasoning models that generate multiple hypothesis pathways, allowing downstream verification systems to select the most promising research directions based on experimental feasibility and theoretical soundness