⭐️ Stability: Splits the unstable exponential weighting into two bounded components, enabling
stable
diffusion-policy learning.
⭐️ Controllability: Reconstructs the final policy via a linear combination of the two dichotomous
policies’ scores, yielding CFG-like greediness control with a single knob ω.
⭐️ Scalability: Outperforms strong baselines across offline, offline-to-online, and large-scale
Vision-Language-Action tasks.
DIPOLE is a reinforcement learning framework for stable, controllable, and
scalable optimization of diffusion policies. It reformulates KL-regularized RL and decomposes
policy improvement into a pair of dichotomous diffusion policies, enabling precise control over policy
optimality at inference time while maintaining stable training dynamics.