Dichotomous Diffusion Policy Optimization

1Fundation Model Research Center, Institute of Automation, Chinese Academy of Sciences,
2School of Artificial Intelligence, University of Chinese Academy of Sciences,
3Institute for AI Industry Research (AIR), Tsinghua University,
4The Chinese University of Hong Kong, 5Shanghai Jiao Tong University, 6Peking University, 7Xiaomi EV

Under Review

*Equal Contribution. Corresponding Author. Work done during internships. Project lead.
SVG Image

DIPOLE algorithm overview

Takeaways

⭐️ Stability: Splits the unstable exponential weighting into two bounded components, enabling stable diffusion-policy learning.

⭐️ Controllability: Reconstructs the final policy via a linear combination of the two dichotomous policies’ scores, yielding CFG-like greediness control with a single knob ω.

⭐️ Scalability: Outperforms strong baselines across offline, offline-to-online, and large-scale Vision-Language-Action tasks.

DIPOLE is a reinforcement learning framework for stable, controllable, and scalable optimization of diffusion policies. It reformulates KL-regularized RL and decomposes policy improvement into a pair of dichotomous diffusion policies, enabling precise control over policy optimality at inference time while maintaining stable training dynamics.

Method

🔀 Stable Training

DIPOLE represents the optimal policy as a combination of two diffusion policies, each trained with stable, bounded sigmoid-weighted regression.

Positive Policy (π⁺)

Encouraged toward high-reward actions, capturing behaviors preferred by the reward signal.

Negative Policy (π⁻)

Discouraged toward low-reward actions, stabilizing learning by explicitly modeling undesirable behaviors.

This dichotomous decomposition avoids exponential loss explosion and enables efficient learning from both high- and low-quality data.


🎛️ CFG-style Controllable Inference

At inference time, DIPOLE reconstructs the final policy by linearly combining the scores of the two diffusion policies in a Classifier-Free-Guidance (CFG)–style manner:

ε = (1 + ω) · ε+ − ω · ε

The greediness factor ω acts as a continuous and interpretable control knob, smoothly interpolating between:

  • ω = 0: conservative, sigmoid-weighted-like behavior
  • larger ω: increasingly greedy, higher-reward behavior

🤖 Algorithm

DIPOLE Algorithm

Experiment

Representative Tasks

Some representative cases of RL benchmarks.


Autonomous Driving Demo Autonomous Driving Demo Autonomous Driving Demo Autonomous Driving Demo

Some representative cases of autonomous driving.




Quantitative Results

ExORL Results

ExORL Results. We report the average score over 8 random seeds. DIPOLE achieves the best performance.


OGBench Results

OGBench Results. We report the aggregate score on all single tasks for each category, averaging over 8 random seeds.


OGBench Offline-to-Online Results

OGBench Offline-to-Online Results. We report the score on the default task for each category, averaging over 8 random seeds. (humanoidmaze-m: humanoidmaze-medium-navigate)


NAVSIM Closed-Loop Results

NAVSIM Closed-Loop Results. We scale up DIPOLE to a large VLA model, demonstrating its potential for real-world applications. (navtrain/navtest represent different data splits used for trajectory rollout)

BibTeX


@article{liang2026dipole,
  title={Dichotomous Diffusion Policy Optimization},
  author={Ruiming Liang and Yinan Zheng and Kexin Zheng and Tianyi Tan and Jianxiong Li and Liyuan Mao and Zhihao Wang and Guang Chen and Hangjun Ye and Jingjing Liu and Jinqiao Wang and Xianyuan Zhan},
  journal={arXiv preprint arXiv:2601.00898},
  year={2026}
}