Dichotomous Diffusion Policy Optimization

Ruiming Liang^{1,2 * ‡}, Yinan Zheng^{3 *
¶}, Kexin Zheng^{4 * ‡}, Tianyi Tan^{3 *}, Jianxiong Li³, Liyuan Mao⁵, Zhihao Wang⁶, Guang Chen⁷, Hangjun Ye⁷, Jingjing Liu³, Jinqiao Wang^{1,2 †}, Xianyuan Zhan^{3 †}

¹Fundation Model Research Center, Institute of Automation, Chinese Academy of Sciences,
²School of Artificial Intelligence, University of Chinese Academy of Sciences,
³Institute for AI Industry Research (AIR), Tsinghua University,
⁴The Chinese University of Hong Kong, ⁵Shanghai Jiao Tong University, ⁶Peking University, ⁷Xiaomi EV

Under Review
^*Equal Contribution. ^†Corresponding Author. ^‡Work done during internships. ^¶Project lead.

Paper Code

DIPOLE algorithm overview

Takeaways

⭐️ Stability: Splits the unstable exponential weighting into two bounded components, enabling stable diffusion-policy learning.

⭐️ Controllability: Reconstructs the final policy via a linear combination of the two dichotomous policies’ scores, yielding CFG-like greediness control with a single knob ω.

⭐️ Scalability: Outperforms strong baselines across offline, offline-to-online, and large-scale Vision-Language-Action tasks.

DIPOLE is a reinforcement learning framework for stable, controllable, and scalable optimization of diffusion policies. It reformulates KL-regularized RL and decomposes policy improvement into a pair of dichotomous diffusion policies, enabling precise control over policy optimality at inference time while maintaining stable training dynamics.

Method

🔀 Stable Training

DIPOLE represents the optimal policy as a combination of two diffusion policies, each trained with stable, bounded sigmoid-weighted regression.

Positive Policy (π⁺)

Encouraged toward high-reward actions, capturing behaviors preferred by the reward signal.

Negative Policy (π⁻)

Discouraged toward low-reward actions, stabilizing learning by explicitly modeling undesirable behaviors.

This dichotomous decomposition avoids exponential loss explosion and enables efficient learning from both high- and low-quality data.

🎛️ CFG-style Controllable Inference

At inference time, DIPOLE reconstructs the final policy by linearly combining the scores of the two diffusion policies in a Classifier-Free-Guidance (CFG)–style manner:


      ε = (1 + ω) · ε⁺ − ω · ε⁻

The greediness factor ω acts as a continuous and interpretable control knob, smoothly interpolating between:

ω = 0: conservative, sigmoid-weighted-like behavior
larger ω: increasingly greedy, higher-reward behavior

🤖 Algorithm

Experiment

Representative Tasks

Some representative cases of RL benchmarks.

Some representative cases of autonomous driving.

Quantitative Results

ExORL Results. We report the average score over 8 random seeds. DIPOLE achieves the best performance.

OGBench Results. We report the aggregate score on all single tasks for each category, averaging over 8 random seeds.

OGBench Offline-to-Online Results. We report the score on the default task for each category, averaging over 8 random seeds. (humanoidmaze-m: humanoidmaze-medium-navigate)

NAVSIM Closed-Loop Results. We scale up DIPOLE to a large VLA model, demonstrating its potential for real-world applications. (navtrain/navtest represent different data splits used for trajectory rollout)

BibTeX


@article{liang2026dipole,
  title={Dichotomous Diffusion Policy Optimization},
  author={Ruiming Liang and Yinan Zheng and Kexin Zheng and Tianyi Tan and Jianxiong Li and Liyuan Mao and Zhihao Wang and Guang Chen and Hangjun Ye and Jingjing Liu and Jinqiao Wang and Xianyuan Zhan},
  journal={arXiv preprint arXiv:2601.00898},
  year={2026}
}

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3