DPO, published by Rafailov et al. at Stanford in 2023, is a preference-optimisation method designed to skip the expensive and unstable RL step in RLHF. Instead of training a separate reward model and then applying PPO, it derives a closed-form objective that updates the model directly from preference data — mathematically, an implicit reward model is optimised in one shot. Because it is simpler, more stable and far cheaper, the open-source community moved to DPO quickly; Zephyr, Mixtral-Instruct and many modern post-training pipelines all rely on it. It hasn't fully replaced RLHF, but it is now the first tool most teams reach for in the alignment toolkit.
External Links