The original DPO paper (Rafailov et al. 2023) provides a nice and simple derivation of the DPO loss in Appendices A.1 and A.2. However, while reading the paper, I noticed certain steps in the derivation were omitted for brevity. In this post, I aim to derive the DPO objective in a more detailed, step-by-step manner.

Deriving the Optimal Policy of the KL-Constrained Reward Maximization

The reward maximization problem in the RL fine-tuning phase (Equation 3 in the paper) is defined as the difference between the reward of a prompt and a response, $r(x, y)$, and a KL term of the target policy $\pi$ and the reference policy $\pi_\mathrm{ref}$:

$$ \max _\pi \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y \mid x)} [r(x, y)]-\beta \mathbb{D}_{\mathrm{KL}}\left[\pi(y \mid x) \| \pi_{\mathrm{ref}}(y \mid x)\right] $$

We first expand the KL term to its expectation form and merge it with the expectation over responses in the reward term. This gives us the difference between the reward over $(x, y)$ and the log ratio between the target policy distribution and the reference policy distribution.

$$ \begin{align*} \max _\pi \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y \mid x)} & [r(x, y)]-\beta \mathbb{D}_{\mathrm{KL}}\left[\pi(y \mid x) \| \pi_{\mathrm{ref}}(y \mid x)\right] \\ & = \max _\pi \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi(y \mid x)} [r(x, y)]-\beta \mathbb{D}_{\mathrm{KL}}\left[\pi(y \mid x) \| \pi_{\mathrm{ref}}(y \mid x)\right] \\ & = \max _\pi \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi(y \mid x)} [r(x, y)]-\beta \sum_{y} \pi(y \mid x) \log \left( \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \right) \\ & = \max _\pi \mathbb{E}_{x \sim \mathcal{D}} \mathbb{E}_{y \sim \pi(y \mid x)} [r(x, y)]-\beta \mathbb{E}_{y \sim \pi(y \mid x)} \left[ \log \left( \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \right) \right] \\ & = \max _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)} \left[r(x, y)-\beta \log \left( \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \right) \right] \right] \end{align*} $$

We then turn the maximization problem to a minimization problem and flip the sign of the two terms. The original paper also divides the whole expression by $\beta$ so that the second term has the form $\frac{1}{\beta} r(x, y)$, which will be useful in the next step.

$$ \begin{align*} & = \max _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)} \left[r(x, y)-\beta \log \left( \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \right) \right] \right] \\ & = \min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)} \left[- r(x, y)+\beta \log \left( \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \right) \right] \right] \\ & = \min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)} \left[\beta \log \left( \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \right) - r(x, y) \right] \right] \\ & = \min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)} \left[\log \left( \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \right) - \frac{1}{\beta} r(x, y) \right] \right] \\ \end{align*} $$

The original paper defines a partition function $Z(x)$ as

$$ Z(x)=\sum_y \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right), $$

which allows us to write the reward term as simply $\log Z(X)$:

$$ \begin{align*} & = \min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)} \left[\log \left( \frac{\pi(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} \right) - \frac{1}{\beta} r(x, y) \right] \right] \\ & =\min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)}\left[\log \frac{\pi(y \mid x)}{\pi_{\text {ref }}(y \mid x)}-\log\left( \exp \left( \frac{1}{\beta} r(x, y) \right) \right)\right] \right] \\ & =\min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)}\left[\log \frac{\pi(y \mid x)}{\pi_{\text {ref }}(y \mid x) \exp \left( \frac{1}{\beta} r(x, y) \right)}\right] \right] \\ & =\min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)}\left[\log \frac{\pi(y \mid x)}{\pi_{\text {ref }}(y \mid x) \exp \left( \frac{1}{\beta} r(x, y) \right)} + \log \left( Z(x) \right) - \log \left( Z(x) \right) \right] \right]\\ & =\min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)}\left[\log \frac{\pi(y \mid x) Z(x)}{\pi_{\text {ref }}(y \mid x) \exp \left( \frac{1}{\beta} r(x, y) \right)} - \log \left( Z(x) \right) \right] \right]\\ & = \min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)} \left[\log \left( \frac{\pi(y \mid x)}{\frac{1}{Z(x)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right)} \right)- \log Z(x) \right] \right] \end{align*} $$

We now define the optimal policy as

$$ \pi^*(y \mid x)=\frac{1}{Z(x)} \pi_{\text {ref }}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) $$

and plug it into the denominator of the log ratio.

$$ \begin{align*} & = \min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)} \left[\log \left( \frac{\pi(y \mid x)}{\frac{1}{Z(x)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right)} \right) \right] - \log Z(x) \right] \\ & = \min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{E}_{y \sim \pi(y \mid x)} \left[ \log \left( \frac{\pi(y \mid x)}{\pi^* (y \mid x)} \right) \right] - \log Z(x) \right] \\ & = \min _\pi \mathbb{E}_{x \sim \mathcal{D}} \left[ \mathbb{D}_{\mathrm{{KL}}} (\pi(y \mid x) \| \pi^*(y \mid x)) - \log Z(x) \right] \\ \end{align*} $$

Given a single prompt $x$, the minimum of the above problem is achieved when the KL term is minimized. We demonstrate that the KL term is non-negative using an identity: $\log z \leq z - 1$.

$$ \begin{align*} -\mathbb{D}_{\mathrm{KL}} (\pi(y \mid x) \| \pi^*(y \mid x)) & = - \sum _{y} \pi (y \mid x) \log \frac{\pi (y \mid x)}{\pi^* (y \mid x)} \\ & = \sum _{y} \pi (y \mid x) \log \frac{\pi^* (y \mid x)}{\pi (y \mid x)} \\ & \leq \sum _{y} \pi (y \mid x) \left( \frac{\pi^* (y \mid x)}{\pi (y \mid x)} - 1 \right) \\ & \leq \sum _{y} \pi^* (y \mid x) - \pi (y \mid x) \\ & \leq \sum _{y} \pi^* (y \mid x) - \sum _{y} \pi (y \mid x) \\ & \leq 1 - 1 \\ & \leq 0 \\ \mathbb{D}_{\mathrm{KL}} (\pi(y \mid x) \| \pi^*(y \mid x)) & \geq 0 \end{align*} $$

Now we show that the minimum at 0 is achieved only when the two distributions are equal, which corresponds to the log term being 0. By setting $\pi (y \mid x) = \pi^* (y \mid x)$, we have

$$ \begin{align*} \mathbb{D}_{\mathrm{KL}} (\pi(y \mid x) \| \pi^*(y \mid x)) & = \sum _{y} \pi (y \mid x) \log \frac{\pi (y \mid x)}{\pi^* (y \mid x)} \\ & = \sum _{y} \pi (y \mid x) \log \frac{\pi^* (y \mid x)}{\pi^* (y \mid x)} \\ & = \sum _{y} \pi (y \mid x) \log 1 \\ & = \sum _{y} \pi (y \mid x) \cdot 0 \\ & = 0 \end{align*} $$

Finally, when the KL term is minimized, the optimal solution $\pi (y \mid x)$ should simply be

$$ \pi(y \mid x)=\pi^*(y \mid x)=\frac{1}{Z(x)} \pi_{\text {ref }}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) $$

Deriving the DPO Loss Under the Bradley-Terry Model

Given the optimal policy derived above,

$$ \pi_r(y \mid x)=\frac{1}{Z(x)} \pi_{\text {ref }}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) $$

we can rearrange the terms to express the reward function in terms of its corresponding optimal policy $\pi$, the reference policy $\pi_\mathrm{ref}$, and the partition function $Z(\cdot)$.

$$ \begin{align*} \pi_r(y \mid x) & =\frac{1}{Z(x)} \pi_{\text {ref }}(y \mid x) \exp \left(\frac{1}{\beta} r(x, y)\right) \\ \log(\pi_r(y \mid x)) & = \log\left( \frac{1}{Z(x)} \right) + \log\left(\pi_{\text {ref }}(y \mid x)\right) + \log \left(\exp \left(\frac{1}{\beta} r(x, y)\right) \right) \\ \log(\pi_r(y \mid x)) & = \log\left( \frac{1}{Z(x)} \right) + \log\left(\pi_{\text {ref }}(y \mid x)\right) + \frac{1}{\beta} r(x, y) \\ \frac{1}{\beta} r(x, y) & = \log(\pi_r(y \mid x)) - \log\left(\pi_{\text {ref }}(y \mid x)\right) - \log\left( \frac{1}{Z(x)} \right) \\ \frac{1}{\beta} r(x, y) & = \log\left(\frac{\pi_r(y \mid x)}{\pi_{\text {ref }}(y \mid x)} \right) + \log\left( Z(x) \right) \\ r(x, y) & = \beta \log\left(\frac{\pi_r(y \mid x)}{\pi_{\text {ref }}(y \mid x)} \right) + \beta \log\left( Z(x) \right) \\ \end{align*} $$

By substituting $r^* (x, y)$ into the Bradley-Terry model

$$ \begin{align*} p^*\left(y_1 \succ y_2 \mid x\right) & =\frac{\exp \left(r^*\left(x, y_1\right)\right)}{\exp \left(r^*\left(x, y_1\right)\right)+\exp \left(r^*\left(x, y_2\right)\right)} \\ & = \frac{1}{1 + \exp (r^*(x, y_2) - r^* (x, y_1))} \end{align*} $$

Writing it in terms of the target policy and the reference policy, we have

$$ \begin{align*} p^*\left(y_1 \succ y_2 \mid x\right)&=\frac{1}{1+\exp \left(\beta \log \frac{\pi^*\left(y_2 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_2 \mid x\right)}-\beta \log \frac{\pi^*\left(y_1 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_1 \mid x\right)}\right)} \\ &=\sigma\left(\beta \log \frac{\pi^*\left(y_2 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_2 \mid x\right)}-\beta \log \frac{\pi^*\left(y_1 \mid x\right)}{\pi_{\mathrm{ref}}\left(y_1 \mid x\right)}\right) \end{align*} $$

Maximizing the likelihood of the preference distribution can then be written as minimizing the negative log likelihood of the above preference function:

$$ \mathcal{L}_{\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right)=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\mathrm{ref}}\left(y_w \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\mathrm{ref}}\left(y_l \mid x\right)}\right)\right] $$

This optimization problem could then be considered as optimizing toward an implicit reward modeling objective, which is shown previously:

$$ r(x, y) = \beta \log\left(\frac{\pi_r(y \mid x)}{\pi_{\text {ref }}(y \mid x)} \right) + \beta \log\left( Z(x) \right) $$

Deriving the DPO gradient

We define a function $u$ of $\theta$ as

$$ u (\theta)=\beta \log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\text {ref }}\left(y_l \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\text {ref }}\left(y_w \mid x\right)} $$

We can then employ two useful identities for the sigmoid function $\sigma$:

$$ \sigma^{\prime}(x)=\sigma(x)(1-\sigma(x)), \quad \sigma(-x)=1-\sigma(x) $$

By applying the chain rule (on $\log, \sigma, u$) and using the above identities, we can write the gradient of the DPO loss as:

$$ \begin{align*} \nabla_\theta \mathcal{L}_{\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right) & =-\nabla_\theta \mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(u\right)\right] \\ &=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[ \frac{1}{\sigma(u)} \sigma^{\prime}(u) \nabla_\theta(u)\right] \\ &=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[ \frac{1}{\sigma(u)} \sigma(u) \sigma(-u) \nabla_\theta(u)\right] \\ &=-\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\sigma(-u) \nabla_\theta(u)\right] \\ \end{align*} $$

After plugging $u$ back and substitute with $\hat{r}_\theta(x, y)=\beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text {ref }}(y \mid x)}$, we have the DPO gradient

$$ \begin{align*} & \nabla_\theta \mathcal{L}_{\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right)= \\ & -\beta \mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}[\underbrace{\sigma\left(\hat{r}_\theta\left(x, y_l\right)-\hat{r}_\theta\left(x, y_w\right)\right)}_{\text {higher weight when reward estimate is wrong }}[\underbrace{\nabla_\theta \log \pi\left(y_w \mid x\right)}_{\text {increase likelihood of } y_w}-\underbrace{\nabla_\theta \log \pi\left(y_l \mid x\right)}_{\text {decrease likelihood of } y_l}]] \end{align*} $$

References

[1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.