Deriving Direct Preference Optimization

The original DPO paper (Rafailov et al. 2023) provides a nice and simple derivation of the DPO loss in Appendices A.1 and A.2. However, while reading the paper, I noticed certain steps in the derivation were omitted for brevity. In this post, I aim to derive the DPO objective in a more detailed, step-by-step manner.

Deriving the Optimal Policy of the KL-Constrained Reward Maximization

The reward maximization problem in the RL fine-tuning phase (Equation 3 in the paper) is defined as the difference between the reward of a prompt and a response, $r (x, y)$ , and a KL term of the target policy $π$ and the reference policy $π_{ref}$ :

π max E_{x \sim D, y \sim π (y ∣ x)} [r (x, y)] - β D_{KL} [π (y ∣ x) ∥ π_{ref} (y ∣ x)]

We first expand the KL term to its expectation form and merge it with the expectation over responses in the reward term. This gives us the difference between the reward over $(x, y)$ and the log ratio between the target policy distribution and the reference policy distribution.

π max E_{x \sim D, y \sim π (y ∣ x)} [r (x, y)] - β D_{KL} [π (y ∣ x) ∥ π_{ref} (y ∣ x)] = π max E_{x \sim D} E_{y \sim π (y ∣ x)} [r (x, y)] - β D_{KL} [π (y ∣ x) ∥ π_{ref} (y ∣ x)] = π max E_{x \sim D} E_{y \sim π (y ∣ x)} [r (x, y)] - β y \sum π (y ∣ x) lo g (\frac{π ( y ∣ x )}{π _{ref} ( y ∣ x )}) = π max E_{x \sim D} E_{y \sim π (y ∣ x)} [r (x, y)] - β E_{y \sim π (y ∣ x)} [lo g (\frac{π ( y ∣ x )}{π _{ref} ( y ∣ x )})] = π max E_{x \sim D} [E_{y \sim π (y ∣ x)} [r (x, y) - β lo g (\frac{π ( y ∣ x )}{π _{ref} ( y ∣ x )})]]

We then turn the maximization problem to a minimization problem and flip the sign of the two terms. The original paper also divides the whole expression by $β$ so that the second term has the form $\frac{1}{β} r (x, y)$ , which will be useful in the next step.

= π max E_{x \sim D} [E_{y \sim π (y ∣ x)} [r (x, y) - β lo g (\frac{π ( y ∣ x )}{π _{ref} ( y ∣ x )})]] = π min E_{x \sim D} [E_{y \sim π (y ∣ x)} [- r (x, y) + β lo g (\frac{π ( y ∣ x )}{π _{ref} ( y ∣ x )})]] = π min E_{x \sim D} [E_{y \sim π (y ∣ x)} [β lo g (\frac{π ( y ∣ x )}{π _{ref} ( y ∣ x )}) - r (x, y)]] = π min E_{x \sim D} [E_{y \sim π (y ∣ x)} [lo g (\frac{π ( y ∣ x )}{π _{ref} ( y ∣ x )}) - \frac{1}{β} r (x, y)]]

The original paper defines a partition function $Z (x)$ as

Z (x) = y \sum π_{ref} (y ∣ x) exp (\frac{1}{β} r (x, y)),

which allows us to write the reward term as simply $lo g Z (X)$ :

= π min E_{x \sim D} [E_{y \sim π (y ∣ x)} [lo g (\frac{π ( y ∣ x )}{π _{ref} ( y ∣ x )}) - \frac{1}{β} r (x, y)]] = π min E_{x \sim D} [E_{y \sim π (y ∣ x)} [lo g \frac{π ( y ∣ x )}{π _{ref} ( y ∣ x )} - lo g (exp (\frac{1}{β} r (x, y)))]] = π min E_{x \sim D} E_{y \sim π (y ∣ x)} lo g \frac{π ( y ∣ x )}{π _{ref} ( y ∣ x ) exp ( \frac{1}{β} r ( x , y ) )} = π min E_{x \sim D} E_{y \sim π (y ∣ x)} lo g \frac{π ( y ∣ x )}{π _{ref} ( y ∣ x ) exp ( \frac{1}{β} r ( x , y ) )} + lo g (Z (x)) - lo g (Z (x)) = π min E_{x \sim D} E_{y \sim π (y ∣ x)} lo g \frac{π ( y ∣ x ) Z ( x )}{π _{ref} ( y ∣ x ) exp ( \frac{1}{β} r ( x , y ) )} - lo g (Z (x)) = π min E_{x \sim D} E_{y \sim π (y ∣ x)} lo g \frac{π ( y ∣ x )}{\frac{1}{Z ( x )} π _{ref} ( y ∣ x ) exp ( \frac{1}{β} r ( x , y ) )} - lo g Z (x)

We now define the optimal policy as

π^{*} (y ∣ x) = \frac{1}{Z ( x )} π_{ref} (y ∣ x) exp (\frac{1}{β} r (x, y))

and plug it into the denominator of the log ratio.

= π min E_{x \sim D} E_{y \sim π (y ∣ x)} lo g \frac{π ( y ∣ x )}{\frac{1}{Z ( x )} π _{ref} ( y ∣ x ) exp ( \frac{1}{β} r ( x , y ) )} - lo g Z (x) = π min E_{x \sim D} [E_{y \sim π (y ∣ x)} [lo g (\frac{π ( y ∣ x )}{π ^{*} ( y ∣ x )})] - lo g Z (x)] = π min E_{x \sim D} [D_{KL} (π (y ∣ x) ∥ π^{*} (y ∣ x)) - lo g Z (x)]

Given a single prompt $x$ , the minimum of the above problem is achieved when the KL term is minimized. We demonstrate that the KL term is non-negative using an identity: $lo g z \leq z - 1$ .

- D_{KL} (π (y ∣ x) ∥ π^{*} (y ∣ x)) D_{KL} (π (y ∣ x) ∥ π^{*} (y ∣ x)) = - y \sum π (y ∣ x) lo g \frac{π ( y ∣ x )}{π ^{*} ( y ∣ x )} = y \sum π (y ∣ x) lo g \frac{π ^{*} ( y ∣ x )}{π ( y ∣ x )} \leq y \sum π (y ∣ x) (\frac{π ^{*} ( y ∣ x )}{π ( y ∣ x )} - 1) \leq y \sum π^{*} (y ∣ x) - π (y ∣ x) \leq y \sum π^{*} (y ∣ x) - y \sum π (y ∣ x) \leq 1 - 1 \leq 0 \geq 0

Now we show that the minimum at 0 is achieved only when the two distributions are equal, which corresponds to the log term being 0. By setting $π (y ∣ x) = π^{*} (y ∣ x)$ , we have

D_{KL} (π (y ∣ x) ∥ π^{*} (y ∣ x)) = y \sum π (y ∣ x) lo g \frac{π ( y ∣ x )}{π ^{*} ( y ∣ x )} = y \sum π (y ∣ x) lo g \frac{π ^{*} ( y ∣ x )}{π ^{*} ( y ∣ x )} = y \sum π (y ∣ x) lo g 1 = y \sum π (y ∣ x) \cdot 0 = 0

Finally, when the KL term is minimized, the optimal solution $π (y ∣ x)$ should simply be

π (y ∣ x) = π^{*} (y ∣ x) = \frac{1}{Z ( x )} π_{ref} (y ∣ x) exp (\frac{1}{β} r (x, y))

Deriving the DPO Loss Under the Bradley-Terry Model

Given the optimal policy derived above,

π_{r} (y ∣ x) = \frac{1}{Z ( x )} π_{ref} (y ∣ x) exp (\frac{1}{β} r (x, y))

we can rearrange the terms to express the reward function in terms of its corresponding optimal policy $π$ , the reference policy $π_{ref}$ , and the partition function $Z (\cdot)$ .

π_{r} (y ∣ x) lo g (π_{r} (y ∣ x)) lo g (π_{r} (y ∣ x)) \frac{1}{β} r (x, y) \frac{1}{β} r (x, y) r (x, y) = \frac{1}{Z ( x )} π_{ref} (y ∣ x) exp (\frac{1}{β} r (x, y)) = lo g (\frac{1}{Z ( x )}) + lo g (π_{ref} (y ∣ x)) + lo g (exp (\frac{1}{β} r (x, y))) = lo g (\frac{1}{Z ( x )}) + lo g (π_{ref} (y ∣ x)) + \frac{1}{β} r (x, y) = lo g (π_{r} (y ∣ x)) - lo g (π_{ref} (y ∣ x)) - lo g (\frac{1}{Z ( x )}) = lo g (\frac{π _{r} ( y ∣ x )}{π _{ref} ( y ∣ x )}) + lo g (Z (x)) = β lo g (\frac{π _{r} ( y ∣ x )}{π _{ref} ( y ∣ x )}) + β lo g (Z (x))

By substituting $r^{*} (x, y)$ into the Bradley-Terry model

p^{*} (y_{1} ≻ y_{2} ∣ x) = \frac{exp ( r ^{*} ( x , y _{1} ) )}{exp ( r ^{*} ( x , y _{1} ) ) + exp ( r ^{*} ( x , y _{2} ) )} = \frac{1}{1 + exp ( r ^{*} ( x , y _{2} ) - r ^{*} ( x , y _{1} ))}

Writing it in terms of the target policy and the reference policy, we have

p^{*} (y_{1} ≻ y_{2} ∣ x) = \frac{1}{1 + exp ( β lo g \frac{π ^{*} ( y _{2} ∣ x )}{π _{ref} ( y _{2} ∣ x )} - β lo g \frac{π ^{*} ( y _{1} ∣ x )}{π _{ref} ( y _{1} ∣ x )} )} = σ (β lo g \frac{π ^{*} ( y _{2} ∣ x )}{π _{ref} ( y _{2} ∣ x )} - β lo g \frac{π ^{*} ( y _{1} ∣ x )}{π _{ref} ( y _{1} ∣ x )})

Maximizing the likelihood of the preference distribution can then be written as minimizing the negative log likelihood of the above preference function:

L_{DPO} (π_{θ}; π_{ref}) = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})]

This optimization problem could then be considered as optimizing toward an implicit reward modeling objective, which is shown previously:

r (x, y) = β lo g (\frac{π _{r} ( y ∣ x )}{π _{ref} ( y ∣ x )}) + β lo g (Z (x))

Deriving the DPO gradient

We define a function $u$ of $θ$ as

u (θ) = β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )} - β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )}

We can then employ two useful identities for the sigmoid function $σ$ :

σ^{'} (x) = σ (x) (1 - σ (x)), σ (- x) = 1 - σ (x)

By applying the chain rule (on $lo g, σ, u$ ) and using the above identities, we can write the gradient of the DPO loss as:

\nabla_{θ} L_{DPO} (π_{θ}; π_{ref}) = - \nabla_{θ} E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (u)] = - E_{(x, y_{w}, y_{l}) \sim D} [\frac{1}{σ ( u )} σ^{'} (u) \nabla_{θ} (u)] = - E_{(x, y_{w}, y_{l}) \sim D} [\frac{1}{σ ( u )} σ (u) σ (- u) \nabla_{θ} (u)] = - E_{(x, y_{w}, y_{l}) \sim D} [σ (- u) \nabla_{θ} (u)]

After plugging $u$ back and substitute with $\overset{r}{^}_{θ} (x, y) = β lo g \frac{π _{θ} ( y ∣ x )}{π _{ref} ( y ∣ x )}$ , we have the DPO gradient

\nabla_{θ} L_{DPO} (π_{θ}; π_{ref}) = - β E_{(x, y_{w}, y_{l}) \sim D} [higher weight when reward estimate is wrong σ (\overset{r}{^}_{θ} (x, y_{l}) - \overset{r}{^}_{θ} (x, y_{w})) [increase likelihood of y_{w} \nabla_{θ} lo g π (y_{w} ∣ x) - decrease likelihood of y_{l} \nabla_{θ} lo g π (y_{l} ∣ x)]]

References

[1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.

🌱 chrisliu298's Garden

Explorer

Deriving Direct Preference Optimization

Deriving the Optimal Policy of the KL-Constrained Reward Maximization

Deriving the DPO Loss Under the Bradley-Terry Model

Deriving the DPO gradient

References

Graph View

Table of Contents

Backlinks