The original DPO paper (Rafailov et al. 2023) provides a nice and simple derivation of the DPO loss in Appendices A.1 and A.2. However, while reading the paper, I noticed certain steps in the derivation were omitted for brevity. In this post, I aim to derive the DPO objective in a more detailed, step-by-step manner.
Deriving the Optimal Policy of the KL-Constrained Reward Maximization
The reward maximization problem in the RL fine-tuning phase (Equation 3 in the paper) is defined as the difference between the reward of a prompt and a response, , and a KL term of the target policy and the reference policy :
We first expand the KL term to its expectation form and merge it with the expectation over responses in the reward term. This gives us the difference between the reward over and the log ratio between the target policy distribution and the reference policy distribution.
We then turn the maximization problem to a minimization problem and flip the sign of the two terms. The original paper also divides the whole expression by so that the second term has the form , which will be useful in the next step.
The original paper defines a partition function as
which allows us to write the reward term as simply :
We now define the optimal policy as
and plug it into the denominator of the log ratio.
Given a single prompt , the minimum of the above problem is achieved when the KL term is minimized. We demonstrate that the KL term is non-negative using an identity: .
Now we show that the minimum at 0 is achieved only when the two distributions are equal, which corresponds to the log term being 0. By setting , we have
Finally, when the KL term is minimized, the optimal solution should simply be
Deriving the DPO Loss Under the Bradley-Terry Model
Given the optimal policy derived above,
we can rearrange the terms to express the reward function in terms of its corresponding optimal policy , the reference policy , and the partition function .
By substituting into the Bradley-Terry model
Writing it in terms of the target policy and the reference policy, we have
Maximizing the likelihood of the preference distribution can then be written as minimizing the negative log likelihood of the above preference function:
This optimization problem could then be considered as optimizing toward an implicit reward modeling objective, which is shown previously:
Deriving the DPO gradient
We define a function of as
We can then employ two useful identities for the sigmoid function :
By applying the chain rule (on ) and using the above identities, we can write the gradient of the DPO loss as:
After plugging back and substitute with , we have the DPO gradient
References
[1] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.