In this post, I derive the gradient of a policy. This tutorial follows the steps in the first part of lecture 5 of CS285 at UC Berkeley.
Terminology
In reinforcement learning, a trajectory is a sequence of states and actions collected by a policy. And it can be defined as
A trajectory distribution is a probability distribution over a sequence of states and actions. In the equation above, it is represented by the chain rule of probability by multiplying the initial state distribution by the product of policy probability and transition probability over all time steps.
The reinforcement learning objective can be written as the optimal parameter that maximizes the expected reward under a trajectory.
In the following derivation, I will use to represent the expected reward , and replace with for simplicity.
Differentiating the Policy Directly
Now, we have our objective
and expectation of rewards
We expand the expectation for continuous variables to its integral form.
Then, the gradient (or the derivative) of the expected reward can be written as by directly putting the differentiation operator inside the integral because it is linear.
Now, we need to use the log-derivative identity
to help us expand the by applying it inversely as
If we replace the term in the gradient of expectation by the left hand side and convert it back into the expectation form, we will get
Recall that the trajectory distribution is
and if we take the logarithm of both sides, we will get
We proceed to substitute the right hand side of this equation for inside the expectation.
Both and do not depend on , so we cancel them out and simplify it to