In this post, I derive the gradient of a policy. This tutorial follows the steps in the first part of lecture 5 of CS285 at UC Berkeley.

Terminology

In reinforcement learning, a trajectory is a sequence of states and actions collected by a policy. And it can be defined as

A trajectory distribution is a probability distribution over a sequence of states and actions. In the equation above, it is represented by the chain rule of probability by multiplying the initial state distribution by the product of policy probability and transition probability over all time steps.

The reinforcement learning objective can be written as the optimal parameter that maximizes the expected reward under a trajectory.

In the following derivation, I will use to represent the expected reward , and replace with for simplicity.

Differentiating the Policy Directly

Now, we have our objective

and expectation of rewards

We expand the expectation for continuous variables to its integral form.

Then, the gradient (or the derivative) of the expected reward can be written as by directly putting the differentiation operator inside the integral because it is linear.

Now, we need to use the log-derivative identity

to help us expand the by applying it inversely as

If we replace the term in the gradient of expectation by the left hand side and convert it back into the expectation form, we will get

Recall that the trajectory distribution is

and if we take the logarithm of both sides, we will get

We proceed to substitute the right hand side of this equation for inside the expectation.

Both and do not depend on , so we cancel them out and simplify it to

Reference

[1] CS 285 Deep Reinforcement Learning.