Deriving Policy Gradient

In this post, I derive the gradient of a policy. This tutorial follows the steps in the first part of lecture 5 of CS285 at UC Berkeley.

Terminology

In reinforcement learning, a trajectory $τ$ is a sequence of states and actions $s_{1}, a_{1}, \dots, s_{T}, a_{T}$ collected by a policy. And it can be defined as

p_{θ} (τ) = p_{θ} (s_{1}, a_{1}, \dots, s_{T}, a_{T}) = p (s_{1}) t = 1 \prod T π_{θ} (a_{t} ∣ s_{t}) p (s_{t + 1} ∣ s_{t}, a_{t}) .

A trajectory distribution is a probability distribution over a sequence of states and actions. In the equation above, it is represented by the chain rule of probability by multiplying the initial state distribution $p (s_{1})$ by the product of policy probability $π_{θ} (a_{t} ∣ s_{t})$ and transition probability $p (s_{t + 1} ∣ s_{t}, a_{t})$ over all time steps.

The reinforcement learning objective can be written as the optimal parameter $θ$ that maximizes the expected reward under a trajectory.

θ^{⋆} = ar g θ max E_{τ \sim p_{θ} (τ)} [t \sum r (s_{t}, a_{t})]

In the following derivation, I will use $J (θ)$ to represent the expected reward $\sum_{t} r (s_{t}, a_{t})$ , and replace $\sum_{t = 1}^{T} r (s_{t}, a_{t})$ with $r (τ)$ for simplicity.

Differentiating the Policy Directly

Now, we have our objective

θ^{⋆} = ar g θ max J (θ)

and expectation of rewards

J (θ) = E_{τ \sim p_{θ} (τ)} [r (τ)] .

We expand the expectation for continuous variables to its integral form.

J (θ) = E_{τ \sim p_{θ} (τ)} [r (τ)] = \int p_{θ} (τ) r (τ) d τ

Then, the gradient (or the derivative) of the expected reward can be written as $\nabla_{θ} J (θ)$ by directly putting the differentiation operator $\nabla_{θ}$ inside the integral because it is linear.

\nabla_{θ} J (θ) = \int \nabla_{θ} p_{θ} (τ) r (τ) d τ

Now, we need to use the log-derivative identity

\frac{d}{d x} lo g (x) = \frac{d x}{x}

to help us expand the $\nabla_{θ} p_{θ} (τ)$ by applying it inversely as

p_{θ} (τ) \nabla_{θ} l o g p_{θ} (τ) = p_{θ} (τ) \frac{\nabla _{θ} p _{θ} ( τ )}{p _{θ} ( τ )} = \nabla_{θ} p_{θ} (τ) .

If we replace the $\nabla_{θ} p_{θ} (τ)$ term in the gradient of expectation by the left hand side $p_{θ} (τ) \nabla_{θ} l o g p_{θ} (τ)$ and convert it back into the expectation form, we will get

\nabla_{θ} J (θ) = \int \nabla_{θ} p_{θ} (τ) r (τ) d τ = \int p_{θ} (τ) \nabla_{θ} l o g p_{θ} (τ) r (τ) d τ = E_{τ \sim p_{θ} (τ)} [\nabla_{θ} lo g p_{θ} (τ) r (τ)] .

Recall that the trajectory distribution is

p_{θ} (τ) = p (s_{1}) t = 1 \prod T π_{θ} (a_{t} ∣ s_{t}) p (s_{t + 1} ∣ s_{t}, a_{t}),

and if we take the logarithm of both sides, we will get

lo g p_{θ} (τ) = lo g p (s_{1}) + t = 1 \sum T lo g π_{θ} (a_{t} ∣ s_{t}) + lo g p (s_{t + 1} ∣ s_{t}, a_{t}) .

We proceed to substitute the right hand side of this equation for $lo g p_{θ} (τ)$ inside the expectation.

\nabla_{θ} J (θ) = E_{τ \sim p_{θ} (τ)} [\nabla_{θ} lo g p_{θ} (τ) r (τ)] = E_{τ \sim p_{θ} (τ)} [\nabla_{θ} [lo g p (s_{1}) + t = 1 \sum T lo g π_{θ} (a_{t} ∣ s_{t}) + lo g p (s_{t + 1} ∣ s_{t}, a_{t})] r (τ)]

Both $lo g p (s_{1})$ and $lo g p (s_{t + 1} ∣ s_{t}, a_{t})$ do not depend on $θ$ , so we cancel them out and simplify it to

\nabla_{θ} J (θ) = E_{τ \sim p_{θ} (τ)} [(t = 1 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})) (t = 1 \sum T r (s_{t}, a_{t}))] .

Reference

[1] CS 285 Deep Reinforcement Learning.

🌱 chrisliu298's Garden

Explorer

Deriving Policy Gradient

Terminology

Differentiating the Policy Directly

Reference

Graph View

Table of Contents

Backlinks