In this post, I derive the gradient of a policy. This tutorial follows the steps in the first part of lecture 5 of CS285 at UC Berkeley.

## Terminology

In reinforcement learning, a trajectory $τ$ is a sequence of states and actions $s_{1},a_{1},…,s_{T},a_{T}$ collected by a policy. And it can be defined as

$p_{θ}(τ)=p_{θ}(s_{1},a_{1},…,s_{T},a_{T})=p(s_{1})t=1∏T π_{θ}(a_{t}∣s_{t})p(s_{t+1}∣s_{t},a_{t}).$A **trajectory distribution** is a probability distribution over a sequence of states and actions. In the equation above, it is represented by the chain rule of probability by multiplying the initial state distribution $p(s_{1})$ by the product of policy probability $π_{θ}(a_{t}∣s_{t})$ and transition probability $p(s_{t+1}∣s_{t},a_{t})$ over all time steps.

The **reinforcement learning objective** can be written as the optimal parameter $θ$ that maximizes the expected reward under a trajectory.

In the following derivation, I will use $J(θ)$ to represent the expected reward $∑_{t}r(s_{t},a_{t})$, and replace $∑_{t=1}r(s_{t},a_{t})$ with $r(τ)$ for simplicity.

## Differentiating the Policy Directly

Now, we have our objective

$θ_{⋆}=gθmax J(θ)$and expectation of rewards

$J(θ)=E_{τ∼p_{θ}(τ)}[r(τ)].$We expand the expectation for continuous variables to its integral form.

$J(θ)=E_{τ∼p_{θ}(τ)}[r(τ)]=∫p_{θ}(τ)r(τ)dτ$Then, the gradient (or the derivative) of the expected reward can be written as $∇_{θ}J(θ)$ by directly putting the differentiation operator $∇_{θ}$ inside the integral because it is linear.

$∇_{θ}J(θ)=∫∇_{θ}p_{θ}(τ)r(τ)dτ$Now, we need to use the log-derivative identity

$dxd g(x)=xdx $to help us expand the $∇_{θ}p_{θ}(τ)$ by applying it inversely as

$p_{θ}(τ)∇_{θ}logp_{θ}(τ)=p_{θ}(τ)p_{θ}(τ)∇_{θ}p_{θ}(τ) =∇_{θ}p_{θ}(τ).$If we replace the $∇_{θ}p_{θ}(τ)$ term in the gradient of expectation by the left hand side $p_{θ}(τ)∇_{θ}logp_{θ}(τ)$ and convert it back into the expectation form, we will get

$∇_{θ}J(θ) =∫∇_{θ}p_{θ}(τ)r(τ)dτ=∫p_{θ}(τ)∇_{θ}logp_{θ}(τ)r(τ)dτ=E_{τ∼p_{θ}(τ)}[∇_{θ}gp_{θ}(τ)r(τ)]. $Recall that the trajectory distribution is

$p_{θ}(τ)=p(s_{1})t=1∏T π_{θ}(a_{t}∣s_{t})p(s_{t+1}∣s_{t},a_{t}),$and if we take the logarithm of both sides, we will get

$gp_{θ}(τ)=gp(s_{1})+t=1∑T gπ_{θ}(a_{t}∣s_{t})+gp(s_{t+1}∣s_{t},a_{t}).$We proceed to substitute the right hand side of this equation for $gp_{θ}(τ)$ inside the expectation.

$∇_{θ}J(θ) =E_{τ∼p_{θ}(τ)}[∇_{θ}gp_{θ}(τ)r(τ)]=E_{τ∼p_{θ}(τ)}[∇_{θ}[gp(s_{1})+t=1∑T gπ_{θ}(a_{t}∣s_{t})+gp(s_{t+1}∣s_{t},a_{t})]r(τ)] $Both $gp(s_{1})$ and $gp(s_{t+1}∣s_{t},a_{t})$ do not depend on $θ$, so we cancel them out and simplify it to

$∇_{θ}J(θ)=E_{τ∼p_{θ}(τ)}[(t=1∑T ∇_{θ}gπ_{θ}(a_{t}∣s_{t}))(t=1∑T r(s_{t},a_{t}))].$