In this blog post, I made a short tutorial on how to derive the gradient of a policy. This tutorial follows the steps in the first part of lecture 5 of CS285 at UC Berkeley. OpenAI Spinning Up also has a more detailed tutorial on doing this. While they use slightly different notations, they are referring to the same derivation.


In reinforcement learning, a trajectory $\tau$ is a sequence of states and actions $\mathbf{s}_{1}, \mathbf{a}_{1}, \ldots, \mathbf{s}_{T}, \mathbf{a}_{T}$ collected by a policy. And it can be defined as

\[p_{\theta}(\tau) = p_{\theta}\left(\mathbf{s}_{1}, \mathbf{a}_{1}, \ldots, \mathbf{s}_{T}, \mathbf{a}_{T}\right)=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right).\]

A trajectory distribution is a probability distribution over a sequence of states and actions. In the equation above, it is represented by the chain rule of probability by multiplying the initial state distribution $p\left(\mathbf{s}_{1}\right)$ by the product of policy probability $\pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)$ and transition probability $p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)$ over all time steps.

The reinforcement learning objective can be written as the optimal parameter $\theta$ that maximizes the expected reward under a trajectory.

\[\theta^{\star}=\arg \max _{\theta} E_{\tau \sim p_{\theta}(\tau)}\left[\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]\]

In the following derivation, I will use $J(\theta)$ to represent the expected reward $\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)$, and replace $\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)$ with $r(\tau)$ for simplicity.

Differentiating the Policy Directly

Now, we have our objective

\[\theta^{\star}=\arg \max _{\theta} J(\theta)\]

and expectation of rewards

\[J(\theta) = E_{\tau \sim p_{\theta}(\tau)}[r(\tau)].\]

We expand the expectation for continuous variables to its integral form.

\[J(\theta) = E_{\tau \sim p_{\theta}(\tau)}[r(\tau)] = \int p_{\theta}(\tau) r(\tau) d \tau\]

Then, the gradient (or the derivative) of the expected reward can be written as $\nabla_{\theta} J(\theta)$ by directly putting the differentiation operator $\nabla_{\theta}$ inside the integral because it is linear.

\[\nabla_{\theta} J(\theta) =\int \nabla_{\theta} p_{\theta}(\tau) r(\tau) d \tau\]

Now, we need to use the log-derivative identity

\[\frac{d}{dx} \log(x) = \frac{dx}{x}\]

to help us expand the $\nabla_{\theta} p_{\theta}(\tau)$ by applying it inversely as

\[\textcolor{#eb3a32}{p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau)} =p_{\theta}(\tau) \frac{\nabla_{\theta} p_{\theta}(\tau)}{p_{\theta}(\tau)}=\textcolor{#3a79f6}{\nabla_{\theta} p_{\theta}(\tau)}.\]

If we replace the $\textcolor{#3a79f6}{\nabla_{\theta} p_{\theta}(\tau)}$ term in the gradient of expectation by the left hand side $\textcolor{#eb3a32}{p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau)}$ and convert it back into the expectation form, we will get

\[\begin{aligned} \nabla_{\theta} J(\theta) & =\int \textcolor{#3a79f6}{\nabla_{\theta} p_{\theta}(\tau)} r(\tau) d \tau \\ & = \int \textcolor{#eb3a32}{p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau)} r(\tau) d \tau \\ & = E_{\tau \sim p_{\theta}(\tau)}\left[\nabla_{\theta} \log p_{\theta}(\tau) r(\tau)\right].\end{aligned}\]

Recall that the trajectory distribution is

\[p_{\theta}(\tau) =p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right),\]

and if we take the logarithm of both sides, we will get

\[\log p_{\theta}(\tau)=\log p\left(\mathbf{s}_{1}\right)+\sum_{t=1}^{T} \log \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)+\log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right).\]

We proceed to substitute the right hand side of this equation for $\log p_{\theta}(\tau)$ inside the expectation.

\[\begin{aligned} \nabla_{\theta} J(\theta) & =E_{\tau \sim p_{\theta}(\tau)}\left[\nabla_{\theta} \log p_{\theta}(\tau) r(\tau)\right] \\ & = E_{\tau \sim p_{\theta}(\tau)}\left[\nabla_{\theta} \left[ \log p\left(\mathbf{s}_{1}\right)+\sum_{t=1}^{T} \log \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)+\log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] r(\tau)\right] \end{aligned}\]

Both $\log p\left(\mathbf{s}_{1}\right)$ and $\log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)$ do not depend on $\theta$, so we cancel them out and simplify it to

\[\nabla_{\theta} J(\theta) = E_{\tau \sim p_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)\left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)\right].\]

This is the full derivation of policy gradient.


[1] CS 285 Deep Reinforcement Learning by Professor Sergey Levine on YouTube.