Deriving Policy Gradient
In this blog post, I made a short tutorial on how to derive the gradient of a policy. This tutorial follows the steps in the first part of lecture 5 of CS285 at UC Berkeley. OpenAI Spinning Up also has a more detailed tutorial on doing this. While they use slightly different notations, they are referring to the same derivation.
Terminology
In reinforcement learning, a trajectory $\tau$ is a sequence of states and actions $\mathbf{s}_{1}, \mathbf{a}_{1}, \ldots, \mathbf{s}_{T}, \mathbf{a}_{T}$ collected by a policy. And it can be defined as
\[p_{\theta}(\tau) = p_{\theta}\left(\mathbf{s}_{1}, \mathbf{a}_{1}, \ldots, \mathbf{s}_{T}, \mathbf{a}_{T}\right)=p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right).\]A trajectory distribution is a probability distribution over a sequence of states and actions. In the equation above, it is represented by the chain rule of probability by multiplying the initial state distribution $p\left(\mathbf{s}_{1}\right)$ by the product of policy probability $\pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)$ and transition probability $p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)$ over all time steps.
The reinforcement learning objective can be written as the optimal parameter $\theta$ that maximizes the expected reward under a trajectory.
\[\theta^{\star}=\arg \max _{\theta} E_{\tau \sim p_{\theta}(\tau)}\left[\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right]\]In the following derivation, I will use $J(\theta)$ to represent the expected reward $\sum_{t} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)$, and replace $\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)$ with $r(\tau)$ for simplicity.
Differentiating the Policy Directly
Now, we have our objective
\[\theta^{\star}=\arg \max _{\theta} J(\theta)\]and expectation of rewards
\[J(\theta) = E_{\tau \sim p_{\theta}(\tau)}[r(\tau)].\]We expand the expectation for continuous variables to its integral form.
\[J(\theta) = E_{\tau \sim p_{\theta}(\tau)}[r(\tau)] = \int p_{\theta}(\tau) r(\tau) d \tau\]Then, the gradient (or the derivative) of the expected reward can be written as $\nabla_{\theta} J(\theta)$ by directly putting the differentiation operator $\nabla_{\theta}$ inside the integral because it is linear.
\[\nabla_{\theta} J(\theta) =\int \nabla_{\theta} p_{\theta}(\tau) r(\tau) d \tau\]Now, we need to use the log-derivative identity
\[\frac{d}{dx} \log(x) = \frac{dx}{x}\]to help us expand the $\nabla_{\theta} p_{\theta}(\tau)$ by applying it inversely as
\[\textcolor{#eb3a32}{p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau)} =p_{\theta}(\tau) \frac{\nabla_{\theta} p_{\theta}(\tau)}{p_{\theta}(\tau)}=\textcolor{#3a79f6}{\nabla_{\theta} p_{\theta}(\tau)}.\]If we replace the $\textcolor{#3a79f6}{\nabla_{\theta} p_{\theta}(\tau)}$ term in the gradient of expectation by the left hand side $\textcolor{#eb3a32}{p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau)}$ and convert it back into the expectation form, we will get
\[\begin{aligned} \nabla_{\theta} J(\theta) & =\int \textcolor{#3a79f6}{\nabla_{\theta} p_{\theta}(\tau)} r(\tau) d \tau \\ & = \int \textcolor{#eb3a32}{p_{\theta}(\tau) \nabla_{\theta} \log p_{\theta}(\tau)} r(\tau) d \tau \\ & = E_{\tau \sim p_{\theta}(\tau)}\left[\nabla_{\theta} \log p_{\theta}(\tau) r(\tau)\right].\end{aligned}\]Recall that the trajectory distribution is
\[p_{\theta}(\tau) =p\left(\mathbf{s}_{1}\right) \prod_{t=1}^{T} \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right) p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right),\]and if we take the logarithm of both sides, we will get
\[\log p_{\theta}(\tau)=\log p\left(\mathbf{s}_{1}\right)+\sum_{t=1}^{T} \log \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)+\log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right).\]We proceed to substitute the right hand side of this equation for $\log p_{\theta}(\tau)$ inside the expectation.
\[\begin{aligned} \nabla_{\theta} J(\theta) & =E_{\tau \sim p_{\theta}(\tau)}\left[\nabla_{\theta} \log p_{\theta}(\tau) r(\tau)\right] \\ & = E_{\tau \sim p_{\theta}(\tau)}\left[\nabla_{\theta} \left[ \log p\left(\mathbf{s}_{1}\right)+\sum_{t=1}^{T} \log \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)+\log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)\right] r(\tau)\right] \end{aligned}\]Both $\log p\left(\mathbf{s}_{1}\right)$ and $\log p\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \mathbf{a}_{t}\right)$ do not depend on $\theta$, so we cancel them out and simplify it to
\[\nabla_{\theta} J(\theta) = E_{\tau \sim p_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)\left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)\right].\]This is the full derivation of policy gradient.
Reference
[1] CS 285 Deep Reinforcement Learning by Professor Sergey Levine on YouTube.