# Bellman equations for reinforcement learning

24 September 2016

These are notes of lecture 2 of David Silver’s course on reinforcement learning.

The goal is to set up a framework which is commonly used in the current reinforcement learning literature.

## Markov reward processes (MRP)

We define the MRP framework and derive some recursive equations.

 $$\mathcal S$$ $$\dots$$ finite set of states $$\mathcal P$$ $$\dots$$ state transition matrix $$\mathcal R$$ $$\dots$$ reward function $$\gamma$$ $$\dots$$ discount factor

States $$\mathcal S = \{1, \dotsc, n\}$$.

State transition matrix: \begin{align} \mathcal P_{ss’} &= \P(S_{t + 1} = s’ \given S_t = s) \\
&= \P(S_{t + 1} = s’ \given S_t = s, S_{t - 1} = s_{t - 1}, \dotsc, S_1 = s_1) \\
\mathcal P &= \begin{bmatrix} \mathcal P_{11} & \cdots & \mathcal P_{1n} \\
\vdots & \ddots & \vdots \\
\mathcal P_{n1} & \cdots & \mathcal P_{nn} \end{bmatrix}. \end{align}

Reward function: \begin{align} \mathcal R(s) &= \E[R_{t + 1} \given S_t = s] \\
&=: \mathcal R_s, \label{eq:notes/bellman/reward} \end{align} where $$R_{t + 1}$$ is a random variable.

Discount factor $$\gamma \in [0, 1]$$.

Return \begin{align} G_t &= \underbrace{\sum_{k = 0}^\infty \gamma^k R_{t + k + 1}}_{\text{random variable}}. \end{align}

Value function \begin{align} v(s) &= \E[G_t \given S_t = s] \\
&= \E[R_{t + 1} + \gamma R_{t + 2} + \cdots \given S_t = s] \\
&= \E[R_{t + 1} \given S_t = s] + \gamma \E[R_{t + 2} + \gamma R_{t + 3} + \cdots \given S_t = s] & \text{(linearity of expectations)} \\
&= \mathcal R_s + \gamma \E[G_{t + 1} \given S_t = s] & \text{(from reward function in \eqref{eq:notes/bellman/reward})}\\
&= \mathcal R_s + \gamma \int g_{t + 1} p(G_{t + 1} = g_{t + 1} \given S_t = s) \,\mathrm dg_{t + 1} & \text{(denote } p \text{ as the density of the corresponding r.v.)} \\
&= \mathcal R_s + \gamma \int g_{t + 1} \left(\int p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’, S_t = s) p(S_{t + 1} = s’ \given S_t = s) \,\mathrm d s’\right) \,\mathrm dg_{t + 1} \\
&= \mathcal R_s + \gamma \int g_{t + 1} \left(\int p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’) \mathcal P_{ss’} \,\mathrm ds’\right) \,\mathrm dg_{t + 1} \\
&= \mathcal R_s + \gamma \int \left(\int g_{t + 1} p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’) \,\mathrm dg_{t + 1}\right) \mathcal P_{ss’} \,\mathrm ds’ \\
&= \mathcal R_s + \gamma \int \E[G_{t + 1} \given S_{t + 1} = s’] \mathcal P_{ss’} \,\mathrm ds’ \\
&= \mathcal R_s + \gamma \int v(s’) \mathcal P_{ss’} \,\mathrm ds’ \\
&= \mathcal R_s + \gamma \sum_{s’ \in \mathcal S} v(s’) \mathcal P_{ss’}. \end{align}

## Markov decision processes (MDP)

We define the MDP framework and derive some recursive definitions of various quantities.

 $$\mathcal S$$ $$\dots$$ finite set of states $$\mathcal A$$ $$\dots$$ finite set of actions $$\mathcal P$$ $$\dots$$ state transition matrix $$\mathcal R$$ $$\dots$$ reward function $$\gamma$$ $$\dots$$ discount factor

Policy: \begin{align} \pi(a \given s) = \P(A_t = a \given S_t = s) \end{align}

State transition matrix: \begin{align} \mathcal P_{ss’}^a = \P(S_{t + 1} = s’ \given S_t = s, A_t = a). \end{align}

Also define \begin{align} \mathcal P_{ss’}^{\pi} &:= \P(S_{t + 1} = s’ \given S_t = s) \\
&= \sum_{a \in \mathcal A} \P(S_{t + 1} = s’ \given S_t = s, A_t = a) \P(A_t = a \given S_t = s) \\
&= \sum_{a \in \mathcal A} \mathcal P_{ss’}^a \pi(a \given s) \end{align} and \begin{align} \mathcal P^{\pi} &:= \begin{bmatrix} \mathcal P_{11}^{\pi} & \cdots & \mathcal P_{1n}^{\pi} \\
\vdots & \ddots & \vdots \\
\mathcal P_{n1}^{\pi} & \cdots & \mathcal P_{nn}^{\pi} \end{bmatrix}. \end{align}

Reward function \begin{align} \mathcal R(s, a) &= \E[R_{t + 1} \given S_t = s, A_t = a] \\
&=: \mathcal R_s^a. \end{align}

Also define \begin{align} \mathcal R_s^\pi &:= \E[R_{t + 1} \given S_t = s] \\
&= \int r_{t + 1} p(R_{t + 1} = r_{t + 1} | S_t = s) \,\mathrm dr_{t + 1} \\
&= \int r_{t + 1} \int p(R_{t + 1} = r_{t + 1} | S_t = s, A_t = a) p(A_t = a | S_t = s) \,\mathrm da \,\mathrm dr_{t + 1} \\
&= \int \int r_{t + 1} p(R_{t + 1} = r_{t + 1} | S_t = s, A_t = a) \,\mathrm dr_{t + 1} p(A_t = a | S_t = s) \,\mathrm da \\
&= \int \E[R_{t + 1} | S_t = 1, A_t = 1] \pi(a | s) \,\mathrm da \\
&= \sum_{a \in \mathcal A} \mathcal R_s^a \pi(a \given s) \end{align} and \begin{align} \mathcal R^{\pi} &:= [\mathcal R_1^{\pi} \cdots \mathcal R_n^{\pi}]^T. \end{align}

State-value function of an MDP \begin{align} v_\pi(s) &:= \E[G_t \given S_t = s]. \end{align}

Action-value function of an MDP \begin{align} q_\pi(s, a) &:= \E[G_t \given S_t = s, A_t = a]. \end{align}

## Bellman Expectation Equations

We have the following Bellman Expectation Equations: \begin{align} v_\pi(s) &= \mathcal R_s^\pi + \gamma \sum_{s’ \in \mathcal S} v_\pi(s’) \mathcal P_{ss’}^\pi \\
v_\pi(s) &= \sum_{a \in \mathcal A} \pi(a \given s) (\mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} v_\pi(s’) \mathcal P_{ss’}^a) \\
v_\pi(s) &= \sum_{a \in \mathcal A} \pi(a \given s) q_\pi(s, a) \\
q_\pi(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a \sum_{a’ \in \mathcal A} \pi(a’ \given s’) q_\pi(s’, a’) \\
q_\pi(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\pi(s’) \end{align}

## Optimal policies

We formalize what it means to find an optimal policy.

Optimal value function: \begin{align} v_*(s) &= \max_\pi v_\pi(s). \end{align}

Optimal action-value function: \begin{align} q_*(s, a) = \max_\pi q_\pi(s, a). \end{align}

Optimal policy theorem:

Define a partial order $$\geq$$ on policies so that $$\pi \geq \pi'$$ if \begin{align} v_\pi(s) \geq v_{\pi’}(s), \forall s \in \mathcal S. \end{align} For any MDP,

• There exists an optimal policy $$\pi_*$$ such that \begin{align} \pi_\ast \geq \pi, \forall \pi. \end{align}
• All optimal oplicies achieve the optimal value function: \begin{align} v_{\pi_\ast}(s) = v_\ast(s). \end{align}
• All optimal policies achieve the optimal action-value function: \begin{align} q_{\pi_\ast}(s, a) = q_\ast(s, a). \end{align}

One example of an optimal policy is \begin{align} \pi_\ast(a \given s) &= \delta(a - \mathrm{argmax}_{a \in \mathcal A} q_\ast(s, a)). \end{align}

## Bellman Optimality Equations

This gives rise to Bellman Optimality Equations: \begin{align} v_\ast(s) &= \max_{a \in \mathcal A} q_\ast(s, a) \\
v_\ast(s) &= \max_{a \in \mathcal A} \left(\mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\ast(s’)\right) \\
q_\ast(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\ast(s’) \\
q_\ast(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a \max_{a’ \in \mathcal A} q_\ast(s’, a’). \end{align}

[back]