Bellman equations for reinforcement learning

24 September 2016

These are notes of lecture 2 of David Silver’s course on reinforcement learning.

The goal is to set up a framework which is commonly used in the current reinforcement learning literature.

Structure:
Markov reward processes
Markov decision processes —> Bellman Expectation Equations
Optimal policies —> Bellman Optimality Equations

Markov reward processes (MRP)

We define the MRP framework and derive some recursive equations.

\(\mathcal S\)	\(\dots\)	finite set of states
\(\mathcal P\)	\(\dots\)	state transition matrix
\(\mathcal R\)	\(\dots\)	reward function
\(\gamma\)	\(\dots\)	discount factor

States \(\mathcal S = \{1, \dotsc, n\}\).

State transition matrix: \begin{align} \mathcal P_{ss’} &= \P(S_{t + 1} = s’ \given S_t = s) \\
&= \P(S_{t + 1} = s’ \given S_t = s, S_{t - 1} = s_{t - 1}, \dotsc, S_1 = s_1) \\
\mathcal P &= \begin{bmatrix} \mathcal P_{11} & \cdots & \mathcal P_{1n} \\
\vdots & \ddots & \vdots \\
\mathcal P_{n1} & \cdots & \mathcal P_{nn} \end{bmatrix}. \end{align}

Reward function: \begin{align} \mathcal R(s) &= \E[R_{t + 1} \given S_t = s] \\
&=: \mathcal R_s, \label{eq:notes/bellman/reward} \end{align} where \(R_{t + 1}\) is a random variable.

Discount factor \(\gamma \in [0, 1]\).

Return \begin{align} G_t &= \underbrace{\sum_{k = 0}^\infty \gamma^k R_{t + k + 1}}_{\text{random variable}}. \end{align}

Value function \begin{align} v(s) &= \E[G_t \given S_t = s] \\
&= \E[R_{t + 1} + \gamma R_{t + 2} + \cdots \given S_t = s] \\
&= \E[R_{t + 1} \given S_t = s] + \gamma \E[R_{t + 2} + \gamma R_{t + 3} + \cdots \given S_t = s] & \text{(linearity of expectations)} \\
&= \mathcal R_s + \gamma \E[G_{t + 1} \given S_t = s] & \text{(from reward function in \eqref{eq:notes/bellman/reward})}\\
&= \mathcal R_s + \gamma \int g_{t + 1} p(G_{t + 1} = g_{t + 1} \given S_t = s) \,\mathrm dg_{t + 1} & \text{(denote } p \text{ as the density of the corresponding r.v.)} \\
&= \mathcal R_s + \gamma \int g_{t + 1} \left(\int p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’, S_t = s) p(S_{t + 1} = s’ \given S_t = s) \,\mathrm d s’\right) \,\mathrm dg_{t + 1} \\
&= \mathcal R_s + \gamma \int g_{t + 1} \left(\int p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’) \mathcal P_{ss’} \,\mathrm ds’\right) \,\mathrm dg_{t + 1} \\
&= \mathcal R_s + \gamma \int \left(\int g_{t + 1} p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’) \,\mathrm dg_{t + 1}\right) \mathcal P_{ss’} \,\mathrm ds’ \\
&= \mathcal R_s + \gamma \int \E[G_{t + 1} \given S_{t + 1} = s’] \mathcal P_{ss’} \,\mathrm ds’ \\
&= \mathcal R_s + \gamma \int v(s’) \mathcal P_{ss’} \,\mathrm ds’ \\
&= \mathcal R_s + \gamma \sum_{s’ \in \mathcal S} v(s’) \mathcal P_{ss’}. \end{align}

Markov decision processes (MDP)

We define the MDP framework and derive some recursive definitions of various quantities.

\(\mathcal S\)	\(\dots\)	finite set of states
\(\mathcal A\)	\(\dots\)	finite set of actions
\(\mathcal P\)	\(\dots\)	state transition matrix
\(\mathcal R\)	\(\dots\)	reward function
\(\gamma\)	\(\dots\)	discount factor

Policy: \begin{align} \pi(a \given s) = \P(A_t = a \given S_t = s) \end{align}

State transition matrix: \begin{align} \mathcal P_{ss’}^a = \P(S_{t + 1} = s’ \given S_t = s, A_t = a). \end{align}

Also define \begin{align} \mathcal P_{ss’}^{\pi} &:= \P(S_{t + 1} = s’ \given S_t = s) \\
&= \sum_{a \in \mathcal A} \P(S_{t + 1} = s’ \given S_t = s, A_t = a) \P(A_t = a \given S_t = s) \\
&= \sum_{a \in \mathcal A} \mathcal P_{ss’}^a \pi(a \given s) \end{align} and \begin{align} \mathcal P^{\pi} &:= \begin{bmatrix} \mathcal P_{11}^{\pi} & \cdots & \mathcal P_{1n}^{\pi} \\
\vdots & \ddots & \vdots \\
\mathcal P_{n1}^{\pi} & \cdots & \mathcal P_{nn}^{\pi} \end{bmatrix}. \end{align}

Reward function \begin{align} \mathcal R(s, a) &= \E[R_{t + 1} \given S_t = s, A_t = a] \\
&=: \mathcal R_s^a. \end{align}

Also define \begin{align} \mathcal R_s^\pi &:= \E[R_{t + 1} \given S_t = s] \\
&= \int r_{t + 1} p(R_{t + 1} = r_{t + 1} | S_t = s) \,\mathrm dr_{t + 1} \\
&= \int r_{t + 1} \int p(R_{t + 1} = r_{t + 1} | S_t = s, A_t = a) p(A_t = a | S_t = s) \,\mathrm da \,\mathrm dr_{t + 1} \\
&= \int \int r_{t + 1} p(R_{t + 1} = r_{t + 1} | S_t = s, A_t = a) \,\mathrm dr_{t + 1} p(A_t = a | S_t = s) \,\mathrm da \\
&= \int \E[R_{t + 1} | S_t = 1, A_t = 1] \pi(a | s) \,\mathrm da \\
&= \sum_{a \in \mathcal A} \mathcal R_s^a \pi(a \given s) \end{align} and \begin{align} \mathcal R^{\pi} &:= [\mathcal R_1^{\pi} \cdots \mathcal R_n^{\pi}]^T. \end{align}

State-value function of an MDP \begin{align} v_\pi(s) &:= \E[G_t \given S_t = s]. \end{align}

Action-value function of an MDP \begin{align} q_\pi(s, a) &:= \E[G_t \given S_t = s, A_t = a]. \end{align}

Bellman Expectation Equations

We have the following Bellman Expectation Equations: \begin{align} v_\pi(s) &= \mathcal R_s^\pi + \gamma \sum_{s’ \in \mathcal S} v_\pi(s’) \mathcal P_{ss’}^\pi \\
v_\pi(s) &= \sum_{a \in \mathcal A} \pi(a \given s) (\mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} v_\pi(s’) \mathcal P_{ss’}^a) \\
v_\pi(s) &= \sum_{a \in \mathcal A} \pi(a \given s) q_\pi(s, a) \\
q_\pi(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a \sum_{a’ \in \mathcal A} \pi(a’ \given s’) q_\pi(s’, a’) \\
q_\pi(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\pi(s’) \end{align}

Optimal policies

We formalize what it means to find an optimal policy.

Optimal value function: \begin{align} v_*(s) &= \max_\pi v_\pi(s). \end{align}

Optimal action-value function: \begin{align} q_*(s, a) = \max_\pi q_\pi(s, a). \end{align}

Optimal policy theorem:

Define a partial order \(\geq\) on policies so that \(\pi \geq \pi'\) if \begin{align} v_\pi(s) \geq v_{\pi’}(s), \forall s \in \mathcal S. \end{align} For any MDP,

There exists an optimal policy \(\pi_*\) such that \begin{align} \pi_\ast \geq \pi, \forall \pi. \end{align}
All optimal oplicies achieve the optimal value function: \begin{align} v_{\pi_\ast}(s) = v_\ast(s). \end{align}
All optimal policies achieve the optimal action-value function: \begin{align} q_{\pi_\ast}(s, a) = q_\ast(s, a). \end{align}

One example of an optimal policy is \begin{align} \pi_\ast(a \given s) &= \delta(a - \mathrm{argmax}_{a \in \mathcal A} q_\ast(s, a)). \end{align}

Bellman Optimality Equations

This gives rise to Bellman Optimality Equations: \begin{align} v_\ast(s) &= \max_{a \in \mathcal A} q_\ast(s, a) \\
v_\ast(s) &= \max_{a \in \mathcal A} \left(\mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\ast(s’)\right) \\
q_\ast(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\ast(s’) \\
q_\ast(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a \max_{a’ \in \mathcal A} q_\ast(s’, a’). \end{align}

[back]