Jon's Dev Blog

Notes on Reinforcement Learning 3

January 28, 2023

I was studying reinforcement learning a while ago, attempting to educate myself about deep Q learning. As part of that effort, I read through the first few chapters of Reinforcement Learning: An A Introduction by Sutton and Barto. Here are my notes on Chapter 3. Like all of my other notes, these were never intended to be shared, so apologies in advance if they make no sense to anyone.

Chapter 3

Agent-Environment Interface

They mainly cover basic terminology here. The main difference from bandit problems is that the state can change with each action.

  • S\mathcal{S} = state space, A\mathcal{A} = action space, R\mathcal{R} = reward space. These are all finite, with RR\mathcal{R} \subseteq \mathbb{R}.

  • As with the bandit problems, we have AtA_t, RtR_t are the actions and rewards at time tt. The book uses Rt+1R_{t+1} to denote the reward given for action AtA_t. MDPs introduce another time series, StS_t to denote the state at time tt. Thus StS_t and AtA_t "go together" and St+1S_{t+1} and Rt+1R_{t+1} are "jointly determined."

  • The probability distributions governing the dynamics of an MDP are given by the density function:

    p(s,rs,a):=P(St+1=s,Rt+1=rSt=s,At=a)p(s', r\,\vert\, s,a) := \mathbb{P}(S_{t+1} = s', R_{t+1} = r\,\vert\,S_t=s,A_t=a)

    Other useful equations are:

    p(ss,a)=rRp(s,rs,a),r(s,a):=E[RtSt1=s,At1=a]=rRsSp(s,rs,a)r(s,a,s):=E[RtSt1=s,At1=a,St=s]=rRrp(s,rs,a)p(ss,a)p(s'\,\vert\,s,a) = \sum_{r\in\mathcal{R}}p(s',r\,\vert\,s,a), \\ r(s,a) := \mathbb{E}[R_t\,\vert\,S_{t-1}=s,A_{t-1}=a] = \sum_{r\in\mathcal{R}}\sum_{s\in\mathcal{S}}p(s',r\,\vert\,s,a) \\ r(s,a,s') := \mathbb{E}[R_t\,\vert\,S_{t-1}=s,A_{t-1}=a,S_t=s'] = \sum_{r\in\mathcal{R}}r\cdot\frac{p(s',r\vert s,a)}{p(s'\vert s,a)}

Goals and Rewards

Note that as with bandit problems, RtR_t is stochastic. But this is also the only thing we can really tune about a given system. In pactice, reward is based on the full state-action-state transition, and therefore the randomness comes from the environment.

Key insight: keep rewards simple with small, finite support. For some reason, I think of this as an extension of defining really simple prior distributions. Since in this case, the value (return) is determined by percolating rewards backwards from terminal states.

Returns and Episodes

Define a new random variable GtG_t to be the return at time tt. So if an agent interacts for TT time steps, this would be defined

Gt=Rt+1+Rt+2++RT.G_t = R_{t+1} + R_{t+2} + \cdots + R_T.

Unified Notation for Episodic and Continuing Tasks

Here, the book allows TT to be infinite. In this ncase, we need a discounting factor for future returns, or otherwise the return would be a potentially divergent series. Let γ\gamma be the discount factor (possibly equal to 1 for finite episodes), so that

Gt:=k=0γkRt+k+1=Rt+1+γGt+1.\begin{aligned} G_t &:= \sum_{k=0}^\infty\gamma^kR_{t+k+1} \\ &= R_{t+1} + \gamma G_{t+1}. \end{aligned}

This unified notation is defined after discussing terminal states, which help to deal with the problem of finite episodes. A terminal state is a sink in the state-action graph, whose reward is always zero. This allows us to always use infinite sums even for finite episodes.

Policies and Value Functions

A policy is a conditional distribution over actions, conditioned on a state:

Pπ(as):=π(a,s).\mathbb{P}_\pi(a\,\vert\,s) := \pi(a,s).

The value of a state is the expected return, with respect to the policy distribution:

vπ(s):=Eπ[GtSt=s].v_\pi(s) := \mathbb{E}_\pi[G_t\,\vert\,S_t = s].

The quality of an action aa at state ss is the expected return

qπ(s,a):=Eπ[GtSt=s,At=a].q_\pi(s,a) := \mathbb{E}_\pi[G_t\,\vert\,S_t=s,A_t=a].

We call qπq_\pi the action-value function.

Exercise 3.12

Give an equation for vπv_\pi in terms of qπq_\pi and π\pi.

Solution:

vπ(s)=Eπ[GtSt=s]=aAEπ[GtAt=a,St=s]P[At=aSt=s]=aAqπ(s,a)π(s,a).\begin{aligned} v_\pi(s) &= \mathbb{E}_\pi[G_t\,\vert\,S_t=s] \\ &= \sum_{a\in\mathcal{A}}\mathbb{E}_\pi[G_t\,\vert\,A_t=a,S_t=s]\cdot\mathbb{P}[A_t=a\,\vert\,S_t=s] \\ &= \sum_{a\in\mathcal{A}}q_\pi(s,a)\cdot\pi(s,a). \end{aligned}

Exercise 3.13

Give an equation for qπq_\pi in terms of vπv_\pi and the four-argument pp.

Solution:

qπ(s,a)=Eπ[GtSt=s,At=a]=Eπ[Rt+1+γGt+1St=s,At=a]=s,rp(s,rs,a)[r+γEπ[Gt+1St+1=s,At=a]]=s,rp(s,rs,a)[r+γEπ[Gt+1St+1=s]]=s,rp(s,rs,a)[r+γvπ(s)]\begin{aligned} q_\pi(s, a) &= \mathbb{E}_\pi[G_t\,\vert\,S_t=s,A_t=a] \\ &= \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1}\,\vert\,S_t=s,A_t=a] \\ &= \sum_{s',r}p(s',r\vert s,a)\cdot[r + \gamma\mathbb{E}_\pi[G_{t+1}\,\vert\,S_{t+1}=s',A_t=a]] \\ &= \sum_{s',r}p(s',r\vert s,a)\cdot[r + \gamma\mathbb{E}_\pi[G_{t+1}\,\vert\,S_{t+1}=s']] \\ &= \sum_{s',r}p(s',r\vert s,a)\cdot[r + \gamma\cdot v_\pi(s')] \\ \end{aligned}

The fourth line follows from the third because of the Markov property.


Profile picture

Written by Jon Lamar: Machine learning engineer, former aspiring mathematician, cyclist, family person.