Jon's Dev Blog

Notes on Reinforcement Learning 4

April 17, 2024

I was studying reinforcement learning a while ago, attempting to educate myself about deep Q learning. As part of that effort, I read through the first few chapters of Reinforcement Learning: An A Introduction by Sutton and Barto. Here are my notes on Chapter 4. Like all of my other notes, these were never intended to be shared, so apologies in advance if they make no sense to anyone.

Chapter 4

This chapter is all about dynamic programming, which is exactly what you'd think, although the treatment here is simultaneously more abstract than the typical CS undergrad's view of solving coding problems, and also very specific to solving tabular MDPs. There is not a lot of practical information in this chapter about solving real problems, but it is worth the discussion for the theoretical backbone.

Policy Evaluation

Let π\pi be an arbitrary policy. Recall that by definition, vπ(s)==E[GtSt=s]v_\pi(s) == \mathbb{E}[G_t|S_t=s]. That expands to the Bellman equation for vπv_\pi.

vπ(s)=E[GtSt=s]=E[Rt+γGt+1St=s]=E[Rt+γE[Gt+1]St=s]=E[Rt+γvπ(St+1)]St=s]=aAπ(as)sSrRp(s,rs,a)(r+γvπ(s))\begin{aligned} v_\pi(s) &= \mathbb{E}[G_t|S_t=s] \\ &= \mathbb{E}[R_t + \gamma G_{t+1}|S_t=s] \\ &= \mathbb{E}[R_t + \gamma \mathbb{E}[G_{t+1}]|S_t=s] \\ &= \mathbb{E}[R_t + \gamma v_\pi(S_{t+1})]|S_t=s] \\ &= \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{\substack{s'\in\mathcal{S} \\ r\in\mathcal{R}}}p(s',r|s,a)(r+\gamma v_\pi(s')) \end{aligned}

This relationship between ππ\pi_\pi and itself allows us to update its values iteratively:

vk+1(s)=Eπ[Rt+1+γvk(St+1)St=s]v_{k+1}(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_k(S_{t+1}) | S_t=s]

This process of assigning a value function to a policy is called policy evaluation (get it?).

Policy Improvement

Now let's flip the problem. Suppose we have a policy π\pi and a resulting value function vπv_\pi. Note this does not mean π\pi is greedy with respect to vπv_\pi. It may be possible that some action-value q values contradict the policy. For example, suppose π\pi is nearly uniform for actions at a state ss, and the action aa with the highest action-value qπ(s,a)q_\pi(s,a) as a relatively low π\pi value. Then it is possible that qπ(s,a)>vπ(s)q_\pi(s,a) > v_\pi(s). If that is the case, then the policy π\pi' defined by selecting aa at state ss and agreeing with π\pi elsewhere is strictly better than π\pi (in the sense that its expected return is greater).

Policy Iteration

Thus, once a value function has been defined from a policy, we can improve the policy by replacing it with the greedy policy with respect to that value function. Iterating this two-step process of evaluation followed by improvement is known as policy iteration:

π0Eπ1Iπ2Eπ3I\pi_0 \overset{E}{\longrightarrow} \pi_1 \overset{I}{\longrightarrow} \pi_2 \overset{E}{\longrightarrow} \pi_3 \overset{I}{\longrightarrow} \cdots

This feels eerily similar to Expectation-maximization algorithms in statistics, about which I know almost nothing.

Other topics in this chapter

There are other topics, mostly related to the observation that the above algorithm involves repeated full scans of the state-action space at every step. Notably, the authors assert without proof that Bellman updates like the above can be done in piecemeal fashion and will still converge to optimal value-policy pairs.


Profile picture

Written by Jon Lamar: Machine learning engineer, former aspiring mathematician, cyclist, family person.