Markov Decision Processes 03: optimal policy variation

October 01, 2018

< change language

This post is going to follow up what was discussed here, and I will show you how changing the value of the discount factor may affect what the optimal policy is.

For that, consider the MDP defined below:

This MDP is inspired in the example I have been using; I added an extra state and stripped the MDP of the majority of the transitions so we can focus on what is essential here. From the MDP above there is a clear contender for the title of optimal policy: $$\begin{align} &\pi(H) = E\\ &\pi(T) = D\\ &\pi(TU) = D\end{align}$$ Assume that when we enter the MDP there is a $50\%$ chance we start Hungry and a $50\%$ chance we start Thirsty. If we use the policy $\pi$ then the expected reward is $$E[R | \pi] = 0.5E[R_T | \pi] + 0.5E[R_H | \pi]$$ where $E[R_s | \pi]$ is the expected reward we get by following policy $\pi$ starting from state $s$. From my previous post we know that $E[R_T | \pi] = E[R_H | \pi] = \frac{1}{1-\gamma}$ and hence $$E[R | \pi] = \frac{1}{1-\gamma}$$ But what if we followed the seemingly masochist policy $\pi'$ where we eat compulsively, getting a negative reward, just to be able to get a bigger reward when we drink and recover? Define $\pi'$ as $$\begin{align} &\pi'(H) = CE\\ &\pi'(TU) = D\\ &\pi'(T) = D\end{align}$$ Can we compute the expected reward for this policy, assuming the same $50/50$ probabilities for the initial state? Of course we can, just notice that $$\begin{cases} \begin{align} &E[R_H | \pi'] = -2 + \gamma E[R_{TU} | \pi'] \\ &E[R_{TU} | \pi'] = 5 + \gamma E[R_T | \pi'] \\ &E[R_T | \pi'] = 1 + \gamma E[R_H | \pi'] \end{align} \end{cases}$$ and we can compute those by solving the system it encodes, $$\begin{cases} \begin{align} &x = -2 + \gamma y \\ &y = 5 + \gamma z \\ &z = 1 + \gamma x \end{align} \end{cases}$$ giving $$\begin{cases} \begin{align} &x = -\frac{\gamma ^2+5 \gamma -2}{\gamma ^3-1}\\ &y = -\frac{-2 \gamma ^2+\gamma +5}{\gamma ^3-1}\\ &z = -\frac{5 \gamma ^2-2 \gamma +1}{\gamma ^3-1} \end{align} \end{cases}$$ for a final expected reward of $$E[R | \pi'] = \frac{6 \gamma ^2 + 3 \gamma - 1}{2 \left(1 - \gamma^3\right)}$$ Is this value greater than $E[R | \pi]$? With a bit more maths, one can find that for the legal values of $\gamma$, $$E[R | \pi'] \geq E[R | \pi] \iff \gamma \geq \frac34$$ For example, if $\gamma = 0.9$, then the regular policy $\pi$ has an expected reward of $10$ but the masochist policy $\pi'$ is worth $\approx 12.1$; at $\gamma = 0.75$ they are both worth $4$ and for $\gamma = 0.5$, $\pi$ has expected reward $2$ while $\pi'$ only has $\approx 1.14$.

This actually makes some sense: we would only consider the policy $\pi'$ if the $+5$ we can get, even though it comes amortized, is better than losing $-2$ right now; i.e. if we see the $+5$ as a long-term investment that will compensate our loss of $-2$. In particular, losing $2$ now and gaining an amortized $\gamma 5$ must be better than the safer option which is to go back and forth between being Hungry and Thirsty: if we are starting in the Hungry position, we either do $$EC \to D \to E,\ \text{following } \pi'$$ or we do $$E \to D \to E,\ \text{following } \pi$$ The first sequence is worth $-2 + 5\gamma + \gamma^2$ while the second sequence of actions is worth $1 + \gamma + \gamma^2$; therefore we would prefer $\pi'$ over $\pi$ whenever $$\begin{align} -2 + 5\gamma + \gamma^2 > 1 + \gamma + \gamma^2 &\iff -2 + 5\gamma > 1 + \gamma \\ &\iff -2 + 4\gamma > 1 \\ &\iff \gamma > \frac34 \end{align} $$ If we start in the Thirsty position, we either do $$D \to EC \to D,\ \text{following } \pi'$$ or we do $$D \to E \to D,\ \text{following } \pi$$ The first sequence is worth $1 - 2\gamma + 5\gamma^2$ while the second sequence of actions is worth $1 + \gamma + \gamma^2$; this means that, when we start Thirsty, we would prefer $\pi'$ over $\pi$ whenever (remember that $\gamma \geq 0$) $$\begin{align} 1 - 2\gamma + 5\gamma^2 > 1 + \gamma + \gamma^2 &\iff - 3\gamma + 4\gamma^2 > 0 \\ &\iff \gamma > \frac34 \end{align}$$ So we can see that everything adds up nicely; regardless of our starting position, we would prefer $\pi'$ over $\pi$ whenever $\gamma > 0.75$ so that is exactly the final result we got.

This concludes this post. We have seen how the value of $\gamma$ might affect what the optimal policy is! This example might seem fabricated, but that is because I wanted to show these changes in optimal policies with the simplest example possible. In real life the MDPs are way bigger and much more complex, so it shouldn't be difficult to believe the value of $\gamma$ might end up being decisive.

In a future post, we will be tackling algorithms to find the optimal policy in a general MDP and hopefully I will find a good example of application to show how they work!

This post was brought to you by the #100DaysOfMLCode initiative.

All posts on the MDP series:
MDP#01 Basics
MDP#02 Discount Factor
MDP#03 Optimal Policy Variation

(tradução indisponível de momento)

- RGS

Search This Blog

Mathspp: mathematics and programming

Markov Decision Processes 03: optimal policy variation

Popular posts from this blog

Pocket maths: how to compute averages in your head

Tutorial on programming a memory card game

Markov Decision Processes 02: how the discount factor works