Markov Decision Processes 02: how the discount factor works



< change language

In this previous post I defined a Markov Decision Process and explained all of its components; now, we will be exploring what the discount factor $\gamma$ really is and how it influences the MDP.

Let us start with the complete example of last post:

In this MDP the states are Hungry and Thirsty (which we will represent with $H$ and $T$) and the actions are Eat and Drink (which we will represent with $E$ and $D$). The transition probabilities are specified by the numbers on top of the arrows. In the previous post we put forward that the best policy for this MDP was defined as $$\begin{cases} \pi(H) = E\\ \pi(T) = D\end{cases}$$ but I didn't really prove that. I will do that in a second, but first what are all the other possible policies? Well, recall that the policy $\pi$ is the "best strategy" to be followed, and $\pi$ is formally seen as a function from the states to the actions, i.e. $\pi: S \to A$. Because of that, we must know what $\pi(H)$ and $\pi(T)$ are: $$\begin{cases} \pi(T) = \text{action to be taken when Thirsty}\\ \pi(H) = \text{action to be taken when Hungry}\end{cases}$$ We know that both $\pi(H)$ and $\pi(T)$ can be either $D$ or $E$ so we really have $4$ possible policies: $$\begin{align} \pi_1(H) = E&,\ \pi_1(T) = E\\ \pi_2(H) = E&,\ \pi_2(T) = D\\ \pi_3(H) = D&,\ \pi_3(T) = E\\ \pi_4(H) = D&,\ \pi_4(T) = D\end{align}$$ All we have to do now is to measure how good each policy is, and to do that, we will use the rewards! Intuitively, what would be the best policy? The best policy is the one that allows us to collect more rewards.

How do we measure the rewards we can collect, then? We use the mathematical notion of expected value. For each policy, we should compute the expected reward we would get by following it, and then stick with the policy that has a bigger expected reward. We will compute the expected reward of $\pi_2$, the policy I claimed to be the best one, but first we will start with this simpler MDP:

In this MDP we only have one state and one action, which gives a reward of $+1$ whenever we choose it. The obvious policy here is to take the only action available. Suppose we are at a given time step $t$. To compute how good our position in the MDP is, we have to consider the reward we will get when we perform the next action (which will give us $+1$) but we must also consider the rewards we will get in the future:

except that those rewards we will get in the future aren't as good as the $+1$ we are going to receive right now! To compensate for that, we will make them less valuable by multiplying a number smaller than $1$ which we call a discount factor: our $\gamma$. When a reward $r$ is only one time step away, we multiply it by $\gamma$ yielding $\gamma r$; if a reward is three time steps away, we multiply it by $\gamma^3$ giving $\gamma^3 r$ and so on and so forth:

Therefore we get that the expected reward is $$1 + \gamma + \gamma^2 + \gamma^3 + \cdots$$ which can be simplified to $\frac{1}{1 - \gamma}$ when $0 \leq \gamma < 1$.

This computation was fairly easy because our actions never had bifurcations: even though we were looking into the future, we were always sure of where we would be... Now we will compute how good the policy $\pi_2$ is!

Let us assume that when we enter the MDP, we have a $50\%$ probability of starting Hungry and a $50\%$ probability of starting Thirsty. This means that the expected worth of the policy is $$E[R | \pi_2] = 0.5E[R_H|\pi_2] + 0.5E[R_T|\pi_2]$$ where $E[R_H | \pi_2]$ is the expected worth of starting Hungry and $E[R_T | \pi_2]$ is the expected worth of starting Thirsty.

How do we compute $E[R_H | \pi_2]$? Recall that this expected value can be interpreted as the total rewards we will get if we start Hungry and then follow the policy $\pi_2$. If we are Hungry and follow $\pi_2$, our action will be $\pi_2(H) = E$. If we Eat when we are Hungry, we get a reward $+1$ and then: there is a $10\%$ chance that we stay Hungry and there is a $90\%$ chance that we become Thirsty. This means that $$E[R_H | \pi_2] = 1 + 0.1\gamma E[R_H | \pi_2] + 0.9\gamma E[R_T | \pi_2]$$ What does this expression mean? The term $+1$ is the immediate reward we get when we choose to Eat and the remainder $0.1\gamma E[R_H | \pi_2] + 0.9\gamma E[R_T | \pi_2]$ accounts for the future rewards we hope to get (as in, expect to get).
Recall that when we are evaluating the worth of an action, we also have to think ahead (into the future) and account for the rewards we might get if we go down that road, making them slightly less valuable by multiplying with $\gamma$. Therefore, $$0.1\gamma E[R_H | \pi_2] + 0.9\gamma E[R_T | \pi_2]$$ breaks down into $0.1\gamma E[R_H | \pi_2]$ for what we might get if, after Eating, we stay Hungry ($E[R_H]$), and $0.9\gamma E[R_T | \pi_2]$ for what we might get if we become Thirsty ($E[R_T]$). Those expected values are multiplied by $\gamma$ because they are only future rewards and then the extra coefficient, either $0.1$ or $0.9$, accounts for the fact that we are not certain of where we will be.

In a similar way we get that $$E[R_T | \pi_2] = 1 + 0.1\gamma E[R_T | \pi_2] + 0.9\gamma E[R_H | \pi_2]$$ For the sake of brevity let me write $x = E[R_H | \pi_2]$ and $y = E[R_T | \pi_2]$. Rewriting the two equations from above we have $$\begin{cases} x = 1 + 0.1\gamma x + 0.9\gamma y \\ y = 1 + 0.1\gamma y + 0.9\gamma x \end{cases}$$ which is a system of two equations with two unknowns! If we solve it we find $$\begin{cases} x = E[R_H | \pi_2] = \frac{1}{1 - \gamma} \\ y = E[R_T | \pi_2] = \frac{1}{1 - \gamma} \end{cases}$$ that we can plug in the first expression $$E[R | \pi_2] = 0.5E[R_H|\pi_2] + 0.5E[R_T|\pi_2]$$ to get $$E[R | \pi_2] = \frac{1}{1 - \gamma}$$ It is not surprising that this gave the exact same result as the simpler MDP above: even though we are not sure of where we will start and we won't be able to tell to where will be going, whenever we act we get a reward of $+1$; for example, this could happen:

If $\pi_2$ is in fact the best policy, then this value $\frac{1}{1 - \gamma}$ will be greater than the worth of any of the three remaining policies. It is fairly simple to see that $E[R | \pi_3] = 0$. Computing $E[R | \pi_1]$ and $E[R | \pi_4]$ is similar, so we will do $\pi_1$ only: $$\begin{cases} E[R_H | \pi_1] = 1 + 0.1\gamma E[R_H | \pi_1] + 0.9\gamma E[R_T | \pi_1]\\ E[R_T | \pi_1] = 0 + \gamma E[R_T | \pi_1] \end{cases}$$ which gives (if $\gamma < 1$) $$\begin{cases} E[R_H | \pi_1] = \frac{1}{1 - 0.1\gamma}\\ E[R_T | \pi_1] = 0 \end{cases}$$ that adds up to $$E[R | \pi_1] = 0.5\frac{1}{1 - 0.1\gamma} < \frac{1}{1 - \gamma} = E[R | \pi_2]$$ showing that the policy $\pi_2$ is the best one!

After all these calculations, you could try to convince yourself that the general expressions for $E[R | \pi]$ and $E[R_s | \pi]$ are $$E[R | \pi] = \sum_{s \in S} P(s_0 = s)E[R_s | \pi]$$ where $P(s_0 = s)$ is the probability that we start at state $s$, and writing $a = \pi(s)$ $$E[R_s | \pi] = \sum_{s' \in S} P_{a}(s, s')\left( R_{a}(s, s') + \gamma E[R_{s'} | \pi] \right)$$
In this post we calculated some expected rewards with the discount factor. In the next post we will see an example of how changing the discount factor can change the best policy to follow!

This post was brought to you by the #100DaysOfMLCode initiative.

All posts on the MDP series:
MDP#01 Basics
MDP#02 Discount Factor
MDP#03 Optimal Policy Variation
(tradução ainda não está disponível)

  - RGS

Popular posts from this blog

Pocket maths: how to compute averages in your head

Tutorial on programming a memory card game