Markov Decision Processes 01: the basics



< change language

In this post I will introduce Markov Decision Processes, a common tool used in Reinforcement Learning, a branch of Machine Learning. By the end of the post you will be able to make some sense of the figure above!
I will couple the formal details, definitions and maths with an intuitive example that will accompany us throughout this post. In later posts we will make our example more complete and use other examples to explain other properties and characteristics of the MDPs.

Let me introduce the context of the example:
From a simplistic point of view, I only have two moods: "hungry" and "thirsty". Thankfully, my parents taught me how to eat and how to drink, so that I can fulfill the needs I mentioned earlier. Of course that eating when I am hungry makes me happy, just as drinking when I am thirsty makes me happy! Not only that, but eating when I am hungry usually satisfies me, much like drinking when I am thirsty usually satisfies me.

Suppose that, given the situation I just described, you want to study formally the evolution of my happiness as I take different actions and feel different moods. Markov Decision Processes can be used just for that: studying a system with different states (my moods) and different possible actions (eating/drinking), as well as planning the optimal behaviour (the behaviour that makes me the happiest).

Googling for the definition of MDP will eventually take you to the mathematical one, so we might as well just deal with it now. An MDP is a quintuple $(S,A,P_a,R_a,\gamma)$ where:
$S$ is a set of states, defining the possible ways in which you can be; in my example, $S = \{ \text{Hungry}, \text{Thirsty} \}$ i.e., the states are my moods.

$A$ is the set of actions we can take in each state; in my example, $A = \{ \text{Eat}, \text{Drink} \}$.

$P_a$ is the collection of all transition probabilities. $P_a(s, s') = P(s_{t+1} = s' | s_t = s, a_t = a)$ is the probability that we change to state $s'$ if we used action $a$ when we were in state $s$. In our notation, $s_t$ is the state we are in at time step $t$ and $a_t$ is the action performed when in time step $t$. In the example (check the figure below) we have $P_E(H, T) = P_D(T, H) = 0.9$ (if we are Thirsty and Drink/if we are Hungry and Eat there is a $90\%$ chance we flip states); $P_E(H, H) = P_D(T, T) = 0.1$ (if we Eat when Hungry/Drink when Thirsty there is a $10\%$ chance we don't satisfy our needs); $P_E(T, T) = P_D(H, H) = 1$ (if we Drink when Hungry/Eat when Thirsty we end up exactly as we started).

$R_a$ is the collection of all immediate rewards. $R_a(s, s')$ is a numerical value representing the reward we get when we reach state $s'$ by using action $a$ when in state $s$. Notice that the reward may be good, if $R_a(s, s') > 0$, or it may be bad, if $R_a(s, s') < 0$. In our example the rewards won't depend on $s'$; for example, if we are Hungry and we Eat we get the exact same reward regardless of the state we go to: $R_E(H,H) = R_E(H,T) = R_D(T,T) = R_D(T,H) = +1$, i.e. trying to meet my needs gives a $+1$ reward, while $R_E(T,T) = R_D(H,H) = 0$, i.e. ignoring my needs isn't rewarded nor penalized.

Finally, $\gamma \in [0, 1]$ is the discount factor, which can be thought of as an exchange rate: it says that getting a $+1$ right now is as good as getting $+1/\gamma$ in the next time step. For example, if $\gamma = 0.5$, then getting a $+2$ in the next time step would be as good as getting a $+1$ in this time step. When we are trying to predict what is the best action to take at a given point, we need to consider the rewards of the possible actions BUT we should also try to see some steps ahead, and take into account what are the possible rewards of the future actions that will become available. The bigger $\gamma$ is, the more importance we are giving to future rewards. When $\gamma$ is small, it is as if we don't really care much about what is going to happen in the future, we only care about getting rewards now. The two limits are $\gamma = 0$ and $\gamma = 1$. When $\gamma = 0$, we care nothing about the future; when $\gamma = 0$, a reward of $+1$ now is better than any reward you can be offered in the next time step. When $\gamma = 1$, all future rewards are equally important, whether they are promised to you in the next time step or in $1000000$ time steps.

Now that we defined an MDP, what do we do? Usually one is looking for a strategy to employ that maximizes its rewards. Think of the MDP as a game and suppose the rewards are your points. Whenever you are in a state you must choose an action and then your points will change according to the reward. Your task is to find the best possible strategy for the game. In this context, we usually call policy to our "winning strategy". Commonly represented as $\pi$, the policy is a function $\pi: S \to A$ that receives states and outputs actions: when we are in a state $s \in S$ we should do action $\pi(s)$. For our example it is quite clear that our winning strategy (i.e. our policy) is to Drink when we are Thirsty ($\pi(T) = D$) and to Eat when we are Hungry ($\pi(H) = E)$, right?

How can we justify that? And for more complex examples, how do we find a winning strategy?
To answer the first question, a bit of maths can do the trick... as for the second question, the area of Reinforcement Learning provides us with algorithms to find the winning strategy!

Both questions will be answered in future posts, so subscribe to our mailing list, like the blog on facebook and follow the blog on instagram so as to stay tuned!

This post was brought to you because of the #100DaysofMLCode initiative.

All posts on the MDP series:
MDP#01 Basics
MDP#02 Discount Factor
MDP#03 Optimal Policy Variation
(tradução indisponível por agora)

  - RGS

Popular posts from this blog

Tutorial on programming a memory card game

The hairy ball theorem and why there is no wind (somewhere) on Earth