Reinforcement Learning 101

Reinforcement Learning is one of the most exciting branches of data-science right now. It is how Google's AlphaGo program beat the world champion at Go in 2017, and it underpins most artifical intelligence programs in use today.

Rather than having instructions programmed which dictate what actions to take, the rules that govern an environment are programmed and the learning agent gets rewarded or penalised based on its actions. It then learns from these actions such that the next time it is in this scenario, it knows what it has done before and how well (or badly) it went those times.

The process is similar to house-training a dog. When the dog messes in the correct place, it gets praised and rewarded. Similarly, if it messes in the wrong place, it may get punished. Over time, the dog learns the behaviours that get it rewards and avoids the behaviours that lead to punishments.


Consider the example shown below. The dog needs to mess every 12 seconds. The red box is the designated mess area, and messing anywhere but the red box results in a large punishment for our virtual dog. Conversely, the green box is the dog's bed, so it wants to maximise the time spent there. You can see that the dog stays in bed for as long as it can, and then makes its way to the red box to mess and then heads back to bed - exactly as would be expected!

To prove that this isn't just coded in, I have made it such that two squares are inaccessible and can't be walked through. These are the black squares, and they are chosen at random every time the page is refreshed.

Action: -
Time until mess: -

How it works

This example is coded up using simple Q-tables. Information about the current "state" (location and time remaining) is used to index a large score-card of the results of potential actions. This information is all held in a table, known as a Q-table.

At first, the Q-table has no information to work from, so the agent makes moves randomly and sees what happens. If the move results in a positive reward, the index for the current state is updated to reflect a positive Q-value. Conversely, if the chosen action results in a punishment of some sort, the Q-score reflects a lower value. As the agent learns over time, it can take fewer random actions, because it learns the ones which result in rewards, rather than punishments.

Looking to the Future

The Q-table approach is simple and generally pretty quick to run, but can have a large memory-overhead. It needs to store results for every action for every possible combination of state. For a 4*3 grid with 12 seconds between incidents and 5 possible actions in any state, that is 4*3*12*5 = 720 values, which is easily manageable. However, for a game of Connect 4*, where the grid is 6*7 and each grid-space can take one of three different states (red, yellow, or empty), the number of combinations is 36*7 = 109 billion billion combinations, which is very much unmanageable.

* Connect 4 was the first example I tried for this blog, but I couldn't get a satisfactory AI with Q-tables. Then I had the idea for a pooping dog and the rest is history...

The most advanced reinforcement learning usually uses Deep Learning to train the agent. It is much more sophisticated but also can also significantly improve results. The neural network is used to estimate (not look-up) the best action to take for the given state. And by getting creative with the network architecture and back-propagation methods, it is possible to further improve agent behaviours and learning rates, without unreasonable demands for memory or other resources.

Deep Learning is almost definitely the future of reinforcement learning. Just one more reason (to add to the many other reasons) to get familiar with Deep Learning!

Final notes

This has been one of the most fun coding examples I have blogged about. I'm not sure I've ever had variables called "time-to-poop" or "poop-interval" before. I guess there really is a first time for everything!