Project IV 2021-22 - Louis Aslett

Gaming vicariously 🕹️ 👾 — Deep Reinforcement Learning

Reinforcement learning enables us to train models which aim to maximise some reward in a given environment, and crucially to do so without having to manually figure out the optimal strategy. Instead, in order to train the model we essentially conduct massive trial-and-error experimentation, slowly learning a strategy which increases the reward. In reinforcement learning, these models are often called 'agents' to reflect the fact they are making deicions and learning from their actions, but essentially these are just statistical models under the hood.

The simplest example of this might be learning to play a game of noughts and crosses without having to manually program a strategy. Our model (or agent) will initially make random placements of their symbol (X or O) each turn and, very occasionally, just by pure chance it will win a game. The goal is to then update the model, nudging it towards the choices which resulted in a win and away from those that resulted in losses: ie to reinforce good choices. Reinforcement learning provides the rigorous statistical and mathematical framework to achieve this.

Because of the entirely simulated environment, video games have proved to be the perfect breeding ground for recent advances in reinforcement learning. Indeed, in 2013 Google Deepmind kick started a resurgence of interest when they trained the first "deep reinforcement learning" agent (that is, using deep neural networks as the model) to play retro arcade games, using only the individual pixel values of the screen as input (Mnih et al, 2013; Mnih et al, 2015). This is remarkable because the model encodes no domain knowledge or understanding of the rules of the game — these are all learned by the model purely through experience. Progress was rapid, and 6 years later Google Deepmind again broke new ground, training an agent which achieved Grandmaster level in the far more challenging StarCraft II game (Vinyals et al, 2019). Application of these advances is now happening outside video games, for example in robotics, where extensive initial training is done on a virtual physics simulation with transfer learning to the physical world (Tobin et al, 2017).

As a tangible example (and to explain the project title!), below you can see me playing the classic Atari game Pong vicariously through an agent that I trained using so-called policy gradient reinforcement learning (Williams, 1992). Here, the agent is a very simple logistic regression model, where we want to predict whether to move the paddle up or down given the current state of the game:

$\begin{align*} \mathbb{P}(\text{move up}) &= \frac{e^{\mathbf{x}^T \beta}}{1+e^{\mathbf{x}^T \beta}} \\[1em] \mathbb{P}(\text{move down}) &= 1-\mathbb{P}(\text{move up}) \end{align*}$

where $\mathbf{x} \in \mathbb{R}^d$ is a set of inputs, such as the ball position, direction of travel, paddle position, etc, and $\beta \in \mathbb{R}^d$ are the model parameters to be learned. Logistic regression is the core building block of more sophisticated deep neural networks, so this is a simple starting point for a more sophisticated approach. Note that such a simple up/down model means the agent always has to be moving the paddle, which gives it the appearance of being high on caffeine ... clearly this could be improved! Every game that the model plays all the way through to either win or lose is called an 'episode'. I have saved the model after each additional order of magnitude in training episodes so that you can see the progression of learning below.

Go ahead! ... Change the toggles to experiment, and even change to 'human' to jump in and play against the agent yourself!

Left player	Right player
Human, you! (up = `W` key, down = `S` key) AI, 10,000 episodes AI, 100,000 episodes AI, 1,000,000 episodes	Human, you! (up = `O` key, down = `L` key) AI, 10,000 episodes AI, 100,000 episodes AI, 1,000,000 episodes

As you can see, after playing the game about a million times even a very simple logistic regression agent has become quite skillful, though the playing technique lacks grace or artistry. In this case, we could manually devise the rules for an agent quite easily ourselves, but the crucial point is that the model parameters were actually learned purely through trial-and-error experience here.

In this project, you will start out by learning about deep neural networks and supervised learning. There are elegant connections between traditional supervised learning and reinforcement learning which will make the latter easier to learn with this knowledge. After this, the primary focus will be on learning Monte Carlo versions of policy gradient deep reinforcement learning, with the ultimate goal being to apply this to some problem, such as playing a retro Atari game like Pong, Space Invaders 👾 etc.

Prerequisites

Statistical Concepts II, Statistical Methods III and quite strong progamming skills in either R or Python are essential.

Topics in Statistics III/IV a slight advantage, but definitely not necessary.

Background

Friedman et al (2009) is an excellent book on supervised learning methods. Deep learning is thoroughly covered in Goodfellow et al (2016). Sutton & Barto (2018) provides an extensive introduction to a broad array of reinforcement learning approaches. All are available in the library or for free online (legally!), see below.

References

Friedman, J., Hastie, T. and Tibshirani, R., 2009. The Elements of Statistical Learning. 2nd Edition. New York: Springer series in statistics. Durham Library, or free online copy.

Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep Learning. MIT Press. Durham Library, or free online copy.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M., 2013. Playing Atari with deep reinforcement learning. NIPS. Deepmind website.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540), pp.529-533. DOI: 10.1038/nature14236

Sutton, R.S. and Barto, A.G., 2018. Reinforcement Learning: An Introduction. 2nd Edition. MIT Press, Cambridge, MA. Durham Library, or free online copy.

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. and Abbeel, P., 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23-30. DOI: 10.1109/IROS.2017.8202133

Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P. and Oh, J., 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), pp.350-354. DOI: 10.1038/s41586-019-1724-z

Williams, R.J., 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, pp.229-256. DOI: 10.1007/BF00992696