How does Deep Reinforcement Learning expand the boundaries of learning?

Customer Experience 28 July 2021

Deep Reinforcement Learning has performed spectacularly in recent years, enabling programs to learn very powerful and robust strategies in complex environments. Led by DeepMind, these algorithms have revolutionized artificial intelligence in many areas, from arcade games (Agent57) and board games (AlphaGo) to video games (AlphaStar).

The learning processes of Deep Reinforcement Learning

Reinforcement Learning is a branch of Machine Learning in which an agent interacts with an environment through various actions in order to maximize its overall reward. The agent’s policy is the rule that determines, at each moment, the action(s) it will have to take, depending on the state of the environment. 

The agent will be able to train several times in the given environment and learn from its successes and failures, thereby arriving at its optimal policy. A classic example of RL ( Reinforcement Learning) is the Pendulum game (Figure 1). At any “t time”, numerical variables are used to describe the state of the game: position and speed of the kart; angle and angular speed of the stick. On the basis of these four values, the agent decides whether to push its kart to the left or to the right, and then finds itself in a new state at “t time+1”. In the long run, its goal is to hold the pole upright as long as possible.

objet oscillant 1

objet oscillant 2

Figure 1: Agent during (up) and after (down) training

Deep Reinforcement Learning (or Deep RL) follows these same principles but uses Deep Learning to analyze even more complex environments, such as images. This has been achieved in the Breakout game (Figure 2): by performing a deep analysis of the game screen, the agent is able to understand, from a structural point of view, the ins and outs of its environment and then deduce how to behave. Here we can see that the agent figured out without any help that by digging a hole in the structure, he could send the ball on the other side and break many bricks in one go!

jeu breakout

Figure 2: Agent practicing Breakout

We can therefore acknowledge that Deep RL performance is very similar to human learning. By training, the agent will test different policies and realize that some are more effective than others. It will then be able to improve these policies and hence master their intricacies. Deep RL makes it possible to work with complex environments (images, sounds, etc.) and probabilistic environments (when one cannot predict the course of the environment). Finally, the agent will be able to understand the concept of strategy, making sacrifices in the short term if it thinks it will help to achieve its long-term goals. It is through this particular technology that artificial intelligences were able to learn to walk by themselves

All these reasons led Deep RL research to focus on strategy games. They make it easy to measure the performance of an algorithm (does the agent’s training result in a greater number of victories over my opponent?). But above all, they are good indicators of “intelligence”, and make it possible to judge a program’s ability to beat the best human beings. By using this technology, DeepMind’s AlphaGo program learned the Go board game and managed to beat the world’s best player, even though all the experts claimed the game was far too complex to be mastered by a computer.

An example of Deep Reinforcement Learning: the checkers game

In this section, we will illustrate how Deep RL can be used to teach an agent to play checkers. The objective here is to produce a strong and robust AI, one that is able to win against the best players as well as against beginners. 

We could decide to represent the state of the checkers board at any given moment by using descriptive variables (by entering the number of pawns and queens per color, the number of game pieces on the edges, the centers of gravity, etc.). However, it would be very difficult to derive information related to the geometric structure of the board (alignments between pieces, interesting areas, possible future sequences…). This is why, in this context, we prefer to use the Deep RL and hence analyze the whole checkerboard as an image.

In order to learn and improve, the agent must perform actions and observe their consequences, but unlike the Breakout game where the agent plays alone, it now needs an opponent. Yet this creates a new difficulty: it would be far too time-consuming to use human beings to play against the agent in every game. The trick is to make the agent play against different versions of itself (static copies of itself after a certain amount of training), so that it becomes better than it was before through different training levels.

We begin by creating a level 0 agent which makes its moves randomly. After sufficient training against a copy of itself, it will then “progress” and move up to a superior level. It will then be able to train against itself and against its lower levels (see Figure 3). This process is continued until a very advanced level is reached by playing a greater number of games at each stage, against the most difficult opponents. Once this ultimate level is reached, we obtain an agent that has learned to win against all its previous versions, thereby against a very large number of possible opposing strategies! 


Figure 3: Training stages of the checkers agent

Deep Reinforcement Learning serving digital marketing

For marketing purposes, we could imagine a brand using its website as a Deep RL agent. In this particular case, the agent will be able to perform “games” with each user during his various visits on the website and will aim to appeal to him at the end of each browsing session.

Throughout the user’s journey on the website, the agent will be able to interact with him by carrying out actions among various available options: redirecting the user to more personalized pages, suggesting certain items, sending push notifications, e-mails or even discount coupons… Therefore, after having sufficiently practiced with a large number of people, the agent will be able to carry out optimal successive actions that will induce its users in the desired direction.

Such a tool will be able to adapt to all different working environments, but will specifically adapt to all objectives of any given company. One could maximize conversion rates, the amount of money spent, or even encourage potential buyers to choose certain items first (for example, those in excess or those about to expire)! To achieve this, we will simply need to adapt the reward system so that the Deep RL algorithm adjusts its strategies to the business goals.

It should also be noted that due to the current legislation on personal data and the European GDPR legislation in particular, it is becoming increasingly complex to track a user from one session to another. Decisions will therefore be made on the basis of weak signals (navigation during the session) in order to obtain medium-term results. This is precisely what the Deep RL will allow.


Finally, Deep Reinforcement Learning has emerged as the AI technology that most closely resembles human intelligence, for it is able to assimilate extremely complex concepts and build long-term strategic thinking skills. As the world continues to discover its immense potential, many sectors are adopting it and revolutionizing their ways of working. Digital marketing seems to be the next one on the list.


Would you like another cup of tea?