Machine learning in a nutshell—part 4: reinforcement learning

Customer Experience 26 April 2019

After discovering supervised and unsupervised learning—which constitute common types of machine learning—here is the third, final and most complex type, but also the most realistic one: reinforcement learning.

As a reminder, within supervised learning, the data scientist provides an algorithm with examples of the right strategy to adopt (for example, item corresponds to category 1), after which the machine learning model learns to link its observations to the labels given by the expert. It is thus necessary to have a set of labelled data beforehand.

In reinforcement learning, it is not necessary to have historical data for the algorithm to make decisions on its own. Indeed, only knowledge of the current context is necessary. It is by interacting with the latter that the algorithm’s decisions will become increasingly relevant, just like a human being in his or her learning phase!

reinforcement learning

Reinforcement learning: vocabulary for dummies

Let’s start with some much needed vocabulary to better understand reinforcement learning. In this article, we will talk about agents, actions, states, rewards, transitions, politics, environments, and finally regret. We will use the example of the famous Super Mario game to illustrate this (see diagram below).

  • The agent is the one who makes decisions, who acts. Let’s take Mario as an example.
  • His decisions are called actions. They are predefined, for example: going to the right, to the left, jumping, bending down….
  • States are what define the context in which the agent evolves at a given moment, for example: there is a hole in front of the agent, a block above, or a monster on his way.
  • With each action performed, the agent impacts the next state by modifying the environment. The next state will not be the same if the agent goes to the right or left, for instance. However, the next state is not always predictable: a monster can suddenly appear on Mario’s road, among other things! It’s all about probability of transition between states.
  • The policy is the strategy adopted by the agent with regard to the environment, to make his choices of actions. Mathematically speaking, it is a function that associates the action to be performed in response to a given state.
  • By choosing an action in a given state, the agent will receive a reward. This is a quantified objective that the agent will seek to optimize over time. In our example, it will be a question of whether Mario will stay alive, die or win a coin (beware of the connotation of the term used—the reward can constitute a positive or negative consequence!).
  • All these elements define the environment in which the agent operates. At the beginning of the experiment, the agent only knows the existing states and the actions that can be carried out in the environment. However, he or she does not know the rewards or transitions, and therefore does not yet know how to achieve his goal. He or she will therefore have to explore—i.e. experience—different actions in each state several times, because the rewards obtained can vary significantly from one time to the next. This is referred to as a probabilistic environment. Beware: you should not let the machine explore randomly, running the risk of increasing the time it takes to understand its environment!
  • Finally, to measure the performance of a reinforcement learning algorithm, we measure regret: it is the difference between the average reward obtained by the agent, and the average reward he or she would have obtained if he or she had applied the optimal policy from the start. The smaller the difference, the less time the agent takes to learn and the more efficient the algorithm is.

reinforcement learning

Reinforcement learning: it’s your turn to play!

Reinforcement learning is a form of machine learning widely used to make the artificial Intelligence of games work. The case we have heard most about is probably the AlphaGo Zero solution, developed by Google DeepMind, which can beat the best Go players in the world. The first versions of the algorithm tried to reproduce human behavior by analyzing games, actually played by amateurs and professionals. So it was supervised learning. In its final solution, DeepMind then used reinforcement learning alone and trained the model to play… against itself!

As seen in our example, reinforcement learning has also been used to learn how to play Super Mario. To do this, there was first of all human work to be done, to define the states. This is why it is necessary to know the game well in order to be able to summarize a state, and then return it to the computer.

For example, we can divide the visible environment around Mario into slices of pixels (for example, every 10 pixels) and note what we observe in each square (a monster, a room, a hole, etc.). Then we have to define actions, to make it simple, say right, left, down, up. And finally, we have to give a value to the rewards, for example: -1,000 if the action kills Mario, 0 if nothing happens, +10 if Mario kills a monster, +100 for each piece picked up by Mario and +10,000 if Mario wins the level. All that remains is to let Mario play and learn by himself. There we go!

 

 

Where does marketing stand in all this?

In marketing, the applications of reinforcement learning are endless… but the model has not yet been widely used, because it is quite complex to implement.

For example, we could use reinforcement learning on an e-commerce website to find the best price for each product, in order to optimize sales. In this case: the state of the model would be the day of the year, associated with past sales and the stock of available products; the actions would consist in increasing or decreasing prices; the number of sales would be the rewards. By varying prices, we would impact the number of sales (rewards), which would change our stock—i.e. the state in which we will be the next day. This new state would therefore be uncertain at the time the price choices are made, since we would not know the exact number of sales we will make (this is called a probabilistic state).

In e-commerce, it is also possible to optimize product ordering on a page to improve sales. Indeed, within a list of products, the performance of a particular product does not necessarily reflect its real attractiveness, since it is biased by its placement within the page. De facto, products placed at the top of the page are more likely to be seen, and therefore to be purchased. An algorithm could make it possible to single out the actual performance of the product, and the impact of its visibility. Reinforcement learning allows—by testing several positions (states)—to find the optimal order. And of course, this optimal order is constantly evolving as products (and their availability) change every day!

 

Reinforcement learning and recommendation: our algorithm presented at the NeurIPS conference

fifty-five, Criteo and Facebook recently created and published a reinforcement algorithm at the NeurIPS conference: this algorithm allows to understand the evolution of the performance of a marketing activation (for example, a recommendation algorithm) according to the number of times the user has been exposed to this activation in the past. Let’s take two examples to illustrate.

First example: suppose that a VoD platform tries to know what kind of film to recommend to users (Action, Romance, Comedy, Thriller…) at a given moment. The users’ desires to see an Action movie will not be the same if they have just seen 10 action movies in the past month or only 1, even if their fondness of action movies remains unchanged. The reinforcement learning algorithm proposed by the platform must therefore be capable of anticipating the fatigue of users, in order to offer them other types of film, depending on their past interactions. This makes it possible to diversify the recommendations proposed to users.

Second example: on an ecommerce website, if you choose to put the “promotions” block on the right, the user who regularly visits the website will gradually get used to this placement, until he or she no longer pays attention to it. To overcome this problem, the reinforcement algorithm can anticipate this form of fatigue by moving the block to the left of the website, after a given period of time, to continuously maintain the user’s attention. The algorithm we developed then makes it possible to measure this evolution, in order to improve customer knowledge and to alternate between the various possibilities and thus maximize long-term engagement!

Want to find out more?

Discover the publication of Romain Warlop, Alessandro Lazaric (Facebok AI Research) and Jérémie Mary (Criteo Research) here!

All in all, it should be kept in mind that any action potentially influences the future environment and can bias analyses. With reinforcement learning, we therefore let the algorithm learn “on its own”, interacting with its environment and observing what is happening there. This form of machine learning is thus the closest to human learning: this is how a child learns, for example, by trying to put objects of various geometric shapes into holes following these shapes. Finally, unlike supervised learning, which simply looks for the best action to perform at a specific time, reinforcement learning seeks to optimize performance over the long term, which sometimes forces it to “take a step back to better move forward”!

Thanks to all those who read this series of articles—machine learning has almost no more secrets for you 🙂

Would you like another cup of tea?