# #AI – Simpson’s Paradox, or, how to make numbers lie

Data Architecture 26 April 2019Simpson’s paradox is named after statisticians Edward Simpson and George Udny Yule. It is present in all statistical studies, and though often illustrated with medical cases, can actually apply to any number of statistical cases – particularly in marketing. Learn how Simpson’s paradox can give new meaning to “the numbers lie”!

“There are three kinds of lies: lies, damned lies, and statistics.”

Mark Twain

**A paradox that is mostly medical…**

Let’s begin with a medical example. Say I go to the doctor’s and learn I have kidney stones. The doctor shows me a statistical study about the effectiveness of two treatments. Treatment A is open surgery, and treatment B is a minimally invasive surgery (percutaneous nephrolithotomy). The study results are as follows:

These results seem to indicate that I should choose **treatment B**, as it shows a higher success rate!

To be sure, I get a second opinion. The second doctor shows me results from **the same experiment** but displayed differently:

The doctor explains that kidney stones are either small or big, there is no medium. Based on these results, he recommends treatment A, as its success rate is higher in both cases.

Oh no! What should I choose?

Let’s start with the wrong answers:

*“Failure rates are higher for open surgery, so I’ll choose B”*In this case, we can assume that consequences are the same for either option.

*“The numbers are insignificant because the volume is too small.”*In this case, we can assume that if we multiply participant volume by one million, the results stay the same.

So the correct response is… treatment A! I can see that there are small kidney stones or big kidney stones calculations. In both cases, treatment A seems to work better. But why does the result get reversed when we look at the treatments more broadly?

The “kidney stone size” variable is what we call a **confounding variable**, which means that it **impacts both the treatment choice and treatment effectiveness** (regardless of what the treatment is).

If we focus on *bi**g kidney stones*, we see that both treatments work less well than they do with *small kidney stones*; this medical case is tougher to treat. However, treatment A more often treated *big kidney stones *than *small kidney stones*, and vice versa for treatment B. Thus, treatment A’s overall effectiveness is decreased (the opposite is true for treatment B).

Still not convinced? Let’s look at a second example. Say you want to find out who is a better tennis player of two amateur players despite the fact that they can’t play against one another (they live too far away!). Player 1 plays against Nadal and Federer 90% of the time, and against children 10% of the time. Player 2 plays against Nadal and Federer 10% of the time, and against children 90% of the time. If we simply looked at each player’s victory percentages, player 1 would, of course, look like a worse player. The skill level of the opponent is therefore a confounding variable.

**…But sometimes marketing!**

Now, let’s see how this paradox can be applied to the marketing world.

On my website, I have two algorithms to personalize user experience. I’d like to know which is more effective, so I set up a simple A/B Test: when a new visitor comes to my site, I display version A half the time and version B the other half. If a given user has already seen one version, I continue displaying the same version to avoid mixing audiences. After one month, I check back for my results: 10% of visitors to site version A bought a product, compared to 7% of version B visitors.

Satisfied but cautious, I then decide to continue the test for a second month. Because version A worked better, I decide to change the proportions: each new site visitor now has an 80% chance of seeing version A, and a 20% chance of getting version B. As before, repeat visitors continue to see the same version. At the end of the month, I look at both versions’ overall performance: 9% conversion rate for version A and 10% for version B.

Any idea where the problem lies?

It’s the same as before! User history is the confounding variable. Recurring visitors who already know the brand naturally have a greater chance of conversion. In month 2, version A receives many new users (a tough break, as they’re harder to convert) and thus its performance decreases. But if we look at each version’s performance in terms of user history, we can see that version A still performs best.

*Performance with new visitors*

*Performance with repeat visitors*

**The solution: be careful when choosing algorithm variables!**

To avoid these types of misinterpretations, you’ve got to have **profound business knowledge** on what you’re trying to measure. Make a list of all criteria that could affect the result, so that the data scientist can use this list to create as many variables as possible based on the criteria.

For example, if you want to analyze an e-commerce site’s performance, criteria could include indicating that the number of page views has a significant impact. With this information, the data scientist can create several variables such as the average number of page views per day, the difference in pages viewed if a visitor comes back to the site two days in a row, the change in this difference over recent weeks, etc. The data scientist can also enrich the algorithm with variables inspired by his or her own personal knowledge of models and the data which he or she has access to.

It could be interesting, for example, to create variables that take users’ interests and their evolution into account. We could count the number of products in each category in order to display categories that will most interest the user, or calculate products’ value in this category, or the general disparity among consulted products by category, etc.

Thus, with just a dozen criteria, it is easy to obtain a hundred different variables for the algorithm, and then to let the algorithm measure each variable’s contribution from historical data.

**Thorough business expertise is thus key to avoiding a biased and irrelevant algorithmic model! **For instance**, **we often decide to delete user gender within the model to avoid a sexist bias.

In conclusion, thanks to the computing power and vast amount of processed data available today, we can create a large number of input variables and then let the algorithm measure the real impact of each. **However, human intervention is still necessary to ensure that chosen variables are relevant… and ethical!** To learn more about this issue, come back for our upcoming article “What if artificial intelligence couldn’t exist without human intelligence?”