P-hacking: Common mistakes and how to avoid them

Fernando Duarte Salhani
GetNinjas
Published in
7 min readJul 1, 2019

--

“[p-hacking] is the misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect” — Wikipedia

From the definition above, one could think that p-hacking is used to deceive or lie using data. However, it isn’t always malicious. This blog post attempts to cover some ways by which p-hacking may happen accidentally and cause wrong decisions. To do so, we will perform a case-study on a simulation.

Source: Atoz Markets

The experiment: Coin tossing

The experiment is very straight-forward. Imagine there are 2 groups of 10,000 coins that will be tossed at random. Each coin has a type associated with it: penny, nickel, dime, quarter, half-dollar and dollar.

The goal is to evaluate if there is significant difference between the 2 groups by calculating the p-value for the results. The experiment is performed 1,000 times.

The point is: There is no difference between the groups. The way the data was generated was the same for both groups. So, how many times will we arrive at the WRONG conclusion that there is a difference between them.

The correct way of testing

How should a test like this be performed?

  1. Determine the number of samples needed for each variation
  2. Perform the experiment for that number of samples
  3. Get the result

By doing this and looking only at the final result of the experiment, out of 1,000 simulations, only for 44 (4.4%) we find “significant” difference between the groups.

Looking at the definition of p-value — the chance of randomly getting a result at least as extreme as the one you got — , this number makes a lot of sense. Setting the p-value to 0.05, we expect that approximately 5% of the simulations would get said “statistical significance”.

Repeatability

In order to lower the chances of false positives, you may as well repeat the experiment to see which attempts will, once again, reach a p-value of 5% or less. The probability that both iterations have such behavior is now 5%² = 0.25%. In my simulation, only 2 out of the 1,000 experiments had a p-value below 5% in both iterations (0.2%).

P-hacking in a business setting

In business, few people will care about the meaning of the data or the methods used. A product manager wants to answer “does this feature improves this aspect of the product?”, not “what are the chances of randomness in my results?”. Directors want fast deliveries and results. They want the answer to “how is the test doing?” to be “good” or “bad”, instead of “we need two more weeks to evaluate our results”.

Those different mindsets, although completely natural, may lead to rushed and precarious experimentation that leads to issues like:

  • Your feature doesn’t perform as it did during the test
  • User behavior detected qualitatively don’t show when a test is made
  • Some features are only applied to counter-intuitive segmentations

Looking for a result where there is none

Now let’s go back to the coin tossing experiment and manipulate the data in some ways that are very common in the business setting that may affect the results and reach “significance”, where there is none.

Peeking on the data and continuous observation

“How is the test performing? Has it reached significance?”

Who hasn’t heard these questions? How do they affect our results?

Peeking on data and calculating significance mid-test can lead to a considerable rise in the occurrence of false positives. In the experiment, only 4.4% of the experiments had false results. If the test was analyzed exactly in the middle and stopped in case p-value < 0.05, that rate would be 8.1%. With 10 peeks during the test, it goes up to 18.5%.

Empirical data from the experiment

In the extreme case in which, after each toss of a coin, we calculate the p-value to determine whether it has reached significance, the false positive rate would reach a whopping 66.3%. (All values based on empirical data and may differ slightly from the theoretical values).

In conclusion:

Don’t peek at your data ahead of time!

Unplanned segmentation

This is one of the most common ways of p-hacking both maliciously and naively. It consists of slicing your data in ways that were not planned ahead in order to try to find patterns.

In our experiment, we used different types of coin. Let’s suppose we decide to analyze each one to see if there is a “significant” difference between versions A and B. Considering we had pennies, nickels, dimes, quarters, half-dollars and dollars, we find p-values below 5% for at least one of the types in 28.2% of the experiments (282 out of 1,000).

Once again, p-value has lost its meaning. By adding more slices, the probability of at least one having a p-value (p) that is less than 0.05 is:

For 6 variations (n=6, p=0.05), that probability is ~26.5%.

Now imagine an e-commerce running an A/B test that affects over 100 products. If we analyze the results for each product, the chances of finding at least one product for which the variation outperforms the control are above 99%. That is, even if the variation doesn’t actually have any effect on the test.

comic strip by XKCD

Iterating over a test

For this last example, imagine you ran a test to change something on your website. After it was over, there wasn’t a significant difference to determine that the variation beat the control group. What do you do?

Some will conclude there is no noticeable difference between the two versions and will decide on A or B based on other criteria, such as time to implement, preference of the main stakeholders or some long term project you intend to work on.

Others, however, will want to adapt version B until it outperforms version A. They will do that by performing small tweaks and trying again. What’s the harm in that?

This attitude has exactly the same effect of creating slices of your data. Every new iteration, your chances of getting a false positive increase. For the first experiment it was 5%, for the second it is 9.7%. And it rises to 14.2%, 18.5%, 22.6%, 26.5%, 30.2%…

How to avoid all of this?

Plan your tests ahead

This might sound obvious, but it is very common to forget to do it and just wait for significance. Not only does that increase the probability of misleading results, it also makes most of the statistical properties used on the test invalid. P-value, confidence level, statistical power, they all lose meaning when good testing practices are ignored.

Define relevant segmentation before choosing a p-value

For the same p-value, the number of slices of your data increases the likelihood of a result that disproves the null hypothesis. Nonetheless, if your segmentation is important, it makes sense to pick a smaller target p-value to use on your test.

If you have 5 slices to consider and use a p-value of 0.05, the chances of having at least one false positive rises to 22.6%. But if you pick 0.01 as your p-value, that probability falls to 4.9%.

This might not be ideal when there are several segmentations, but it does solve part of the problem.

Don’t peek at your data

This is very important and one of the toughest tasks when performing a test with a business purpose. However, there are ways to apply early stopping that, unlike the usual peek at the results, use valid statistical methods that help maintain the meaning of your data.

Optimal stopping

There are a few methods to perform early stopping that preserve the statistical meaning behind your data without losing the confidence level. I stumbled upon some of them when I read this post on Netflix’s Techblog.

Sequential analysis or sequential hypothesis testing is a method that allows you to observe your data during the experiment, without sacrificing statistical confidence. Determining the number of peeks beforehand, it is possible to divide your test into equal chunks and calculate the p-value after you collect each set. With a reduced p-value, it is possible to ensure the desired rate of false positives.

There are several ways of choosing p-values for sequential analysis, like the O’Brien-Fleming, Pocock and Haybittle-Peto methods.

To test them out, I used these three techniques on the coin tossing data set and got the following false positive rates when splitting the data into 5 sets:

  • O’Brien-Fleming: 4.2%
  • Pocock: 4.1%
  • Haybittle-Peto: 4.5%

What to do with this?

Good test practices are not just for statisticians, data analysts and data scientists. The key to performing a good test and ensuring efficiency and precision on the results is to know the concepts behind the methods used. Being aware of how we may, inadvertently, misuse our data and taint our test results, we can act on it so that it doesn’t happen.

Optimal stopping can be interesting if, and only if, we implement it correctly. Be careful not to make decisions based on misleading data. If you are adopting a data-driven perspective on your company (and I think everyone should), your decisions are as good as your data.

Notebook with some of the simulations mentioned here

--

--