![]() If you roll it once, you’ll have a 5% (1 in 20) chance of getting a 1. However, ongoing monitoring while waiting for significance leads to a compounding effect of the 5% false positive rate. We do this knowing that there’s a 5% chance that a statistically significant result is actually just random noise. When the p-value is less than 0.05, it’s common practice to reject the null hypothesis and attribute the observed effect to the treatment we’re testing. ![]() In hypothesis testing, we accept a predetermined false positive rate, typically 5% (alpha = 0.05). Limitations of the underlying statistical test Naturally, we want to leverage this powerful capability to make the best decisions as early as possible. These results can then be updated to reflect the most up-to-date insights as data collection continues. Unlike A/B tests conducted in fields like Psychology and Drug Testing, state-of-the-art online experimentation platforms use live data streams and can surface results immediately. This stems from a tension between two aspects of online experimentation: The Need for Sequential TestingĪ common concern when running online A/B tests is the “peeking problem”, the notion that making early ship decisions as soon as statistically significant results are observed leads to inflated false positive rates. Here we outline our approach to Sequential Testing and recommended best practices. This is achieved by adjusting the p-values and confidence intervals to account for the increase in false positive rates associated with continuous monitoring of experiments. We recently released Sequential Testing on Statsig, a much-requested feature that solves the “peeking problem” and shows valid results even when checking on an experiment early.
0 Comments
Leave a Reply. |