# The trouble with hypothesis testing and statistical significance

It is not seldom that I encounter the following question while marketing a solution: “What are the hypothesis you are testing for and how will you prove or disprove it?”, thereby alluding to standard hypothesis testing techniques.

In this context, I had a short conversation with a friend a couple of days back on the prevalent culture in many analytics organisations and academic research alike, and its possible drawbacks. He was a strong proponent of this culture; also, to be fair, I didn’t put forth my view clearly.

It turns out that the reliance on null hypothesis testing and statistical significance testing are often regarded to be from the “august procedures of analysis”; the computation of statistical significance and using it as an “aid” to put forth an “objective” scientific argument is taken to be the norm. This, based on my reading, has resulted many-a-times in sloppy scientific work and erroneous judgements. Therefore, I’m writing to explore this idea through this blog and garner comments about misunderstandings I might have, from its readers.

# The backdrop

## What are some of the cornerstones of a scientific theory?

A scientific theory should be objective (or at least attempt to provide a knob over subjective aspects), repeatable, and testable.

The reader is requested to review Carl Sagan, 2014. The article covers several aspects that are necessary to develop “good theories”.  We will refer to this later to show how statistical significance testing, the way it is commonly used, defies several of these aspects.

## What is statistical hypothesis testing?

Statistical hypothesis testing is a procedure used to put forth the validity (or invalidity) of a scientific argument (in the context of a scientific theory) in four steps [D. Johnson, 1999]:

1. An experimenter develops a “null hypothesis” about a phenomenon or parameter. This hypothesis is generally the opposite of what he/she (from now on, for simplicity, I’ll use one of he or she) wants to prove through research (aka  “research hypothesis“). This research hypothesis should ideally be generated inductively through numerous experiments or deductively from theory.
2. Subsequently, she collects data from “samples” in the population under question. Typically, the population is sampled twice – to generate what’s known as the “experimental group” and the “control group“. The parameter or phenomenon to be investigated is applied or varied for the experimental group, while keeping it constant for the control group. The experimenter also ensures through a fair process, the control and experimental group are sampled such that they are not biased by the experimenter and any other external parameter or phenomenon. However, it is understood that there might be differences due to the random sampling, that show up in the computations done in the next step.
3. The effects of changing the said parameter in the experimental group  is observed. A statistical test of the null hypothesis is used to determine a “statistic” and a “p-value“. The p-value gives the probability of obtaining anything more “extreme” than the observations if the null hypothesis is held true. In other words, it measures the “rarity” of observing the effects under the null hypothesis.
4. An interpretation is made about the  statistic and the p-value. A low p-value rejects the null hypothesis and this, as a consequence is (often wrongly) accorded to other conclusions about the phenomenon being studied. This forms the crux of the discussion in this blog.

## Why is statistical hypothesis testing done?

In theory, statistical testing is done to disprove that a “random setting” has caused an interesting observation in an experiment.

In practice, it is done to show that a parameter that is of interest to a scientist has significant contribution to a phenomenon. In this process, it often lends way to erroneous/opportunistic interpretations (called fantasies about statistical significance in Craver, 1978) of the p-value by both, the experimenter and the consumer of the results of an experiment.

# The Story

## Where does it all go wrong?

My understanding is that the misuse of statistical significance testing is due to incorrect understanding or poor adherence to scientific methods. This are elaborated below.

1. The belief that statistical significance testing is objective: Although computation of p-values uses a mathematical procedure, significance testing and its outcome is largely dependent on the experiment design. The subjectivity is handled by the “experimenter” and not the “consumer” of the results of an experiment.For example, let’s consider a an experiment I found in Caver, 1978: Say, we want to study the effect of Vitamin C on common cold.As a first step, we will define the null hypothesis $latex H$ to denote that Vitamin C has no effect on common cold. Our objective is to disprove $latex H$.

For this, we set up an experiment with $latex N$ “matching pairs” (a matching pair comprises of two subjects that are as similar as possible; this is for the creation of the experimental and control groups.). Next, we investigate the number of pairs in which the administration of Vitamin C leads to a “beneficial effect” of common cold. (In other experiments, one may just be concerned with some “effect”, both “beneficial” or “detrimental”; in this case, a minor variance of the explanation given here can be used).
Let’s assume that in this experiment, $latex k$ pairs turn out to show a “beneficial” effect and the remain $latex N-k$ pairs don’t. Now, under the null hypothesis, we can compute how “often” such an outcome can happen as follows: The probability of observing a beneficial effect is the same as the probability distribution of getting a heads or a tails in $latex N$ tosses of a balanced coin. So, if $latex p$ is the probability of observing beneficial effects, for the above condition, we get $latex p = 0.5$. Observing k pairs as having an effect out of N pairs would be given by a binomial distribution.A plot of this distribution for two values of $latex N$ is shown below:So, we have observed $latex k$ pairs having a beneficial effect; so, going by the standard processes, we compute the the p-value as $latex P[X>k|N, p]$. For example, for $latex N=10$, using the tables from [IITMTables, 2015], if $latex k=7$, we get $latex p = 1 – 0.9453 = 0.0547$. This p-value is used to reject or accept the null hypothesis.Now, something magical (and arbitrary is done). Different “fields” have different tolerances for the p-value. In some fields, $latex p<0.05$ is required for the null-hypothesis to be rejected, while some require $latex p<0.001$ and so on.  So depending on the “academic context”, the hypothesis is accepted or rejected.This is where it gets more interesting – Apart from this ad-hoc nature of accepting and rejecting the null hypothesis, there is another problem with this approach. The computation of p-value itself is dependent on the sampling process used. The above result is valid only if the experimenter decided to do $latex N$ controlled trials and compute the p-value. Instead, if she decided to conduct trials in series such that $latex N$ would be fixed based on when at least $latex k$ cases of matched and unmatched pairs are observed the probability distribution for the null hypothesis would change. The distribution for such a case is given below.

To illustrate the difference, let’s take a case where we have $latex 17$ pairs out of which $latex 13$ show effects and $latex 4$ don’t. The p-value for this, from figure-1 is $latex 0.049$ (this is the probability of getting anything more than $latex 12$ and less than $latex 5$). Now, under the conditions for figure-2, the p-value (i.e. probability of getting $latex 17$ pairs or more) is $latex 0.0211$.

In any case, if the approach is wildly different – of say, conducting experiments in phases, it would affect p-values too… (Refer Berger & Berry, 1988 for a detailed treatment of this statement). Therefore, the intentions of the experimenter plays an important role in the computation of the p-value. This subjectivity is sometimes ignored or opportunistically presented in papers.

2. Incorrect allusions to the “fantasies of significance” [Craver, 1978] in literature:
Apart from the subjectivity being largely controlled by the experimenter, the following erroneous interpretations of p-values often distort scientific results:

1. Odds-against-chance fantasy: This one is quite obvious. Often, the p-value is linked to the probability that the results was due to chance. Now, this amounts to $latex P[H_o|E]$, where the null hypothesis is denoted by $latex H_o$ and $latex E$ is the observed evidence. This statistic is provided using Bayesian statistics; standard statistical techniques deal exclusively with likelihoods. Therefore, in contrast, the p-value is actually $latex P[E | H_o]$; In other words, we assume $latex H_o$ to be true while computing the p-value. At this juncture, it is also interesting to note that the null hypothesis is almost always false and further, it is often (incorrectly, but due to “practical purposes”) set after the data has been collected.
2. Replicability or reliability fantasy: This one is easier to spot. If $latex R$ is the probability of replicability, the p-value does not measure $latex R$, neither does it state anything about how replicable the results are, and if the same strength of differences will be observed in a subsequent trial. In probabilistic terms, it does not measure $latex P[R|E]$ or $latex P[R|H]$, where E is the observed evidence and H is any hypothesis. In fact, there was an article stating that 75% of psych research results could not be replicated [Gaurdian, 2015] 🙂
3. Valid research fantasy: While it seems absurd for someone to fall into this trap, it is an indirect endorsement of the scientific importance of results and therefore, according to Craver, the most serious. This amounts to stating that the research hypothesis is true as opposed to saying that there is evidence against the null hypothesis being true. In other words, we are incorrectly working with $latex P[H_1|E]$ instead of $latex P[E|H_o]$, where $latex H_0$ and $latex H_1$ are the null and research hypotheses, respectively. It is important to understand that using the p-value, one may reject the null hypothesis. But this is not a direct indicator of the research hypothesis being true; other alternative hypotheses will also have to be considered to arrive at such a conclusion.
3. Faulty design of experiments and poor research methodologies:

1. Not disclosing or missing out aspects of an experiment: As described previously, if using statistical significance testing, the experimenter has to make available all decisions that went into sampling of the control and experimental groups. This has a bearing on the computation of p-values.
2. Having very large N: By having a very large sample size, the p-value invariably becomes small and displays “statistical significance”. This is because the two groups are often not exactly sampled from the same “population” with respect to the variable being measured. Since the experimental has complete control over the experiment, statistically significant results can be obtained if she chooses to.
3. Incorrect consideration of alternative hypothesis or statistically insignificant results: In literature, often, results that do not show statistical significance are ignored and not interpreted against the research hypothesis. Further, once statistical significance is established, future work focuses on eliminating alternative hypotheses and the null hypothesis is no longer considered a prime contender. All of these aspects lead to corruption of the scientific method: refer observation selection, misunderstanding the nature of statistics, and excluded middle, straw man, and suppressed evidence in Carl Sagan, 2014.

## Why are we still using it, then?

In my opinion, although many people for several decades have vehemently criticized null hypothesis testing, its use is perpetuated by a system of research that places importance on periodic/short-turnaround publications with results, sensationalism, and piling up  of hypotheses to keep the machinery chugging (irrespective of what comes out of it). Some of the possible, specific reasons are listed out below (May of these are from the references mentioned at the end of the blog).

1. Small-size-data research: Many areas of research such as psychology, education, wild-life research, sociology often deal with sample sizes that are very small. When questions are raised about the generalizability of results, it is easy to duck behind “statistical significance” of the results obtained.
2. Complexity=Awesomeness attitude: All the complexity that goes into doing null hypothesis testing comes out as something “rigorous” and “sophisticated”. This perception could be a “soft” contributing factor to its use in putting forth a scientific theory.
3. Illusion of objectivity: Doing a statistical significance test and using a pre-defined threshold (while ignoring all the baggage that comes with it) gives an illusion that all has been done according to procedure.
4. Incorrect allusion to replicability or more generally, “fantasies of significance“: described above; this happens because many of the factors discussed above (like replicability, the odds of the research hypothesis being true, reliability,  etc.) are the cornerstones of a good scientific theory.
5. Enforced by journals and thesis supervisors: For whatever reason.
6. Easy availability in statistical packages: Makes them a natural choice when conducting experiments!
7. Taught in all basic statistics courses: As one of the basic processes to be used with scientific theory building.
8. Human Nature! I found these two tables in one of the references [Johnson, 1999]: about how a scientist, when it comes down to ground reality, views the outcome of his/her experiment.

## An alternative: Bayesian statistics

So, if the statistical significance testing road is full of potholes, what is an alternative?

The most straightforward answer is to compute the values that it is most often misinterpreted for. So, we could instead compute $latex P[R|E]$, $latex P[H_o|E]$, and $latex P[H_1|E]$. A popular way to do is, is to use Bayesian statistics.

A general introduction to this is the following…

We define the prior belief about the phenomenon or parameter being studied in probabilistic terms: this is either from:

1. Common sense and Intuition
2. Previous research findings
3. Situational constraints

Let’s represent it as $latex P[\theta]$ where $latex \theta$ is the variable of interest. It is worth noting here that there is a shift in the meaning of probability from the frequentist view to a notion of belief or uncertainity.

This prior belief is combined with empirical observation, which is given by a “likelihood” (say, $latex P[X|\theta]$), to give what’s known as a “posterior probability”, $latex P[\theta|X]$ . $latex P[R|E]$, $latex P[H_o|E]$, and $latex P[H_1|E]$ can be viewed as posterior probabilities. Specifically, the Bayes rule is used to get to this:

$latex \textrm{Posterior} = \frac{\textrm{Prior} \times \textrm{Likelihood}}{\textrm{Evidence}}$

Here evidence is $latex P[X]$; so we, get

$latex P[\theta | X] = \frac{ P[\theta] P[X | \theta]}{P[X]}$

$latex P[X]$ can further be expanding using the total probability rule. For continuous distributions, it is given by:

$latex P[X] = \int_{\forall \theta \in \Theta} P[X | \theta]P[\theta] d\theta$

for discrete distributions, it is given by:

$latex P[X] = \sum_{\forall \theta in \Theta} P[X | \theta]P[\theta]$.

As an example (a general derivation of what’s in Berger & Berry, 1988), instead of computing the p-value for the study of Vitamin C’s (beneficial) effect on common cold, if we want to find out $latex P[H_0 | E ]$ we could do the following:

Let $latex p$ be the probability that Vitamin C has a positive effect on common cold. Under the null hypothesis, $latex p=0.5$. Otherwise, $latex p\ne0.5$. Now, instead of deciding on what $latex p$ could be under the research hypothesis, we could parametrize it into a “prior belief”, by saying that it would be at most $latex p_0$ (one of many ways of doing it).

Since each matching pair can either have an effect or no effect, say, E, and NE, respectively,  the number of E observed after $latex N$ trials is given by a binomial of the form:

$latex P[k|p;N] = {N \choose k} p^k (1-p)^{N-k}$

For the null hypothesis, since $latex p=0.5$, we get:

$latex P[k|H_0; N] = {N \choose k} 0.5^N$

Now, let’s set the probability of null hypothesis (for the purposes of illustration) $latex P[H_0] = 0.5$ and for the other $latex 0.5$ of having an alternative hypothesis, $latex H$, let’s have $latex p~U[1-p_0, p_0]$ (uniformly distributed between $latex 1- p_0$ and $latex p_0$. Now, applying Bayes rule, we get

$latex P[H_0 | N; k] = \frac{ {N \choose k} 0.5^N \times 0.5}{ {N \choose k} 0.5^N \times 0.5 + {N \choose k} \times \frac{1}{1 – 2p_0} \int_{1 – p_0}^{p_0} p^k (1 – p)^{N-k} dp}$

This is equal to $latex P[H_0 | N; k] = [ 1 + \frac{2^N}{1 – 2p_0} \int_{1-p_0}^{p_0} p^k (1-p)^{N-k} dp] ^ {-1}$

Here, the value of $latex p_0$ can be set as prior information and therefore, the subjectivity is not done away with, but instead placed at the control of the consumer.

In practice, such analysis is done with priors that have interesting mathematical properties; the prior distribution is often chosen to be of a form, that when combined with the likelihood, gives the same form for the posterior. For example, a beta distribution, when combined with a likelihood that’s a binomial distribution, again gives a beta distribution. This simplifies mathematical calculations and also enables us to carry forward our beliefs to subsequent experiments in a principled form. Such priors are called “conjugate priors“. I’ve found Bishop, 2006 to contain a good treatment of such conjugate priors for practical purposes. Further, Bayesian approaches are only dependent on the observed data and not the experimental methods thought of by the experimenter and therefore clears up a lot of the confusions brought about by standard statistical testing.

# References

1. [Bishop, 2006] Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
2. [Craver, 1978] Ronald P. Craver. The case against statistical significance testing. Vol 48. Issue 3. Pages 378-399. Harvard Educational Review. 1978.
3. [Craver, 1993] Ronald P. Craver. The Case against Statistical Significance Testing, Revisited. he Journal of Experimental Education,  Vol. 61, No. 4, Statistical Significance Testing in Contemporary Practice (Summer, 1993), pp. 287-292. 1993
4. [Berger & Berry, 1988] James O. Berger and Donald A. Berry. Statistical Analysis and the Illusion of Objectivity. American Scientist. pp. 159-165. 1988
5. [Lee & Wagenmakers, 2005] Michael D. Lee and Eric-Jan Wagenmakers. Bayesian Statistical Inference in Psychology: Comment on Trafimow. American Psychological Association: Psychology Review. Vol 112. No. 3. pp. 662-6668. 2005
6. [Feinberg & Gonzalez, 2012] Fred M. Feinberg and Richard Gonzalez. Bayesian Modeling for Psychologists: An applied approach. APA Handbook of Research methods in Psychology. Vol 2. Research Designs. 2012.
7. [Frayley, 2003] R. Chris Fraley. The Statistical Significance Testing Controversy: A Critical Analysis. Graduate Seminar in Methods and Measurement. Fall 2003.
8. [Gelman, 2012] Andrew Gelman. The inevitable problems with statistical significance and 95% intervals. http://andrewgelman.com/2012/02/02/the-inevitable-problems-with-statistical-significance-and-95-intervals/. 2012
9. [Higgs, 2015] Megan D. Higgs. Do We Really Need the S-word?. American Scientist. http://www.americanscientist.org/issues/id.15964,y.0,no.,content.true,page.1,css.print/issue.aspx. Retrieved 2015.
10. [Carl Sagan, 2014] https://www.brainpickings.org/2014/01/03/baloney-detection-kit-carl-sagan/
11. [IITMTables, 2015] http://math.usask.ca/~laverty/S245/Tables/BinTables.pdf
12. [Ziliak & McCloskey, 2009] Stephen T. Ziliak and Deirdre N. McCloskey. The Cult of Statistical Significance. Section on Statistical Education. JSM 2009.
13. [Gaurdian, 2015] http://www.theguardian.com/science/2015/aug/27/study-delivers-bleak-verdict-on-validity-of-psychology-experiment-results
14. [Johnson, 1999]. Douglas H. Johnson. The Insignificance of Statistical Significance Testing. USGS Northern Prairie Wildlife Research Center. Paper 225. 1999.

# Appendix

Code used to generate the two plots:

bindist.py

#!/usr/bin/python2.7

import numpy as np;
from itertools import groupby;
import matplotlib.pyplot as plt;

def autolabel(rects, ax):
# attach some text labels
for rect in rects:
height = rect.get_height()
print rect.get_height(), rect.get_width()
ax.text(rect.get_x()+rect.get_width()/2., 1.05*height, '%0.4f'% (height),
ha='center', va='bottom', fontdict= {"rotation": "vertical", "size":"smaller"})

def plotBinomialDistribution(N, p, ax, pltCnt=100000):

x = np.random.binomial(N, p, pltCnt);
y = [(k, len(list(v))/float(pltCnt)) for k,v in groupby(sorted(x))];
sortedDict = sorted(y, key=lambda x: x[0]);

x = [v[0] for v in sortedDict];
vals = [v[1] for v in sortedDict];

ax.ticklabel_format(useOffset=False)
rects = ax.bar(x, vals);
ax.set_title("N=%d; p=%1.2f" %(N, p));
return rects;

if __name__=="__main__":

N = [8, 17];
p = 0.5;

f, axarr = plt.subplots(2, 1, sharex=False, sharey=True);

for ii in xrange(0, len(N)):
print ii
rects = plotBinomialDistribution(N[ii], p, axarr[ii])
autolabel(rects, axarr[ii]);

plt.show();
x = raw_input();

atleastk.py

#!/usr/bin/python2.7

import numpy as np;
from itertools import groupby;
import matplotlib.pyplot as plt;

def autolabel(rects, ax):
# attach some text labels
for rect in rects:
height = rect.get_height()
print rect.get_height(), rect.get_width()
ax.text(rect.get_x()+rect.get_width()/2., 1.05*height, '%0.4f'% (height),
ha='center', va='bottom', fontdict= {"rotation": "vertical", "size":"smaller"})

def plotAtleastK(k, p, ax, pltCnt=100000):

pDist = dict();
for ii in xrange(0, pltCnt):
positives = 0;
negatives = 0;
while positivesp:
positives += 1;
else:
negatives += 1;
key = positives + negatives;
if not key in pDist:
pDist[key] = 0;

pDist[key] += 1;
print pDist
maxVal = max(pDist.keys());
minVal = min(pDist.keys());
print maxVal, minVal;
x = np.zeros(maxVal - minVal + 1);
vals = np.array(range(minVal, maxVal + 1));

for k2,v in pDist.items():
pDist[k2] = v/float(pltCnt);
x[k2 - minVal] = pDist[k2];

ax.ticklabel_format(useOffset=False)
rects = ax.bar(vals, x, 1.0);
ax.set_title("k=%d; p=%1.2f" %(k, p));
ax.set_ylabel("Probability");
ax.set_xlim([minVal, maxVal]);
return rects;

if __name__=="__main__":

k = [4, 7];
p = 0.5;

f, axarr = plt.subplots(2, 1, sharex=False, sharey=True);

for ii in xrange(0, len(k)):
print ii
rects = plotAtleastK(k[ii], p, axarr[ii])
autolabel(rects, axarr[ii]);

axarr[-1].set_xlabel("Matched Pairs");
plt.show();
x = raw_input();