Null Hypothesis Significance Testing (NHST) is common in research, notably in the biomedical and social sciences.
The practice is nonsense and harmful.
At best, it hinders science and wastes money. At worst, it hurts people.
It should be abandoned.
Navigating the page
A very short summary
- You should consider no longer using significance testing (NHST), because it slows the progress of science and damages real lives.
- NHST's continued adoption is supported by cognitive biases, notably: dichotomania (black-and-white thinking), nullism (false notion that “zero” is objective), and statistical reification (confusing data and models with physical reality).
- NHST is not the same as p‑values.
- They usually come together but you can have NHST without p‑values and p‑values without NHST.
- p‑values are widely misinterpreted.
- There are valid uses for p‑values; their mindless degradation into “significant or not” isn't one of them.
What is significance testing?
A statistical procedure that promises to give a yes-or-no verdict of whether a discovery has been made.
Many statistical tests exist, such as ANOVA, χ², and t-tests. A certain probability is calculated over the results of those tests. This probability is called a p‑value.
The significance test is this: if this p‑value is found to be...
- below the agreed-upon threshold, the result is said to “have reached significance”, a discovery is declared, and researchers feel joy.
- above the agreed-upon threshold, it “has not reached significance”, no discovery is declared, and researchers feel doomed.
The usual threshold practiced is 0.05, but varies by field.
Three kinds of significance levels
There are three different interpretations for “significance level”:
- The setting to 5% or any other level — an early proposal by Fisher.
- The interpretation as α — based on Neyman-Pearson's theory.
- The report of exact ps, no thresholds — a later proposal by Fisher.
Significance in the...
- first is a property of the test, set by mere convention;
- second is a property of the test, set by cost-benefit analysis;
- third is a property of the data.
So here's one thing Null Hypothesis Significance Testing is: a mix of the three theories.
I invite you to see Question 4 (page 9) of Gigerenzer et al. 2004. Then ask yourself: “are you committing the confusion of Dr. Publish‑Perish?”
(Only if you're interested in much more technical detail, see Schneider 2014 and Perezgonzalez 2015.)
What about that “null” part?
So, what is a “null hypothesis” anyway? Depends on who you ask. It can mean:
- the hypothesis that you wish to refute (i.e. nullify); or
- the hypothesis of zero effect and zero systematic error, which you wish to refute.
Pick your terminology. People who adopt the...
- former will say that any hypothesis can be tested (be your “null”), and that zero effect and zero systematic error is a nil-null hypothesis;
- latter will say that zero effect and zero systematic error is a null hypothesis; some will say the null is the only one to test; others know that other hypotheses can be tested, not only the null.
This means that “null” is ambiguous. As if we didn't already have semantic confusions enough with the word “significance”.
What significance testing isn't
For one, significance testing isn't the p‑value.
- You can calculate a p‑value without giving a yes/no verdict of “statistical significance”.
- And you can “test significance” using things other than p‑values.
It's true that, in practice:
- If you see a p‑value, it's highly likely that a “significance test” follows it.
- If you see a “significance test”, it's highly likely that it was through a p‑value.
However, they are separate things. You can test significance with intervals. Or Bayes factors. Or you can eyeball your results and declare “Given the null and assumptions, the chance of this observation is very low. I declare that Yes, significant!”
And this is an issue.
No, not the use of intuitions. Something deeper, that applies regardless of what you are using for declaring “significance”.
Anxious apes dichotomize. Anxious apes reify.
People expect certainty from single studies.
“Is it true or not? Effect or no effect? Significant or not significant? Will Freshfutazol™ cure my grandmother's athlete's foot or not? Tell me!”
This anxiety consumes people. Humans, like cows, ruminate.
So Mrs. Significance comes along and says:
“Behold, mortal! The numbers have been transmuted, and I have thy answer! Do I have thy attention now? Good. So... (dramatic drumming) Significant! May the journals open their doors to thy achievement.”
And yet, how dangerous. Because reality doesn't care if someone declare an issue closed. Reality doesn't care if the neurons of a couple of apes (that would be you and I) rearranged and thοse apes now believe Freshfutazol™ is the best thing invented since sliced bread.
Freshfutazol™ will do its thing to cure your grandmother's ailments — or it won't. Freshfutazol™ is indifferent to what you think of it. Freshfutazol™ doesn't care the least that Mrs. Significance showed up and pompously gave a verdict.
Statistical models and their outputs are not reality itself.
We want to be more certain and less anxious. And this is one reason why Mrs. Significance is so appealing. But her soothing powers come at the cost of delusion. And this delusive overconfidence can be, and often is, harmful.
Why is significance testing a problem?
It doesn't deliver on its promise of telling discoveries and non-discoveries apart.
- Not (only) because the threshold is “too high” (lowering it won't solve the problem);
- not (only) because of issues with p‑values (or Bayes factors, or intervals);
- but because mindlessly defaulting to a single hypothesis of zero effect and zero systematic error is unjustified and produces distortions;
- and because of an unjustified expectation that single studies can give final, yes-or-no answers about matters of existence based on arbitrary thresholds.
Ok, more concretely:
Null hypothesis significance testing...
- hinders the advancement of cumulative science, causes predictable nonreplication of studies, and is a major contributor to publication bias;
- distracts from the much more important “neglected factors” of “prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding” (McShane et al. 2018);
- harms people as an effect of the disinformation it promotes and the information it suppresses.
- it enables publication of junk research (masqueraded as “significant”);
- it discourages people from pursuing promising research (upfront deemed “insignificant”).
This happens because
- it allows and invites p-hacking (maliciously or unconsciously), so in the end anything can be made “significant”;
- its mindless mechanical procedures are easy;
- its output is reassuring, in part because of some cognitive biases, notably:
- dichotomania: black-and-white thinking of significant-or-not
- nullism: default of zero effect and zero systematic error
- statistical reification: conflation of data and models with physical reality
- the gatekeeping role of “statistical significance” for having one's research accepted is a terrible proposition maintained for historical and psychological reasons rather than merit;
- its output is a wildly unreliable answer to a question you're usually not asking.
And since the difference between significant and nonsignificant is not significant... Uh‑oh.
Nullism is harmful
Usually only one hypothesis is tested, and it assumes zero effect and zero systematic error.
It implies a belief that “starting from zero” is always rational and impartial and unbiased and objective. It isn't. Why?
Rejection of the null doesn't imply your hypothesis “is true”
The usual logic is that if p is low, the alternative hypothesis is proved. Wrong.
Other hypotheses could better fit. Noise in sampling and measurement may better explain the low p. This is often overlooked.
Your (nil-)null hypothesis isn't only “Freshfutazol™ produces zero effect”: it's also all other model assumptions. It presupposes zero systematic error. A tiny p doesn't imply you discovered faster-than-light neutrinos. Maybe it was just a loose cable.
Rejection of the null doesn't imply large effects
Now suppose your measurements are flawless. You say “Freshfutazol™ produces an effect”. But you already knew that; “zero difference” is rarely true. An increase in precision will find some difference.
Rejecting a (nil-)null is like saying “this place is not sterile”.
Anything fits this exclusion — from pigsty to palace.
If this large uncertainty were kept in mind, fine. But the significant-or-not compulsion collapses it into “this place is dirty”. So even tiny, clinically irrelevant differences will be blown out of proportion.
Testing only the null brings a bias that favors the null
Consider Greenland 2016:
“Testing only the no-effect hypothesis simply assumes, without grounds, that erroneously defaulting to no effect is the least costly error, and in this sense is a methodologic bias toward the null.”
Suppose “this substance has no side effects” is the only hypothesis, and “p > 0.05”. It's approved. Harms from eventual false negatives would fall on users.
Science does not demand that you assume “no effect” as a starting point. In fact, often you should not.
Everything null in multiple comparisons? Unlikely.
So, there's this nice cartoon everybody likes to show. The point it makes is that the more data someone analyzes, the higher the chance of some “significant” result showing up. Then the person sweeps under the rug all the nonsignificant results and selectively reports the significant ones. This is not cool, and you shouldn't do it: rather, display all the analyses. So the criticism is valid.
Scrupulous researchers then go one step further: “Hey, I'm not doing multiple comparisons to cheat. Here, I will prove. I will compensate by tweaking the threshold to make things harder.”
One popular way to do that is the Bonferroni correction. It's like this: if you made 20 comparisons, you compensate by dividing your 0.05 threshold by 20. So now some comparison is only “significant” if its p is below 0.0025. This addresses the false positive problem.
But... it may much increase the rate of false negatives. Causal relationships are there, and you throw them away, scared by the possibility of they being “only chance”.
The problem is that you assume no relationship whatsoever between any of the things you analyze. If you always do this, you're implying that you live in a universe where nothing is expected to have any effect on anything else. This is just wrong in principle and in general (with exceptions). It throws away promising avenues of investigation by creating a penalty for gathering information.
If you deal with biological data and have been correcting your multiple comparisons, then you should read the short misconception 5 of Rothman 2014; and the four pages of Rothman 1990.
(Technical reading for statisticians: Gelman suggests (pdf) that with Bayesian inference and the correct prior the problem of multiple comparisons disappears.)
Flat priors: that's “null” for Bayesians, with similar vices
(If this title is unintelligible to you, just skip to the next section)
Uninformative priors are a Bayesian version of nullism.
- In theory, they have the virtues of: simplifying assumptions; making Frequentism dovetail into Bayesianism; and making p‑values transmute into posterior probabilities.
- In practice, they inherit vices of nullism and make posterior distributions unnecessarily implausible.
You may find Gelman 2013 to be a short and useful read. From there:
“The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings.”
Greenland 2017 has more about nullism and other cognitive distortions behind NHST.
What is a p-value?
“No-one understands what a p-value is, not even research professors or people teaching statistics.” (Haller & Kraus 2002)
Tongue-in-cheek but meaningful (by Nicholas Maxwell):
“p‑value is the degree to which the data are embarrassed by the null hypothesis.”
Ok, let's try a formal definition: p‑value is “the probability of obtaining a test statistic equal to or more extreme than what was actually observed, conditional on the null hypothesis (including all model assumptions) being true, and comparable only to samples of the same magnitude”.
Since everyone misunderstands what this entails, we pass to the fertile terrain of...
What a p-value isn't
Misconceptions of p‑values are widespread. Wikipedia has a dedicated page just for it. Goodman 2008 showed 12 of them. Greenland et al. 2016 pointed out 25 misinterpretations — of p‑values, confidence intervals, and power.
In particular, p-values CANNOT tell you:
- the importance of your result,
- the strength of the evidence,
- the size of an effect,
- the probability that rejecting the null is a wrong decision,
- the probability that the data were produced by random chance alone,
- the probability that the result can be replicated, or
- the probability that your hypothesis is true.
You cannot claim any of the above based on a p‑value.
p-values: by nature unsuitable for “significance”
Worth repeating: NHST is not the same as p‑values. They usually come together but you can have NHST without p‑values and p‑values without NHST. There are valid uses for p‑values; their mindless degradation into “significant or not” isn't one of them.
However, let's suppose you insist that the goal of finding “significance” is valid; that from a single study you can give a verdict of whether some effect “is real or not”. Nonsense, but suppose.
p‑values, by their very nature, would be unsuitable for that goal. In particular, they:
- answer a question that you're probably not asking;
- are sensitive to model violations;
- are sensitive to sample size — increase it and you get “significance”;
- are volatile — significant today, nonsignificant tomorrow.
These points should help you see the meaninglessness of the label “significant” obtained from a calculated p‑value.
p-values answer a question you're probably not asking
“It does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!”
p‑values answer the chance of observations given a hypothesis; whereas what you usually want is the chance of a hypothesis given observations.
Let's pick two groups: all the bread-eaters in the world, and all the psychologists in the world. Clearly, there are bread-eating psychologists. However:
- The percentage of psychologists who eat bread is one thing.
- The percentage of bread-eaters who are psychologists is a different thing.
Those are not interchangeable. Not at all the same thing. Without additional information, you can't infer one from the other.
Which leads us to this example from Gorard 2016:
- “Assuming that a bag of 100 marbles contains 50 red and 50 blue (the null hypothesis) it is easy to calculate the precise probability of drawing out 7 red and 3 blue in a random sample of 10 marbles. This is, in effect, what significance tests do. But this does not tell us whether the bag really does contain 50 red and 50 blue marbles. How could it? In practice, we would not know the number of each colour in the bag. In this situation, drawing 7 red and 3 blue in a sample would not tell us the colours of those left. Yet this is what significance test advocates claim we can do. It is a kind of magic-thinking — an erroneous superstitious belief.”
p-values are sensitive to violations of model assumptions
Violations of any model assumptions (perfect randomization, zero systematic error, full reporting, etc.) can disturb your calculated p‑value. As already mentioned, a loose cable is all you need to have a p of virtually zero. Your model assumes good measures (no loose cables), and this violation makes your data much more incompatible with your model, which the p‑value reflects.
p-values are sensitive to sample size
If the null hypothesis isn't true (and it never is), then some difference exists between the groups. Improve your precision and it will be detected. Sample size increases, p‑value decreases. So “significance” can be bought, and a small p does not imply a big effect.
Here is the problem:
When people declare “significant!”, they think they detected some sizeable difference or effect. But the above shows that you can move your p up and down by varying your sample size with effect size kept constant!
p-values are volatile
p‑values are extremely volatile, even at larger sample sizes. They vary wildly when repeatedly sampling the same populations under the same methodology.
- Watch video of their dance.
- Read abstract and conclusion of this short paper.
- Read this short article.
This is by construction, by the way. It's not a “defect”. This is what they are supposed to do. They reflect random deviation from the model. They don't converge. (See p.5 of Amrhein, Trafimow, & Greenland 2018)
You say: “But intervals and Bayes factors also dance! Why single out the p‑value?”
First, I'm not arguing that other metrics are better for testing significance “because they wouldn't dance”. Rather, I'm arguing that the wild dance of p‑values disqualify the very notion and usefulness of claiming “significance”.
Second, the black-and-white thinking inherent to yes-or-no testing of significance is itself problematic whether you use p, CI, Bayes factors or whatever. We are dichotomaniacal apes. We can and should dispense with this habit. Let the variables be continuous. Embrace the uncertainty.
In increasing order of change:
- Maintain things exactly as they are.
- Reform significance testing by lowering the cutoff of p<0.05.
- Abandon significance testing rather than just reforming.
- Maintain. Those unaware of the issue can't help but continue doing the same. And those motivated to advance their careers through production of junk science may prefer that things don't change. (Maybe you have been in the first group, until now. Hopefully, you are not in the second.)
- Reform NHST. To produce a “true p‑value” of 0.05, you need to aim for at most 0.005, or even 0.001 (see Taleb 2018). At first sight, a lower threshold might then seem reasonable. Benjamin et al. 2017 proposed just that, arguing it would much reduce the rate of published false positives.
But that doesn't take p‑hacking into account. Once it does, the reduction in false positives disappears, and it could make the replication crisis worse (Crane 2017). Some further reasons why such reforms could bring net harm are an increased overconfidence in published results, exaggeration of effect sizes, and discounting of valid findings (McShane et al. 2018).
- Abandon NHST. So, the current practice is harmful, and reforming doesn't help, plus could make it worse. Seems like sufficient reason to propose abandonment. Well, there are more. NHST is a form of “uncertainty laundering” poorly suited for the biomedical and social sciences, with their “small and variable effects and noisy measurements”. Plus, this yes-or-no mindset has “no ontological basis” and promotes bad reasoning. Plus the nil-null is uninteresting to calibrate against. Those issues aren't addressed by tweaking cutoff levels. So we better get rid of thresholds.
If you read nothing else, please read the clear McShane et al. 2018 (pdf), since attempts to summarize it won't do it justice.
Then Trafimow et al. 2018 (pdf),
those two pages by Greenland,
and Gorard 2016.
A quick discussion
So “Abandon null hypothesis significance testing” is what is being
proposed urged here. See what you can do.
Worth repeating that dichotomization (yes/no to “significance”) and obsession with “zero effect, zero difference” nulls are serious problems intrinsic to NHST. They are not intrinsic problems of p‑values, intervals, or Bayes factors. So we can (should) eliminate NHST without the need for throwing those babies with the bathwater. Even if some babies are treacherous.
Elaboration on valid use cases for p‑values is beyond the scope of this site. If you are educated about them, go for it. For now, let's make sure you don't misuse them.
Objections to the abandonment of NHST
“Still, shouldn't we keep using significance testing, because...”
No. Read for example Schmidt et al. 1997, who for three years collected objections to the abandonment of significance testing — all rejected.
False objections you will see addressed there:
- Without significance testing we would not know whether a finding is real or due to chance.
- Hypothesis testing would not be possible without significance tests.
- The problem is not significance tests, but failure to develop a tradition of replicating studies.
- When studies have a large number of relationships, we need significance tests to identify those that are real.
- Confidence intervals are themselves significance tests.
- Significance testing ensures objectivity in the interpretation of research data.
- It is the misuse, not the use, of significance testing that is the problem.
- It is futile to try to reform data analysis methods, so why try?
See the FAQ for additional objections and questions.
What you can do
“So what if there were no null ritual or NHST? Nothing would be lost, except confusion, anxiety, and a platform for lazy theoretical thinking. Much could be gained, such as knowledge about different statistical tools, training in statistical thinking, and a motivation to deduce precise predictions from one’s hypotheses.
Should we ban the null ritual? Certainly — it is a matter of intellectual integrity. Every researcher should have the courage not to surrender to the ritual, and every editor, textbook writer, and adviser should feel obliged to promote statistical thinking and reject mindless rituals.” (Gigerenzer et al. 2004)
So you have read those pages, seen this letter, are aware of potential personal costs. Conditional on your having understood and agreed — what can you do?
There is no simple universal method for scientific inference. Likewise, there's no simple universal recommendation of what to do in situations where your personal convictions conflict with those of people around you.
The specifics of what you will do (if anything) depend on your personal situation, values, goals, temperament, people and institutions involved, and many other variables that only you, personally, can evaluate.
With all that in mind, below are suggestions. Starting upstream:
- If you donate to institutions that fund research: Say you disapprove of significance testing. If possible, make your donation conditional on them only funding research that use proper methodology.
- If you fund research: Make clear that significance testing is not an acceptable tool in research you fund. Make your grants conditional upon researchers using proper methodology.
- If you are a professor: Educate students about the huge contrast between significance testing's popularity and its usefulness.
Explain that there's no one-size-fits-all magical solution (no, the brave Bayesian gods are not that powerful either). Emphasize quality of measurement, carefulness of design, thoughtfulness of analyses, relevance of the topic.
- If you are an advisor to researchers: Don't bully them into using significance testing. Encourage them not to. Enforce integrity of research practices.
- If you are a researcher: Don't use significance testing in your research. Don't passively allow your co-authors to use it (speak up, show them this site). If you so decide, consider adding text to justify its absence. Don't say “significant” when you mean “relevant” or “large” (it's misleading). Analyze and report all your data and results, regardless of the outcome — you found whatever you found, and that's ok.
(See also: 4.2 and App. B here and pp. 8,9 here)
- If you are an editor or reviewer: Disregard significance testing as a metric for acceptance of an article, and say so clearly. Evaluate articles exclusively on merits such as quality of measurement, carefulness of design, thoughtfulness of analyses, relevance of the topic. (See also: 4.3 here)
- If you are a science writer, journalist or a reader of scientific literature: Ignore significance testing when you see them in publications. Ignore conclusions that are claimed to follow from it.
(See also: pp. 9,10 here)
- See Gigerenzer's papers, in particular:
Gigerenzer & Marewski 2015 for the delusion of worshipping single tools such as p‑values or Bayes factors and using them as “universal hammers”;
Gigerenzer 2004 about mindless statistics in social science (starring Dr. Publish-Perish);
Gigerenzer et al. 2004 talks about the mass confusion behind the null ritual.
Simmons et al. 2011 show how you can make anything turn “significant” in psychology;
Westover et al. 2011 for a didactic paper for a medical audience, with examples that doctors will find familiar;
Amrhein & Greenland 2018 (or here) also suggesting abandonment of NHST;
Amrhein et al. 2017 for an in-depth analysis of p‑values and unreplicability (“dichotomous threshold thinking must give way to non-automated informed judgment”);
Greenland 2017 for the cognitive biases of dichotomania, nullism, and statistical reification behind NHST;
Rothman 2014 tells us about six persistent research misconceptions (guess which one is number 6);
Orlitzky 2011 for suggestions about institutional changes;
Gelman & Stern 2012 for an ironically appropriate thrice-significant title;
Gorard 2015 for a suggestion of what to do instead of NHST in the social sciences.
- Neither Gigerenzer nor Seife mince words on Edge (2014) (Seife's criticism is valid even though he commits common misconception #2 from Greenland et al. 2016);
likewise, Colquhoun condemns NHST on Aeon (2016);
Colquhoun also gives didactic examples of the perils of (misinterpreting) p‑values, on Chalkdust (2015);
Amrhein argues inferential statistics is not inferential, on Medium (2018);
Cumming shows p‑values dancing all over the place in this simulated replication video (your study could be any of those lines; “significant”, you said?);
and, of course, this cartoon and this one.
- Also quotes and open letters.
Follow the links, understand them, read about potential personal risks, make up your mind.