Null Hypothesis Significance Testing (NHST) is common in research, usually through so‑called “p‑values”.
The practice is nonsense and harmful. It should be abandoned. At best, it hinders science and wastes money. At worst, it hurts people.
Navigating the page
What is significance testing?
A statistical procedure that promises to give a yes-or-no verdict of whether a discovery has been made.
Many statistical tests exist, such as ANOVA, χ², and t-tests. A certain probability is calculated over the results of those tests. This probability is called a p‑value.
The significance test is this: if this p‑value is found to be...
- below the agreed-upon threshold, the result is said to “have reached significance”, a discovery is declared, and researchers feel joy.
- above the agreed-upon threshold, it “has not reached significance”, no discovery is declared, and researchers feel doomed.
The usual threshold practiced is 0.05, but varies by field.
Three kinds of significance levels
Did you know that there are three different interpretations for “significance level”?
- The setting to 5% or any other level — an early proposal by Fisher.
- The interpretation as α — based on Neyman-Pearson's theory.
- The report of exact ps, no thresholds — a later proposal by Fisher.
Significance in the...
- first is a property of the test, set by mere convention;
- second is a property of the test, set by cost-benefit analysis;
- third is a property of the data.
So here's one thing Null Hypothesis Significance Testing is: a confused conflation of the three theories; “a mishmash that doesn't exist in statistics proper”.
For more about that, I invite you to see Question 4 (page 9) of Gigerenzer et al. 2004. Then ask yourself: “are you commiting the emotional and intellectual confusion of Dr. Publish‑Perish?”
What significance testing isn't
For one, significance testing isn't the p‑value.
- You can calculate a p‑value without giving a yes/no verdict of “statistical significance”.
- And you can “test significance” using things other than p‑values.
It's true that, in practice:
- If you observe a p‑value being used, it's highly likely that a “significance test” follows it, explicitly or implied.
- If you observe a “significance test” being used, it's highly likely that it was through a p‑value.
However, they are separate things. A bunch of stuff can be used to test significance. It can be done with intervals. Or Bayes factors.
Or you can eyeball your results and declare “Hmmm... Given the null and assumptions, the chance of this observation is very low. Way off the expected. I declare that Yes, significant!”
And this is an issue.
Not the use of intuitions, because some data may be so clear that you don't need complicated statistics to declare what you see. No, it's something else, deeper, and it applies regardless of what you are using for declaring “significance”.
People expect certainty from single studies.
“Is it true or not? Yes or no? Effect or no effect? Significant or not significant? You just finished a study on this new-generation Freshfutazol™, so will it cure my grandmother's athlete's foot or not? Tell me!”
This anxiety consumes people. Humans, like cows, ruminate.
So Mrs. Significance comes along and says:
“Behold, mortal! The numbers have been shaken and transmuted, and I have thy answer! Do I have thy attention now? Good. So... (dramatic drumming) Significant! Congratulations! May the journals open their doors to thy achievement.”
And yet, how dangerous. Because reality doesn't care if someone declare an issue closed. Reality doesn't care if the neurons of a couple of apes (that would be you and I) rearranged and thοse apes now believe Freshfutazol™ is the best thing invented since sliced bread.
Freshfutazol™ will do its thing to cure your grandmother's ailments — or it won't. Freshfutazol™ is indifferent to what you think of it. Freshfutazol™ doesn't care the least that Mrs. Significance showed up and pompously gave a verdict.
We want to be more certain and less anxious. And this is one reason why Mrs. Significance is so appealing. But her soothing powers come at the cost of delusion. And this delusive overconfidence can be, and often is, harmful.
Why is significance testing a problem?
It doesn't deliver on its promise of telling discoveries and non-discoveries apart.
- Not (only) because the threshold “is bad” (if it was only this we could tweak it, problem solved);
- not (only) because of the p‑values (or Bayes factors, or intervals) that feed it;
- not (only) because a single study is often insufficient to drive conclusions;
- but also because of an unwarranted expectation that it can give final, yes-or-no answers about matters of existence.
Ok, more concretely: why is NHST a problem?
- hinders the advancement of cumulative science, causes predictable nonreplication of studies, and is a major contributor to publication bias;
- distracts from the much more important “neglected factors” of “prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding” (McShane & Gelman 2017);
- harms people as an effect of the disinformation it promotes and the information it suppresses.
- it enables publication of junk research (masqueraded as “significant”);
- it discourages people from pursuing promising research (upfront deemed “insignificant”).
This happens because
- it allows and invites p-hacking (maliciously or unconsciously), so in the end anything can be made “significant”;
- its mechanical procedures are easy, and its output is psychologically reassuring;
- the gatekeeping role of “statistical significance” for having one's research accepted is a terrible proposition maintained for historical and psychological reasons rather than merit;
- its output is a wildly volatile answer to a question you're not really asking.
And since the difference between significant and nonsignificant is not significant... Uh‑oh.
What is a p-value?
“No-one understands what a p-value is, not even research professors or people teaching statistics.” (Haller & Kraus 2002)
Ok, ok. Let's try. A p‑value is “the probability of obtaining a result equal to or more extreme than what was actually observed, conditional on the null hypothesis plus all other model assumptions being true, and comparable only to samples of the same magnitude”.
Ready to misunderstand that in 5 minutes? Great, so we can now pass to the fertile terrain of...
What a p-value isn't
Misconceptions of p‑values are widespread. Wikipedia has a dedicated page just for it. Goodman 2008 showed 12 of them. Greenland et al. 2016 pointed out 25 misinterpretations — of p‑values, confidence intervals, and power.
In particular, p-values CANNOT tell you:
- the importance of your result,
- the strength of the evidence,
- the size of an effect,
- the probability that rejecting the null is a wrong decision,
- the probability that the data were produced by random chance alone,
- the probability that the result can be replicated, or
- the probability that your hypothesis is true.
You cannot claim any of the above based on a p‑value.
Why are p-values a problem?
Because you won't use them correctly. And because even if you did, what they answer is unlikely to be useful or interesting.
Let's start with the “using correctly” part.
Social scientist? No p-values for you.
Why? Because p‑values assume you have randomized samples. And you probably don't.
It's a built-in requirement for calculating a p‑value that your data come from a randomized sample.
So unless you have legitimate reasons not to, you'd better follow this rule.
(What if you didn't? Then on top of p-values' intrinsic volatility and distortions from noise in your measurements, you would also have the usually large distortions from nonrandom sampling. This would demand even stronger assumptions for any claims you wish to make about the general population. Would you be able to provide them? Statistical pyrotechnics may be insufficient to rescue hopelessly low-quality data. Worse, the added layer of apparent sophistication may breed overconfidence.)
What do I mean by random? I mean you better get very close to no design bias, no dropouts, no non-responses, no measurement error, and no loss of blinding.
So if your samples are not randomized over the population about which you want to draw inferences, you will have a distorted input for the calculation of a p‑value, which in turn will be distorted input for the calculation of the significance test, whose output will be distorted.
And this is following p‑values and significance testing's own terms. This is before any of the abundant additional criticism given by this site is taken into consideration.
From Gorard 2017, Sociological Research Online:
- “In the last five years I have conducted a large number of systematic and scoping reviews covering perhaps 60,000 research reports. In real-life research, I have never seen a significance test used in a correct context (meeting the basic assumptions) and with the results described correctly.”
By the way, Gigerenzer 2004 is mandatory reading for mindfull social scientists.
- “If psychologists are so smart, why are they so confused? Why is statistics carried out like compulsive hand washing?”
If this still doesn't make sense, see this question here.
Ok, now let's suppose you solved randomization and measurement. Next:
You'll misunderstand it
See what a p-value isn't. Are you sure you are not claiming any of those things that you cannot conclude?
You'll be misled by its unreliability
Because p‑values are, by construction, extremely volatile, regardless of sample size. Their value will vary wildly when repeatedly sampling the same populations under the same methodology. This should convince you:
- Watch video of their dance.
- Read abstract and conclusion of this short paper.
- Read this short article.
Mind you, the issue exists even if you're not evaluating “significance”. But if you are, then significant today, nonsignificant tomorrow.
Worth repeating: the difference between significant and nonsignificant is not itself significant.
p‑values are, more often than not, calibrated against a hypothesis that assumes zero effect and zero systematic error. Also more often than not, this is uninteresting and unlikely.
Physicists normally do this thing of “let's assume this cow is a perfectly frictionless sphere” — unlikely scenario, but at least in their case it's interesting, and useful for coming up with workable simplified models. More about cows later.
Now you say: “I'm going to rebel against this default! I will use non-nil nulls!” Fine, but you will still have to deal with the detail that...
It answers the wrong question
p‑values answer the chance of observations given a hypothesis; whereas what we usually want is the chance of a hypothesis given observations.
Those are not interchangeable. Not at all the same thing. Without additional information, you can't jump from one to the other.
Let's pick two groups: all the bread-eaters in the world, and all the psychologists in the world. Clearly, there are psychologists who eat bread. However:
- The percentage of psychologists who eat bread is one thing.
- The percentage of bread-eaters who are psychologists is a different thing.
And you can't infer one just from the other.
Significance testing in action:
- “From past studies, we know most psychologists eat bread. Students were randomly sampled from the Mechatronic Department of the Yangon Technological University. My null assumes that this population consists exclusively of psychologists. If they are, we expect bread-eating to be prevalent. We observed the sample, and most were found to eat bread. That observation is expected, fitting the null almost perfectly. So it's not statistically significant, and we must accept the null hypothesis that they are all psychologists.”
Do you see anything weird in the logic above? Good. (What is it?)
If you prefer numbers, here is an example from Gorard 2016:
- “Assuming that a bag of 100 marbles contains 50 red and 50 blue (the null hypothesis) it is easy to calculate the precise probability of drawing out 7 red and 3 blue in a random sample of 10 marbles. This is, in effect, what significance tests do. But this does not tell us whether the bag really does contain 50 red and 50 blue marbles. How could it? In practice, we would not know the number of each colour in the bag. In this situation, drawing 7 red and 3 blue in a sample would not tell us the colours of those left. Yet this is what significance test advocates claim we can do. It is a kind of magic-thinking — an erroneous superstitious belief.”
“It does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!”
Too many problems for too few solutions
So, p‑values are almost certain to be a source of confusion. And unlikely to be a source of answers you want to questions you care about.
(Well, I don't really know what you want, and probably neither do you. I'd still guess: what you want to know is probably not what p‑values answer.)
In increasing order of change:
- Maintain things exactly as they are.
- Reform significance testing by lowering the cutoff of p<0.05.
- Abandon significance testing rather than just reforming.
- Ban p-values as well, any use of it.
- Maintain. Those unaware of the issue can't help but continue doing the same. And those motivated to advance their careers through production of junk science may prefer that things don't change. (Maybe you have been in the first group, until now. Hopefully, you are not in the second.)
- Reform NHST. To produce a “true p‑value” of 0.05, you need to aim for at most 0.005, or even 0.001 (see Taleb 2018). At first sight, a lower threshold might then seem reasonable. Benjamin et al. 2017 proposed just that, arguing it would much reduce the rate of published false positives.
But that doesn't take p‑hacking into account. Once it does, the reduction in false positives disappears, and it could make the replication crisis worse (Crane 2017). Some further reasons why such reforms could bring net harm are an increased overconfidence in published results, exaggeration of effect sizes, and discounting of valid findings (McShane & Gelman 2017).
- Abandon NHST. So, the current practice is harmful, and reforming doesn't help, plus could make it worse. Seems like sufficient reason to propose abandonment. Well, there are more. NHST is a form of “uncertainty laundering” poorly suited for the biomedical and social sciences, with their “small and variable effects and noisy measurements”. Plus, this yes-or-no mindset has “no ontological basis” and promotes bad reasoning. Plus the nil-null is uninteresting to calibrate against. Those issues aren't addressed by tweaking cutoff levels. So we better get rid of thresholds.
Please read the short and clear McShane & Gelman 2017, since attempts to summarize it won't do it justice. Then quickly Trafimow et al 2017, and then the incisive Gorard 2016. For an amusing account of events, you may like the alpha wars.
So what happens to p‑values once NHST is over and they lose their gatekeeping role?
One could want to use them as just another tool, without launching into yes-or-no claims of significance. And their use would decline by itself, circumscribed to its (very) limited applications.
- Ban p-values. Many, however, see in p‑values so few redeeming qualities and so much potential for misuse that, if not an outright ban, they more emphatically suggest the tool be “relegated to the scrap heap” (Lindley 1999).
So “Abandon NHST” is what is being
proposed urged here. See what you can do.
No ban on p‑values, though. Despite all their confusion and limitations, they have their use in some fields, in some cases. They should, however, be much de-emphasized.
Objections to the abandonment of NHST
“Still, shouldn't we keep using significance testing, because...”
Read for example Schmidt et al. 1997, who for three years collected objections to the abandonment of significance testing — all rejected.
False objections you will see addressed there:
- Without significance testing we would not know whether a finding is real or due to chance.
- Hypothesis testing would not be possible without significance tests.
- The problem is not significance tests, but failure to develop a tradition of replicating studies.
- When studies have a large number of relationships, we need significance tests to identify those that are real.
- Confidence intervals are themselves significance tests.
- Significance testing ensures objectivity in the interpretation of research data.
- It is the misuse, not the use, of significance testing that is the problem.
- It is futile to try to reform data analysis methods, so why try?
See the FAQ for additional objections and questions.
What you can do right now
“So what if there were no null ritual or NHST? Nothing would be lost, except confusion, anxiety, and a platform for lazy theoretical thinking. Much could be gained, such as knowledge about different statistical tools, training in statistical thinking, and a motivation to deduce precise predictions from one’s hypotheses.
Should we ban the null ritual? Certainly — it is a matter of intellectual integrity. Every researcher should have the courage not to surrender to the ritual, and every editor, textbook writer, and adviser should feel obliged to promote statistical thinking and reject mindless rituals.” (Gigerenzer et al. 2004)
So you have read those pages, seen this letter, are aware of potential personal costs, and decided to do something. What can you do?
Here are suggestions, starting upstream:
- If you donate to institutions who fund research: Say you disapprove of significance testing. If possible, make your donation conditional on them only funding research that use proper methodology.
- If you fund research: Make clear that significance testing is not an acceptable tool in research you fund. Make your grants conditional upon researchers using proper methodology.
- If you are a professor: Educate students about the huge contrast between significance testing's popularity and its usefulness.
Explain that there's no one-size-fits-all magical solution (no, the brave Bayesian gods are not that powerful either). Emphasize quality of measurement, carefulness of design, thoughtfulness of analyses, relevance of the topic.
- If you are an advisor to researchers: Don't bully them into using significance testing. Encourage them not to. Enforce integrity of research practices.
- If you are a researcher: Don't use significance testing in your research. Don't passively allow your co-authors to use them. Refuse to have your name in articles that do. And don't use the word “significant” when you mean “relevant”.
- If you are an editor-in-chief of a journal: Disregard significance testing as a metric for acceptance of an article, and say so clearly. Evaluate articles exclusively on merits such as quality of measurement, carefulness of design, thoughtfulness of analyses, relevance of the topic.
- If you are a reader of scientific literature: Ignore significance testing when you see them in publications. Ignore conclusions that are claimed to follow from it.
- See Gigerenzer's papers, in particular:
Gigerenzer & Marewski 2015 for the delusion of worshipping single tools such as p‑values and using them as “universal hammers”;
Gigerenzer 2004 about mindless statistics in social science (starring Dr. Publish-Perish);
Gigerenzer et al. 2004 talks about the mass confusion behind the null ritual.
Simmons et al. 2011 show how you can make anything turn “significant” in psychology;
Westover et al. 2011 for a didactic paper for a medical audience, with examples that doctors will find familiar;
Amrhein & Greenland 2017 (or here) also suggesting abandonment of NHST;
Amrhein et al. 2017 for an in-depth analysis of p‑values and unreplicability (“dichotomous threshold thinking must give way to non-automated informed judgment”);
Gelman & Stern 2012 for an ironically appropriate thrice-significant title;
Orlitzky 2011 for suggestions about institutional changes; Gorard 2015 for a suggestion of what to do instead of NHST in the social sciences.
- Neither Gigerenzer nor Seife mince words on Edge (2014) (Seife's criticism is valid even though he commits common misconception #2 from Greenland et al. 2016);
likewise, Colquhoun condemns NHST on Aeon (2016);
Colquhoun also gives didactic examples of the perils of p‑values, on Chalkdust (2015);
Amrhein argues inferential statistics is not inferential, on Medium (2018);
Cumming shows p‑values dancing all over the place in this simulated replication video (your study could be any of those lines; “significant”, you said?);
and both this cartoon and this one illustrate the problem.
- Also quotes and open letters.
Follow the links, understand them, read about potential personal risks, make up your mind.