Null Hypothesis Significance Testing (NHST) is common in research, notably in the biomedical and social sciences. The practice is nonsense and harmful.
At best, it hinders science and wastes money. At worst, it hurts people.
It should be abandoned.
A statistical procedure that promises to give a yes-or-no verdict of whether a discovery has been made.
Many statistical tests exist, such as ANOVA, χ², and t-tests. A certain probability is calculated over the results of those tests. This probability is called a p‑value.
The significance test is this: if this p‑value is found to be...
The usual threshold practiced is 0.05, but varies by field.
There are three different interpretations for “significance level”:
Significance in the...
So here's one thing Null Hypothesis Significance Testing is: a mix of the three theories.
I invite you to see Question 4 (page 9) of Gigerenzer et al. 2004. Then ask yourself: “are you committing the confusion of Dr. Publish‑Perish?”
So, what is a “null hypothesis” anyway? Depends on who you ask. It can mean:
Pick your terminology. People who adopt the...
This means that “null” is ambiguous. As if we didn't already have semantic confusions enough with the word “significance”.
For one, significance testing isn't the p‑value.
It's true that, in practice:
However, they are separate things. You can test significance with intervals. Or Bayes factors. Or you can eyeball your results and declare “Given the null and assumptions, the chance of this observation is very low. I declare that Yes, significant!”
And this is an issue.
No, not the use of intuitions. Something deeper, that applies regardless of what you are using for declaring “significance”.
People expect certainty from single studies.
“Is it true or not? Effect or no effect? Significant or not significant? Will Freshfutazol™ cure my grandmother's athlete's foot or not? Tell me!”
This anxiety consumes people. Humans, like cows, ruminate.
So Mrs. Significance comes along and says:
“Behold, mortal! The numbers have been transmuted, and I have thy answer! Do I have thy attention now? Good. So... (dramatic drumming) Significant! May the journals open their doors to thy achievement.”
And yet, how dangerous. Because reality doesn't care if someone declare an issue closed. Reality doesn't care if the neurons of a couple of apes (that would be you and I) rearranged and thοse apes now believe Freshfutazol™ is the best thing invented since sliced bread.
Freshfutazol™ will do its thing to cure your grandmother's ailments — or it won't. Freshfutazol™ is indifferent to what you think of it. Freshfutazol™ doesn't care the least that Mrs. Significance showed up and pompously gave a verdict.
Statistical models and their outputs are not reality itself.
We want to be more certain and less anxious. And this is one reason why Mrs. Significance is so appealing. But her soothing powers come at the cost of delusion. And this delusive overconfidence can be, and often is, harmful.
It doesn't deliver on its promise of telling discoveries and non-discoveries apart.
Ok, more concretely:
Null hypothesis significance testing...
This happens because
And since the difference between significant and nonsignificant is not significant... Uh‑oh.
Usually only one hypothesis is tested, and it assumes zero effect and zero systematic error. It implies a belief that “starting from zero” is always rational and impartial and unbiased and objective. It isn't. Why?
The usual logic is that if p is low, the alternative hypothesis is proved. Wrong.
Other hypotheses could better fit. Noise in sampling and measurement may better explain the low p. This is often overlooked.
Your (nil-)null hypothesis isn't only “Freshfutazol™ produces zero effect”: it's also all other model assumptions. It presupposes zero systematic error. A tiny p doesn't imply you discovered faster-than-light neutrinos. Maybe it was just a loose cable.
Now suppose your measurements are flawless. You say “Freshfutazol™ produces an effect”. But you already knew that; “zero difference” is rarely true. An increase in precision will find some difference.
Rejecting a (nil-)null is like saying “this place is not sterile”.
Anything fits this exclusion — from pigsty to palace.
If this large uncertainty were kept in mind, fine. But the significant-or-not compulsion collapses it into “this place is dirty”. So even tiny, clinically irrelevant differences will be blown out of proportion.
Consider Greenland 2016:
“Testing only the no-effect hypothesis simply assumes, without grounds, that erroneously defaulting to no effect is the least costly error, and in this sense is a methodologic bias toward the null.”
Suppose “this substance has no side effects” is the only hypothesis, and “p > 0.05”. It's approved. Harms from eventual false negatives would fall on users.
Science does not demand that you assume “no effect” as a starting point. In fact, often you should not.
So, there's this nice cartoon everybody likes to show. The point it makes is that the more data someone analyzes, the higher the chance of some “significant” result showing up. Then the person sweeps under the rug all the nonsignificant results and selectively reports the significant ones. This is not cool, and you shouldn't do it: rather, display all the analyses. So the criticism is valid.
Scrupulous researchers then go one step further: “Hey, I'm not doing multiple comparisons to cheat. Here, I will prove. I will compensate by tweaking the threshold to make things harder.”
One popular way to do that is the Bonferroni correction. It's like this: if you made 20 comparisons, you compensate by dividing your 0.05 threshold by 20. So now some comparison is only “significant” if its p is below 0.0025. This addresses the false positive problem.
But... it may much increase the rate of false negatives. Causal relationships are there, and you throw them away, scared by the possibility of they being “only chance”.
The problem is that you assume no relationship whatsoever between any of the things you analyze. If you always do this, you're implying that you live in a universe where nothing is expected to have any effect on anything else. This is just wrong in principle and in general (with exceptions). It throws away promising avenues of investigation by creating a penalty for gathering information.
(If this title is unintelligible to you, just skip to the next section)
Uninformative priors are a Bayesian version of nullism.
You may find Gelman 2013 to be a short and useful read. From there:
“The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings.”
Greenland 2017 has more about nullism and other cognitive distortions behind NHST.
“No-one understands what a p-value is, not even research professors or people teaching statistics.” (Haller & Kraus 2002)
Tongue-in-cheek but meaningful (by Nicholas Maxwell):
“p‑value is the degree to which the data are embarrassed by the null hypothesis.”
Ok, ok. Let's try a common formal definition:
“p‑value is the probability of obtaining a test statistic equal to or more extreme than what was actually observed, conditional on the null hypothesis (including all model assumptions) being true.”
Another formal definition — an unconditional one (by Sander Greenland):
“p is the observed value of the random variable
P, which in turn serves as a unit‑scaled index of compatibility between the data and the proposed decoding (summarizing) model
M from which
P is derived.”
Great. With that out of the way, we now pass to the fertile terrain of...
Misconceptions of p‑values are widespread. Wikipedia has a dedicated page just for it. Goodman 2008 showed 12 of them. Greenland et al. 2016 pointed out 25 misinterpretations — of p‑values, confidence intervals, and power.
In particular, p-values CANNOT tell you:
You cannot claim any of the above based on a p‑value.
Worth repeating: NHST is not the same as p‑values. They usually come together but you can have NHST without p‑values and p‑values without NHST. There are valid uses for p‑values; their mindless degradation into “significant or not” isn't one of them.
However, let's suppose you insist that the goal of finding “significance” is valid; that from a single study you can give a verdict of whether some effect “is real or not”. Nonsense, but suppose.
p‑values, by their very nature, would be unsuitable for that goal. In particular, they:
These points should help you see the meaninglessness of the label “significant” obtained from a calculated p‑value.
p‑values answer the chance of observations given a hypothesis; whereas what you usually want is the chance of a hypothesis given observations.
Let's pick two groups: all the bread-eaters in the world, and all the psychologists in the world. Clearly, there are bread-eating psychologists. However:
Those are not interchangeable. Not at all the same thing. Without additional information, you can't infer one from the other.
Which leads us to this example from Gorard 2016:
Violations of any model assumptions (perfect randomization, zero systematic error, full reporting, etc.) can disturb your calculated p‑value. As already mentioned, a loose cable is all you need to have a p of virtually zero. Your model assumes good measures (no loose cables), and this violation makes your data much more incompatible with your model, which the p‑value reflects.
If the null hypothesis isn't true (and it never is), then some difference exists between the groups. Improve your precision and it will be detected. Sample size increases, p‑value decreases. So “significance” can be bought, and a small p does not imply a big effect.
Here is the problem:
When people declare “significant!”, they think they detected some sizeable difference or effect. But the above shows that you can move your p up and down by varying your sample size with effect size kept constant!
p‑values are extremely volatile, even at larger sample sizes. They vary wildly when repeatedly sampling the same populations under the same methodology.
This is by construction, by the way. It's not a “defect”. This is what they are supposed to do. They reflect random deviation from the model. They don't converge. (See p.5 of Amrhein, Trafimow, & Greenland 2018)
You say: “But intervals and Bayes factors also dance! Why single out the p‑value?”
First, I'm not arguing that other metrics are better for testing significance “because they wouldn't dance”. Rather, I'm arguing that the wild dance of p‑values disqualify the very notion and usefulness of claiming “significance”.
Second, the black-and-white thinking inherent to yes-or-no testing of significance is itself problematic whether you use p, CI, Bayes factors or whatever. We are dichotomaniacal apes. We can and should dispense with this habit. Let the variables be continuous. Embrace the uncertainty.
In increasing order of change:
So “Abandon null hypothesis significance testing” is what is being
proposed urged here. See what you can do.
Worth repeating that dichotomization (yes/no to “significance”) and obsession with “zero effect, zero difference” nulls are serious problems intrinsic to NHST. They are not intrinsic problems of p‑values, intervals, or Bayes factors. So we can (should) eliminate NHST without the need for throwing those babies with the bathwater. Even if some babies are treacherous.
Elaboration on valid use cases for p‑values is beyond the scope of this site. If you are educated about them, go for it. For now, let's make sure you don't misuse them.
“Still, shouldn't we keep using significance testing, because...”
No. Read for example Schmidt et al. 1997, who for three years collected objections to the abandonment of significance testing — all rejected.
False objections you will see addressed there:
See the FAQ for additional objections and questions.
“So what if there were no null ritual or NHST? Nothing would be lost, except confusion, anxiety, and a platform for lazy theoretical thinking. Much could be gained, such as knowledge about different statistical tools, training in statistical thinking, and a motivation to deduce precise predictions from one’s hypotheses.
Should we ban the null ritual? Certainly — it is a matter of intellectual integrity. Every researcher should have the courage not to surrender to the ritual, and every editor, textbook writer, and adviser should feel obliged to promote statistical thinking and reject mindless rituals.” (Gigerenzer et al. 2004)
There is no simple universal method for scientific inference. Likewise, there's no simple universal recommendation of what to do in situations where your personal convictions conflict with those of people around you.
The specifics of what you will do (if anything) depend on your personal situation, values, goals, temperament, people and institutions involved, and many other variables that only you, personally, can evaluate.
With all that in mind, below are suggestions. Starting upstream:
Follow the links, understand them, read about potential personal risks, make up your mind.