Null Hypothesis Significance Testing (NHST) is common in research, notably in the biomedical and social sciences. The practice is nonsense and harmful.

At best, it hinders science and wastes money. At worst, it hurts people.

It should be abandoned.

Navigating the page

A very short summary

What is significance testing?

A statistical procedure that promises to give a yes-or-no verdict of whether a discovery has been made.

Many statistical tests exist, such as ANOVA, χ², and t-tests. A certain probability is calculated over the results of those tests. This probability is called a p‑value.

The significance test is this: if this p‑value is found to be...

The usual threshold practiced is 0.05, but varies by field.

Three kinds of significance levels

There are three different interpretations for “significance level”:

  1. The setting to 5% or any other level — an early proposal by Fisher.
  2. The interpretation as α — based on Neyman-Pearson's theory.
  3. The report of exact ps, no thresholds — a later proposal by Fisher.

Significance in the...

  1. first is a property of the test, set by mere convention;
  2. second is a property of the test, set by cost-benefit analysis;
  3. third is a property of the data.

So here's one thing Null Hypothesis Significance Testing is: a mix of the three theories.

I invite you to see Question 4 (page 9) of Gigerenzer et al. 2004. Then ask yourself: “are you committing the confusion of Dr. Publish‑Perish?”

(Only if you're interested in much more technical detail, see Schneider 2014 and Perezgonzalez 2015.)

What about that “null” part?

So, what is a “null hypothesis” anyway? Depends on who you ask. It can mean:

Pick your terminology. People who adopt the...

This means that “null” is ambiguous. As if we didn't already have semantic confusions enough with the word “significance”.

What significance testing isn't

For one, significance testing isn't the p‑value.

It's true that, in practice:

However, they are separate things. You can test significance with intervals. Or Bayes factors. Or you can eyeball your results and declare “Given the null and assumptions, the chance of this observation is very low. I declare that Yes, significant!”

And this is an issue.

No, not the use of intuitions. Something deeper, that applies regardless of what you are using for declaring “significance”.

Anxious apes dichotomize. Anxious apes reify.

People expect certainty from single studies.

“Is it true or not? Effect or no effect? Significant or not significant? Will Freshfutazol™ cure my grandmother's athlete's foot or not? Tell me!”

This anxiety consumes people. Humans, like cows, ruminate.

So Mrs. Significance comes along and says:
“Behold, mortal! The numbers have been transmuted, and I have thy answer! Do I have thy attention now? Good. So... (dramatic drumming) Significant! May the journals open their doors to thy achievement.”

How soothing!

And yet, how dangerous. Because reality doesn't care if someone declare an issue closed. Reality doesn't care if the neurons of a couple of apes (that would be you and I) rearranged and thοse apes now believe Freshfutazol™ is the best thing invented since sliced bread.

Freshfutazol™ will do its thing to cure your grandmother's ailments — or it won't. Freshfutazol™ is indifferent to what you think of it. Freshfutazol™ doesn't care the least that Mrs. Significance showed up and pompously gave a verdict.

Statistical models and their outputs are not reality itself.

We want to be more certain and less anxious. And this is one reason why Mrs. Significance is so appealing. But her soothing powers come at the cost of delusion. And this delusive overconfidence can be, and often is, harmful.

Why is significance testing a problem?

It doesn't deliver on its promise of telling discoveries and non-discoveries apart.

Ok, more concretely:

Null hypothesis significance testing...

This happens because

And since the difference between significant and nonsignificant is not significant... Uh‑oh.

Nullism is harmful

Usually only one hypothesis is tested, and it assumes zero effect and zero systematic error. It implies a belief that “starting from zero” is always rational and impartial and unbiased and objective. It isn't. Why?

Rejection of the null doesn't imply your hypothesis “is true”

The usual logic is that if p is low, the alternative hypothesis is proved. Wrong.

Other hypotheses could better fit. Noise in sampling and measurement may better explain the low p. This is often overlooked.

Your (nil-)null hypothesis isn't only “Freshfutazol™ produces zero effect”: it's also all other model assumptions. It presupposes zero systematic error. A tiny p doesn't imply you discovered faster-than-light neutrinos. Maybe it was just a loose cable.

Rejection of the null doesn't imply large effects

Now suppose your measurements are flawless. You say “Freshfutazol™ produces an effect”. But you already knew that; “zero difference” is rarely true. An increase in precision will find some difference.

Rejecting a (nil-)null is like saying “this place is not sterile”.
Anything fits this exclusion — from pigsty to palace.

If this large uncertainty were kept in mind, fine. But the significant-or-not compulsion collapses it into “this place is dirty”. So even tiny, clinically irrelevant differences will be blown out of proportion.

Testing only the null brings a bias that favors the null

Consider Greenland 2016:
“Testing only the no-effect hypothesis simply assumes, without grounds, that erroneously defaulting to no effect is the least costly error, and in this sense is a methodologic bias toward the null.”

Suppose “this substance has no side effects” is the only hypothesis, and “p > 0.05”. It's approved. Harms from eventual false negatives would fall on users.

Science does not demand that you assume “no effect” as a starting point. In fact, often you should not.

Everything null in multiple comparisons? Unlikely.

So, there's this nice cartoon everybody likes to show. The point it makes is that the more data someone analyzes, the higher the chance of some “significant” result showing up. Then the person sweeps under the rug all the nonsignificant results and selectively reports the significant ones. This is not cool, and you shouldn't do it: rather, display all the analyses. So the criticism is valid.

Scrupulous researchers then go one step further: “Hey, I'm not doing multiple comparisons to cheat. Here, I will prove. I will compensate by tweaking the threshold to make things harder.”

One popular way to do that is the Bonferroni correction. It's like this: if you made 20 comparisons, you compensate by dividing your 0.05 threshold by 20. So now some comparison is only “significant” if its p is below 0.0025. This addresses the false positive problem.

But... it may much increase the rate of false negatives. Causal relationships are there, and you throw them away, scared by the possibility of they being “only chance”.

The problem is that you assume no relationship whatsoever between any of the things you analyze. If you always do this, you're implying that you live in a universe where nothing is expected to have any effect on anything else. This is just wrong in principle and in general (with exceptions). It throws away promising avenues of investigation by creating a penalty for gathering information.

If you deal with biological data and have been correcting your multiple comparisons, then you should read the short misconception 5 of Rothman 2014; and the four pages of Rothman 1990.

(Technical reading for statisticians: Gelman suggests (pdf) that with Bayesian inference and the correct prior the problem of multiple comparisons disappears.)

Flat priors: that's “null” for Bayesians, with similar vices

(If this title is unintelligible to you, just skip to the next section)

Uninformative priors are a Bayesian version of nullism.

You may find Gelman 2013 to be a short and useful read. From there:
“The casual view of the P value as posterior probability of the truth of the null hypothesis is false and not even close to valid under any reasonable model, yet this misunderstanding persists even in high-stakes settings.”

Greenland 2017 has more about nullism and other cognitive distortions behind NHST.

What is a p-value?

Nobody knows.

“No-one understands what a p-value is, not even research professors or people teaching statistics.” (Haller & Kraus 2002)

Tongue-in-cheek but meaningful (by Nicholas Maxwell):
“p‑value is the degree to which the data are embarrassed by the null hypothesis.

Ok, ok. Let's try a common formal definition:
“p‑value is the probability of obtaining a test statistic equal to or more extreme than what was actually observed, conditional on the null hypothesis (including all model assumptions) being true.”

Another formal definition — an unconditional one (by Sander Greenland):
“p is the observed value of the random variable P, which in turn serves as a unit‑scaled index of compatibility between the data and the proposed decoding (summarizing) model M from which P is derived.”

Great. With that out of the way, we now pass to the fertile terrain of...

What a p-value isn't

Misconceptions of p‑values are widespread. Wikipedia has a dedicated page just for it. Goodman 2008 showed 12 of them. Greenland et al. 2016 pointed out 25 misinterpretations — of p‑values, confidence intervals, and power.

In particular, p-values CANNOT tell you:

  1. the importance of your result,
  2. the strength of the evidence,
  3. the size of an effect,
  4. the probability that rejecting the null is a wrong decision,
  5. the probability that the data were produced by random chance alone,
  6. the probability that the result can be replicated, or
  7. the probability that your hypothesis is true.

You cannot claim any of the above based on a p‑value.

p-values: by nature unsuitable for “significance”

Worth repeating: NHST is not the same as p‑values. They usually come together but you can have NHST without p‑values and p‑values without NHST. There are valid uses for p‑values; their mindless degradation into “significant or not” isn't one of them.

However, let's suppose you insist that the goal of finding “significance” is valid; that from a single study you can give a verdict of whether some effect “is real or not”. Nonsense, but suppose.

p‑values, by their very nature, would be unsuitable for that goal. In particular, they:

  1. answer a question that you're probably not asking;
  2. are sensitive to model violations;
  3. are sensitive to sample size — increase it and you get “significance”;
  4. are volatile — significant today, nonsignificant tomorrow.

These points should help you see the meaninglessness of the label “significant” obtained from a calculated p‑value.

p-values answer a question you're probably not asking

“It does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!”
(Cohen 1994)

p‑values answer the chance of observations given a hypothesis; whereas what you usually want is the chance of a hypothesis given observations.

Let's pick two groups: all the bread-eaters in the world, and all the psychologists in the world. Clearly, there are bread-eating psychologists. However:

Those are not interchangeable. Not at all the same thing. Without additional information, you can't infer one from the other.

Which leads us to this example from Gorard 2016:

p-values are sensitive to violations of model assumptions

Violations of any model assumptions (perfect randomization, zero systematic error, full reporting, etc.) can disturb your calculated p‑value. As already mentioned, a loose cable is all you need to have a p of virtually zero. Your model assumes good measures (no loose cables), and this violation makes your data much more incompatible with your model, which the p‑value reflects.

p-values are sensitive to sample size

If the null hypothesis isn't true (and it never is), then some difference exists between the groups. Improve your precision and it will be detected. Sample size increases, p‑value decreases. So “significance” can be bought, and a small p does not imply a big effect.

Here is the problem:

When people declare “significant!”, they think they detected some sizeable difference or effect. But the above shows that you can move your p up and down by varying your sample size with effect size kept constant!

p-values are volatile

p‑values are extremely volatile, even at larger sample sizes. They vary wildly when repeatedly sampling the same populations under the same methodology.

  1. Watch video of their dance.
  2. Read abstract and conclusion of this short paper.
  3. Read this short article.

This is by construction, by the way. It's not a “defect”. This is what they are supposed to do. They reflect random deviation from the model. They don't converge. (See p.5 of Amrhein, Trafimow, & Greenland 2018)

You say: “But intervals and Bayes factors also dance! Why single out the p‑value?”

First, I'm not arguing that other metrics are better for testing significance “because they wouldn't dance”. Rather, I'm arguing that the wild dance of p‑values disqualify the very notion and usefulness of claiming “significance”.

Second, the black-and-white thinking inherent to yes-or-no testing of significance is itself problematic whether you use p, CI, Bayes factors or whatever. We are dichotomaniacal apes. We can and should dispense with this habit. Let the variables be continuous. Embrace the uncertainty.

The options

In increasing order of change:

  1. Maintain things exactly as they are.
  2. Reform significance testing by lowering the cutoff of p<0.05.
  3. Abandon significance testing rather than just reforming.

Shortest analysis:

  1. Maintain. Those unaware of the issue can't help but continue doing the same. And those motivated to advance their careers through production of junk science may prefer that things don't change. (Maybe you have been in the first group, until now. Hopefully, you are not in the second.)

  2. Reform NHST. To produce a “true p‑value” of 0.05, you need to aim for at most 0.005, or even 0.001 (see Taleb 2018). At first sight, a lower threshold might then seem reasonable. Benjamin et al. 2017 proposed just that, arguing it would much reduce the rate of published false positives.

    But that doesn't take p‑hacking into account. Once it does, the reduction in false positives disappears, and it could make the replication crisis worse (Crane 2017). Some further reasons why such reforms could bring net harm are an increased overconfidence in published results, exaggeration of effect sizes, and discounting of valid findings (McShane et al. 2018).

  3. Abandon NHST. So, the current practice is harmful, and reforming doesn't help, plus could make it worse. Seems like sufficient reason to propose abandonment. Well, there are more. NHST is a form of “uncertainty laundering” poorly suited for the biomedical and social sciences, with their “small and variable effects and noisy measurements”. Plus, this yes-or-no mindset has “no ontological basis” and promotes bad reasoning. Plus the nil-null is uninteresting to calibrate against. Those issues aren't addressed by tweaking cutoff levels. So we better get rid of thresholds.

    If you read nothing else, please read the clear McShane et al. 2018 (pdf), since attempts to summarize it won't do it justice. Then Trafimow et al. 2018 (pdf), those two pages by Greenland, and Gorard 2016.

A quick discussion

So “Abandon null hypothesis significance testing” is what is being proposed urged here. See what you can do.

Worth repeating that dichotomization (yes/no to “significance”) and obsession with “zero effect, zero difference” nulls are serious problems intrinsic to NHST. They are not intrinsic problems of p‑values, intervals, or Bayes factors. So we can (should) eliminate NHST without the need for throwing those babies with the bathwater. Even if some babies are treacherous.

Elaboration on valid use cases for p‑values is beyond the scope of this site. If you are educated about them, go for it. For now, let's make sure you don't misuse them.

Objections to the abandonment of NHST

“Still, shouldn't we keep using significance testing, because...”

No. Read for example Schmidt et al. 1997, who for three years collected objections to the abandonment of significance testing — all rejected.

False objections you will see addressed there:

  1. Without significance testing we would not know whether a finding is real or due to chance.
  2. Hypothesis testing would not be possible without significance tests.
  3. The problem is not significance tests, but failure to develop a tradition of replicating studies.
  4. When studies have a large number of relationships, we need significance tests to identify those that are real.
  5. Confidence intervals are themselves significance tests.
  6. Significance testing ensures objectivity in the interpretation of research data.
  7. It is the misuse, not the use, of significance testing that is the problem.
  8. It is futile to try to reform data analysis methods, so why try?

See the FAQ for additional objections and questions.

What you can do

“So what if there were no null ritual or NHST? Nothing would be lost, except confusion, anxiety, and a platform for lazy theoretical thinking. Much could be gained, such as knowledge about different statistical tools, training in statistical thinking, and a motivation to deduce precise predictions from one’s hypotheses.

Should we ban the null ritual? Certainly — it is a matter of intellectual integrity. Every researcher should have the courage not to surrender to the ritual, and every editor, textbook writer, and adviser should feel obliged to promote statistical thinking and reject mindless rituals.” (Gigerenzer et al. 2004)

So you have read those pages, seen this letter, are aware of potential personal costs. Conditional on your having understood and agreed — what can you do?

There is no simple universal method for scientific inference. Likewise, there's no simple universal recommendation of what to do in situations where your personal convictions conflict with those of people around you.

The specifics of what you will do (if anything) depend on your personal situation, values, goals, temperament, people and institutions involved, and many other variables that only you, personally, can evaluate.

With all that in mind, below are suggestions. Starting upstream:




Next steps

Follow the links, understand them, read about potential personal risks, make up your mind.