Null Hypothesis Significance Testing (NHST) is common in research, usually through so‑called “p‑values”.

The practice is nonsense and harmful. It should be abandoned. At best, it hinders science and wastes money. At worst, it hurts people.

Navigating the page

What is significance testing?

A statistical procedure that promises to give a yes-or-no verdict of whether a discovery has been made.

Many statistical tests exist, such as ANOVA, χ², and t-tests. A certain probability is calculated over the results of those tests. This probability is called a p‑value.

The significance test is this: if this p‑value is found to be...

The usual threshold practiced is 0.05, but varies by field.

Three kinds of significance levels

Did you know that there are three different interpretations for “significance level”?

  1. The setting to 5% or any other level — an early proposal by Fisher.
  2. The interpretation as α — based on Neyman-Pearson's theory.
  3. The report of exact ps, no thresholds — a later proposal by Fisher.

Significance in the...

  1. first is a property of the test, set by mere convention;
  2. second is a property of the test, set by cost-benefit analysis;
  3. third is a property of the data.

So here's one thing Null Hypothesis Significance Testing is: a confused conflation of the three theories; “a mishmash that doesn't exist in statistics proper”.

For more about that, I invite you to see Question 4 (page 9) of Gigerenzer et al. 2004. Then ask yourself: “are you commiting the emotional and intellectual confusion of Dr. Publish‑Perish?”

What significance testing isn't

For one, significance testing isn't the p‑value.

It's true that, in practice:

However, they are separate things. A bunch of stuff can be used to test significance. It can be done with intervals. Or Bayes factors.

Or you can eyeball your results and declare “Hmmm... Given the null and assumptions, the chance of this observation is very low. Way off the expected. I declare that Yes, significant!”

And this is an issue.

Not the use of intuitions, because some data may be so clear that you don't need complicated statistics to declare what you see. No, it's something else, deeper, and it applies regardless of what you are using for declaring “significance”.

Anxious apes

People expect certainty from single studies.

“Is it true or not? Yes or no? Effect or no effect? Significant or not significant? You just finished a study on this new-generation Freshfutazol™, so will it cure my grandmother's athlete's foot or not? Tell me!”

This anxiety consumes people. Humans, like cows, ruminate.

So Mrs. Significance comes along and says:
“Behold, mortal! The numbers have been shaken and transmuted, and I have thy answer! Do I have thy attention now? Good. So... (dramatic drumming) Significant! Congratulations! May the journals open their doors to thy achievement.”

How soothing!

And yet, how dangerous. Because reality doesn't care if someone declare an issue closed. Reality doesn't care if the neurons of a couple of apes (that would be you and I) rearranged and thοse apes now believe Freshfutazol™ is the best thing invented since sliced bread.

Freshfutazol™ will do its thing to cure your grandmother's ailments — or it won't. Freshfutazol™ is indifferent to what you think of it. Freshfutazol™ doesn't care the least that Mrs. Significance showed up and pompously gave a verdict.

We want to be more certain and less anxious. And this is one reason why Mrs. Significance is so appealing. But her soothing powers come at the cost of delusion. And this delusive overconfidence can be, and often is, harmful.

Why is significance testing a problem?

It doesn't deliver on its promise of telling discoveries and non-discoveries apart.

Ok, more concretely: why is NHST a problem?

Significance testing

This happens because

And since the difference between significant and nonsignificant is not significant... Uh‑oh.

What is a p-value?

Nobody knows.

“No-one understands what a p-value is, not even research professors or people teaching statistics.” (Haller & Kraus 2002)

Ok, ok. Let's try. A p‑value is “the probability of obtaining a result equal to or more extreme than what was actually observed, conditional on the null hypothesis plus all other model assumptions being true, and comparable only to samples of the same magnitude”.

Ready to misunderstand that in 5 minutes? Great, so we can now pass to the fertile terrain of...

What a p-value isn't

Misconceptions of p‑values are widespread. Wikipedia has a dedicated page just for it. Goodman 2008 showed 12 of them. Greenland et al. 2016 pointed out 25 misinterpretations — of p‑values, confidence intervals, and power.

In particular, p-values CANNOT tell you:

  1. the importance of your result,
  2. the strength of the evidence,
  3. the size of an effect,
  4. the probability that rejecting the null is a wrong decision,
  5. the probability that the data were produced by random chance alone,
  6. the probability that the result can be replicated, or
  7. the probability that your hypothesis is true.

You cannot claim any of the above based on a p‑value.

Why are p-values a problem?

Because you won't use them correctly. And because even if you did, what they answer is unlikely to be useful or interesting.

Let's start with the “using correctly” part.

Social scientist? No p-values for you.

Why? Because p‑values assume you have randomized samples. And you probably don't.

It's a built-in requirement for calculating a p‑value that your data come from a randomized sample. So unless you have legitimate reasons not to, you'd better follow this rule.

(What if you didn't? Then on top of p-values' intrinsic volatility and distortions from noise in your measurements, you would also have the usually large distortions from nonrandom sampling. This would demand even stronger assumptions for any claims you wish to make about the general population. Would you be able to provide them? Statistical pyrotechnics may be insufficient to rescue hopelessly low-quality data. Worse, the added layer of apparent sophistication may breed overconfidence.)

What do I mean by random? I mean you better get very close to no design bias, no dropouts, no non-responses, no measurement error, and no loss of blinding.

So if your samples are not randomized over the population about which you want to draw inferences, you will have a distorted input for the calculation of a p‑value, which in turn will be distorted input for the calculation of the significance test, whose output will be distorted.

And this is following p‑values and significance testing's own terms. This is before any of the abundant additional criticism given by this site is taken into consideration.

From Gorard 2017, Sociological Research Online:

By the way, Gigerenzer 2004 is mandatory reading for mindfull social scientists.

If this still doesn't make sense, see this question here.

Ok, now let's suppose you solved randomization and measurement. Next:

You'll misunderstand it

See what a p-value isn't. Are you sure you are not claiming any of those things that you cannot conclude?

Ok. Then:

You'll be misled by its unreliability

Because p‑values are, by construction, extremely volatile, regardless of sample size. Their value will vary wildly when repeatedly sampling the same populations under the same methodology. This should convince you:

  1. Watch video of their dance.
  2. Read abstract and conclusion of this short paper.
  3. Read this short article.

Mind you, the issue exists even if you're not evaluating “significance”. But if you are, then significant today, nonsignificant tomorrow.

Worth repeating: the difference between significant and nonsignificant is not itself significant.

Uninteresting. Unlikely.

p‑values are, more often than not, calibrated against a hypothesis that assumes zero effect and zero systematic error. Also more often than not, this is uninteresting and unlikely.

Physicists normally do this thing of “let's assume this cow is a perfectly frictionless sphere” — unlikely scenario, but at least in their case it's interesting, and useful for coming up with workable simplified models. More about cows later.

Now you say: “I'm going to rebel against this default! I will use non-nil nulls!” Fine, but you will still have to deal with the detail that...

It answers the wrong question

p‑values answer the chance of observations given a hypothesis; whereas what we usually want is the chance of a hypothesis given observations.

Those are not interchangeable. Not at all the same thing. Without additional information, you can't jump from one to the other.

Let's pick two groups: all the bread-eaters in the world, and all the psychologists in the world. Clearly, there are psychologists who eat bread. However:

And you can't infer one just from the other.

Significance testing in action:

Do you see anything weird in the logic above? Good. (What is it?)

If you prefer numbers, here is an example from Gorard 2016:

“It does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!”
(Cohen 1994)

Too many problems for too few solutions

So, p‑values are almost certain to be a source of confusion. And unlikely to be a source of answers you want to questions you care about.

(Well, I don't really know what you want, and probably neither do you. I'd still guess: what you want to know is probably not what p‑values answer.)

The options

In increasing order of change:

  1. Maintain things exactly as they are.
  2. Reform significance testing by lowering the cutoff of p<0.05.
  3. Abandon significance testing rather than just reforming.
  4. Ban p-values as well, any use of it.

Shortest analysis:

  1. Maintain. Those unaware of the issue can't help but continue doing the same. And those motivated to advance their careers through production of junk science may prefer that things don't change. (Maybe you have been in the first group, until now. Hopefully, you are not in the second.)

  2. Reform NHST. To produce a “true p‑value” of 0.05, you need to aim for at most 0.005, or even 0.001 (see Taleb 2018). At first sight, a lower threshold might then seem reasonable. Benjamin et al. 2017 proposed just that, arguing it would much reduce the rate of published false positives.

    But that doesn't take p‑hacking into account. Once it does, the reduction in false positives disappears, and it could make the replication crisis worse (Crane 2017). Some further reasons why such reforms could bring net harm are an increased overconfidence in published results, exaggeration of effect sizes, and discounting of valid findings (McShane & Gelman 2017).

  3. Abandon NHST. So, the current practice is harmful, and reforming doesn't help, plus could make it worse. Seems like sufficient reason to propose abandonment. Well, there are more. NHST is a form of “uncertainty laundering” poorly suited for the biomedical and social sciences, with their “small and variable effects and noisy measurements”. Plus, this yes-or-no mindset has “no ontological basis” and promotes bad reasoning. Plus the nil-null is uninteresting to calibrate against. Those issues aren't addressed by tweaking cutoff levels. So we better get rid of thresholds.

    Please read the short and clear McShane & Gelman 2017, since attempts to summarize it won't do it justice. Then quickly Trafimow et al 2017, and then the incisive Gorard 2016. For an amusing account of events, you may like the alpha wars.

    So what happens to p‑values once NHST is over and they lose their gatekeeping role? One could want to use them as just another tool, without launching into yes-or-no claims of significance. And their use would decline by itself, circumscribed to its (very) limited applications.

  4. Ban p-values. Many, however, see in p‑values so few redeeming qualities and so much potential for misuse that, if not an outright ban, they more emphatically suggest the tool be “relegated to the scrap heap” (Lindley 1999).


So “Abandon NHST” is what is being proposed urged here. See what you can do.

No ban on p‑values, though. Despite all their confusion and limitations, they have their use in some fields, in some cases. They should, however, be much de-emphasized.

Objections to the abandonment of NHST

“Still, shouldn't we keep using significance testing, because...”


Read for example Schmidt et al. 1997, who for three years collected objections to the abandonment of significance testing — all rejected.

False objections you will see addressed there:

  1. Without significance testing we would not know whether a finding is real or due to chance.
  2. Hypothesis testing would not be possible without significance tests.
  3. The problem is not significance tests, but failure to develop a tradition of replicating studies.
  4. When studies have a large number of relationships, we need significance tests to identify those that are real.
  5. Confidence intervals are themselves significance tests.
  6. Significance testing ensures objectivity in the interpretation of research data.
  7. It is the misuse, not the use, of significance testing that is the problem.
  8. It is futile to try to reform data analysis methods, so why try?

See the FAQ for additional objections and questions.

What you can do right now

“So what if there were no null ritual or NHST? Nothing would be lost, except confusion, anxiety, and a platform for lazy theoretical thinking. Much could be gained, such as knowledge about different statistical tools, training in statistical thinking, and a motivation to deduce precise predictions from one’s hypotheses.

Should we ban the null ritual? Certainly — it is a matter of intellectual integrity. Every researcher should have the courage not to surrender to the ritual, and every editor, textbook writer, and adviser should feel obliged to promote statistical thinking and reject mindless rituals.” (Gigerenzer et al. 2004)

So you have read those pages, seen this letter, are aware of potential personal costs, and decided to do something. What can you do?

Here are suggestions, starting upstream:




Next steps

Follow the links, understand them, read about potential personal risks, make up your mind.