**Null Hypothesis Significance Testing (NHST)** is common in research, usually through so‑called “** p‑values**”.

The practice is **nonsense and harmful**. It **should be abandoned**. At best, it hinders science and wastes money. At worst, it hurts people.

**Significance testing**: what it is, what it isn't, and why a problem.: what it is, what it isn't, and why a problem.*p*‑value**Why**, among the options, the**end of significance testing is best**.**False**objections you may have.**What you can do about it**.- And further sources.

A statistical procedure that *promises* to give a yes-or-no verdict of whether a discovery has been made.

Many statistical tests exist, such as ANOVA, χ², and t-tests. A certain probability is calculated over the results of those tests. This probability is called a *p*‑value.

The significance test is this: if this *p*‑value is found to be...

*below*the agreed-upon threshold, the result is said to “have reached significance”, a discovery is declared, and researchers feel joy.*above*the agreed-upon threshold, it “has not reached significance”, no discovery is declared, and researchers feel doomed.

The usual threshold practiced is 0.05, but varies by field.

Did you know that there are **three** different **interpretations** for “significance level”?

- The setting to 5% or any other level — an early proposal by
**Fisher**. - The interpretation as α — based on
**Neyman-Pearson**'s theory. - The report of exact
*p*s, no thresholds — a later proposal by**Fisher**.

**Significance** in the...

- first is a property of the
**test**, set by**mere convention**; - second is a property of the
**test**, set by**cost-benefit analysis**; - third is a property of the
**data**.

So here's one thing **Null Hypothesis Significance Testing is**: a **confused conflation** of the three theories; *“a mishmash that doesn't exist in statistics proper”*.

For more about that, I invite you to see Question 4 (page 9) of Gigerenzer et al. 2004. Then ask yourself: **“are you commiting the emotional and intellectual confusion of Dr. Publish‑Perish?”**

For one, **significance testing isn't the p‑value.**

- You can calculate a
*p*‑value without giving a yes/no verdict of “statistical significance”. - And you can “test significance” using things other than
*p*‑values.

It's true that, in *practice*:

- If you observe a
*p*‑value being used, it's highly likely that a “significance test” follows it, explicitly or implied. - If you observe a “significance test” being used, it's highly likely that it was through a
*p*‑value.

However, they are separate things. A bunch of stuff can be used to test significance. It can be done with intervals. Or Bayes factors.

Or you can eyeball your results and declare *“Hmmm... Given the null and assumptions, the chance of this observation is very low. Way off the expected. I declare that Yes, significant!”*

And this is an issue.

Not the use of intuitions, because some data may be so clear that you don't need complicated statistics to declare what you see. No, it's something else, deeper, and it applies regardless of what you are using for declaring “significance”.

People expect certainty from single studies.

*“Is it true or not? Yes or no? Effect or no effect? Significant or not significant? You just finished a study on this new-generation Freshfutazol™, so will it cure my grandmother's athlete's foot or not? Tell me!”*

This anxiety consumes people. Humans, like cows, ruminate.

So Mrs. Significance comes along and says:

**“Behold, mortal!** The numbers have been shaken and transmuted, and I have thy answer! Do I have thy attention now? Good. So... (dramatic drumming) **Significant!** Congratulations! May the journals open their doors to thy achievement.”

How soothing!

And yet, how dangerous. Because reality doesn't care if someone declare an issue closed. Reality doesn't care if the neurons of a couple of apes (that would be you and I) rearranged and thοse apes now believe Freshfutazol™ is the best thing invented since sliced bread.

Freshfutazol™ will do its thing to cure your grandmother's ailments — or it won't. Freshfutazol™ is indifferent to what you think of it. Freshfutazol™ doesn't care the least that Mrs. Significance showed up and pompously gave a verdict.

We want to be more certain and less anxious. And this is one reason why Mrs. Significance is so appealing. But her soothing powers come at the cost of delusion. And this delusive overconfidence can be, and often is, harmful.

It doesn't deliver on its promise of telling discoveries and non-discoveries apart.

- Not (only) because the threshold “is bad” (
*if*it was only this we could tweak it, problem solved); - not (only) because of the
*p*‑values (or Bayes factors, or intervals) that feed it; - not (only) because a single study is often insufficient to drive conclusions;
- but also because of an unwarranted expectation that it can give final, yes-or-no answers about matters of
*existence*.

Ok, more concretely: why is NHST a problem?

**Significance testing**

- hinders the advancement of cumulative science, causes predictable nonreplication of studies, and is a major contributor to publication bias;
- distracts from the much more important
*“neglected factors”*of*“prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding”*(McShane & Gelman 2017); - harms people as an effect of the disinformation it promotes and the information it suppresses.
- it enables publication of junk research (masqueraded as “significant”);
- it discourages people from pursuing promising research (upfront deemed “insignificant”).

**This happens because**

- it allows and invites
*p*-hacking (maliciously or unconsciously), so in the end anything can be made “significant”; - its mechanical procedures are easy, and its output is psychologically reassuring;
- the gatekeeping role of “statistical significance” for having one's research accepted is a terrible proposition maintained for historical and psychological reasons rather than merit;
- its output is a wildly volatile answer to a question you're not really asking.

And since the difference between significant and nonsignificant is not significant... Uh‑oh.

**Nobody knows.**

*“No-one understands what a p-value is, not even research professors or people teaching statistics.” (Haller & Kraus 2002)*

Ok, ok. Let's try. A *p*‑value is *“the probability of obtaining a result equal to or more extreme than what was actually observed, conditional on the null hypothesis plus all other model assumptions being true, and comparable only to samples of the same magnitude”*.

Ready to misunderstand that in 5 minutes? Great, so we can now pass to the fertile terrain of...

Misconceptions of *p*‑values are widespread. Wikipedia has a dedicated page just for it. Goodman 2008 showed 12 of them. Greenland et al. 2016 pointed out **25 misinterpretations** — of *p*‑values, confidence intervals, and power.

In particular, ** p-values CANNOT tell you**:

- the importance of your result,
- the strength of the evidence,
- the size of an effect,
- the probability that rejecting the null is a wrong decision,
- the probability that the data were produced by random chance alone,
- the probability that the result can be replicated, or
- the probability that your hypothesis is true.

You cannot claim any of the above based on a *p*‑value.

- The 1st item means that “statistically significant” (itself a problematic term)
**has nothing to do**with “clinically relevant” or “economically important”.

(And yet, 80% of these American economists got it wrong.) - The 4th item doesn't work either; after all,
*p*‑values are**probabilities of data**, not of hypotheses.

(And yet, 73% of these German professors of statistics got it wrong.) - The 6th item refers to an
**unconditional**probability of observing the data. It overlooks that*p*‑values are probability of data**conditional on the null**(plus model assumptions). This has been dubbed “the replication fallacy”.

(And yet, 60% of these British professors got it wrong.) - The 7th item means you
**must not, cannot, will not**declare the probability of your “discovery” being true or false based on the*p*‑value you found.

(And yet, 93% of these American physicians got it wrong.)

Because you won't use them correctly. And because even if you did, what they answer is unlikely to be useful or interesting.

Let's start with the “using correctly” part.

Why? Because ** p‑values assume you have randomized samples**. And you probably don't.

It's a built-in requirement for calculating a *p*‑value that your data come from a *randomized* sample.
So unless you have legitimate reasons not to, you'd better follow this rule.

(What if you didn't? Then on top of p-values' intrinsic volatility *and* distortions from noise in your measurements, you would *also* have the usually large distortions from nonrandom sampling. This would demand even stronger assumptions for any claims you wish to make about the general population. Would you be able to provide them? Statistical pyrotechnics may be insufficient to rescue hopelessly low-quality data. Worse, the added layer of apparent sophistication may breed overconfidence.)

What do I mean by random? I mean you better get very close to no design bias, no dropouts, no non-responses, no measurement error, and no loss of blinding.

So if your samples are not randomized over the population about which you want to draw inferences, you will have a **distorted input** for the calculation of a *p*‑value, which in turn will be **distorted input** for the calculation of the significance test, whose **output will be distorted**.

And this is following *p*‑values and significance testing's own terms. *This is before any of the abundant additional criticism given by this site is taken into consideration.*

From Gorard 2017, *Sociological Research Online*:

- “In the last five years I have conducted a large number of systematic and scoping reviews covering perhaps 60,000 research reports. In real-life research, I have
seen a significance test used in a correct context (meeting the basic assumptions) and with the results described correctly.”**never**

By the way, Gigerenzer 2004 is mandatory reading for mindfull social scientists.

- “If
**psychologists**are so smart, why are they so confused? Why is statistics carried out like compulsive hand washing?”

If this still doesn't make sense, see this question here.

Ok, now let's suppose you solved randomization and measurement. Next:

See what a p-value isn't. Are you sure you are not claiming any of those things that you cannot conclude?

Ok. Then:

Because *p*‑values are, by construction, extremely volatile, regardless of sample size. Their value will vary wildly when repeatedly sampling the *same* populations under the *same* methodology. This should convince you:

- Watch video of their dance.
- Read abstract and conclusion of this short paper.
- Read this short article.

Mind you, the issue exists even if you're not evaluating “significance”. But if you are, then **significant today, nonsignificant tomorrow**.

Worth repeating: the difference between significant and nonsignificant is not itself significant.

*p*‑values are, more often than not, calibrated against a hypothesis that assumes zero effect and zero systematic error. Also more often than not, this is uninteresting and unlikely.

Physicists normally do this thing of *“let's assume this cow is a perfectly frictionless sphere”* — unlikely scenario, but at least in *their* case it's interesting, and useful for coming up with workable simplified models. More about cows later.

Now you say: *“I'm going to rebel against this default! I will use non-nil nulls!”* Fine, but you will still have to deal with the detail that...

*p*‑values answer the chance of observations given a hypothesis; whereas what we usually want is the chance of a hypothesis given observations.

Those are **not interchangeable. Not at all the same thing.** Without additional information, you can't jump from one to the other.

Let's pick two groups: all the bread-eaters in the world, and all the psychologists in the world. Clearly, there are psychologists who eat bread. However:

**The percentage of psychologists who eat bread is one thing.****The percentage of bread-eaters who are psychologists is a different thing.**

And you can't infer one just from the other.

Significance testing in action:

- “From past studies, we know most psychologists eat bread. Students were randomly sampled from the
*Mechatronic Department of the Yangon Technological University*. My null assumes that this population consists exclusively of psychologists. If they are, we expect bread-eating to be prevalent. We observed the sample, and most were found to eat bread. That observation is expected, fitting the null almost perfectly. So it's not statistically significant, and we must accept the null hypothesis that they are all psychologists.”

Do you see anything *weird* in the logic above? Good. (What is it?)

If you prefer numbers, here is an example from Gorard 2016:

- “Assuming that a bag of 100 marbles contains 50 red and 50 blue (the null hypothesis) it is easy to calculate the precise probability of drawing out 7 red and 3 blue in a random sample of 10 marbles. This is, in effect, what significance tests do. But this does not tell us whether the bag really does contain 50 red and 50 blue marbles.
**How could it?**In practice, we would not know the number of each colour in the bag. In this situation, drawing 7 red and 3 blue in a sample would not tell us the colours of those left. Yet this is what significance test advocates claim we can do. It is a kind of magic-thinking — an erroneous superstitious belief.”

*“It does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!”*

(Cohen 1994)

So, *p*‑values are almost certain to be a source of confusion. And unlikely to be a source of answers you want to questions you care about.

(Well, I don't *really* know what you want, and probably neither do you. I'd still guess: what you want to know is probably not what *p*‑values answer.)

In increasing order of change:

**Maintain things**exactly as they are.**Reform significance testing**by lowering the cutoff of p<0.05.**Abandon significance testing**rather than just reforming.**Ban p-values**as well, any use of it.

Shortest analysis:

**Maintain.**Those unaware of the issue can't help but continue doing the same. And those motivated to advance their careers through production of junk science may*prefer*that things don't change. (Maybe you have been in the first group, until now. Hopefully, you are not in the second.)**Reform NHST.**To produce a “true*p*‑value” of 0.05, you need to aim for at most 0.005, or even 0.001 (see Taleb 2018). At first sight, a lower threshold might then seem reasonable. Benjamin et al. 2017 proposed just that, arguing it would much reduce the rate of published false positives. But that doesn't take*p*‑hacking into account. Once it does, the reduction in false positives disappears, and it could make the replication crisis*worse*(Crane 2017). Some further reasons why such reforms could bring net*harm*are an increased overconfidence in published results, exaggeration of effect sizes, and discounting of valid findings (McShane & Gelman 2017).**Abandon NHST.**So, the current practice is harmful, and reforming doesn't help, plus could make it worse. Seems like sufficient reason to propose abandonment. Well, there are more. NHST is a form of “uncertainty laundering” poorly suited for the biomedical and social sciences, with their “small and variable effects and noisy measurements”. Plus, this yes-or-no mindset has “no ontological basis” and promotes bad reasoning. Plus the nil-null is uninteresting to calibrate against. Those issues aren't addressed by tweaking cutoff levels. So we better get rid of thresholds. Please read the short and clear McShane & Gelman 2017, since attempts to summarize it won't do it justice. Then quickly Trafimow et al 2017, and then the incisive Gorard 2016. For an amusing account of events, you may like the alpha wars. So what happens to*p*‑values once NHST is over and they lose their gatekeeping role? One could want to use them as just another tool, without launching into yes-or-no claims of significance. And their use would decline by itself, circumscribed to its (very) limited applications.**Ban p-values.**Many, however, see in*p*‑values so few redeeming qualities and so much potential for misuse that, if not an outright ban, they more emphatically suggest the tool be “relegated to the scrap heap” (Lindley 1999).

So “**Abandon NHST**” is what is being ~~proposed~~ urged here. See what you can do.

No ban on *p*‑values, though. Despite all their confusion and limitations, they have their use in *some* fields, in *some* cases. They should, however, be much de-emphasized.

*“Still, shouldn't we keep using significance testing, because...”*

No.

Read for example Schmidt et al. 1997, who for three years collected objections to the abandonment of significance testing — all rejected.

**False objections** you will see addressed there:

- Without significance testing we would not know whether a finding is real or due to chance.
- Hypothesis testing would not be possible without significance tests.
- The problem is not significance tests, but failure to develop a tradition of replicating studies.
- When studies have a large number of relationships, we need significance tests to identify those that are real.
- Confidence intervals are themselves significance tests.
- Significance testing ensures objectivity in the interpretation of research data.
- It is the
*misuse*, not the use, of significance testing that is the problem. - It is futile to try to reform data analysis methods, so why try?

See the FAQ for additional objections and questions.

*“So what if there were no null ritual or NHST? Nothing would be lost, except confusion, anxiety, and a platform for lazy theoretical thinking. Much could be gained, such as knowledge about different statistical tools, training in statistical thinking, and a motivation to deduce precise predictions from one’s hypotheses.*

*Should we ban the null ritual? Certainly — it is a matter of intellectual integrity. Every researcher should have the courage not to surrender to the ritual, and every editor, textbook writer, and adviser should feel obliged to promote statistical thinking and reject mindless rituals.”* (Gigerenzer et al. 2004)

So you have read those pages, seen this letter, are aware of potential personal costs, and decided to do something. What can you do?

Here are suggestions, starting upstream:

**If you donate to institutions who fund research**: Say you disapprove of significance testing. If possible, make your donation conditional on them only funding research that use proper methodology.**If you fund research**: Make clear that significance testing is not an acceptable tool in research you fund. Make your grants conditional upon researchers using proper methodology.**If you are a professor**: Educate students about the huge contrast between significance testing's popularity and its usefulness. Explain that there's no one-size-fits-all magical solution (no, the brave Bayesian gods are not*that*powerful either). Emphasize quality of measurement, carefulness of design, thoughtfulness of analyses, relevance of the topic.**If you are an advisor to researchers**: Don't bully them into using significance testing. Encourage them*not*to. Enforce integrity of research practices.**If you are a researcher**: Don't use significance testing in your research. Don't passively allow your co-authors to use them. Refuse to have your name in articles that do. And don't use the word “significant” when you mean “relevant”.**If you are an editor-in-chief of a journal**: Disregard significance testing as a metric for acceptance of an article, and say so clearly. Evaluate articles exclusively on merits such as quality of measurement, carefulness of design, thoughtfulness of analyses, relevance of the topic.**If you are a reader of scientific literature**: Ignore significance testing when you see them in publications. Ignore conclusions that are claimed to follow from it.

**Papers**:

- See Gigerenzer's papers, in particular:
Gigerenzer & Marewski 2015 for the delusion of worshipping single tools such as
*p*‑values and using them as “universal hammers”; Gigerenzer 2004 about mindless statistics in social science (starring Dr. Publish-Perish); Gigerenzer et al. 2004 talks about the mass confusion behind the null ritual. - Then
Simmons et al. 2011 show how you can make anything turn “significant” in psychology;
Westover et al. 2011 for a didactic paper for a medical audience, with examples that doctors will find familiar;
Amrhein & Greenland 2017 (or here) also suggesting abandonment of NHST;
Amrhein et al. 2017 for an in-depth analysis of
*p*‑values and unreplicability (*“dichotomous threshold thinking must give way to non-automated informed judgment”*); Gelman & Stern 2012 for an ironically appropriate thrice-significant title; Orlitzky 2011 for suggestions about institutional changes; Gorard 2015 for a suggestion of what to do instead of NHST in the social sciences.

**Popular**:

- Neither Gigerenzer nor Seife mince words on Edge (2014) (Seife's criticism is valid even though he commits common misconception #2 from Greenland et al. 2016);
likewise, Colquhoun condemns NHST on Aeon (2016);
Colquhoun also gives didactic examples of the perils of
*p*‑values, on Chalkdust (2015); Amrhein argues inferential statistics is not inferential, on Medium (2018); Cumming shows*p*‑values dancing all over the place in this simulated replication video (your study could be*any*of those lines; “significant”, you said?); and both this cartoon and this one illustrate the problem. - Also quotes and open letters.

Follow the links, understand them, read about potential personal risks, make up your mind.