Questions and objections

Here are direct links to read about respectable people, replacements, hammers and toolboxes, CI and Bayes, replication first, the ethics of NHST.

Wouldn't that mean all those respectable people using significance testing for decades have been... misguided?

Yes, it would.

The observation is not surprising, considering that history is paved with misguided respectable people.

But... if not significance testing, then what?!

First problem with this question: an implied assumption that you can only abandon something once you found a suitable replacement.

What statistical tools to use is an important question. It just doesn't need to be answered at the same time of what tools not to use.

In the same spirit, from Gorard 2016: “The volatile results produced by such testing are so prevalent in epidemiology that Le Fanu (1999) suggested, almost seriously, that all departments of epidemiology be closed down as a service to medicine.”

(consider spending 2 minutes with 2.4 from Gorard 2017, and maybe 1.5 as well)

The general point is that if a practice causes net harm, abandon it.

If you're moving in a clearly wrong direction, stop moving. Then you look at a map or ask others to figure out what many possible good places you could go to instead.

Same asymmetry here: We have one clear widely-used wrong statistical tool to abandon (significance testing). But we don't have one clear right tool to replace all cases for which NHST is the wrong tool.

Which brings us to the...

Second problem with this question: an implied assumption that there are simple replacements.

“First, don't look for a magic alternative to NHST, some other objective mechanical ritual to replace it. It doesn't exist.” (Cohen 1994)

“None of the statistical tools should replace significance testing as the new magic method giving clear-cut mechanical answers.” (Trafimow et al. 2017)

And I find Gigerenzer & Marewski 2015 particularly useful against the hope for a universal method: “Surrogates have been created, most notably the quest for significant p values. This form of surrogate science fosters delusions and borderline cheating and has done much harm, creating, for one, a flood of irreproducible results. Proponents of the ‘Bayesian revolution’ should be wary of chasing yet another chimera: an apparently universal inference procedure. A better path would be to promote both an understanding of the various devices in the ‘statistical toolbox’ and informed judgment to select among these.”

So a recommendation is: Critical thinking and informed judgment to select tools according to the situation.

I know. You dislike this because it sounds like an abstract non-actionable platitude. Unfortunately, the concrete actionable non–platitude-sounding suggestions, while psychologically comfortable, are often delusions.

The problem of researchers hammering away with NHST won’t be solved by trading their hammers for empty toolboxes.

If the job is to clean glassware, I'd rather hire the person who comes with an empty toolbox thinking “I'm not sure what I will use” than the one who comes with a hammer full of initiative, eager to start.

Same for baking bread.

Same for milking cows.

Glass breaks under hammers, flour is rather indifferent, and cows moo in despair. Before any considerations about how to clean glass, bake bread or milk cows, the most valuable action is to keep the hammer guy away from all of them.

Good. Damage contained.

Now, the hammer guy is still in love with his tool. You approach him.
— Listen, there aren't useful things to do with a hammer here.
— What to do instead? I can't do nothing.
— Of course you can. Please do nothing immediately. Hammering is a disaster.
— What tool would I use next, though?
— I don't know. It depends on the job. But first, quit the hammer.
— Seriously, what tool?
— It depends on the job. Cleaning glass requires different tools than baking bread or milking cows.
— But isn't there a universal tool I can use in all those jobs?
— Your brain. Which, if properly functioning, should produce: “Hammer: no good, retire it now. Next: learn about other tools.”
— Ok, so will you teach me about other tools?
— Unlikely. I'm here to contain the harmful use of hammers, to avoid unnecessary damage to glassware, bread, and cows.
— But... but... look, those cows, from the fungiculturists. On Monday I went there with the hammer and got milk and—
— Cows release all sorts of liquids under stress. You thought it was milk. And everybody believed you, which is a problem, because now the foamy nondairy thing you extracted is for sale, and people will become unsuspecting victims of stressed cows' effluents. Please don't do that. They expect good milk, not... Well, now, what about leaving your hammer here and studying cowmilking techniques? It may take a while, and people may think you're odd, what with their attachment to hammers. But then you can actually milk cows.
— I'm actually sick o' cows.
— Ok. Bread then?
— Bread.

Your cow analogy is... I keep thinking of it.

Oh my.

Imagine if people started to associate destructive hammers with significance testing, milk with useful valid outputs, and stressed-out cows' pee with thoughtlessly produced p‑values. Imagine if they visualized the distressed cow's face as the poor mammal detects the approach of the Nasty Hammer of Significance Testing™.

Imagine if people remembered this every time they started pressing buttons in statistical software. Imagine, God forbid, if they spent 3 minutes reading the tiny 1.4 and 1.5 of Gorard 2017. Imagine if that made them pause. What kind of world would that be?

So, please don't think about this analogy all the time. Tell your friends to do the same.

PS: If bovine effluents and scholarly commentary thereon are not topics of your interest, do not read this letter.

Are confidence intervals and Bayes factors a satisfactory replacement?

There is no simple universal replacement. See “Second problem” above.

That said:

For a thoughtful short answer that addresses both, I do recommend:
McShane et al. 2018 (pdf) (number 4, from page 10).

Shouldn't we first solve the crisis of replication, which is more important than the matter of significance testing?

First, there's a faulty logic with this general kind of argument. See 2.4 of Gorard 2017.

But in this case it so happens that they are both part of the same phenomenon. See this video from Cumming. This is how p‑values are expected to replicate from study to study under constant methodology and population. Non-replication is in part (although not exclusively) a consequence of the binary volatility of significance testing.

This is also objection 3 (p. 8) from Schmidt et al. 1997. And see Halsey et al. 2015.

More accurately, though, let's question the framing. Consider that “If replications do not find the same results, this is not necessarily a crisis, but is part of a natural process by which science evolves. The goal of scientific methodology should be to direct this evolution toward ever more accurate descriptions of the world and how it works, not toward ever more publication of inferences, decisions, or conclusions.” (Amrhein, Trafimow, & Greenland 2018)

You talk as if NHST is useless and borderline immoral. It's not that bad, really.

I may have understated the issue, then. “Harmful and immoral” is a better approximation.

Seems like a good time to read the aptly‑titled “Damaging Real Lives Through Obstinacy: Re‑Emphasising Why Significance Testing is Wrong” (Gorard 2016).

All of it.

But in particular:

... What is it with the funny domain?

Oh. That?