Open letters

Navigate to the letters to:

Letter to you, researcher

“To stop the ritual, we also need more guts and nerves. We need some pounds of courage to cease playing along in this embarrassing game. This may cause friction with editors and colleagues, but it will in the end help them to enter the dawn of statistical thinking.” (Gigerenzer 2004)

So, you've covered the first page, the quotes, the FAQ. You saw some suggestions of what you can do. What now? What stands in your way?

It may be that you face a dilemma. What you now believe is right may seem at odds with what your peers are doing. Or what your advisor is telling you to do. Or what the journal seems to demand. And you don't want to suggest they're being unreasonable. Understandable. Read this overview of what you might run into if you are inclined to abandon NHST.

Here's a relevant story from Gigerenzer 2004:

“Who is to blame for the null ritual? Always someone else. A smart graduate student told me that he did not want problems with his thesis advisor. When he finally got his Ph.D. and a post-doc, his concern was to get a real job. Soon he was an assistant professor at a respected university, but he still felt he could not afford statistical thinking because he needed to publish quickly to get tenure. The editors required the ritual, he apologized, but after tenure, everything would be different and he would be a free man. Years later, he found himself tenured, but still in the same environment. And he had been asked to teach a statistics course, featuring the null ritual. He did.”

So, ponder this: are you this person?

I wish you success in your future research, and that your advisor and peers be ethical capable people with whom openly discussing those issues is not an issue.

Letters to authors

“Neural correlates”

To: the authors of “Neural correlates of maintaining one's political beliefs in the face of counterevidence”, published on Scientific Reports, Nature in 2016.

Dear authors,

I appreciate your paper. The discussion of religion and politics, like the discussion of which-sport-team-is-the-best, is often normatively proscribed. This is because everybody remains sure that (i) they are on the right side and (ii) the other side is populated by idiots. This is usually not the best recipe for progress and better understanding.

Statistics is subject to similar issues. Salsburg 1985 talked about “The religion of statistics as practiced in medical journals”, for example: “Invocation of the religious dogmas of Statistics will result in publication in prestigious journals. This form of Salvation yields fruit in this world (increases in salary, prestige, invitations to speak at meetings) and beyond this life (continual references in the citation indexes)”

A persistent such dogma is the notion of statistical significance. The associated ritual is the Null Hypothesis Significance Testing (NHST).

I have noticed its presence on your paper. It appears in sentences such as “failed to reject the null hypothesis that the signal change data...” or “Signal in the posterior insula (r =​ −​0.293, p =​ 0.066) and ventral anterior insula (r =​ −​0.255, p =​ 0.112) did not significantly correlate with belief change.”

I thus invite you to read the rest of this site. I also invite you to read all of the clear McShane et al. 2018, which has the specific observation that “in neuroimaging, the voxelwise NHST approach misses the point in that there are typically no true zeros and changes are generally happening at all brain locations at all times. Graphing images of estimates and uncertainties makes sense to us, but we see no advantage in using a threshold.”

Having read it, how do you now evaluate your beliefs about significance testing?

I leave with my best regards, and wish you success in your future research.


(Will the authors of this study maintain their political statistical beliefs in the face of counterevidence? Or will they promptly recognize the quasi-religious nature of significance testing and exorcise it from future papers?

Once their default mode network is primed, my sincere wish is that BOLD signaling from their insula and amygdala be low, and that their dorsomedial prefrontal cortices remain mostly unresponsive.

Regretfully, lack of funding affords no fMRI or control groups for this humble experiment. We have to see if changes in beliefs will manifest themselves through other markers.)

“Pseudo-profound bullshit”

To: the authors of “On the reception and detection of pseudo-profound bullshit”, published on Judgment and Decision Making in 2015.

Ah, the beauty of fighting bullshit. An amusing paper, which I recommend to readers.

Hence the distress one feels upon seeing it maculated by asterisks-laden tables, a sad reference to how profoundly “significant” its calculated p‑values were found to be.

Your paper talks about ontological confusion — an issue even the best of us may fall prey to. Null hypothesis significance testing is an instance of ontological confusion. Or to use the terminology proposed by Greenland 2017, it's “statistical reification”.

Would you then consider updating your beliefs and abandoning significance testing in future papers?

Why you, from all who populate their articles with this misguided piece of statistics? Precisely because you seem to be gifted with a heightened preparedness to detect bullshit.

I leave with my best regards, and wish you success in your future research.

PS: In case it's not clear, I mean well. Read the post script.

“Milk, soft drinks, and insulin sensitivity”

To: the authors of “Effect of high milk and sugar-sweetened and non-caloric soft drink intake on insulin sensitivity after 6 months in overweight and obese adults: a randomized controlled trial”, published on the European Journal of Clinical Nutrition in 2018.

Dear authors,

I'm writing for three reasons.

First, I want to express my appreciation for your research on that topic. The importance of “attacking” the rise of T2D cannot be overstated.

Second, your paper uses null hypothesis significance testing. I would like to invite you to read this site, which makes the point that this practice should be abandoned. Start with the quotes.

Third, I would like to call your attention to the use of the word “significant(ly)” throughout this paper. See, I do sincerely hope this site persuades you. This would mean less researchers using NHST. Regardless, a significant serious source of ambiguity can be avoided.

As I wrote here, “statistically significant” (itself a problematic term) has nothing to do with “clinically relevant” or “economically important”. And it's not a direct measure of effect sizes.

Let's have a look at your paper. Here are all sentences with the word “significant(ly)”:

  1. A two-tailed P-value < 0.05 was considered significant.
  2. ...if non-significant, another approximate F-test was used to evaluate if there was a time-independent treatment effect.
  3. None of the other parameters differed significantly between the groups.
  4. One study reported a significant, unbeneficial increase in fasting insulin with milk compared to a non-dairy diet after 2 weeks.
  5. No significant differences were observed after 6 months.
  6. Analysis of the OGTT measured before and after the intervention showed no significant time× treatment effect or treatment effect during the 2 h test for glucose (P = 0.601 and 0.835, respectively) or insulin (P = 0.349 and 0.552, respectively)
  7. There was no significant difference in the time course of glucose (a) and insulin (b) concentration after the four different beverage groups according to the repeated measures analysis of variance.
  8. The analysis showed significant differences between the groups for total cholesterol and triacylglycerol (P = 0.01 and 0.02, respectively).
  9. For triacylglycerol, the pairwise comparisons showed a significantly higher concentration with SSSD compared to NCSD and WATER (P = 0.045 and 0.045, respectively).
  10. However, the intake of fat (percentage of fat) was significantly higher with WATER compared to MILK and SSSD (P < 0.05 and P < 0.01, respectively) but not compared to NCSD.
  11. Calcium intake was significantly higher with MILK compared to all other groups before intervention and during intervention (P = 0.01 and P < 0.001, respectively).
  12. A student's t test within the MILK beverage group showed a significantly higher calcium intake after 6 months compared to before (t-test: P < 0.001).
  13. As expected SSSD significantly increased plasma concentration of total cholesterol and triacylglycerol as compared to NCSD and plasma triacylglycerol concentration as compared to WATER.

About number 1, consider that “The obligatory statement in most research articles, ‘p‑values below 0.05 were considered statistically significant’ is an empty exercise in semantics.” (Goodman 1993), because “Not only is p < 0.05 very weak evidence for rejecting the null hypothesis, but statements like this perpetuate the division of results into ‘significant’ and ‘non-significant’.” (Colquhoun 2017)

Numbers 3–13 share a semantic pattern, and is the main point I want to make. See, all those bolded expressions are suggestive of effect sizes. They talk about “significant differences”, “significant increases”, “significantly higher concentration”, and even (explicitly) “significant effect”. Those evoke of “big”, “substantially”, “much”. At the same time, the context suggests that you are talking about “statistical significance”, as indicated by the p‑values mentioned. The problem is: p‑values are not effect sizes (although they do correlate with effect sizes). So those sentences are significantly substantially misleading.

Consider number 9, about there being a “significantly higher concentration of triacylglycerols when drinking an SSSD”. The image that comes to mind is of patients' TGs skyrocketing because they had too many regular Cokes. But then we see the p = 0.045 and suspect that “oh, they mean that the calculated difference passed the significance test, that's all”. This ambiguity can be significantly quite confusing to even highly-educated readers.

The calculated p is the probability of observing a difference of this magnitude or more extreme if the null plus all other model assumptions are true. But it doesn't tell us how significant big that magnitude is. You could have the very same p for different effect sizes, and conversely. See the graph on page 3 of Goodman 2008.

The yes/no given to the “statistical significance” of each variable is not to be taken seriously. It is also silent about how large the increase in TG concentration was.

Consider that you can have a “(statistically) significant” difference (at, say, p = 0.02) of a (clinically) insignificant magnitude of 0.01 mmol/L. This would be sad, given the significance (i.e. importance) of TGs as a marker for T2D and the (economic) significance of reducing the burden that T2D imposes to public funds.

Prevalence of these conflations, however, is significant high. Best to avoid this word whenever a synonym does the job, which is almost always the case.

One more thing: beware of the construction “there was a tendency for total cholesterol to be lower after NCSD compared to MILK and WATER (P = 0.089 and 0.074, respectively)”. The phrase is meaningless, and being in this list is not fun. See also here.

In closing, consider abandoning significance testing completely. Here's what you can do. If you report p‑values at all, don't classify them as “significant or not”. Put emphasis on effect sizes. If you want to qualify them with adjectives, talk about “large effects”, “small differences”, “negligible increases in concentrations”.

And thank you for working on this problem.

I leave with my best regards, and wish you success in your future research.

Post script to authors

To: the authors of the papers above.

Your articles are published and you move on. Time passes. Then a hand raises from some funny-looking URL and a voice goes “actually, if you notice here...”. You wonder why you should care. Fair.

So some clarifications about intentions seem appropriate.

And it's in this spirit that those letters were written.

Again, my sincere best regards.