The "Replication Crisis", Null Hypothesis Significance Testing (NHST), and related issues

Overviews

A list of issues and difficulties confronting psychological researchers lately, that I've presented here informally and shrilly which is consistent with how I present it in class. This was written for my graduate stats class so some of it may lose you, but in general it's a decent quick overview of the replication crisis.

Wikipedia article on the Replication Crisis, which is a pretty good overview with good references.

A good introduction to the replication crisis from the Veritasium youtube channel, focusing on Daryl Bem's completely implausible research on precognition and "psi" phenomena that nevertheless was published in a top social psychology journal, and the connected issues that raises.

A syllabus / reading list for a seminar titled Everything Is F***ed, only the asterisks are made more rude at this link. It's a good starting point for where to read about all the headaches science in general and psychology in particular are currently up against. This was a VERY popular and frequently shared blog post from the moment it appeared in August 2016.

What has happened down here is the winds have changed: Andrew Gelman's incisive 2016 blog post recounting his impression of the developing problems. Point of information for the curious: the headings are lines from Randy Newman's song "Louisiana 1927" about the devastating Mississippi River flood that should have been foreseen and prepared for but somehow took everyone by surprise and became a huge disaster.

When the Revolution Came for Amy Cuddy: Very good report from the New York Times on the turmoil that has developed within psychology as a result of reconsidering the field's research methods and standards.



Notable papers

Ioannidis (2005) - Why Most Published Research Findings Are False: early forebodings about the current concerns with replication and the accuracy of published research

Simmons Nelson Simonsohn (2011) - False Positive Psychology: influential demonstration of "researcher degrees of freedom" making it too easy to find p-values less than .05; note that here "degrees of freedom" is used to describe the many pathways available in data analysis, not the df of a significance test

Nosek et al (2015) - Estimating the reproducibility of psychological science: making a big splash in psychology, the Open Science Collaboration performed replications of 100 experiments from three prominent journals and found fairly low rates of replicability, according to various criteria: The mean effect size of the replication effects was half that of the original effects; 97% of original studies had significant results (p < .05) while only 36% percent of replications did; only 47% of original effect sizes were in the 95% confidence interval of the replication effect size; only 39% of effects were subjectively rated to have replicated the original result; combining original and replication results left 68% with statistically significant effects. This opened up lots of arguments about what counts as successful replication (p-values? effect size? general interpretation of results?) but was a huge wake-up call to the field.

Munafo et al (2017) - A manifesto for reproducible science: proposals for how to correct the course of science (not just psychology!), many of which have been gaining ground

American Statistical Association statement on p values (2016): some long needed but long unheeded clarifications, warnings, and interpretations, in light of the recognition that so much research using p-values has not been replicable; none of this was new with this statement, but much was likely vague to many researchers

BASP Editorial Null Hypothesis Banning (2015): the journal Basic and Applied Social Psychology made the decision to completely ban the use of Null Hypothesis Signifiance Testing in published articles, forcing researchers to investigate other methods of data analysis that might yield better science



Early warnings

Gigerenzer (1993): examination of the NHST controversy by contrasting the incompatible original views of Fisher and Neyman & Pearson with the unsatisfying hybrid of their views that became the dominant method of data analysis; the metaphor of the statistical superego, ego, and id works quite well

Cohen (1994) - famous criticism of Null Hypothesis Significance Testing from before the current version of the crisis

Howell Ch.4: accurate treatment of the logic and controversies of hypothesis testing, possibly more accessible than Cohen's (1994) paper

Wilkinson and APA Task Force (1999): the American Psychological Association's initial response to the growing NHST controversy with then-new recommendations for treatment of data

Book Review of The Cult Of Statistical Significance from the journal Science, June 2008. This one-page article focuses on one consequence of psychology's misplaced emphasis on null hypothesis significance testing, which is the neglect of effect size and of effect measurements.

Cohen (1990): general advice about treatment of data

Cowles & Davis (1982): historical roots of the "p<.05" significance level

Note that psychology's problems with data analysis and NHST in particular have been clearly identified and presented in major journals going back to at leas the 1960's (e.g., Greenwald (1975), Lykken (1968), Meehl (1967), Bakan (1966), etc.).



Other directions

Howell Ch. 5 Excerpt on Bayes's Theorem: provides a brief description of Bayes's Theorem, an alternative to NHST for data analysis

Dienes (2011): makes the case that Bayes's Theorem is what most people really believe is appropriate and want to use when analyzing data [original here, requires logging in with UConn NetID and password: Dienes (2011)]



Quotes

Ben Goldacre, in Bad Pharma (2012), sums it up this way: the gambits give researchers an abundance of chances to find something when the tools assume you have had just one chance. Scouring different subgroups and otherwise "trying and trying again" are classic ways to blow up the actual probability of obtaining an impressive, but spurious, finding -- and that remains so even if you ditch P-values and never compute them."
- Deborah G. Mayo, Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press, 2018. Tour I Beyond Probabilism and Performance p.20

There is a growing realization that statistically significant claims in scientific publications are routinely mistaken. A dataset can be analyzed in so many different ways (with the choices being not just what statistical test to perform but also decisions on what data to [in]clude or exclude, what measures to study, what interactions to consider, etc.), that very little information is provided by the statement that a study came up with a p < .05 result. The short version is that it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough -- and good scientists are skilled at looking hard enough and subsequently coming up with good stories (plausible even to themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant comparisons they happen to come up with.
This problem is sometimes called "p-hacking" or "researcher degrees of freedom" (Simmons, Nelson, and Simonsohn, 2011). In a recent article, we spoke of "fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant" (Gelman, 2013a).
But we are starting to feel that the term "fishing" was unfortunate, in that it invokes an image of a researcher trying out comparison after comparison, throwing the line into the lake repeatedly until a fish is snagged. We have no reason to think that researchers regularly do that. We think the real story is that researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out differently, they could have done other analyses that were just as reasonable in those circumstances.
We regret the spread of the terms "fishing" and "p-hacking" (and even "researcher degrees of freedom") for two reasons: first, because when such terms are used to describe a study, there is the misleading implication that researchers were consciously trying out many different analyses on a single data set; and, second, because it can lead researchers who know they did not try out many different analyses to mistakenly think they are not so strongly subject to problems of researcher degrees of freedom. Just to be clear: we are not saying that Simmons et al. (2001), Vul et al. (2009), Francis (2013), and others who have written about researcher degrees of freedom were themselves saying that p-hacking implies that researchers are cheating or are even aware of performing data-dependent analyses. But the perception remains, hence the need for this paper.
Our key point here is that it is possible to have multiple potential comparisons, in the sense of a data analysis whose details are highly contingent on data, without the researcher performing any conscious procedure of fishing or examining multiple p-values.
- Andrew Gelman and Eric Loken (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time.

William L. Thompson, statistical and environmental biologist with the National Park Service in Alaska and the author of numerous books on ecology, who compiled a bibliography of "326 Articles/Books Questioning the Indiscriminate Use of Statistical Hypothesis Tests in Observational Studies", wrote:
"Unfortunately, this approach was (and continues to be) pounded into us at both the introductory and advanced level of statistics in universities throughout the world. The general lack of awareness of problems with statistical hypothesis testing is especially acute in my own field (ecology/environmental science/fish and wildlife biology). This number of articles [namely, the 326] pales in comparison to the vast array of articles devoted to [using and teaching the technique]... in the social, medical, and statistical sciences. Indeed, until very recently, I was one of the "unaware" who blindly applied statistical hypothesis tests to observational data without considering the validity of such an approach."


Uttal (2003) Psychomythics pp. 126-127, citing Nickerson (2000) citing Gregson (1997) citing various:

... much of psychological investigation is bogged down in more or less mindless applications of techniques that are eminently suited for discovering what type of fertilizer to employ. (Townsend, 1994, p. 321)

... at the very least we should stop the practice of using p values to sanctify data merely to appease the Simplicios and put some effort into using data to sift through theories. (Gonzalez, 1994, p. 328)

The principles of significance testing and estimation are simply wrong, and clearly beyond repair, they are the phlogiston and alchemy of twentieth century statistics; and statisticians in the next century will look back on them in sheer wonderment. (Howson & Urbach, 1994, p. 51)

To which Gregson (1997) himself added, "Why psychologists have seemingly been untouched by these criticisms [of null hypothesis testing] is a question about the sociology of science and scientists, not a question about statistics at all! Professor G. A. Barnard (in England) recently commented [at a discussion at the Royal Statistical Society] that significance testing was something that survived in statistical backwaters, like the Journals of the American Psychological Association..." (p. 63)


some other descriptions of NHST over the years:

"a kind of essential mindlessness in the conduct of research" (Bakan, 1966).

"a method that has involved more fantasy than fact... representing a corrupt form of the scientific method" (Carver, 1978).

"the most bone headed, misguided procedure ever institutionalized in the rote training of science students" (Rozeboom, 1977).

"... NHST has not only failed to support the advance of psychology but also has seriously impeded it." (Cohen, 1994).


Still Not Significant: a sadly hilarious blog post listing many many delusional euphemisms for insisting that non-significant results are still sorta significant