Here are the issues I mentioned in class that are currently making life difficult for researchers in Psychology, not to mention Medicine and Biology and I guess Science in general. I'm making it a link instead of just emailing it because I have good intentions about elaborating these points, with references included, in the future (future weeks? semesters? lifetimes?). This is nowhere near a complete discussion of even the points mentioned, and is mostly over the top rhetorically but hey, I'm not submitting it for publication. Just some things to think about.

 

1) replication / reproducibility failure in psychology, medicine, biology, etc. See Nosek et aaaaaaaallllllll, "Estimating the reproducibility of psychological science" (Open Science Collaboration, Science 349, (2015) for an examination of Psychology's "replication crisis", though there have been some decent "it's not a crisis" responses as well. Also see John Ioannidis's famous 2005 paper "Why most published research findings are false." PLOS Med. 2, e124 (2005).

 

2) NHST (Null Hypothesis Significance Testing), a.k.a. the use of p-values as a means of data analysis -- and oh boy will you get familiar with the use and interpretation of p-values in this course. Consider its general uselessness for providing evidence for or against hypotheses (google it to see the decades of arguments against it, not to mention the 2016 statement on p-values from the American Statistical Association). Consider the incredibly frequent misinterpretation of p-values by most people using them (as surveys have shown). Consider the credulousness it gives rise to: Bem recently used p-values to argue that "psi phenomena" such as mind-reading exist -- see J. Pers. Soc. Psych. 100, 407–425 (2011). Consider the lack of power in most studies (as described by Cohen from the 1960s to the 1990s, among others), where "power" is the probability that a study will correctly reject a false null hypothesis, or roughly, will find an effect when it's actually there to be found -- in other words, even if psychologists believe in p-value significance as the criterion for discovery, they barely give themselves a fighting chance to achieve it due to too-small sample sizes, too-small effect sizes, and too much variability in their measures.

 

3) p-hacking (largely deliberate, though well-intentioned) -- the manipulation of data in search of significant results (p-values less than .05), including things like making decisions about outliers and data transformations and choice of tests, dividing samples into subsets for further analysis, consideration of lots and lots of results to capitalize on the inevitable spurious p-value that's less than .05, and so forth. When the only thing worth finding is a p<.05, you look HARD. It's common because most researchers make an effort to uncover what's really in their data, and p-values have seemed to be the sign of legitimate findings, there waiting to be discovered like a sculpture still concealed in a block of marble.

 

4) forking paths in data analysis, to use statistician Andrew Gelman's term, which I think is mostly non-deliberate on researchers' part since they don't even have to be aware of it. It just seems like reasonable good investigation of data, but may bias you to capitalize on things that just happened to be true of your sample. Oh look, this unexpected finding happened. We should pursue that with further analyses, doing things we wouldn't otherwise have done since now there's the  prospect of something important just dangling there. And here's a result that didn't turn out significant, surely there's no need to follow that up. But here's one that IS significant -- by all means let's see what the post-hoc tests show. And so forth. Gelman does a better job of describing this. His blog is so in-your-face it's almost exhausting.

 

5) journals' publication biases and the file drawer problem: You didn't get p<.05, your results are not significant, you've failed to find evidence in favor of a hypothesis. Is this not knowledge you've gained, then? But try getting that negative knowledge published. Editors and reviewers, and readers for that matter, are biased in favor of hypotheses that evidence DOES support. It's sometimes even an explicit editorial policy that non-significant results aren't worth publishing. Those non-significant results go into the researcher's file drawer and silently represent the lack of support for a hypothesis, while if any finding DOES support that same hypothesis (maybe just due to sampling error!), it gets published and publicized and becomes part of the literature. Not surprising then that results may fail to replicate. This is part of why Ioannidis said most published findings are probably false.

 

6) researchers' incentives for publication, grants, tenure, careers, etc.: Academic success is based on publishing and getting grants. You must publish if you want tenure, if you want a job, if you want to feed your kids. If publication requires significant p-values, you will find them. If there are better data analyses to use but everyone is still speaking the language of p-values, your forays into Bayesian data analysis will remain a mere hobby. Fortunately most people today know the language of p-values is inadequate, but new norms have yet to emerge beyond including Effect Size estimates in grant proposals and that sort of thing. The general unrelenting need to publish in order to maintain and advance a career places an emphasis on volume of output since it's easy to quantify by counting published papers, as opposed to quality of output which is not so easy to quantify -- though when that attempt is made, e.g. by assigning an Impact Factor to measure the importance of various journals, it can end up providing still more perverse incentives.

 

7) Fraud also happens. Recently notorious was the Stapel case, but it's not exactly an isolated incident (see also Ruggiero and Hauser cases). It's not a big problem, but there should be some kind of checks and inspections that could protect against it. Cause for instance, I just said it's not a big problem, but how the heck do I know that??? Making replication of studies a respectable and publishable activity would help.

 

 

 

 

 

Here are some potential remedies that have been proposed.

In terms of data analyses to replace NHST and p-values:

1) most journals now require reporting of Effect Size measures and encourage interpretation in those terms -- i.e., focusing on how much of an effect a manipulation had, not just deciding whether it had an effect or not as NHST offers. These ES measures typically include things like a) correlational expressions  like R-squared of how much variation in the DV was accounted for or explained by the IV, or b) a standardized difference between two means, such as Cohen's d, taking account of both the values of the means and their variability so we can tell if the difference counts as large or not

2) emphasis on Confidence Intervals, which provide an estimate of the value of a population parameter such as a mean, based on the sample value obtained, with a corresponding window around it describing how likely it is that the population value is captured by that estimate. Unfortunately CI's are mathematically identical to p-values and are usually used as a fancy substitute with all the same interpretive problems, except that their interpretation is EVEN MORE obscure and similarly not very useful. There are ways to interpret CI's without just making them equivalent to p-values though, and those approaches are respectable. Says me, the arbiter of respectability.

3) Bayesian Data Analysis, based on Bayes's Theorem, which provides a general scheme for updating your level of acceptance of a hypothesis based on prior knowledge and newly collected data. Which seems like exactly what science should be all about (see Zoltan Dienes, "Bayesian Versus Orthodox Statistics: Which Side Are You On?", Perspectives on Psychological Science 6(3) 274–290). But aside from a core group of psychologists using Bayes's Theorem in their theories of cognition, and those actively promoting it as an analytic tool, it remains largely unknown, untaught, and unused in psychology. Wave of the future, I think. Which is what Bayesians in psychology have been saying for a little over half a century.

4) Meta-Analysis is a set of techniques for combing through the published literature to identify a bunch of studies all getting at the same hypothesis, and then combine their data and results to get an overall evaluation of the hypothesis as if they all made up one big study. This is going to become more and more useful, and common.

5) Dynamical Systems Theory provides a set of analysis tools that have little in common with statistical approaches but seem more plausible and appropriate for summarizing and modeling results in certain domains of science including certain parts of psychology. It's heavily mathematical in ways that conventional statistics isn't, and has an unfamiliar language for describing the phenomena it's applied to. So there's a long shallow learning curve to it (not a "steep" learning curve, that would mean it was learned quickly!), but it could certainly be a step towards better quantification of observations in psychology, which p-values are eminently not suited for. Physics took off when it realized it was mostly talking about differential equations (back in 1687), so maybe psychology just needs to find the right math.

6) Exploratory Data Analysis (EDA), proposed and described by John Tukey in a 1977 book, is a set of techniques and displays and strategies for visualizing data in ways that could reveal patterns and relationships without mindlessly reducing them to the output of a statistical test. I wouldn't say it's caught on hugely but there's a lot of potential and a lot to recommend it even if it's not a substitute for mathematical formulations. It is universally recommended by serious data analysts that the first step in any analysis is to graph data and inspect it for what kinds of things it might reveal. Jumping straight to p-values is frowned upon and a sign that someone doesn't necessarily know what they're doing. So there's that.

7) Other statistics are occasionally proposed to convey new information, such as "p-rep", or the probability of replication, that had its moment a decade or so ago. I think people were soon dissatisfied with how little information it really conveyed and how closely related it actually was to p-values. Still, interesting idea, and others like that pop up and get trendy once in a while, so maybe something will catch on. (I'm actually at a loss to think of another example.)

 

And NOT as data analysis alternatives but more generally as healthy steps the field might take to ensure more rigor in published findings, there are

8) Pre-Registration of research protocols and data analysis plans on centralized web sites or journal hosts, so that a researcher commits to doing exactly what they plan from the start, as motivated by theory and previous findings. This could cut down on following surprising data into new off-the-cuff hypotheses since anything not in the original research plan would clearly have a lesser discovery status, with sampling error assumed to be a major potential player in any outcomes not predicted or planned for. The hard part is getting researchers to play by these much stricter rules, especially if not everyone else is doing it. But those who do will gain a bit of added respectability for their results.

9) embracing Replication as a necessary and worthwhile pursuit for researchers, so a body of knowledge develops that represents stable repeatable findings rather than just interesting results "confirmed" by one p-value. Which is an exaggeration, but not a huge one. Nobody presently gets any career benefit from re-doing a study someone else has already done, especially in exactly the same way. But it would be much better for psychology if that happened a lot more.

10) changing professional incentives to favor reproducibility and thoroughness over novelty and ... "prolificacy", is that the noun? You know, being prolific -- publishing a whole lot. That doesn't seem to be an important component of successful science, but it IS the major component of a successful vita. Which is not at all to say that prolific publishers aren't doing good science -- just that it's the major criterion used for deciding who's doing good science, which is probably inappropriate. So the governing body in charge of creating the professional incentive structure should definitely make a few tweaks to help science get better. Haha, there is no such governing body, obviously. It's all us. How does this thing even work?

11) The statistical education of graduate students is probably the biggest lever the field has to move itself in the right direction, as far as data analysis is concerned. People do what they've been taught, especially in those formative years, and then later they teach what they've been taught. Make Bayesian data analysis part of the graduate student stats preparation and suddenly they can handle it and maybe they think that way about data. Teach other useful but currently unfamiliar techniques along with skepticism of NHST, and in just a couple of generations maybe the field will have moved on.