Here are the issues I mentioned in class that are
currently making life difficult for researchers in Psychology, not to mention
Medicine and Biology and I guess Science in general. I'm making it a link
instead of just emailing it because I have good intentions about elaborating
these points, with references included, in the future (future weeks? semesters?
lifetimes?). This is nowhere near a complete discussion of even the points mentioned,
and is mostly over the top rhetorically but hey, I'm not submitting it for
publication. Just some things to think about.
1) replication / reproducibility failure in psychology,
medicine, biology, etc. See Nosek et aaaaaaaallllllll, "Estimating the
reproducibility of psychological science" (Open Science Collaboration,
Science 349, (2015) for an examination of Psychology's "replication
crisis", though there have been some decent "it's not a crisis"
responses as well. Also see John Ioannidis's famous 2005 paper "Why most published
research findings are false." PLOS Med. 2, e124 (2005).
2) NHST (Null Hypothesis Significance Testing), a.k.a.
the use of p-values as a means of data analysis -- and oh boy will you get
familiar with the use and interpretation of p-values in this course. Consider
its general uselessness for providing evidence for or against hypotheses (google
it to see the decades of arguments against it, not to mention the 2016
statement on p-values from the American Statistical Association). Consider the incredibly
frequent misinterpretation of p-values by most people using them (as surveys
have shown). Consider the credulousness it gives rise to: Bem recently used
p-values to argue that "psi phenomena" such as mind-reading exist --
see J. Pers. Soc. Psych. 100, 407–425 (2011). Consider the lack of power in
most studies (as described by Cohen from the 1960s to the 1990s, among others),
where "power" is the probability that a study will correctly reject a
false null hypothesis, or roughly, will find an effect when it's actually there
to be found -- in other words, even if psychologists believe in p-value
significance as the criterion for discovery, they barely give themselves a
fighting chance to achieve it due to too-small sample sizes, too-small effect
sizes, and too much variability in their measures.
3) p-hacking (largely deliberate, though
well-intentioned) -- the manipulation of data in search of significant results
(p-values less than .05), including things like making decisions about outliers
and data transformations and choice of tests, dividing samples into subsets for
further analysis, consideration of lots and lots of results to capitalize on
the inevitable spurious p-value that's less than .05, and so forth. When the
only thing worth finding is a p<.05, you look HARD. It's common because most
researchers make an effort to uncover what's really in their data, and p-values
have seemed to be the sign of legitimate findings, there waiting to be
discovered like a sculpture still concealed in a block of marble.
4) forking paths in data analysis, to use statistician
Andrew Gelman's term, which I think is mostly non-deliberate on researchers'
part since they don't even have to be aware of it. It just seems like
reasonable good investigation of data, but may bias you to capitalize on things
that just happened to be true of your sample. Oh look, this unexpected finding
happened. We should pursue that with further analyses, doing things we wouldn't
otherwise have done since now there's the
prospect of something important just dangling there. And here's a result
that didn't turn out significant, surely there's no need to follow that up. But
here's one that IS significant -- by all means let's see what the post-hoc
tests show. And so forth. Gelman does a better job of describing this. His blog
is so in-your-face it's almost exhausting.
5) journals' publication biases and the file drawer
problem: You didn't get p<.05, your results are not significant, you've
failed to find evidence in favor of a hypothesis. Is this not knowledge you've
gained, then? But try getting that negative knowledge published. Editors and
reviewers, and readers for that matter, are biased in favor of hypotheses that
evidence DOES support. It's sometimes even an explicit editorial policy that
non-significant results aren't worth publishing. Those non-significant results
go into the researcher's file drawer and silently represent the lack of support
for a hypothesis, while if any finding DOES support that same hypothesis (maybe
just due to sampling error!), it gets published and publicized and becomes part
of the literature. Not surprising then that results may fail to replicate. This
is part of why Ioannidis said most published findings are probably false.
6) researchers' incentives for publication, grants,
tenure, careers, etc.: Academic success is based on publishing and getting
grants. You must publish if you want tenure, if you want a job, if you want to
feed your kids. If publication requires significant p-values, you will find
them. If there are better data analyses to use but everyone is still speaking
the language of p-values, your forays into Bayesian data analysis will remain a
mere hobby. Fortunately most people today know the language of p-values is
inadequate, but new norms have yet to emerge beyond including Effect Size
estimates in grant proposals and that sort of thing. The general unrelenting
need to publish in order to maintain and advance a career places an emphasis on
volume of output since it's easy to quantify by counting published papers, as
opposed to quality of output which is not so easy to quantify -- though when
that attempt is made, e.g. by assigning an Impact Factor to measure the
importance of various journals, it can end up providing still more perverse
incentives.
7) Fraud also happens. Recently notorious was the
Stapel case, but it's not exactly an isolated incident (see also Ruggiero and
Hauser cases). It's not a big problem, but there should be some kind of checks
and inspections that could protect against it. Cause for instance, I just said
it's not a big problem, but how the heck do I know that??? Making replication
of studies a respectable and publishable activity would help.
Here are some potential remedies that have been
proposed.
In terms of data analyses to replace NHST and
p-values:
1) most journals now require reporting of Effect Size
measures and encourage interpretation in those terms -- i.e., focusing on how
much of an effect a manipulation had, not just deciding whether it had an
effect or not as NHST offers. These ES measures typically include things like
a) correlational expressions like
R-squared of how much variation in the DV was accounted for or explained by the
IV, or b) a standardized difference between two means, such as Cohen's d,
taking account of both the values of the means and their variability so we can
tell if the difference counts as large or not
2) emphasis on Confidence Intervals, which provide an
estimate of the value of a population parameter such as a mean, based on the
sample value obtained, with a corresponding window around it describing how
likely it is that the population value is captured by that estimate.
Unfortunately CI's are mathematically identical to p-values and are usually
used as a fancy substitute with all the same interpretive problems, except that
their interpretation is EVEN MORE obscure and similarly not very useful. There
are ways to interpret CI's without just making them equivalent to p-values
though, and those approaches are respectable. Says me, the arbiter of
respectability.
3) Bayesian Data Analysis, based on Bayes's Theorem,
which provides a general scheme for updating your level of acceptance of a
hypothesis based on prior knowledge and newly collected data. Which seems like
exactly what science should be all about (see Zoltan Dienes, "Bayesian
Versus Orthodox Statistics: Which Side Are You On?", Perspectives on
Psychological Science 6(3) 274–290). But aside from a core group of
psychologists using Bayes's Theorem in their theories of cognition, and those
actively promoting it as an analytic tool, it remains largely unknown,
untaught, and unused in psychology. Wave of the future, I think. Which is what
Bayesians in psychology have been saying for a little over half a century.
4) Meta-Analysis is a set of techniques for combing
through the published literature to identify a bunch of studies all getting at
the same hypothesis, and then combine their data and results to get an overall
evaluation of the hypothesis as if they all made up one big study. This is
going to become more and more useful, and common.
5) Dynamical Systems Theory provides a set of analysis
tools that have little in common with statistical approaches but seem more
plausible and appropriate for summarizing and modeling results in certain
domains of science including certain parts of psychology. It's heavily
mathematical in ways that conventional statistics isn't, and has an unfamiliar
language for describing the phenomena it's applied to. So there's a long
shallow learning curve to it (not a "steep" learning curve, that
would mean it was learned quickly!), but it could certainly be a step towards
better quantification of observations in psychology, which p-values are
eminently not suited for. Physics took off when it realized it was mostly
talking about differential equations (back in 1687), so maybe psychology just
needs to find the right math.
6) Exploratory Data Analysis (EDA), proposed and
described by John Tukey in a 1977 book, is a set of techniques and displays and
strategies for visualizing data in ways that could reveal patterns and
relationships without mindlessly reducing them to the output of a statistical
test. I wouldn't say it's caught on hugely but there's a lot of potential and a
lot to recommend it even if it's not a substitute for mathematical
formulations. It is universally recommended by serious data analysts that the
first step in any analysis is to graph data and inspect it for what kinds of
things it might reveal. Jumping straight to p-values is frowned upon and a sign
that someone doesn't necessarily know what they're doing. So there's that.
7) Other statistics are occasionally proposed to convey
new information, such as "p-rep", or the probability of replication,
that had its moment a decade or so ago. I think people were soon dissatisfied
with how little information it really conveyed and how closely related it
actually was to p-values. Still, interesting idea, and others like that pop up
and get trendy once in a while, so maybe something will catch on. (I'm actually
at a loss to think of another example.)
And NOT as data analysis alternatives but more
generally as healthy steps the field might take to ensure more rigor in
published findings, there are
8) Pre-Registration of research protocols and data
analysis plans on centralized web sites or journal hosts, so that a researcher
commits to doing exactly what they plan from the start, as motivated by theory
and previous findings. This could cut down on following surprising data into
new off-the-cuff hypotheses since anything not in the original research plan
would clearly have a lesser discovery status, with sampling error assumed to be
a major potential player in any outcomes not predicted or planned for. The hard
part is getting researchers to play by these much stricter rules, especially if
not everyone else is doing it. But those who do will gain a bit of added respectability
for their results.
9) embracing Replication as a necessary and worthwhile
pursuit for researchers, so a body of knowledge develops that represents stable
repeatable findings rather than just interesting results "confirmed"
by one p-value. Which is an exaggeration, but not a huge one. Nobody presently
gets any career benefit from re-doing a study someone else has already done,
especially in exactly the same way. But it would be much better for psychology
if that happened a lot more.
10) changing professional incentives to favor reproducibility
and thoroughness over novelty and ... "prolificacy", is that the
noun? You know, being prolific -- publishing a whole lot. That doesn't seem to
be an important component of successful science, but it IS the major component
of a successful vita. Which is not at all to say that prolific publishers
aren't doing good science -- just that it's the major criterion used for deciding
who's doing good science, which is probably inappropriate. So the governing
body in charge of creating the professional incentive structure should
definitely make a few tweaks to help science get better. Haha, there is no such
governing body, obviously. It's all us. How does this thing even work?
11) The statistical education of graduate students is
probably the biggest lever the field has to move itself in the right direction,
as far as data analysis is concerned. People do what they've been taught,
especially in those formative years, and then later they teach what they've
been taught. Make Bayesian data analysis part of the graduate student stats
preparation and suddenly they can handle it and maybe they think that way about
data. Teach other useful but currently unfamiliar techniques along with
skepticism of NHST, and in just a couple of generations maybe the field will
have moved on.