Like many psychologists, I was dismayed to see the results of a recent study that attempted to replicate 100 different psychology studies, and managed to support the results in only 36% of cases. The inferential statistical analyses used to make sense of the results of psychology studies are intended to sift through patterns and separate the reliable ones–the ones that aren’t just blips in the data, that are strong enough that they probably represent some real phenomenon–from the spurious. Clearly, in many cases, they are failing.
The standard for psychology research is that to call a result significant, one must achieve a 0.05 or lower p-value. What this means is that there is a 5% (or less) chance, statistically speaking, that your findings are one of those meaningless data blips. In my own experience as an academic researcher, I have seen many, many studies fail to meet that 0.05 benchmark. The results are therefore deemed insignificant, and most likely, the researchers either table it completely or tweak their design and re-run the study.
This is not a new problem. In 1979, Robert Rosenthal (author of the best statistics text I ever used) wrote a now-classic paper on the file drawer problem. In the thirty-odd years since, other authors have explored the problem empirically (including at least one study that concluded it’s not much of a problem at all, which nonetheless has not become the accepted wisdom). And many people have proposed solutions, none of which have yet gotten significant traction.
The emphasis on positive over null results is a problem for (at least) two reasons: One, finding that a relationship does not exist between two variables is useful knowledge for the field. When we relegate our non-significant findings to the file drawer, we deny others in the field the opportunity to learn what phenomena do not reliably influence one another. Imagine how many research studies might be avoided if we shared our null results.
The second problem is that by not paying attention to null results, we make it harder to detect that 5% of statistically significant findings that do not in fact represent a reliable finding (Type 1 errors). If we paid attention to both positive and null results, we’d be better equip to detect something amiss when a single study found a relationship that 49 others did not. But without the evidence of the null results available, the positive results appear to be more robust than they perhaps should.
There are some solutions proposed to help bring null results out of the file drawer. The Psych File Drawer archives null results and makes them available to other researchers. Some have suggested registering new studies in a clinical trials-like registry, so their outcomes can be tracked regardless of publication status. A few journals are dedicated specifically to publishing null findings, including the logically named Journal of Articles in Support of the Null Hypothesis. And other social science journals are revising editorial policies to put more emphasis on replication and acceptance of null results (not that replication is a perfect process either).
Marcus Munafo of Bristol University, one of the researchers who participated in the replication effort, perhaps summed up the dilemma best when he told The Guardian:
If I want to get promoted or get a grant, I need to be writing lots of papers. But writing lots of papers and doing lots of small experiments isn’t the way to get one really robust right answer. What it takes to be a successful academic is not necessarily that well aligned with what it takes to be a good scientist.
Now we are left with the sobering finding that 64 out of a hundred studies that were part of the psychology knowledge base were not replicated, despite the replication studies having larger power in many cases. It’s uncomfortable to think that many of the things we “know” based on research may not in fact be true. Our job is to use that discomfort as motivation to do better and learn more. I look forward to seeing how our field evolves as a result.