In the past decade or so, there’s been an increased focus in psychology, and especially in social psychology, to replicate studies. The basic idea is that if a study’s results are accurate, then other researchers should be able to repeat the same study design and get the same results. If they can’t, then there might be a serious problem, such as:
- The original researchers didn’t describe the study design well enough for someone to accurately repeat it.
- The original results were wrong–maybe there was a data management error, maybe there was an undetected confound, or some other issue.
- Some kind of researcher malfeasance–rare, but unfortunately not impossible.
A number of the social psychology studies I grew up on, so to speak, have failed this replication test (a priming study conducted in 1996 by John Bargh is a prominent example). My initial reaction to the news each time has been something along the lines of: “Crap. That’s another great paper I can’t cite anymore.” But as I think about it, I’m not sure that should always necessarily be the case.
Here’s why.
- Effect sizes for many of these social psychology studies are very small. Small effect sizes will be more difficult to replicate. Bargh, Chen, and Burrows (1996) do not report the effect size in their study finding that priming people with words associated with ages leads to slower walking, the actual difference in mean walking time was about one second. That’s probably not a big enough behavior change to be meaningful for understanding walking speed, but as Matthew Lieberman points out, it does have value for understanding how priming might nudge people’s behavior.
- Some things change with time. Bargh, Chen, and Burrows published their walking speed study in 1996. I would argue that perceptions of age have particularly changed in the last 20 years. Today’s 60 year olds are not the ones of my childhood. People are working longer, using technology at more advanced ages, and living longer (although they are also more likely to develop conditions such as hypertension or cancer than they were in past decades). The particular stereotype around what it means to be elderly may well have shifted with the vitality of the American senior population, which would certainly limit the ability of researchers to replicate the initial Bargh, Chen, and Burrows study. Whether or not that really is what’s behind the failed replication, it’s worth considering that the specific time and place of a study might influence its results.
- Expectation effects might be at play. As Jason Mitchell notes, people conducting a replication study are likely to be doing so because they have skepticism about the original results. Now, most psychology experiments that I’ve been a part of are conducted as double-blind studies. This means that the experimenter–the person actually in the room with the participant giving direction–is not aware of what experimental condition a participant is assigned to. This helps prevent the experimenter from influencing results with tone of voice or body language. While I am sure people conducting replication studies maintain the same sorts of safeguards as people originating the studies, it’s not inconceivable that all members of the study team are familiar with the initial research and may be projecting expectations to participants as a result.

As I said, I do think there’s value in replication of results. I also think that replication studies are stronger if they somehow incorporate and build on the original findings, especially since there may be reasons outside of researcher control why exact findings don’t replicate. In the case of the Bargh study I’ve referenced here, it may be that replication studies help us understand that stereotypes may subtly influence behavior but the specific stereotype-behavior pairing from 1996 is no longer so powerful.