Then we run other studies with other manipulations such as having participants sit at a disgusting table or watch a disgusting film clip and in each of those studies we find the same thing, that people make harsher moral judgments. These are called conceptual replications and in psychology, social psychology, we do them all the time. Usually we report them in the same paper.
Our entire literature is built on those conceptual replications, but those are not the ones that people are now discussing. They are different. They're called direct replications. The idea there is that you take an experiment in exactly the same way and repeat it with that precise method. A direct method, for example, would be to take that same study with a dirty desk and then again have participants complete a moral questionnaire.
That's different. That's what some people consider more valid in a way. They say it's similar to clinical trials in medicine or it's more similar to the hard sciences. But then of course if you think of the hard sciences, what they do is very different from what we do in social psychology because for example, they have a specific pill like 50 milligrams of Lipitor, and they look at the outcome in terms of people's blood lipid levels. It's very clear what needs to be measured: the pill and the outcome.
Whereas for social psychology, our outcomes and also our manipulations often are more complicated. There are many ways to induce disgust and there are many types of moral outcomes one can look at. And indeed people have looked at all kinds of factors when it comes to disgust and moral judgment and there's an entire literature based on those conceptual replications, even though nobody's ever done any given study twice. It's a bit of a different interpretation of what's considered a replication.
Intuitively it sounds like one would have to find the same result if one had an original finding, if the finding was true. But it turns out that's not necessarily the case at all and that's very counter-intuitive. This is a complicated story but there's a very good paper by David Stanley and Jeffrey Spence where they talk about the expectations for replications and they run computer simulations where they do experiments thousands of times under perfect conditions with nothing but measurement error. And even then one gets a great variability of results. The conclusion is that any one given study is not that conclusive. That's why normally we do lots of studies to see if there is a general pattern.
One thing, though, with the direct replications, is that now there can be findings where one gets a negative result, and that's something we haven't had in the literature so far, where one does a study and then it doesn't match the earlier finding. Some people have said that well, that is not something that should be taken personally by the researcher who did the original work, it's just science. These are usually people outside of social psychology because our literature shows that there are two core dimensions when we judge a person's character. One is competence—how good are they at whatever they're doing. And the second is warmth or morality—how much do I like the person and is it somebody I can trust.
These are very relevant dimensions. Somebody's work is clearly relevant to how they're judged and how they perceive themselves. It's interesting to look at these direct replications and how they've been evaluated among colleagues and in the literature. It's an interesting situation because it points to the fact that people often use these intuitions that it seems like it's a really scientific way of confirming a previous finding when in fact that's not necessarily the case. In this context it's useful to think of how evidence is used in other contexts.
There was a really important paper in 1964 by Herbert Packer, a law professor. He made the distinction between two types of law. One is the due process model of law, the other is the crime control model of law. Due process is where the burden of proof always has to be very high. For example before you point the finger at anybody, you have to have some evidence to make an accusation. The law recognizes that if you were to just accuse somebody of something without any proof, that would be a crime. And the burden of proof also has to be really high before we as a society make a conviction. We usually consider any criminal innocent until proven guilty, so we're very careful how we assemble the evidence, what we examine and so on. And if we cannot make a conclusive judgment, we say that the person walks.
It's a very labor-intensive process and Packer calls it an "obstacle course" that ensures that we figure out the truth and don't convict an innocent person. In science we do something similar. We have an obstacle course where we consider lots of data, we run various controls, checks, we do all kinds of things and then I suppose our version of what's in the law the Fourth Amendment, is our publication ethics. These laws are our way of ensuring that editors cannot just decide on their own what they want to print but we have independent experts confirm the validity of the findings. This is our peer review system. The idea is that what we consider a verdict also has undergone due process.
In contrast to the idea of due process there is the crime control model of law. That's now very different. First of all, the burden of proof is much lower. That's the case when it comes to suspicions, where it's all about trying to look for suspicious activity and often it's not even clear what it is. For example in the former East Germany there was a system in place where each citizen had a file, their neighbors were spying on them, and everything was subject to monitoring.
It's all based on suspicions and there's a low burden of proof for the suspicion and then also a very low burden of proof when it comes to conviction. In that system, a person is assumed guilty until proven innocent, and the burden of proof is not very high. The goal is to convict people very quickly because there are so many suspicious people. Packer calls it an "assembly-line conveyor belt." The goal is the suppression of crime at any cost, to make sure that not a single criminal slips through. Of course that leads to some errors, and some innocent people get convicted but then, that's acceptable. The usual phrase is that "the innocent have nothing to fear", but, in reality, they have a lot to fear.
Crime control comes from crisis. In the United States, for example, some of these measures were put in place by the government in the form of the U.S. Patriot Act where various civil liberties were curtailed, where citizens were encouraged to be on the alert to look for any suspicious activity. It wasn't quite clear what to look for but one had to be careful anyway. It was because of that crisis, because of that horrible event that had happened that there was this unbelievable betrayal by somebody right within the community, in fact, several people who turned out to be terrorists and we had no idea.
In social psychology we had a problem a few years ago where one highly prominent psychologist turned out to have defrauded and betrayed us on an unprecedented scale. Diederik Stapel had fabricated data and then some 60-something papers were retracted. Everybody was in shock. In a way it led the field into a mindset to do with crime control, the sense that times are different now, we need to do things differently from what we used to do and we need to be more careful. We need to look for the fraudsters; we need to look for the false positives. And that has led to a different way of looking at evidence. This is also when this idea of direct replications was developed for the first time where people suggested that to be really scientific we should do what the clinical trials do rather our regular way of replication that we've always done.
Let's look at how replications are currently done, how these direct replications are carried out. First of all, there is no clear system in terms of how findings are selected for replication. At the moment, the only criterion is that a study has to be feasible, that is easy to conduct and that it's important, or rather the finding is important.
But then it's very hard to define what's important. In that sense, anything could be important and anything could be suspicious. What has happened is that some people have singled out certain findings that they find counterintuitive and often it's people who don't work in the research area, who wouldn't have any background on the literature or on the methods, but who nevertheless have a strong opinion that the findings somehow don't seem plausible.
There's been a disproportional number of studies that have been singled out simply because they're easy to conduct and the results are surprising to some people outside of the literature. It's unfortunate because there is not necessarily a scientific reason to investigate those findings, and at the same time, we know there may be some findings that we should go after more systematically but we really don't know which ones they are. There is no systematic effort to target specific findings.
There are also some issues with the quality of some of these replication projects. They're set up for very efficient data collection, so sometimes it really resembles that idea of an "assembly line conveyor belt" that Packer described for the crime control model where there's lots of data that's being collected even though it's not necessarily done all that carefully.
For example, there was a large-scale project called the "Many Labs" Replication Project. They went around the world and had various labs participate and rerun earlier studies. There was one original study conducted in the United States where participants had been presented with an American flag and they were asked about their attitudes about President Obama. Then Many Labs went around the world and presented participants in Italy, in Poland, in Malaysia, in Brazil and many other countries with an American flag and asked them about their views on President Obama. This is taking that idea of direct replication very literally. There have been other examples like this as well where it's not clear whether the kind of psychological process we're trying to capture is realized in that experiment.
The conclusions are also interesting and, again, it relates to Packer's idea of this quick processing of evidence where it's all about making the verdict and that verdict has to be final. Often the way these replications are interpreted is as if one single experiment disproves everything that has come before. That's a bit surprising, especially when a finding is negative, if an effect was not confirmed. We don't usually do that with positive findings. We don't say this now proves once and for all that such and such effect is real. It probably perhaps comes with that idea that it intuitively seems like this is the real study because we repeated exactly what had been done before.
There's a number of problems with how these replications are done, but at the same time, some people feel very strongly that they are the only right way to basically confirm whether the effect exists. The studies usually are not so much about whether the effect is confirmed; it's more about whether that particular method got a significant finding. It's just that one example rather than the whole body of literature that's available. Now, one reason why some people feel so strongly about these direct replications is that perhaps they've taken on a moral connotation.
Linda Skitka has talked about moral conviction where people feel like they have a moral mandate where something is so important that it just by default has to be right: it's just a better way of doing things and that's basically the end of it. And when that happens, Skitka has shown that people feel like the regular rules don't apply. That has recently happened where there was a journal special issue with 15 replication papers covering 27 earlier reported effects. That issue went in print without having undergone any peer review.
It may not seem like a big deal but peer review is one of our laws; these are our publication ethics to ensure that whatever we declare as truth is unbiased. I took issue with the fact that there was no peer review and one of my findings was reported to not be replicated by some researchers. I looked at their data, looked at their paper and I found what I consider a statistical problem. What was really interesting though, was that when I alerted the editors, they were not very interested. They were not interested at all. In fact, they denied me the right to a published response. I had to fight tooth and nail to get that response.
And at every step of the way I was made to feel like whatever I could possibly say must be wrong. Mind you, that was without that paper nor any of the other papers, having gone through peer review. When that whole thing became public, it was interesting to observe people because one thing I pointed to was this idea of replication bullying, that now if a finding doesn't replicate, people take to social media and declare that they "disproved" an effect, and make inappropriate statements that go well beyond the data.
But that was not the main point about the bullying. The much more serious issue was what happened in the publication process, because again this is the published truth. Of course it's easy to say, and some people did say that peer review is not always accurate. Some reviewers make mistakes and anyway, maybe it wasn't such a big deal that there was no peer review.
But again let's think about it in the legal context. This is to declare a verdict on people's work, on the quality of people's work, without a judge and without having given the people whose work is concerned any right to even look at the verdicts, never mind to defend themselves.
Interestingly, people didn't see it that way. When I raised the issue, some people said yes, well, it's too bad she felt bullied but it's not personal and why can't scientists live up to the truth when their finding doesn't replicate? It was quite interesting just to see how people arrived at those judgments because it's ultimately a judgment of wrongdoing because it is personal. If my finding is wrong, there are two possibilities. Either I didn't do enough work and/or reported it prematurely when it wasn't solid enough or I did something unethical. It's also about allegations of wrongdoing. It's quite interesting how quickly people made those allegations.
People really didn't fully appreciate what it means that there was no peer review. Some people raised various general points such as "we have all these findings in the literature that don't replicate, so that's why we must do replications." All that is true but then of course I don't know how I can help with that because so far I don't know of a single person who failed to replicate that particular finding that concerned the effect of physical cleanliness and moral cleanliness. In fact, in my lab, we've done some direct replications, not conceptual replications, so repeating the same method. That's been done in my lab, that's been done in a different lab in Switzerland, in Germany, in the United States and in Hong Kong; all direct replications. As far as I can tell it is a solid effect. But people just repeat the mantra of well, it's important to do direct replications, we need to do them more often and so on. They go by the intuition that if there was a study with a large sample that repeated exactly the same method, it must be the right study, the ultimate study, when that is not necessarily the case.
What happened on social media was also interesting because the whole thing played out quite publicly and there were various heated discussions and at some point some people said oh, but what we really need to look at the data. I had been required to make all my raw data available so people were crunching numbers and there were blogs with all kinds of colorful pictures. At the end all the blogs concluded Schnall is definitely wrong. She is definitely wrong about that claim that there's a concern about her replication finding, no, there absolutely is not.
That was then called "post publication review" by the editors when in reality those self-appointed reviewers neglected to do the main part of their assignment, which was to evaluate the quality of the replication paper; in particular, the rejoinder where I am in print accused of hunting for artifacts. In terms of what was considered due process, nine months after I raised the concern that there was no peer review and although I found a technical problem in the replication, there still hasn't been any independent review. And that's not just for myself but that's for a total of 44 colleagues who all now have the label of "failure to replicate." There was no independent verification; there was no judge for that verdict.
Such judgments are made quickly nowadays in social psychology and definitively. There are now news reports about so and so many findings replicated, so many findings did not replicate when in reality it's usually a single experiment and nobody mentions all the conceptual replications which are part of the literature. When one looks at how these replications are done, they have a lot of the features of the crime control mindset, so there are no clear criteria for what's suspicious. We don't know what a false positive looks like or what we're looking for.
Then the quality criteria are oftentimes not nearly as high as for the original work. The people who are running them sometimes have motivations to not necessarily want to find an effect as it appears. We now have all these findings that didn't go through any peer review and yet there are exaggerated claims of what they can tell us.
When crime control is implemented by governments, it's a means of control, and it creates fear. And it is used in times of crisis. It's the kind of situation where people just aren't sure what's happening and they worry that they may become a suspect because anybody can become a suspect. That's the most worrisome thing that I learned throughout this whole experience where after I raised these concerns about the special issue, I put them on a blog, thinking I would just put a few thoughts out there.
That blog had some 17,000 hits within a few days. I was flooded with e-mails from the community, people writing to me to say things like "I'm so glad that finally somebody's saying something." I even received one e-mail from somebody writing to me anonymously, expressing support but not wanting to reveal their name. Each and every time I said: "Thank you for your support. Please also speak out. Please say something because we need more people to speak out openly."
Almost no one did so. They all kept quiet and they say they can't afford to speak out; they can't afford to question the replication movement because they don't have tenure yet, they don't have jobs yet and they can't afford to become a target. That's really the worrisome thing here, that we have created a system where there's just so much uncertainty or so much variability regarding what's done that people probably are not as much afraid that their findings don't replicate, as they're afraid of the fact that there's absolutely no due process. Anybody could be singled out at any point and there are no clear criteria for how the verdict is handed down.
That's a real problem and of course one could think that well, when it comes to governments that implement crime control, sometimes in times of crisis it can be useful. For example, if you have to be so sure that a particular person doesn't blow up a building or an airplane and you have good reason to believe that that might happen, it may still be useful to detain them even if it's wrong, if it's an error, to do so just to be on the safe side. When it comes to crime control it can be good to be on the safe side as far as criminals are concerned. But if we have that kind of crime control mindset when science is concerned, that's never a good thing because it comes with errors. We'll have errors across the board. We have them regarding our false positives and our false negatives. We'll just have a bunch of errors. And now we already have them in the literature in that particular special issue. Even the so-called "successfully replicated" findings have errors.
It's a problem for the accuracy of the published record because those are our verdicts. Those are what researchers build on. The whole idea was to increase the credibility of the published record. It's also a problem for all the people who put in the hard work running the replication studies and doing exactly what was expected of them and they now end up with publications that are not very valuable on a scientific level.
What social psychology needs to do as a field is to consider our intuitions about how we make judgments, about evidence, about colleagues, because some of us have been singled out again and again and again. And we've been put under suspicion; whole areas of research topics such as embodied cognition and priming have been singled out by people who don't work on the topics. False claims have been made about replication findings that in fact are not as conclusive as they seem. As a field we have to set aside our intuitions and move ahead with due process when we evaluate negative findings.
If junior people in the community aren't comfortable joining the discussion, then we have a real problem. If they're too afraid of being targeted for replication simply because it's not clear what's going to happen once their findings are under scrutiny, we really need to be careful. I appeal to colleagues to say, look, we often use intuitions. We do it all the time. But we know from the research that that's not the way to make a good decision, a good judgment. And we should treat our findings and our colleagues with at least the same respect that we give to murder suspects. We hear them out, we let them talk, we look at the evidence and then we make a decision.
THE REALITY CLUB
FIERY CUSHMAN: One of the things I really appreciated that you brought up is what is the appropriate analogy between science and law and the kind of standards and due process that get used in law. The analogy that you invoke is to criminal law where at least in the United States the standard of evidence is beyond a reasonable doubt, which is a very high standard of evidence. And criminal law not entirely, but mostly deals with intentional harms, what would be the equivalent in science would be intentional fraud.
For instance if I were to accidentally spill my coffee on Laurie, that would be handled through torte law where a different standard of evidence is applied. It's a preponderance of the evidence so … the idea is Laurie's got a claim, I've got a claim and 51 percent in favor of either one of us is going to decide the matter. Another interesting analogy would be to the area of libel law.
There's a concept in U.S. law that different people who occupy different roles in society are held to different standards in terms of when they can make a claim of libel against somebody else. As a private citizen, there's a fairly low standard. I can make a claim that someone's libeled me or slandered me in a broader array of circumstances. If I'm a public official, especially an elected official, then it's incredibly hard for me to make a successful claim of slander or libel. And the reason is because the … legal scholars and justices have interpreted the Constitution to imply that if you put yourself out in a public arena, then you're opening yourself up to criticism and the existence of that criticism is vital to a well-functioning democracy.
SCHNALL: Sure, that's right.
CUSHMAN: I'm curious to hear more from you about what are the appropriate analogies within the law and what kind of standard of due process are you envisioning? I think you brought up one issue that is whether or not a replication paper should be subject to peer review. I wouldn't be surprised at all if every person in this tent right now feels strongly that any publication in the literature needs to be subject to peer review. Are there elements of due process that go beyond that where extra scrutiny is required for replication …?
SCHNALL: Well, I will say this. We know how easy it is for any study to fail. There is almost no limit to the reasons for a given experiment to fail and sometimes you figure out what the problem was, you made an error, there was something that you didn't anticipate. Sometimes you don't figure it out. There are always many reasons for a study to go wrong and everything would have to go right to get the effect.
We have to apply a really high standard before we infer that there is no effect. In a way, before you declare that there definitely is no effect, the burden of proof has to be really high.
CUSHMAN: And do you think that burden of proof is most appropriately applied by the action editor and reviewers or by the readership?
SCHNALL: Based on the paper by Stanley and Spence I mentioned earlier, the conclusion is that there's very little you can say based on a single study. Practically all the large-scale replication projects that are being conducted now such as the Reproducibility Project, they will say very little about the robustness of the effect because it's just a one-off experiment. It's practically impossible to read much into that one experiment. And we usually don't do that, either. That's why we usually have a line of work rather than one single one that we consider conclusive.
It's about doing lots of studies as we've always done and getting at that effect from different angles rather than putting the weight into that one study. Just because it's the exact repetition of an earlier method and just because it has a large sample doesn't mean it's the conclusive study. In a way, that's really a misconception at the moment where people think that's the best kind of study to run when in fact it's not.
DAVID PIZARRO: There is a way, in which as you point out, social psychologists are well suited to see the problems in the scientific process because as you say, we know very well that people use evidence in very different ways depending on what their motivations are. And you rightfully point out a lot of the issues with people motivated in this way. It's not … if I wanted to show that say somebody's studies were wrong I could just do a poor job. I could claim to replicate Laurie's monkey studies and since I don't know anything about doing it, that's problematic.
But at the same time so much of what we know is that we have equal errors on the side of trying to find what you're convinced about. This has been problematic and there are reasons why this might be magnified in psychology right now or maybe across all sciences right now because of the ease with which we communicate. This has always been problematic. There's this way in which you can have an ideal answer where you say that … science corrects itself. You're motivated to find this, I'm motivated to find that, and at the end of the day it'll work itself out because there will be this body of evidence.
But the truth of the matter is that people get trampled in the process. I can imagine that … I don't know, if Newton and Leibniz had Facebook there would be flame wars and people would take sides. And there's a way in which this is just extra-problematic now. But it's not a new problem, at least in principle. I don't know if there are any good ideas aside from, say, just being more rigorous about publishing, about how to go about fixing given the ability to smear reputations in the modern world.
SCHNALL: Sure. Well, one key thing is to select, to really go after phenomena or methods for which we have evidence that they're in some way not as reliable as we hoped they were.
PIZARRO: But how do you get that evidence?
SCHNALL: Well, that would have to be something that the field as a whole decides; right?
PIZARRO: It's a problematic first step …
SCHNALL: Well, the way right now is that you can go to a website and anonymously nominate a replication target. One doesn't have to give any evidence of why it makes a good replication target except that it needs to be easy to run and important. Basically there are no criteria. That's the problematic thing because if we want to go in depth into a specific phenomenon, we need to do that, rather than just covering lots of different things and doing a one-off study that will tell us very little.
That's really a key thing for the field to decide, how to select replication targets because it does come at a huge reputational cost. There is no question about it but at the same time it needs to be done. We need to go after those potential false positives.
About your earlier point about people's expectations, one can always have biases this way or the other way. I would imagine just considering how easy it is for any given study to go wrong, that it's easier to get a study not to work than for it to work just based on bias. That's just a hunch.
LAURIE SANTOS: Let me follow up on that. One of the things I haven't liked in following the replication crisis is the fact that these effects are seen as either/or. Like either having a dirty table is going to cause moral evaluations or it won't. And in psychology the depth of our effects and the amount that they're going to stick varies.
I could have all kinds of preconceived notions that Müller-Lyer is a fake effect and it doesn't work. I'm going to bring 20 people in and they're going to see it. Social psychology is not as profound as perception that is why it's probably a lot more interesting. But it raises the question of can you use these null effects to see the boundary conditions on these things?
You brought up the case of the many labs doing, looking at the American flag in Brazil. Probably a boundary condition on the American flag effect is you have to be American. When we don't see replications there, we learn something about the effect. That it’s bounded—we see it only in Americans.
Is there a way to move the debate closer to that, that we can learn something important and scientific about the boundary conditions of different effects through non-replications? And I see the issue if somebody is doing what Dave was going to do in my monkey study, he's just actively and intentionally doing a crappy job. I'm not sure any of the cases are like that. I would like to believe that most of the cases are people who are curious about whether these effects….
SCHNALL: Yes, that's right. The issue is with these one-off experiments where a single experiment is taken to be representative of the effect as a whole as opposed to just one particular method. And that's what we normally do anyway with the conceptual replications. We do a line of studies. If one wanted to really go after specific false positives by using direct replications, one would have to use a series of direct replications rather than a one-off across a large number of phenomena.
SANTOS: But do we still learn something from the direct … I run a study. Somebody directly replicates it and they don't see the same result.
SCHNALL: Again, that paper by Stanley and Spence, that's an excellent paper. It's quite stunning just how much variability one can get. For example, they did computer simulations with perfect testing conditions, thousands of simulations, nothing but just measurement error. If you have a known correlation coefficient of 0.3 with a known reliability of 0.7, what you get as a correlation coefficient can range from 0 to 0.5. It can be much, much smaller and much, much larger than the real thing, 0.3. And there's just such a big range that any one given study tells us very little.
L.A. PAUL: It seems to me that it might be productive to distinguish a couple of things. It's my job as a philosopher anyway. And I heard you talk about a couple of things that I wanted to sort of separate. One issue involves the evidential standard, namely; what's the standard that our evidence has to reach before we draw a conclusion? Another thing though that I was hearing you talk about is what counts as evidence? And then embedded in that is also a question about, if we establish a particular evidential standard, do we … are we keeping it constant as we move from context to context? And so one reason why it's helpful to distinguish these things is because I don't think that you want to argue, that's ok if my evidential standard is low and other people's evidential standard is high.
PAUL: Rather, what I hear you talking about are two problems. One is that we're not being careful enough about what counts as evidence and so the quality of the replication studies must be looked at to understand whether or not we should even judge these results as evidence. And then the second problem is that it seems like we are holding different people and different groups to different standards. And for high quality scientific research and inquiry, we need to have a constant standard. It's a little bit related to what Fiery was raising.
SCHNALL: Yes. That's exactly my point. First of all, we consider: Is it admissible evidence? For example, I have had people write to me, ask me for my experimental materials. They take materials that were done in a lab. They run an online study then they don't find the effect and they make a big deal out of it on social media and blogs and so on, failure to replicate an effect. Well, is that something that should be put out there because it in no way repeated the original study? It was a lab study as opposed to an online study. It's obvious why they're not as highly controlled in each case.
What is considered admissible evidence? And the line is very easily blurred for two reasons when it comes to social psychology. One is that there are all these discussions on social media. The reason why they're everywhere is because social psychology seems so intuitive to people. Everybody has an opinion: Do I believe that finding, yes or no? As opposed to string theory where people will accept that they just don't know enough about it.
It's a real problem that people feel they know enough about it to say that yes, it's probably true or not. It's just an intuitive judgment as opposed to a scientific judgment and that's a big problem now with social psychology where everybody feels like they can be a social psychologist and make conclusions and put up studies online or … especially now with some of these replication efforts that don't require any expertise, so in a way they propagate that image that anybody can sign up and run a study.
In the end, it's about admissible evidence and ultimately, we need to hold all scientific evidence to the same high standard. Right now we're using a lower standard for the replications involving negative findings when in fact this standard needs to be higher. To establish the absence of an effect is much more difficult than the presence of an effect.