Blog Coordinator

Knobe's X-Phi Page

X-Phi Grad Programs

« Joshua Alexander at 3AM | Main | Conference "Perception and Concepts" Riga, May 16-18 (CfP deadline: Jan 15) »



Feed You can follow this conversation by subscribing to the comment feed for this post.

jonathan weinberg

These results are interesting, but I'm not sure that they really count as "experimenter bias", as I would have thought that is understood. I mean, there's nothing about this "effect" that they've found that actually raises any worries about anyone's results! At the level of the experiments themselves, the experimenters have a hypothesis of the form, "On this vignette, more subjects will answer X than Y" (or whatever). And the real danger of experimenter bias arises when one can get results that appears to confirm or disconfirm such a hypothesis, but where it's just one's inclinations as a researcher that is driving the results. In these cases, the problem isn't about the experimental _results_, but rather, at worst, that researchers don't always select experimental _hypotheses_ that are best at shedding light on the bigger theoretical game they might have been interested in. This is a caution against hasty inference, and not really about any problems with the experiments themselves. (Which is not to say that there couldn't be problems with experimenter bias at the level of the experiments themselves, in xphi, but most labs that I know are appropriately careful about that.)

One thing that makes this not a particularly deep problem, is that it is something we can expect to already be well compensated for by the dialectic nature of the research community. If I choose vignettes that will skew in favor of my theory, but my theory is not in fact correct, then it will also be easy for someone else to find vignettes that skew against my theory. At the end of the day, we'll have several pieces of good experimental knowledge: that _these_ vignettes give results one way, but _those_ give results another way. And we can thus revise our theories based on that expanded & expanding set of results (or have good grounds to chastise those who do not so revise). So, that's all good!

But, in contrast, what's scary about the more standard version of experimenter bias, is that it poses a threat to exactly this healthy dynamic, by undermining the individual results themselves. That I use one set of vignettes that go one way, and you use a different set that go another way -- no worries there. What's scary (for some epistemological reading of "scary") is when _you and I use the same vignettes but end up with different results anyway_.

It also seems to me worth emphasizing, in the context of various ongoing debates about philosophical methodology, that one should expect this "experimental design bias" to afflict armchair use of case judgments _as bad if not worse_ than in x-phi practice.

None of this makes the results of the study non-cool! But I don't think it really justifies much hand-wringing, is all.

Brent Strickland

Hi Jonathan,

Your comments are quite interesting, and it was a pleasure reading them (being one of the author's on the original paper). I certainly agree with you that standard arm chair philosophy is likely to be more influenceable by "experimental design bias." One can cherry pick one's examples in support of whatever point one is trying to make even more easily if one is not constrained by having to actually carry out an actual experiment.

I suppose I am nevertheless more skeptical than you in that I think these findings present grounds for somewhat serious concern. They almost certainly fall within the scope of what is traditionally considered to be a standard "experimenter bias" effect. The key point is that just entertaining a certain hypothesis leads the experimenter to set up or carry out an experiment in such a way that it is more likely than it should be to yield a desired result. Our experiment, like the classic Rosenthal experiments meet this definition.

Your point that any flaws in experimental design could be evened out by the process of scientific dialogue was thought provoking. To a certain extent, this is a very reasonable point since the scientific process does sometimes carry out this function, at least when viewed over very long time scales. However, I think the X-phi community as a whole would benefit from instantiating some sort of procedure for ensuring that their stimuli are not subtly weighted in favor of the hypothesis being tested. For starters, it is quite possible that an entire field can be biased towards a given hypothesis, perhaps because it is intuitively more appealing than its opposite. And thus the field simply fails to seriously consider (and test) the alternative hypothesis. In this case, the "ironing out" process may require a very long time or be non-existent. In the mean time graduate students and professors waste a lot of time and resources chasing a red herring.

The argument from wasted energy is compelling even in the case where the scientific process works as it should. Thus imagine that one day we go one to discover that the Knobe and Prinz findings were much ado about nothing, and their findings were simply due to experimenter bias. In this case, many man hours that could have been used chasing down a real effect would have been wasted in attempting to explain a ghost. If from the get-go people are willing to implement measures that prevent bias in stimulus design, the problem of wasted time virtually disappears since your findings will be more reliable from the very beginning.

For me one problem I've been thinking about lately is trying to find an easy and reliable way to prevent bias during the process of stimulus creation itself. For example, could m-turkers create some stimuli without ever having been exposed to a hypothesis? Or maybe one could give one hypothesis to Group A the opposite hypothesis to Group B and take some sort of average? In any case, I'd be curious to hear people's thoughts along these lines.

jonathan weinberg

Hi Brent,

I wonder if I'm not thinking of the right Rosenthal experiments? The only stuff I know about is the educational stuff, and that seems to me to be clearly of the sort that I was describing as casting doubt on results themselves: the relevant experimental hypotheses of Rosenthal's targets were about student abilities, but Rosenthal showed that researchers' expectations would be a substantial driver of student scores, which gives a reason not to trust results that were supposed to show that students better in such-and-such a way can be expected to get thus-and-so better test results. So Rosenthal's expectancy effects would count as dangerous, in the terms of my earlier comment. Am I not thinking of the right set of results?

Though I think it's really a very cool thing for y'all to be exploring, I must confess I'm a bit doubtful about how well we can crowdsource stimuli development, especially once one has to use stimuli even a little more complicated than single-sentence ones. I guess it doesn't seem to me at all obvious to me that this involves _less_ of an expenditure of energy, to go through the sort of process y'all did, than what practitioners are already doing!

Also, as I noted in my earlier comment, it's probably going to be a mistake to think of the kind of experimenter bias you're describing as one that would produce _wasted_ energy, because the results that folks like Prinz, etc. get are still perfectly legitimate results. That is, the kind of effect you describe _isn't_ one that can lead to these sorts of outcomes: "Thus imagine that one day we go one to discover that the Knobe and Prinz findings were much ado about nothing, and their findings were simply due to experimenter bias." Even if K&P's _theoretical_ hypotheses turn out to be mistaken, their _experimental_ hypotheses will still have generated useful & interesting results -- even if their selection of those particular experimental hypotheses was driven by experimenter bias. That they went in search of those findings as opposed to other, different ones that would have falsified their theory -- that might be due to experimenter bias. But _the findings themselves_ would still stand, it seems to me. And that's a good thing.

Brent Strickland

Hi again and thanks for the thoughtful comments. I'm not sure if you and I had the same Rosenthal studies in mind or not. The Rosenthal study I was thinking of is Rosenthal & Lawson, 1964. In this study, experimenters were given hypothesis A (i.e. that their rats were bred to be good maze learners) or hypothesis B (i.e. that their rats were bred to be bad maze learners). The experimenters then carried out their studies, and those in group A found that their rats performed better on a maze learning task than those in group B despite the fact that there was no systematic difference in breeding between the two groups of rats.

We tried to model our paper on this classic study. Thus in our experiment, similarly to that one, experimenters were given hypothesis A (i.e. that people intuitively think that groups/corporations can have intentional but not phenomenal states) or the opposite hypothesis B (i.e. that people intuitively think that groups/corporations can have phenomenal but not intentional states). The experimenters then designed their studies, and those in group A (i.e. those who had the same expectations as Knobe & Prinz) were able to replicate the results from Knobe & Prinz (2008). However those experimenters in group B who had differing expectations about the outcome of the experiment systematically failed to replicate Knobe & Prinz's results.

So here, just as in the Rosenthal & Lawson results, it seems that the outcome of the experiment depends completely on what outcome one expects going into it. It's true that in the Rosenthal experiment we know in advance what the results should be since there was no true genetic difference between the two groups. Unfortunately in our study we didn't have that luxury, and you are right that our results may be less "dangerous" for this reason.

Brent Strickland

The thing that nevertheless gives our results some element of danger is that experimenters in group A and in group B should have gotten identical outcomes since they were given identical instructions on how to create their experiment, and they were asking the same theoretical question (i.e. do people have differing expectations about the types of mental states that groups can possess). Instead, it looks like Knobe & Prinz (2008) only replicates when people have the same expectations about the outcome that the authors originally had in mind when creating their study. By analogy, if we had given explicit instructions to undergrads on how to design a stroop test and they correctly carried out those instructions, we would have likely obtained evidence of a stroop interference effect regardless of the hypothesis or expectations that we would have given to our experimenters. So the systematic lack of replication of the Knobe & Prinz results by the experimenters in group B is somewhat disturbing.

So to respond to this "Even if K&P's _theoretical_ hypotheses turn out to be mistaken, their _experimental_ hypotheses will still have generated useful & interesting results -- even if their selection of those particular experimental hypotheses was driven by experimenter bias."

The worry is less about their selection of experimental hypotheses and more that their selection of materials to test their hypothesis was questionable. I don't think our experiments conclusively show that Knobe & Prinz got their results simply due to bias, but the worry now is that the original results may have only obtained due to (unconsciously) cherry picking their stimuli to unfairly stack the deck in their favor. Thus of all the many possible sentences that they could have tested, they maybe tested just a very particular subset of sentences that gave them the results they were looking for.

Brent Strickland

If that turns out to be the case (which it may not), then the value of the findings would be severely lessened since they would reveal only a bias in stimulus selection and not anything interesting about how people ascribe mental states to groups vs. individuals. I agree that there could be a silver lining here in that at least Knobe and Prinz will have introduced a new and interesting topic that could generate further research, and the work might nevertheless have some residual value.

But in a hypothetical case where the Knobe and Prinz results were purely generated by biased stimulus selection, I would still think that there would have been wasted energy in terms of follow-up studies and replications which could have been avoided by implementing checks in the design process to avoid creating biased stimuli. You may be right that the cure may be worse than the disease, but I'm quite optimistic on this point in that I think a simple solution can be developed that saves energy and time in the long run. The man hours that are required for reviewing a paper, creating follow-up studies, etc…(for me at least) are likely to be much more intensive than getting a few more undergrads to design some stimuli.

Your doubts about crowd-sourcing stimulus development are well taken. I agree with you that many of the scenarios may be too complicated to do by on-line participants. I will say that in some of my own experiments on causal reasoning, I have started crowd-sourcing stimulus design with very simple sentences and have been getting pretty good results. I know that Danny Oppenheimer at Princeton has also started doing the same thing, also to good effect. It could be interesting to know what the limits of this approach are though, and secondly to maybe come up with other methods for the cases that involve more involved stimuli like those often used experimental philosophy.

Eric Schwitzgebel

Cool study! One pernicious way in which such bias might arise is in piloting vignettes and in assessing which ones are "successful" and thus merit follow-up study and publication.

It's a tricky issue, because you don't want to publish just every old junky thing you run through mTurk or pilot with undergrads, but on the other hand as soon as you start making choices about what to follow up on, you've got Rosenthalesque experimenter-effect and file-drawer problems.

Josh Weisberg

Very interesting study, Brent. As a general point, it would surprise me greatly if such biases are not present in x-phi, as it seems to be in all experimental practice. Obviously, we must work to control it.

But I wonder if your choice of of the Knobe and Prinz thing on phenomenal consciousness might have made things look worse than they are. This study has already been challenged (Sytsma and Machery, Phelan, and others, get contrary results). That is, the K&P results look extremely sensitive to vignette presentation and that is perhaps unsurprising as they are about the "concept" of "phenomenal consciousness." My guess is that things are not at all clear in the folk psychology, so it is very easy to move the results. I take this to be more about the subject matter (consciousness) than about x-phi practice in general.

Have you tried other prominent x-phi studies? I wonder if they are as sensitive. But similar problems may (and I think do) arise with freewill (how in the heck do you tell "naive" subjects about a deterministic universe without biasing them?). Anyway, I wonder if it's not the rather arcane intuitions we are trying to elicit, rather than experimenter bias.

And for what it's worth, it seems to me that Jonathan's point about the "dialectic nature" of x-phi-ers might be supported by the literature following K&P's study. But I don't think that's too different from usual scientific practice.

(PS: I wonder how one would crowd source sentences/vignettes about phenomenal consciousness?)

Cool stuff. Thanks!

Brent Strickland

@Eric Schwitzgebel that's a really interesting point. The problem of stimulus filtering (i.e. piloting stimuli then taking the one's that work) comes up ALL the time, and you are totally correct that this type of procedure is ripe for bias and cherry picking. I'm not sure about the best way to avoid bias in situations like these. A lot of the time "flawed" stimuli are picked out in lab meetings when a researcher is showing failed results, and then people find post-hoc reasons as to why that the non-working stimuli were bad for some reason. One useful method I've seen is to ask from the very beginning if, on the whole, the items are showing a trend in the right direction or not. If not, then just stop the experiment. Another way that one could imagine doing this process is trying to get feedback from one's lab group on stimulus choice BEFORE the audience sees the results.

Brent Strickland

@Josh Weisberg

Thanks for the positive comments Josh. I see your point about the Knobe and Prinz study being particularly sensitive to phrasing/wording etc...I haven't had a chance to test other prominent X-phi results, but my guess is that you would find bias effects in roughly one third of the results. The reason why I think the problem is likely to be widespread is that bias effects appear around a third of the time across a wide range of psychological domains (animal learning, social psych, memory, response time, etc...). For a really cool review on this, see an old BBS article called "Interpersonal expectancy effects: The first 345 studies" by Rosenthal and Lawson.

As for your and Jonathan's point about the dialectic nature of the K&P being able to weed out bias, I agree that this is possible. But I'm worried that this whole literature may just have sprouted up because of an original study that found a result due simply to biased stimulus selection. So I'm saying it would have been good to have a check in place from the very beginning to avoid wasted intellectual effort and time. But I definitely see you guys' point here.

As for your final question, if you (or anyone else) want to try to figure out a way to crowd source sentences about phenomenal consciousness together, I'd definitely be game! I'm most interested in comparing the effectiveness of different possible solutions. One simple starting point would be to have three groups of on-line experimenters : (1) receives hypothesis A (2) receives hypothesis B (3) receives no hypothesis. Then you give all m-turkers clear instructions on the types of sentences they need to build (e.g. all sentences must have a group as a grammatical subject and must contain the verb "desire". then the experimenter can choose the tense and any complements)....

Joshua Knobe

Hi Brent,

Like everyone else who has commented here, I think that this is a really nice paper, and I really appreciate the way that you are bringing these issues to our attention.

Anyway, I know that you were originally just this effect about phenomenal state attribution as a case study to demonstrate a broader point, but since questions about this particular effect keep coming up in the comments, I thought it might be helpful to make three quick points about it.

1. In your actual study, you find a significant main effect in the direction predicted by the original hypothesis. (People are more willing to attribute non-phenomenal than phenomenal states to group agents.) So how do you see your results as bearing on the truth of the original hypothesis itself?

2. Subsequent studies have provided a lot of helpful information about how the stimuli have to be designed in order for the original effect to come out. For example, Adam Arico has shown that the effect only comes out when the stimuli don't include an intentional object. So there is a difference between people's responses to the two versions of (a) but not to the two versions of (b).

(a) Microsoft [Bill Gates] is upset.
(b) Microsoft [Bill Gates] is upset about the court's recent ruling.

It is an important finding that the effect only comes out when we use sentences like (a) -- and this finding might ultimately be used to show that our whole way of conceptualizing the effect was incorrect -- but either way, it does seem that there is some real phenomenon here. The question is just about what that phenomenon is telling us.

3. Our original claim was that there is an important connection between attributing phenomenal states and seeing an entity as embodied. (Since group agents don't have bodies, people don't see them as having phenomenal states.) It should be noted, though, that the evidence for this claim doesn't just come from vignette studies with group agents. For example, Gray and colleagues provide evidence by looking at phenomenal state attributions to people in various states of undress:

Of course, I don't mean to deny that Prinz and I might have been guilty of experimenter bias, and I certainly wouldn't want to insist that our original hypothesis was completely correct all along. Still, it does seem that this basic area of research has clearly been a fruitful one, and work on it by other researchers definitely does seem to be leading to interesting and important results.

Brent Strickland

Hi Josh (Knobe),

Thanks for the comments. To respond to some of your questions…

(1) I definitely don't think that our study rules out the truth of your and Jesse's original hypothesis. There could be experimenter bias effects in your study, but people nevertheless could still be slightly less inclined to attribute phenomenal than intentional mental states to groups. One data point from our own study that speaks somewhat in favor of this is that participants who had your original hypothesis in mind when designing the stimuli for the study were able to replicate your results, but those with the opposite hypothesis didn't get the opposite results (i.e. with higher ratings for phenomenal ascriptions to groups). They just got a null result. Perhaps the reason why you don't get the complete flip is because your original idea was correct?

Nevertheless, I think the lack of replication is still a worry in the sense that it casts a degree of doubt on the original piece of evidence that you guys put forward for your claims. For example, if this were instead an experiment looking at the stroop effect, I think you would find the same basic pattern of results regardless of the hypothesis that the experimenter had in mind when designing the stimuli.

So to summarize, I think our results cast some degree of doubt, but the jury is still out with regards to whether or not your effects were just due to biased stimulus selection.

(2) The Adam Arico results sound interesting. Are you referring to the results from the paper called "Folk Psychology, Consciousness, and Context Effects" ? In that paper, they do get a pattern of results similar to the one you suggest but with a slight twist. From their results, it looks like added context (in the form of a prepositional phrase) brings naturalness ratings for phenomenal (i.e. feeling) attributions to groups to the level of phenomenal state ascriptions for individuals. BUT they get an identical pattern of results for non-phenomenal (i.e. non-feeling) mental states. So if you ascribe a non-feeling mental state to a group without context, that is just as unnatural as assigning that group a feeling mental state without context. While this is definitely an informative and useful result with respect to your original findings (because I definitely think it helps clarify what is going on), my worry still remains.

IF (and that's a capital if on purpose) it turns out that your original results were due just to biased stimulus selection, calling Arico and company's efforts "wasted" would likely be too strong of a claim. Nevertheless Arico and company may have preferred to clarify the nature of (and model the mechanisms for) a more solid effect.

(3) The study you mentioned by Gray and colleagues is pretty cool. I'll take a closer look.

In any case, I definitely don't want to hate on this area of research too much as I think the theoretical questions being asked by you guys are original and this research has the potential to be very important in how we understand theory-of-mind abilities. My worry is that this area (like many others) may have started out on methodologically unstable foundations, and that developing strategies to remove bias would be a big step forward because it would remove a lot of (perhaps unwarranted) doubts.

To take another example, consider infancy researchers. They go to great lengths to ensure that both the researchers running the experiments and the people coding the babies' reactions have minimal exposure to the experimental hypotheses and/or the experimental condition in which a given baby appeared. I imagine that even before these imperfections were rectified, the field found some solid results and they were able to make new and interesting theories. Nevertheless now the the field has been cleaned up, many doubts about experimenter effects have been eliminated. The field is better off for it because you don't have that worry in the back of your mind going "Yeah, but do we think babies can count to three just because the person coding the data knew the hypothesis?" Infancy research is definitely more believable because of the protocols the field has in place, and my hope is that X-Phi can will take a similar step forward.

Joshua Knobe


I completely agree with you about the importance of developing structures that help to address this problem. (Maybe we could even name such structures after you and refer to them as 'stricktures.') :)

In any case, I think that the results Adam Arico obtained are pretty clearly *not* a waste of time. The key thing to keep in mind is what the distinction between 'feeling' and 'non-feeling' amounted to in the study. The distinction there was just between sentences that actually use the English word 'feeling' and those that do not. For example, the distinction between (1) and (2).

(1) Microsoft is feeling upset.
(2) Microsoft is upset.

Just as you say, this distinction doesn't have any impact on anything. Participants say that both of those sentences sound bad.

However, the sentences come out sounding fine if we add an intentional object, as in (3) and (4).

(3) Microsoft is feeling upset about the court's ruling.
(4) Microsoft is upset about the court's ruling.

This result seems to suggest that there is something importantly different about the states being attributed in (3) and (4) that licenses attributions of those states to groups when more straightforward phenomenal state attributions like (1) and (2) are not licensed.

Ultimately, what I'm saying here is basically a version of Jonathan Weinberg's point above. Perhaps it was a bias on our part that led us to use sentences like (1) and (2) in the original study, but subsequent work has helped to clarify the precise conditions under which the effect we obtained there arises.

Brent Strickland

Hey Josh,

Haha, I don't know what the best thing to call structures for avoiding bias would be. Maybe I will cross that bridge once the problem has been solved?

As for the Arico results, I actually quite like them and I agree they add an interesting data point in the discussion here. So I hope nothing I say is interpreted as a slight on the quality of his work or contribution, for which I have the utmost respect. Having said all that, I'm still going to play devil's advocate for a bit.

The original point I was trying to make was a conditional one. For the sake of argument, let's imagine for a second that all of your and Jesse's original effects are simply due to biased selection of stimuli (again I'm not sure if this is the case or not, but let's imagine for a bit). By this I mean that there would be no true difference in how people ascribe phenomenal vs. intentional states to groups vs. individuals. Just natural sounding sentences happened to have been chosen for the experimental cases where the sentences needed to sound natural for the original hypothesis to be true, and non-natural sounding sentences happened to have been chosen for the opposite condition.

In a case like that, the benefit of running follow-up studies is dramatically reduced because the effect in question is not a true one. Instead for Adam to understand the full range of cases in which your original effects would and would not obtain, he would just need to have a general theory of what makes sentences sound natural. While a general theory of what makes sentences sound natural could be useful and interesting, the problem is not likely to be tractable because there are likely to be hundreds of ways that one could modify the stimuli in question to alter their naturalness (e.g. removing/adding a prepositional phrase as in Arico's results, removing a direct object to a non-feeling sentence as in (5) below, altering the prepositional complement of a feeling sentence as in (6), or modifying background knowledge as in (7)).

(5) Microsoft believes. (sounds bad-should sound good)

(6) Microsoft is feeling upset at the comment it heard on television. (sounds bad-should sound good)

(7) Apple's sales are up. Microsoft is worried/feeling worried. (sounds good-should sound bad)

So my point here is that if Arico knew before starting to study this topic that the original effect was due to biased stimulus selection, then he likely would have preferred to concentrate on a more tractable problem (or at least I would have in his place). Like I've been saying though, it's hard to know right now what the nature of the original effect is, so it could turn out that the effects that are being uncovered in this research paradigm will ultimately tell us something important about mental state attribution as opposed to charting a few of the parameters that can influence how natural sentences sound.

Joshua Knobe

Hi Brent,

I completely agree with the basic way in which you are understanding the potential effects of experimenter bias. At this point, I guess our one point of disagreement is about whether that outcome is a plausible one in this particular case.

My own sense is that the full body of work following up on these original studies -- including the results of your own experiment -- provides strong reason to suspect that there really is an effect here. I'm still somewhat inclined to think that the original hypothesis we developed about this effect was at least more or less on the right track, but as a number of commenters have already noted, other researchers have developed quite different views about what is going on here. In my view, the hypotheses developed by those other researchers are also highly promising, and it may turn out in the end that they are right and we are wrong. However, even if that does turn out to be the case, there would still be some kind of real phenomenon that was being uncovered by this stream of research, just not the one that Jesse and I originally thought there was.

The comments to this entry are closed.

3QD Prize 2012: Wesley Buckwalter