Our results show that in spite of the fact that all participants knew for sure that neither the stranger nor the shocks were real, the participants who saw and heard her tended to respond to the situation at the subjective, behavioural and physiological levels as if it were real. This result reopens the door to direct empirical studies of obedience and related extreme social situations, an area of research that is otherwise not open to experimental study for ethical reasons, through the employment of virtual environments.

Following the style of the original experiments, the participants were invited to administer a series of word association memory tests to the (female) virtual human representing the stranger. When she gave an incorrect answer, the participants were instructed to administer an ‘electric shock’ to her, increasing the voltage each time. She responded with increasing discomfort and protests, eventually demanding termination of the experiment. Of the 34 participants, 23 saw and heard the virtual human, and 11 communicated with her only through a text interface.

Stanley Milgram's 1960s experimental findings that people would administer apparently lethal electric shocks to a stranger at the behest of an authority figure remain critical for understanding obedience. Yet, due to the ethical controversy that his experiments ignited, it is nowadays impossible to carry out direct experimental studies in this area. In the study reported in this paper, we have used a similar paradigm to the one used by Milgram within an immersive virtual environment. Our objective has not been the study of obedience in itself, but of the extent to which participants would respond to such an extreme social situation as if it were real in spite of their knowledge that no real events were taking place.

Funding: The work described in this paper was funded under the European Union IST FET research grant PRESENCIA IST-2001-37927. The funding agency was not involved at all in the study, nor in any aspect of analysis or writing up of the results, nor in the preparation of the paper.

The results suggest that the participants were stressed by the situation, and certainly more so when they interacted directly with a visible Learner rather than only through a text interface with a hidden Learner. This is demonstrated with an analysis of their subjective, behavioural and physiological responses. On the whole the results at least for some of the participants were stronger than we expected prior to the experiment. Our study was subject to full ethical scrutiny with no deception, informed consent, and ensured that any distress to participants was transitory.

The aim of the study was therefore to investigate how people would respond to such a dilemma within a virtual environment, the broader aim being to assess whether such powerful social-psychological studies could be usefully carried out within virtual environments. From our previous experience with virtual environments that depict social settings we expected that participants would exhibit stress in response to the behaviour of the virtual Learner. A specific hypothesis was that the stress would be greater in a situation where the Learner could be seen and heard in comparison to one where she would only communicate with the participant through text.

The study of presence forms the wider background to our work and in this experiment we specifically wished to investigate whether participants would reach such a high level of presence that they would withdraw from the experiment, or exhibit signs of stress or behaviours that indicated that the virtual person was being treated as if real, in spite of their certain knowledge that no one real was protesting or being hurt by electric shocks. Another way to consider the situation is that the experiment established a dilemma for the participants: they had agreed to take part in it, and would be paid for their trouble, yet there was a virtual person (the Learner) who eventually strongly objected to its continuation. Of course, participants had been told in advance as part of the normal ethical procedures that they could withdraw at any time without giving reasons. However, the objections to continuation were not from anyone real, so why stop?

Previous work has shown that people tend to respond realistically to events within such environments and even to virtual humans in spite of their relatively low fidelity compared to reality [10] . For example, virtual environments have been used in studies of social anxiety and behavioural problems [11] , [12] , and individuals with paranoid tendencies have been shown to experience paranoid thoughts in the company of virtual characters [13] – [15] . These provide specific examples of ‘presence’ – the tendency of participants to respond to virtual events and situations as if they were real [9] , [16] – [18] . However, such previous studies involving virtual humans have been limited to situations where participants only react to rather than initiate significant interaction with them (for example, see the review in [19] ). In our study the human participants were required to carry out actions that would cause ‘pain’ to a virtual character. In this situation the behaviour of the participants had consequences for the condition of the virtual human that would be dangerous were it a real person.

An immersive virtual environment is formed by a computer-generated surrounding real-time (stereoscopic) display of virtual sensory data from a viewpoint determined by the tracked position and orientation of the participant's head [7] . This delivers a life-sized virtual reality within which a person can experience events and interact with representations of objects and virtual humans. Our experiment took place in a projection based virtual reality system of the generic type that is called a ‘Cave’ [8] – specifically a Trimension ReaCTor - that has three back-projected vertical screens (3 m×2.2 m) and a floor screen (from a ceiling mounted projector) (3 m×3 m) controlled by a Silicon Graphics Onyx 2. This system and how stereo projection and head-tracking is achieved was described in an earlier paper [9] (and see Materials and Methods ).

Milgram's paradigm was an experiment that subjects were led to believe was a study of the effects of punishment on learning. The subjects, referred to as Teachers, were asked to administer electric shocks of increasing voltages to another subject (the Learner) whenever he gave a wrong answer in a word-memory experiment. A lottery to choose who would be ‘Teacher’ and who ‘Learner’ was carried out at the start of the experiment. In fact, the whole situation was contrived: there were no actual shocks, the lottery was fixed, and the Learner was a confederate of the experimenter. Contrary to expectations, a high proportion of subjects (65% in one condition, n = 40) continued to give ‘shocks’ to the maximum 450 volts, in spite of screams of protest from the Learner. Almost all subjects exhibited signs of distress and many expressed their fears regarding the well-being of the Learner, nevertheless continuing to give shocks to the end.

In an attempt to understand events in which people carry out horrific acts against their fellows Stanley Milgram carried out a series of experiments in the 1960s at Yale University that directly attempted to investigate whether ordinary people might obey the orders of an authority figure to cause pain to a stranger. He showed that in a social structure with recognised lines of authority, ordinary people could be relatively easily persuaded to give what seemed to be even lethal electric shocks to another randomly chosen person [1] , [2] . His results are often cited today, for example, recently in helping to explain how people become embroiled in organised prisoner abuse [3] and even suicide bombings [4] . However, his study also ignited a far-reaching debate about the ethics of deception and of putting subjects in a highly distressing situation in the course of research [5] , [6] , and as a result this line of research is no longer amenable to direct experimental studies.

The participants in the VC often behaved in a way that only made sense if they were responding to the virtual character as if she were real. For example, when she asked participants to speak louder, they invariably did so. The voices of some participants showed increasing frustration at her wrong answers. At times when the Learner vigorously objected, many turned to the experimenter sitting nearby and asked what they should do. The experimenter would say: ‘Although you can stop whenever you want, it is best for the experiment that you continue, but you can stop whenever you want.’ As we have seen some did stop before the end. Some giggled at the Learner's protests, as was observed by Milgram in the original experiments. When the Learner failed to answer at the 28 th and 29 th questions, one participant repeatedly called out to her ‘Hello? Hello? …’ in a concerned manner, then turned to the experimenter, and seemingly worried said: ‘She's not answering …’ In the debriefing interviews many said that they were surprised by their own responses, and all said that it had produced negative feelings – for some this was a direct feeling, in others it was mediated through a ‘what if it were real?’ feeling. Others said that they continually had to reassure themselves that nothing was really happening, and it was only on that basis that they could continue giving the shocks.

We recorded the times between the completion of the participants reading out the five words and the moment that they said “Incorrect” signifying that they would now give the shock. Those in the VC waited very much longer before giving the shock than those in the HC, especially at the 28 th question – as shown in Figure 4 . Moreover, 8 of the VC participants repeated the question (i.e., read out the five words again) at least once after receiving no answer to question 28, and 6 repeated the question after no response to question 29. None of those in the HC repeated the question. From the point of view of the reality of the situation there was no rational need for the participants in the VC to wait so much longer than those in the HC, let alone repeat the question. Why did participants wait and repeat the question? This must be because this was not experienced as like watching a movie. Although individuals watching a horror movie may sometimes scream, or when watching a sports game on television may shout at the players, they do not expect that their actions can have any effect on the outcome of the movie or the game. Here, however, the situation was quite different. The actions of the participants actually mattered, and they behaved accordingly – they needed to wait, or withdraw altogether, in order to stave off or avoid the act of administering the shock and the unpleasant consequences that would follow from this.

Similarly the Learner did not respond to the 28 th and 29 th questions (in both conditions) – unknown to the participants these were the final two questions. In response to the 28 th question the Learner simply ‘stared’ at the participant saying nothing (VC). After the shock she seemed to fall unconscious and made no further responses, and then 3 of the VC participants withdrew failing to give the next shock ( Supporting Information, Movie S3 ).

As part of the briefing before the experiment started, participants were given an information sheet that said that if the Learner replies with any answer other than the correct one, “or if she does not answer in a short time, you will say ‘Incorrect’” (italics in the original sheet given to participants). “You will then administer a shock to her using the machine in front of you. You then call out the correct answer.” The Learner gave no answer to the 11 th question in both VC and HC. In the VC condition she shouted “Stop the experiment!” just after the question (see Supporting Information, Movie S2 ). In order to remind the participants of the rule and emphasise it once again, the experimenter said at that moment (to participants in both groups): “If she doesn't answer remember that it is incorrect,” and in the VC case the Learner then responded angrily “Don't listen to him, I don't want to continue!” After this the participants invariably and immediately said “Incorrect” and administered the (6 th ) shock.

Our simulation of a version of Milgram's experiment shows that in spite of their knowledge that the situation was artificial, participants were aroused (skin conductance analysis), this was associated with stress (ECG analysis), and the intensity was greater for those who saw and heard the virtual character during the learning period compared with those who did not. Nevertheless, the same types of response might be generated by someone watching a distressing or violent film – so what was special about this depiction within virtual reality? The answer can be found in the behaviours that many participants directed back towards the virtual Learner. Participants in the VC made unscripted comments to her and two even emphasised the correct answers while reading out the words, as they later admitted, in an attempt to help her. Such behaviour was not observed in the HC participants. This type of evidence, though convincing for the experimenters who witnessed it, is nevertheless anecdotal. However, there was behaviour that is easily quantifiable and that illustrates the extent of engagement by many of the participants in the VC.

Electrodermal activity indicates that there is greater overall arousal in the VC participants than in the HC (SCL) and greater specific orienting responses around the time of the shocks (SCR). However, to obtain some idea of the associated valence we turn to heart rate and heart rate variability. The mean heart rates (beats per minute) were analysed for the VC and HC over (i) the first 256 s in the baseline, (ii) the first 256 s of the learning session, and (iii) the last 256 s of the learning session. For the VC these were: (i) 70.2±12.4 bpm, (ii) 74.4±14.2 bpm and (iii) 78.0±14.6 bpm. Using a paired non-parametric sign test the difference between (i) and (ii) is significant with P<0.01 and between (ii) and (iii) with P<0.05. For the HC the equivalent values are: (i) 77.7±12.7 bpm, (ii) 82.0±15.0 bpm and (iii) 78.7±11.4. In this case the difference between (i) and (ii) is significant with P<0.01 but the difference between (ii) and (iii) is not significant at 5%. Hence, heart rates increase significantly from the baseline to the start of the learning session for both groups, but only for the VC does the heart rate show a significant increase by the end of the learning session. We also analysed the event-related heart rate (HR) and heart rate variability (HRV) around the time of the shocks. Overall for the VC the mean HR increases and HRV decreases significantly, which is an indicator of stress [25] . There are no significant changes for the HC. (The values for each individual together with further information can be found in Supporting Tables S4 to S7). HR and HRV can be influenced by many factors such as respiration, movement, stress and so on. However, the two conditions required exactly the same physical tasks so that these differences cannot be due, for example, to the movements of the participants in pressing the shock button, nor to the sounds of the shocks, but must be caused by the protesting behaviour of the Learner in the VC.

We restrict attention to SCRs in an 8 s neighbourhood of the shocks (−6 s to +2 s) and let N i be the number of such SCRs observed for the ith participant, and A i be the mean of the corresponding SCR amplitudes (i = 1,…,23 for the VC and i = 24,…,33 for the HC). The period (−6 s to +2 s) was chosen on the basis of Figure 2 , as likely to include the time just after the Learner gave an answer up to the time of administration of the shock (but not including the response to the sound of the shock itself). To validly test for significant differences between VC and HC we need to control for confounding variables, in particular the spontaneous SCR rate of individuals as available from the baseline recordings, and other factors such as their psychological profile. These variables were included in a standard log-linear Poisson regression which showed that N is significantly higher for VC than for HC (P = 0.0123) and that the same is the case for A (P = 0.025). Full details of the regression are provided in Supporting Information Tables S2 and S3.

For the earlier shocks where the Learner displays little distress in the VC we would expect the VC and HC responses to be similar. However, as the shocks continue we would expect to see some evidence of a differential response between these two groups. At the time that each shock is given (time 0 in Figure 2a ) we have the individual SCL value for each participant in the VC and the HC. A Wilcoxon rank sum test can be used to test the null hypothesis, for each shock, that the two sets of values could have come from the same population. Figure 3 represents the significance levels for this null hypothesis, revealing that for the earlier shocks the null hypothesis is not rejected, but that it would be rejected for later shocks in favour of the alternative hypothesis that the SCL for the VC is generally higher than those for the HC.

If we entirely eliminate any possible differences between the VC and HC groups by translating each 20 s segment to start at height 0, then although each group has a similarly shaped waveform, there is a highly significant difference between them over regions of the curve, as shown by the 95% confidence intervals ( Figure 2b ). For example, if we consider time zero and use the non-parametric rank sum test to test the hypothesis that the two samples of SCL values at this point (when the shock is administered) could have come from the same population, the hypothesis is rejected (P = 10.1×10 −4 ). This also eliminates the possibility that the results are solely due to movement artefacts – since both groups carried out the same physical movements in order to press the shock button. The difference between the two conditions is therefore most likely due to the visible presence of the Learner. The peak in the curves after the shock point is probably due to the sound of the shock.

The first physiological response we consider is electrodermal activity (EDA) [21] , [22] of which two aspects are considered: Skin conductance level (SCL) and Skin Conductance Response (SCR). SCL reflects the overall level of sympathetic arousal whereas SCR reflect transient sympathetic arousal, either spontaneous or in response to events [23] , [24] . Each individual's raw SCL is used without smoothing or detrending, and was sampled at 32 Hz (see Materials and Methods ). We took into account the natural variation of SCL between individuals by subtracting the mean SCL for each participant obtained from the baseline period from their SCL waveform during the learning period. The mean of the SCL time series over all intervals of ±10 s around each shock was found over all participants in the VC and also over all in the HC (these are sometimes referred to as the ‘event-triggered averages’). These resulting mean SCL waveforms were significantly different to what would be expected by chance for both the VC ( Figure 2a ) and HC ( Figure S1, Supporting Information ), and also the mean SCL waveform was significantly higher for the VC than the HC ( Figure S2, Supporting Information ).

The Autonomic Perceptions Questionnaire (APQ) is a 24-item visual-analogue scale that was used to assess self-awareness of various physiological indicators (e.g., ‘trembling or shaking’, ‘face becoming hot’, ‘perspiration’). High scores indicate greater subjective awareness of somatic state, and have been found to correlate positively with anxiety, heart rate, skin conductance responses, respiration, face temperature, and blood volume [20] . It was administered to participants in both groups before the experiment, reporting on how they were feeling ‘right now’ (Before-score), and then after the experiment reporting on ‘how you were feeling during the experience’ (After-score). For the VC the median Before-score was 7.6 (range 0.38 to 39.4) and the median After-score was 14.8 (range 0.00 to 52.7), showing increased perception of somatic responses during the study (medians significantly different using a Wilcoxon paired sign rank, P = 0.013). For the HC the Before-score median was 12.1 (range 4.9–29.2) and the After-score median was 17.4 (range of 5.7–31.0) (no significant difference, P = 0.28).

A clear behavioural difference between the two groups was the different levels of early withdrawal from the experiment. All participants in the Hidden Condition (HC) administered all 20 shocks. However, in the Visible Condition (VC) 17 gave all 20 shocks, 3 gave 19 shocks, and 18, 16 and 9 shocks were given by one person each. At the end of the final relaxation period they were asked: ‘Did it ever occur to you before the end of the experiment that you wanted to stop?’ requiring a yes/no answer. (If the participant had actually stopped before giving all shocks then the answer was recorded as ‘yes’). 12/23 in the VC and 1/11 in the HC answered ‘yes’, and all who wanted to stop said that this was because of their negative feelings about what was happening. For those 12 in the VC who wanted to stop before the end, 5 claimed to be well-acquainted with the original Milgram study, and therefore we cannot rule out the possibility that this influenced their behaviour. However, if we treat ‘wanting to stop’ as a binary response variable in order to test for differences between the proportions (using binary logistic regression) then the VC was significantly different from the HC (χ 2 = 6.691 on 1 d.f., P = 0.0097) whereas knowledge of Milgram did not have a significant impact (χ 2 = 1.525 on 1 d.f., P = 0.22) and there was no interaction effect between group and knowledge of Milgram.

This was a between-groups experiment with two conditions. In one condition (‘Visible’, n = 23) the Learner was seen and heard throughout and she responded to the shocks with increasing signs of discomfort, eventually protesting that she had ‘never agreed to this’ and wanted to stop. At the penultimate shock her head slumped forward and she made no further responses. In the second condition (‘Hidden’, n = 11) the Learner was not seen or heard apart from a few seconds of introductions at the start of the experiment, her answers were communicated only through text, and there were no protests. Both conditions were otherwise identical, and carried out in the same setting. Each experimental session was divided into three periods with the participants seated by the shock machine and wearing the virtual reality and physiological recording equipment. There was a baseline period of 5 minutes, the learning period of about 10 minutes, and a final relaxation period of 5 minutes followed by an interview (see Materials and Methods ).

Participants interacted with a female virtual character, referred to as the Learner, seen seated behind a transparent partition (Figure1a). Their task was to read out five words addressed to the Learner, the first of which was a cue word and the others one of four possible words associated with the cue word that the Learner was supposed to have memorised beforehand. There were 32 sets of these 5 words (including some repetitions). On 20 out of the 32 trials the Learner gave the wrong answer, the later trials more likely to result in a wrong answer than the earlier ones ( Table S1, Supporting Information ). On the desk in front of the participant was an ‘electric shock machine’ with a shock button, voltage indicators and a knob for turning up the voltage level ( Figure 1b ). The participant was instructed that each time the Learner gave an incorrect answer he or she should turn up the voltage by one unit and press the shock button which would give a shock to the Learner. Each shock was accompanied by an ‘electric’ buzz sound.

Discussion

The main conclusion of our study is that humans tend to respond realistically at subjective, physiological, and behavioural levels in interaction with virtual characters notwithstanding their cognitive certainty that they are not real. The specific conclusion of this study is that within the context of the particular experimental conditions described participants became stressed as a result of giving ‘electric shocks’ to the virtual Learner. It could even be said that many showed care for the well-being of the virtual Learner – demonstrated, for example, by their delay in administering the shocks after her failure to answer towards the end of the experiment. To some extent based on previous evidence this was to be expected. In fact, it has even been taken for granted that virtual humans can substitute for real humans when studying the responses of people to a social situation. For example, this was the strategy used in the fMRI study described in [19], where participants passively observed virtual characters gazing at the participants themselves or at other virtual characters. However, no previous experiments have studied what might happen when participants have to actively engage in behaviours that would have consequences for the virtual humans. The evidence of our experiments suggests that presence is maintained and that people do tend to respond to the situation as if it were real. We review the evidence for this in subsequent paragraphs.

First, several participants withdrew from the experiment before termination. We have been conducting experimental studies with virtual environments since the early 1990s, with altogether hundreds of participants. Ethical rules require us to inform the participants that they may withdraw from the experiment at any time without giving reasons. Nevertheless, withdrawal is extremely rare, and has only previously occurred due to simulator sickness with no more than about 5 participants out of all the hundreds. Second, there were physiological responses that indicated stress (the SCL, SCR and ECG analysis). There were differential responses within groups (comparing the baseline to the learning session) and between groups (comparing those in the VC with those in the HC). Third, subjectively reported physiological symptoms also differed between groups. Finally, there were clear behavioural differences between the HC and the VC regarding responses to a failure of the Learner to reply to the questions. All these factors, together with the non-quantifiable participant behaviour observed by the experimenters, show a pattern of responses similar to those found in the original Milgram studies, although at lesser intensity.

In the original studies by Milgram it was found that the smaller the ‘distance’ between the Learner and the Teacher the more likely that the Teacher would refuse to give the higher level of shocks. For example, at one extreme the Learner was hidden as in the case of our HC, although unlike in our condition he protested by banging on the wall. At another extreme the subjects had to force the Learner's hand onto the shock machine in order to administer the shock. A similar result regarding ‘distance’ was found here, comparing the responses of the HC with the VC. However, it must also be said that the objections of the virtual Learner were much less extreme and violent than those of Milgram's actor. The virtual Learner complained and even screamed, but there was none of the banging and shouting and protestations of a heart condition expressed by the original actor. One of our participants, for example, reported that although he was affected by the protestations of the virtual Learner, he wasn't too upset, because she didn't protest enough, did not for example scream at and insult him nor writhe in agony in the chair.

Our study leaves open many avenues of further research. We carried out this experiment using two conditions that are far apart. However, we do not know what would have happened if the virtual Learner in the HC had issued protests through text. Neither do we know whether simply the voice of the virtual Learner would have been sufficient to provoke the responses, nor what would have happened if the protests of the Learner had been extremely violent. During our pilot studies we did try a condition with three participants where the Learner was seen but did not show any signs of discomfort and did not protest. One of those participants claimed to see signs of discomfort in the behaviour of the Learner (even though none had been programmed), and said that he felt uncomfortable continuing with the experiment. It is possible that very minimal cues are sufficient to provoke the stress responses in some people.

This issue of minimal cues is important in another sense. Our virtual Learner could never be confused with a real human. Her visual representation was not realistic, and her behaviours were as realistic as could be programmed with the resources available to us (see, for example, Movie S1). Nevertheless, there were evidently strong responses to her. How is this possible? It has been pointed out before that the phenomenon of presence in virtual environments is an important a research question in its own right, closely related to the question of consciousness [9]. People tend to respond to virtual environments as if the objects and events depicted are real, in spite of low fidelity representations and certain knowledge that the events taking place are within a virtual reality. However, the perceptual and neural mechanisms that underlie this are largely unexplored.

The line of research opened up by Milgram stopped forty years ago due to ethical concerns, despite the tremendous importance of this work in the understanding of human behaviour. It has been argued before that immersive virtual environments can provide a useful tool for social psychology [26]. Our results reinforce this argument and show that virtual environments can provide an alternative methodology for pursuing laboratory-based experimental research even in this type of extreme social situation. For example, in future experiments within the Milgram obedience paradigm we plan to make the experimenter a virtual character, thus allowing manipulations of the type of person that the experimenter represents (for example, personality type, clothing, and so on) and also supporting a greater degree of conflict between the demands of the experimenter and the protests of the Learner than is possible when the experimenter is a real person.

The argument regarding the utility of virtual environments applies not simply to obedience research but to all social and psychological research where, for ethical or safety reasons, it is not possible to immerse experimental participants into the actual phenomena to be studied. For example, one of the motivations for our Milgram study was a longer term goal to explore ‘bystander behaviour’ in street violence. There is a well-known result in social psychology that counter-intuitively predicts, amongst other things, that the greater the size of a crowd that is watching street violence, the less likely it is that anyone will attempt to intervene to stop it. This is a vital area of current social-psychological research given the current level of perceived crime in urban areas – yet in order to study this researchers are forced at best to use videos that require people to judge likely responses to such situations [27], and the same techniques have been used in the Milgram obedience paradigm [28]. Milgram's own results clearly show that taking people's opinions about their own or others' behaviours in such circumstances at face value is far from reliable. We suggest that immersive virtual environments provide an alternative way forward in this area of research.