Implementation of the SR system

The SR system consists of the following three sub-modules: a recording module, an experience module and a control computer. The recording module (Fig. 1, left) was equipped with a microphone and a panoramic video camera with the ability to record a panoramic movie, which was then stored on the control computer. The experience module (Fig. 1, right) consisted of a HMD, a head-mounted camera, an orientation sensor, noise-cancelling headphones and the same microphone used by the recording module. The camera was mounted at the front of the HMD and the orientation sensor was mounted on a rim. The experience module alternately presented two different types of scenes: the first was a real-time scene captured by the head-mounted camera and the microphone (live scene) and the second was a scene that was previously recorded and edited in advance by the recording module (recorded scene). During presentation of the recorded scene, the panoramic movie was cropped in real-time to fit the HMD display size. The cropped area was determined based on the participant's head orientation, which was obtained from the orientation sensor (i.e., when a participant turned to the left, the cropped area shifted accordingly). Therefore, assuming that the head was kept stable in a position, natural visuo-motor coupling was ensured both in the live and recorded scenes. Additionally, by setting the head position close to the location where the panoramic camera was placed when recording the movie, the visuo-motor experiences of live and recorded scenes were similar enough to be indistinguishable. In both scenes, an identical image was presented to each eye, meaning that there was no binocular parallax. In this way, participants' reality could be manipulated by covertly switching the live scene and the recorded scenes back and forth.

Figure 1 Substitutional Reality System. In the recording module (left), the panoramic view was recorded in advance by a panoramic camera and stored in the data storage connected to the control computer. In the experience module (right), either a live scene captured by a head-mounted camera or recorded scenes cropped from a pre-recorded movie were shown on a head-mounted display (HMD). The cropped area presented in the recorded scenes was determined in real-time using head orientation information calculated from the HMD orientation sensor. Scene examples are shown here. In the recorded scene a person with a lab coat waved his hand, who was not present in the live scene. A participant believed the person with the lab coat was physically present there, when the covert switch from the live to the recorded scene was successfully performed. Full size image

The experience of the SR system was determined by the scene sequence (including the live scene), which could be either fixed (as in following Experiment I), or manually adapted by experimenters depending on the response of participants. Such manual sequence manipulation is feasible when more complex and interactive scene selection is required.

Performance of the SR system (Experiment I)

We assessed the performance of the SR system (n = 21, see Methods regarding Experiment I) by observing the following three points: (1) whether the SR system could covertly substitute reality successfully, (2) how a participant's CR was modulated when exposed to an unrealistic, extremely contradictive event and (3) whether we can re-establish participants' CR after they explicitly noticed the substitution and mechanism of the SR system.

To address these questions, we designed a sequence of scene presentations. A five-frame comic strip depicts how the sequence was presented (Fig. 2). We employed three scenes that were recorded prior to the experience session. Each scene corresponds to each of three questions described above, respectively. The first scene was designated the “Normal Question” scene, in which the experimenter appeared and asked several questions (e.g., “Do you feel OK with HMD?” or “Can you look around?”). In this case, the experimenter was pretending to speak to the participant during the recording session, although the experimenter was actually speaking to the panoramic camera. The second scene was extremely contradictive and referred to as a “Doppelgänger” scene, in which the participant appeared from the door with the experimenter, walked close to the panoramic camera, had a conversation with the experimenter (2∼3 minutes) and walked out of the room. This scene was recorded when the participant was invited into the experimental room to receive instructions (Fig. 2a). The third scene was a “Fake Live” scene, in which the experimenter behaved as if he was talking in real-time, saying, “So, this is the live scene. I'm here. Can you tell?” (Fig. 2d).

Figure 2 A cartoon depiction for each step of Experiment I's sequence is shown. (a) During the recording session, the participant was invited into the room and received instructions about the experiment. During this time, everything was recorded for the Doppelgänger scene. (b) Normal Question scene. After the covert substitution from the live scene to the recorded scene, the participant replied naturally to the experimenter, indicating that the substitution was successful. (c) Doppelgänger scene. The participant saw himself, thereby realising that the scene he had experienced was not live. (d) Fake Live scene. The SR system worked even after the Doppelgänger scene. Seven of 10 participants could not detect that the given scene was recorded. (e) The Live scene after the Fake Live scene. The participant was not certain whether he was experiencing live or recorded scenes any more. See DISCUSSION. Colour bars at the right of each box indicate scene differentiation (orange for a live scene and green for a recorded scene). For convenience, the microphone and connection cables are omitted from the drawings. Full size image

During the experiment, we instructed the participants to sit back in the chair with their hands resting on their thighs and to freely look around the room, but not to look down at themselves because their body would not be visible in the recorded scenes. Each participant first experienced a live scene via the head-mounted camera and the microphone. During the live portion of the experiment, the experimenter asked questions that were similar to the ones asked in the Normal Question scene and confirmed that the HMD was comfortable. When the participant moved his/her head, the experimenter manually switched the live scene to the Normal Question scene. Switching during head movement enhanced substitution performance. This issue is described in Experiment III. If the participant did not look around the room spontaneously, we asked him/her to do so. During the experiment, all of the participants verbally responded to the experimenter's questions in the Normal Question scene as if the scene were taking place in real time (Fig. 2b; additionally, see the supplementary video S1). Afterwards, all of the participants reported that they did not notice the switch and that they believed they were experiencing actual events throughout the entire session. This result shows that (1) without any prior knowledge about SR system, people did not recognise the substitution and (2) an interaction could be established with people appearing in previously recorded scenes (in this case, a fake conversation including simple questions and responses).

Next, we switched the scene to the Doppelgänger scene (Fig. 2c and the supplementary video S1 at 1:32). When the participants saw themselves in the recorded scene, all of the participants became aware that they were not experiencing live reality. Not surprisingly, the Doppelgänger scene was too contradictory to maintain a CR.

Finally, we switched the scene to the Fake Live scene (Fig. 2d and the supplementary video S1 at 2:07). Ten of the 21 participants experienced this optional scene after the Doppelgänger scene. Seven of them could not detect that the given scene was the recorded scene. We confirmed this from their replies to the experimenter in the scene (e.g., “Yes, I know this is live, of course”), indicating that they re-established CR. The remaining three noticed that the scene had been recorded previously, stating that they noticed a difference in the sound quality between the live and the Doppelgänger scene and used this auditory difference as a cue in the Fake live scene. At the end of Experiment I, we switched back to the live scene and explained that the previous Fake Live scene was also a recorded scene. The participants who did not detect the substitution during the Fake Live scene were often confused during this conversation because their conviction became uncertain (Fig. 2e and the supplementary video S1 at 2:45).

We observed an interesting behaviour in one participant during the Normal Question scene. The participant happened to raise his hand in front of his eyes, although he had been instructed not to do so. Although his hand was invisible to him, he did not notice the switch and continued to respond to the experimenter's (recorded) questions. After the experiment, he reported that he was confused when he could not see his hand, but he thought that he might have put his hand somewhere other than in his field of view. Although an “invisible hand” would seem to be strongly contradictory, the reality substitution worked and the contradiction was compensated for with confabulation. This observation suggests that participants' CR can be maintained even in apparently contradictive situations with strong conviction.

In the following studies, we designed two verification experiments to manipulate two important major factors (i.e., motion parallax and the scene switch timing) to determine how they influenced substitution performance.

Difference in motion parallax (Experiment II)

When head position changes, the shape and depth of objects in the visual field change accordingly. Even when head position is stationary, changing the orientation of the head can alter the shape of objects (motion parallax). Although there was normal motion parallax in the live scenes, it was absent in the recorded scenes in the SR system because the viewpoint of the panoramic camera was fixed. Therefore, if the participants paid attention to the difference in motion parallax when changing their head position or orientation, they would be able to differentiate live and recorded scenes. However, it has to be emphasised that none of the participants in Experiment I spontaneously noticed the difference in motion parallax, even after they were informed about the substitution trick. They kept looking around at visual objects at various depth (∼1.5 m), but could not use the parallax difference as a clue until we explained it. This suggests that the visuo-motor experience could be natural enough without motion parallax in the SR system and that object distance may play a minor role in influencing successful substitution.

To examine this proposal, in Experiment II we tested the effect of motion parallax on substitution performance when it was explicitly explained and used by participants as a discrimination clue. The participants (n = 10) were told about the mechanism of the SR system, then asked to sit alone in a room, where one red chair was placed in front of them (Fig. 3a–c). Each participant was asked to determine whether the scene he/she was viewing was live or recorded by monitoring the motion parallax around the red chair that was induced by his/her own head motion. There were three different distances (1.0 m/2.5 m/4.0 m) between the participant and the chair (Fig. 3a). In general, longer distances cause less motion parallax. To introduce the wide variety of head movements, participants received two instructions with a randomised order (Fig. 3b). With “Head Only” instructions, the participants were asked to rotate their head without body displacement. With “Head and Upper-body” instructions, the participants were asked to displace their upper body (i.e., move their shoulders) and change their head orientation to induce greater motion parallax. Figure 4a shows the correct detection rates for each distance. As we expected, the correct detection rate was higher in “Head and Upper-body” instruction than in “Head Only” instruction. But a statistical comparison did not show significant differences between the three distance conditions [Friedman test: p = 0.627] in both instructions. Figure 4b shows the time lag between scene switching and correct detection in the six conditions. A two-way repeated measures ANOVA revealed a significant main effect of distance [F(2,18) = 4.85, p < 0.05], with no significant main effect of displacement [F(1,18) = 3.37, p > 0.05]. Multiple comparisons showed a significant effect between the 1.0 m and 4.0 m conditions (Scheffé's test: p < 0.01). There was no significant distance-by-displacement interaction (F = 0.0648, p = 0.94). Although the motion parallax is an important factor for the SR system performance, the high and constant correct rates regardless of the different distances indicates that the object distance does not necessarily affect the subjective discriminability of scenes. This finding is consistent with the observation in Experiment I that participants did not spontaneously find the difference in motion parallax, even though they looked around at objects that had different distances. It is important to note that we need to further investigate applying different environments in the SR system to generalise the results.

Figure 3 Experimental Design of Experiment II. Two independent conditions were applied. (a) In the first condition, there were three different distances (1 m, 2.5 m and 4 m) from an object in the visual field, presumably providing different degrees of motion parallax. (b) In the second condition, there were two different instructions for head movement. With the “Head Only” instruction, the participants could only change their head orientation. With the “Head and Upper-body” instruction, the participants could move their upper body in addition to their head. In both cases, the participants were instructed to keep their eyes on the chair (the line of sight is indicated with a grey dashed arrow). (c) Temporal sequence of Exp. II for discriminating between live and recorded scenes. Live scenes or recorded scenes were pseudo-randomly selected and presented (10 sec) interspersed with a 3 sec fixation period. The participants were asked to report whether the scene was live or recorded by pressing a button. Full size image

Figure 4 Results of Experiment II. (a) Correct detection rates for the three distance conditions in Experiment II are indicated. All data were averaged across the participants (n = 10). No significant difference was observed (Friedman's test: p = 0.627) between conditions. (b) Response latencies for the six conditions are shown. A two-way repeated measures ANOVA revealed a significant difference between the distance conditions. The p-values were obtained through post-hoc analysis (Scheffé's test). Error bars indicate the mean ± standard error. * indicates significance levels (p < 0.05). Full size image

Head speed and detection rate of scene switching (Experiment III)

Although head orientation was the same, the images from the live and recorded scenes could not be identical due to fluctuations in the orientation sensor and motion parallax. Thus, the image inevitably slipped at the switch onset between the live and recorded scenes. In Experiment I, to prevent the participants from noticing the visual slip, we heuristically switched the scenes manually only when the participants moved their heads so that the slip was perceptually masked during the scene transition. Although this worked well, it did not provide an appropriate range of head speeds for successful substitution. Here, we attempted to determine the optimal range of head speeds for successful switching in the SR system.

In Experiment III (Fig. 5), the participants were instructed to sit in a chair, to make their head position stable according to the Head Only instruction from Experiment II and to look at different orientations by turning their head intermittently at one of four speeds: “Motionless” (<32 deg/sec), “Slow” (32–64 deg/sec), “Fast” (64–96 deg/sec) and “Very Fast” (>96 deg/sec) (Fig. 5a). The speed of the “Very Fast” condition roughly corresponds to the speed attained when an individual turns around quickly. Head speed was monitored by the orientation sensor on the HMD and scene switching occurred when the speed exceeded the given instructed speed (see Fig. 5b). Participants were asked to focus on the onset of the scene switch and press a button on an interface box as soon as they detected the switch. Figure 6 shows the correct detection rates for the four speed conditions. A one-way repeated measures ANOVA revealed a significant main effect of speed (F(3,27) = 19.38, p < 0.01) (Fig. 6). Multiple comparisons showed significant effects between the “Motionless” condition (76±2%) and the other three conditions (45±3%, 36±2% and 21±2% for “Slow,” “Fast,” and “Very Fast”, respectively) (Scheffé's test: p < 0.001), indicating that switch detection was easier when the participants did not move their head, with even “Slow” head motion significantly reducing the detection performance. Detection performance of visual changes decreases during head movements with HMD (i.e., head movement suppression18). The result suggests that the same suppression also occurred in our system, which hid the visual slip during the scene switch.

Figure 5 Experimental Design of Experiment III. (a) Screenshot of the HMD. A bar at the bottom of the display indicated the current head speed and a vertical line indicated the target head speed threshold. When the head speed exceeded the threshold, the bar changed from red to green. The participants were asked to maintain the target head speed. Scene switches occurred only when the participant's head speed exceeded the target speed (bar in green). Participants were asked to press a button upon identifying a scene switch. (b) Example time course of Experiment III for detecting the switch with the “Fast” instruction. The green line indicated actual head speed. A switch occurred after a short time had passed (randomly chosen from 5 to 15 sec) and when the head speed exceeded the instructed head speed. A response (button press) within 3 sec of the switch occurring was considered a correct response. Full size image