Technologies often imitate natural objects, giving rise to artificial diamonds, artificial flowers, artificial fur, and countless other artefacts. How are we to judge the success of such imitations? In 1950, Alan Turing proposed an influential answer for the specific case of artificial intelligence: an imitation is successful when we cannot distinguish it from the real thing (Turing, 1950). In his original argument, Turing imagined a human evaluator engaged in natural language conversations with a real human and a computer designed to generate human-like responses. The evaluator would be informed that one of the two partners is a computer, and asked to determine which one. To focus the evaluation on quality of thought rather than quality of speech, the dialogue would be mediated by text only (e.g. keyboard and screen). If the evaluator cannot reliably distinguish the computer from the human, the computer is said to pass the test.

As a target of imitation, intelligent conversation is enormously complex. No current machine appears close to passing the Turing test. However, the logic of the test itself is straightforward, and provides a means for assessing the maturity of imitation technologies generally: given the imitation alongside the real thing, can an observer tell which is which?

Here we bring this logic to bear on a much more tightly circumscribed imitation technology - artificial faces (see Fig. 1). The past decade has seen increasing interest in the realism of computer-generated faces (Holmes, Banks, & Farid, 2016; Nightingale, Wade, & Watson, 2017). Our concern is artificial face images of a very different kind, specifically, unretouched photos of artificial faces in the real world. Images in this category differ from digital images in at least two important ways. First, digitally generated or manipulated images are not snapshots of the physical environment. They only exist in print and on screen, and that limits the ways in which viewers can encounter them. Our focus is physical artefacts that exist in the real world and are caught on camera. Second, digital image manipulation has been a part of mainstream media for a generation. As such, the level of public understanding that images may be “photoshopped” is high. One consequence of this development is that photorealistic images carry less evidential weight than they once did - all images are suspect in this sense (see Kasra, Shen, & O’Brien, 2018). Since the real world cannot be photoshopped in the same way, physical artefacts are more protected from this slide in credibility.

Fig. 1 Schematic illustrating parallels between the standard Turing test (left) and a similar test for synthetic faces (right). In both cases, an evaluator is given the task of trying to determine which presentation is the genuine article and which is the imitation. The evaluator is limited to using a computer interface to make the determination Full size image

Artificial faces in the real world may not be intended to pass for genuine faces, even when they strive for realism in some respect. A marble bust might capture the proportions of a real face, but none of the movement; a robotic head might capture some facial movement, but remain disembodied. Hyper-realistic silicone masks differ from these examples in that they are worn by a real person, and so are seen in the context of a real body. Moreover, they are constructed from a flexible material, so they relay the wearer’s rigid and non-rigid head movements - at least at the gross scale (e.g. head turns; opening and closing of the mouth). These characteristics set hyper-realistic masks apart from other artificial faces, as they allow them to be fully embedded in natural social situations (see Fig. 2 for examples).

Fig. 2 Example trials from the Caucasian image set. Each mask image was randomly paired with one real-face image from the set, independently set for each participant. Correct answers from left to right: Z, M, Z, Z, M. For source information, see Additional file 1 Full size image

These natural social situations place unusual demands on imitation technologies, as humans tend to be especially attuned to social stimuli. Face perception offers abundant evidence of such tuning. For example, humans are predisposed to detect face-like patterns (Robertson, Jenkins, & Burton, 2017), and this tendency is present from early infancy (Morton & Johnson, 1991). Faces capture our attention (Langton, Law, Burton, & Schweinberger, 2008; Theeuwes & Van der Stigchel, 2006), and having captured attention, tend to retain it (Bindemann, Burton, Hooge, Jenkins, & De Haan, 2005). While viewing a face, we make inferences about the mind behind it, including emotional state from facial expression (Ekman & Friesen, 1971; Ueda & Yoshikawa, 2018; Young et al., 1997) and direction of attention from eye gaze (Baron-Cohen, Wheelwright, Hill, Raste, & Plumb, 2001; Friesen & Kingstone, 1998). We also use faces to identify individual people (Burton, Bruce, & Hancock, 1999; Burton, Jenkins, & Schweinberger, 2011), which can trigger retrieval of personal information from memory (Bruce & Young, 1986). All of these processes require high sensitivity to subtleties of facial appearance. There is even some evidence that these processes can become tuned to specific populations through social exposure. For example, children tend to be better at recognising young faces than old faces (and vice versa; Anastasi & Rhodes, 2005; Neil, Cappagli, Karaminis, Jenkins, & Pellicano, 2016); Japanese viewers tend to be better at recognising East Asian faces than Western faces (and vice versa; O’Toole, Deffenbacher, Valentin, & Abdi, 1994). Perhaps most relevant for the current study, discrimination between faces and non-face objects can be accomplished rapidly and accurately. Using saccadic reaction times, Crouzet, Kirchner, and Thorpe (2010) found that viewers could differentiate images of faces versus vehicles at 90% accuracy in under 150 milliseconds - significantly faster than discriminations that did not involve faces. The findings of Crouzet et al. (2010) were based on images from different categories. Nevertheless, they provide an interesting baseline against which to compare the more nuanced discriminations investigated here.

Taken together, these findings suggest that faces may be particularly difficult objects to imitate. Faces attract the glare of attention, and details of their appearance convey socially significant information. Even so, there is some evidence that hyper-realistic silicone masks can pass for real faces, at least in certain situations. In a previous study (Sanders et al., 2017), passers-by consistently failed to notice that a live confederate was wearing a hyper-realistic mask, and showed little evidence of having detected the mask covertly. Out of 160 participants in the critical condition, only two spontaneously reported the mask, and only a further three reported the mask following prompting. These low detection rates are consistent with the idea that hyper-realistic masks successfully imitate real faces. However, several aspects of the experimental procedure complicate this interpretation. For example, masks were not mentioned during the main phase of data collection, and participants had no reason to expect to see a mask. It is possible that participants might have detected the masks more often had they been expecting them. Moreover, responses were collected in a live social setting. It is possible that respondents were reluctant to inspect or to discuss the appearance of a person who was physically present (albeit out of earshot) - and especially reluctant to declare that person’s face to be artificial.

These matters of interpretation arise in part from our approach to testing, which prioritised ecological validity over experimental control. Here we adopt the complementary approach of two-alternative forced choice testing (2AFC), which strikes the opposite balance (see Bogacz, Brown, Moehlis, Holmes, & Cohen, 2006 for a review). The 2AFC method originated in psychophysical research (Fechner, 1860/1966), where it was developed to measure quantities such as perceptual acuity. Our application is closer in spirit to the Turing test, in that our main interest concerns the realism of artificial stimuli.

In 2AFC testing, the participant is presented with two stimuli, one of which is the target, and is forced to choose which is the correct stimulus. This contrasts with the tasks that we used previously (Sanders et al., 2017; Sanders & Jenkins, 2018), in which participants viewed individual stimuli, and made categorical judgements. There are several reasons why the proposed 2AFC testing should sharpen observers’ ability to distinguish hyper-realistic masks from real faces. First, the task instructions ensure that participants are aware in advance that masks will be presented. Second, social influence is minimised, as the task is computer based. Third, the task always involves two stimuli at a time: one is always a mask and the other is always a real face. Thus, even when participants are uncertain whether one of the images is the target, they can still solve the task indirectly if they are certain about the other image.

To test for other-race effects in this task, we collected data in both the UK and Japan. Although other-race effects are most strongly associated with identity-based tasks, such as face recognition (Meissner & Brigham, 2001) and face matching (Megreya, White, & Burton, 2011), our question here is whether they can also arise when distinguishing real faces from other face-like stimuli (Robertson et al., 2017) - a task more akin to face detection. The live viewing study by Sanders et al. (2017) could not address this point fully, as in naturalistic settings, the base-rate probabilities of encountering own-race and other-race faces are not well matched. Moreover, participants had no insight into the probability of a mask being present, even in the laboratory-based experiments. The 2AFC task gets around these limitations by allowing us to present own-race and other-race items equally often. We expect that equating background probabilities in this way will allow us to reach a more definitive answer.