Having set the stage, we now briefly consider a series of classic, textbook ideas from cognitive science. To avoid having this review grow unmanageable, we have selected nine topics that illustrate our point. (Table 1 lists other candidate phenomena that we could have addressed, but excluded for brevity.) Because each finding is well known, we provide brief explanations, just noting the basics.

Table 1 Twenty additional findings from cognitive science that appear challenging to explain from an embodied cognition perspective Full size table

Word frequency and related effects

The well-known Hebb (1949) learning rule states that concurrent activation of adjacent neurons will strengthen their shared connections. Therefore, repeated exposure to a stimulus increases the fluency of neural subpopulations responding to its presence. In word perception, a person encodes a spoken or printed string, which automatically activates its corresponding meaning or syntax. The more commonly any word is experienced, the faster and more robust its lexical access becomes – the word frequency effect. Perhaps no experimental variable has pervaded the cognitive literature to a greater extent: Frequency affects every word-perception task, and it moderates the impact of other variables such that common words are immune to variations in other lexical dimensions, but rare words show many effects. In addition to laboratory measures such as lexical decision or naming, word frequency predicts eye fixation durations in reading (Inhoff & Rayner, 1986; Staub, White, Drieghe, Hollway, & Rayner, 2010) and ERP waveforms that occur before overt responses are generated (Polich & Donchin, 1988).

Beyond word perception, frequency effects also arise in recognition memory (e.g., the mirror effect; Glanzer & Adams, 1990), but the effect is flipped. Whereas high-frequency words show advantages in perception, low-frequency words show advantages in recognition memory. Because of its ubiquity, word frequency must be addressed by any viable model of word perception; the most prominent accounts are connectionist (neural network) models that track the statistical properties of large word corpora (e.g., Perry et al., 2010; Sibley, Kello, Plaut, & Elman, 2008). With respect to memory, word frequency is assumed to correlate with distinctiveness, allowing greater differentiation of targets and lures (e.g., Wagenmakers et al., 2004).

Can the core principles of EC help explain word frequency effects? The first principle is that “cognitive processing is influenced by the body.” Stated plainly, we cannot conceive of any embodied account for frequency effects, without resorting to the trivially true notion that peoples’ bodies are present every time they perceive (or produce) words. Even then, frequency effects would imply that different bodily states exist across trials in word-perception tasks, and that opposite bodily states exist across trials in recognition memory. In similar fashion, the second EC principle (that “cognition is situated” in the environment) cannot explain the frequency effect, unless it merely means that words are experienced in the environment. Finally, explaining frequency effects without representations appears impossible – they reflect a lifetime of linguistic memory, and that memory must reside in some form.

Perhaps more reasonably, an EC theorist could argue that word perception involves motor simulation, which becomes more fluent with expertise.Footnote 1 For example, one might appeal to the classic motor theory (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman & Mattingly, 1985), which posits that speech perception is accomplished through recreation of the motor commands for speaking. Similar theories have been offered for handwritten word perception (Babcock & Freyd, 1988). By this view, word frequency could be construed as a motor fluency effect, consistent with EC.

Despite this possible account, matters become far more complicated for EC when other word-perception findings are considered. Specifically, there are myriad effects showing that perception of any given word is profoundly affected by its relations to other potential words in memory. Regularity and consistency effects (Monaghan & Ellis, 2002; Seidenberg, Waters, Barnes, & Tanenhaus, 1984) show that perception of a word, such as GAVE, is affected by the presence and “strength” of other similar words with different pronunciations (WAVE and SAVE are “friends,” but HAVE is an “enemy”). Neighborhood effects (Andrews, 1989; Ziegler & Muneaux, 2007) show that perception is affected by the sheer numbers of words that resemble any given word. Imageability and concreteness effects (Strain, Patterson, & Seidenberg, 1995) show that word perception can be affected by semantic factors. All these effects interact with word frequency. Considering EC, neither motor simulation nor “bodily influences” can explain effects that derive from stable relationships among “covert” words. Words that are not present in the environment, but exist in memory, affect the perception of words that are actually shown. All these effects (and the models that predict them) are inherently statistical, with fine-tuned tradeoffs among myriad, complex variables. They cannot reasonably be explained by reference to the body, or the environment, or a mind devoid of lexical representations.

Concepts and prototypes

People are capable of remarkable feats of categorization. When motivating the EC theory, Glenberg and Kaschak (2002) described Harnad’s (1990) symbol grounding problem: A foreigner lands in a Chinese airport, speaking no Chinese, with only a Chinese dictionary. This is characterized as an impossible problem because unknown symbols can only be mapped onto other unknown symbols. But, although the traveler cannot read the airport signs, is he entirely out of luck? If he stumbles across baggage claim, can he identify his suitcase on the conveyer belt? Can he discriminate employees from other passengers? Can he locate an exit and a taxi? The answer to all these questions is clearly “yes” – you can travel anywhere in the world and rely upon past experience to help you classify new objects or interpret situations.

People experience the world largely in categories, fluently recognizing tables and parrots and lemons that were never previously encountered. People also have strong intuitive ideas about category prototypes, their central tendencies or best representations (even for ad hoc categories, defined on the spot; Barsalou, 1983). Two major theories explain how people derive prototypes. According to prototype views (e.g., Homa, Sterling, & Trepel, 1981; Reed, 1972), as category exemplars are experienced, perceivers gradually abstract generalities across items, unconsciously generating prototypes, even without veridically experiencing them. According to exemplar views (e.g., Medin & Schaffer, 1978; Nosofsky, 1988), perceivers store each category example in memory; prototypes are emergent properties of the memory-trace population. (For an interesting discussion regarding the logical limitations of exemplar theories, see Murphy, 2015.) By either theory, perceptual classification is a hallmark of cognitive life: We constantly recognize new instances of known categories, using prior knowledge to mediate new perception.

Research on prototype abstraction largely stems from Posner and Keele (1968, 1970), who had participants learn to classify “dot pattern” stimuli into categories. Each categorized pattern was actually a distorted version of an unseen prototype, with different categories derived from different prototypes. In subsequent transfer tests, people classified old and new items, including the unseen prototypes. Posner and Keele (1968) found that prototypes elicited the best classification, relative to transfer patterns that were equally similar to other training patterns. Posner and Keele (1970) later found that if testing was delayed by a week, the unseen prototypes were remembered better than the actually studied patterns (Homa & Cultice, 1984; Omohundro, 1981). Similar results have been obtained hundreds of times, and in various populations, such as amnesics (Knowlton & Squire, 1993), newborn infants (Walton & Bower, 1993), and nonhuman animals (e.g., Smith, Redford, & Haas, 2008; Wasserman, Kiedinger, & Bhatt, 1988).

In what manner might EC help us understand prototype abstraction? Even more basic, how might EC help explain ubiquitous perceptual classification, such as recognizing dogs or books? As stated by Glenberg et al. (2013, p. 573), “thinking is an activity strongly influenced by the body and the brain interacting with the environment.” Does this assertion illuminate how a person recognizes common objects, or learns the central tendency of dot patterns? Without adding numerous complex assumptions, there is no reasonable way to argue that “the body” plays any role in these common acts of categorization. Similarly, the environment cannot explain the data, nor can off-loading, nor culture, nor emotions. Prototype abstraction is a purely cognitive activity, rooted in the relationships among encoded memories. Moreover, people have rich conceptual structures that guide thinking and behavior. Is a canary a bird? What is a “doggier” dog, a dachshund or a golden retriever? Where might you go for an unpleasant vacation? When answering these questions, are your answers somehow explained by bodily states or potential actions? People possess so much general knowledge, with no appreciable connections to the body, it seems untenable to posit embodiment as a basis for thinking.

Although we consider conceptual knowledge difficult to reconcile with EC, there have been prior attempts (e.g., Allport, 1985; Barsalou, 2008). For example, making conceptual judgments is slower when a person must “switch implied modalities” from one trial to another in an experiment (e.g., Barsalou, 1999; Pecher, Zeelenberg, & Barsalou, 2004), such as judging whether “lemons are tart,” followed by whether “thunder is loud.” Pecher et al. (2004) suggested that people use sensorimotor simulation while accessing conceptual knowledge in such a task. Although we do not question the results, we note that a vast array of conceptual questions do not logically entail sensorimotor dimensions, and the implied manner of simulation is far too scientifically flexible. Is titanium a metal? A person might answer “yes” by finding that knowledge in memory, or by internally simulating some experience wherein she touched titanium and realized that it felt like other metals. The latter possibility is more complex, still requires memory, and appears unmotivated. It is a theory of embodiment, simply for the sake of embodiment.

Short-term memory scanning

In Chapter 1 of his book Embodied Cognition, Shapiro (2011) nicely summarizes the classic study by Sternberg (1966) on short-term memory scanning. In this procedure, a person first memorizes a series of one to six digits. A moment later, a test digit is shown, and the person must quickly indicate whether it was in the original set. Sternberg hypothesized that such “scanning of short-term memory” might engage any of three processes. It could occur in parallel, which would create flat RT functions: “Yes” and “no” responses would be equally fast and unaffected by set size. Alternatively, scanning might occur in a serial, self-terminating manner, with the person serially searching working memory for a match to the test digit, responding “yes” when a match is found or continuing until all options are exhausted. This would create a pattern wherein RTs increase with set size increases, but the slope for “no” responses would be double the slope for “yes” responses. (The serial, self-terminating search nicely accords with common experience: Once you find your keys, you stop searching for them.) Finally, a person might serially scan the memory set, but always scan the entire set, even if a match is detected along the way (serial-exhaustive search). This would create a pattern wherein “yes” and “no” RTs would again overlap, both increasing with larger set sizes.

The actual, surprising result matched the serial-exhaustive search prediction: RTs increased linearly with set size, with no divergence of “yes” and “no” trials. Thus, memory scanning is akin to searching your entire house for your keys, even after finding them. The counterintuitive result makes sense when considering that scanning time (approximately 40 ms per item) is very fast, whereas decision time (“yes” vs. “no”) is estimated to require about 250 ms. The original Sternberg (1966) study has been cited over 3,000 times (Google Scholar) and has inspired numerous empirical and theoretical extensions (Kahana & Sekuler, 2002; Monsell, 1978). For example, Nosofsky, Little, Donkin & Fific (2011) recently applied an exemplar-based random-walk model to the Sternberg paradigm, fitting an impressive array of data, including RT distributions from individual participants.

How might EC help us understand the Sternberg memory-scanning paradigm? Even Shapiro (2011) offered no embodied account: The speed of internal scanning is too fast to correspond with simulated action, and there is no reasonable way to attribute the results to the environment, perception–action loops, or a mind without representations. From a scientific perspective, there appears to be little gained from asserting that short-term memory scanning is rooted in bodily experience.

Priming effects

In any word-perception task (e.g., lexical decision, naming, identification), there are myriad and robust priming effects, costs and benefits based on recent context. Priming arises in both perception and memory, from various underlying relationships. Arguably the strongest are repetition priming effects which arise in word perception (Forster & Davis, 1984; Scarborough, Cortese, & Scarborough, 1977) and memory (Jacoby & Dallas, 1981). In repetition priming, an item is presented at Time 1 and repeated at Time 2, with wide variations across experiments in terms of materials, tasks, and delays between repetitions. The effects are profound, changing perceptual fluency, feelings of memory, neural habituation and other measures. There are also numerous form-priming effects, wherein perception of a word (e.g., clock) is affected by preceding words that are orthographically or phonologically similar (e.g., flock, click).

In some regards, priming effects appear consistent with EC. For example, assume that word perception inherently involves motor simulation of the articulatory gestures used for word production. It becomes easy to predict that repeated simulation will become fluent. Other priming results also naturally emerge from EC. For example, the modality effect shows that repetition priming is stronger when both word presentations occur in the same modality (e.g., visual–visual), rather than changing modalities across repetitions (Scarborough, Gerard, & Cortese, 1979). Priming appears loosely tethered to the perceptual channel used for encoding, which seems consistent with an embodied account, relative to accounts based on abstract underlying symbols.

Once again, however, broader examination of priming quickly undermines any logical connection to EC, unless we resort to trivial truisms. Consider semantic priming (e.g., Becker, 1980; Meyer & Schvaneveldt, 1971): Processing a word such as bread improves processing of the related word butter (relative to a “neutral” prime, such as #####), and impairs processing of an unrelated word, such as giraffe. Semantic priming seems to entail both automatic activation of semantic neighbors in memory and strategic expectancy effects (Neely, 1977). With respect to EC, how might we explain semantic priming? Does the body explain why doctor primes nurse? Those are moderately “embodied” concepts, but what about sky priming cloud, or China priming Japan? The word light can create priming effects for switch, heavy, dark, weight, bulb, and house. In trying to explain such effects, do we gain any leverage from asserting that “cognitive processes are influenced by the body?” Do they suggest a mind without representations? Clearly, priming is guided by the person’s immediate environment (i.e., the presented words), but this statement is theoretically empty. Finally, there is a rich literature on masked priming, wherein primes are subliminal, yet create patterns of semantic and form priming (Abrams, Klinger & Greenwald, 2002; Kinoshita, 2006; Lupker & Davis, 2009). Such effects help elucidate what happens when lexical representations receive an activation boost, without strategic responding by observers. We cannot envision any reasonable embodied account that predicts or explains subliminal priming.

Face perception

People are both remarkably good and remarkably poor at face perception, a domain that demonstrates perception, attention, and memory working together. Imagine that you are at the airport, waiting to meet someone as they exit the terminal. Depending upon whom you are meeting, the experiences may differ dramatically. Perhaps you are meeting someone for the first time, but she has provided a description: “I’m blonde and will be wearing a red jacket.” Given this clue, you can tune visual attention, allowing red to “pop out” from the crowd, then focusing on each person who catches your eye. But, even if you see a person matching this description, “blonde” is a broad and common category, so several false alarms may occur before an eventual hit. Alternatively, perhaps you have seen a photograph of the person. This would allow you to scan the crowd, pausing to consider potential matches, and eventually spot someone who is probably correct. But if the person has changed hairstyles since the photograph was taken, the task will be challenging, with high potential for a miss. Finally, perhaps you are picking up a spouse or close friend. In this case, you can disengage attention almost entirely, loosely scanning the crowd, confident that your eyes will be drawn to your familiar target, regardless of clothing or variations in appearance.

In the foregoing example, the various target individuals differ only in their familiarity to the observer. In more cognitive terms, they differ in the degrees to which they allow top-down matching from memory. In visual search, speed and accuracy are powerfully affected by the quality of internal target representations (Hout & Goldinger, 2014). An expert radiologist can detect CT anomalies better than a novice; you can find your own child quickly on a crowded playground. In face perception, there are surprisingly profound performance differences based on top-down knowledge. Given unfamiliar faces, observers are surprisingly poor at detecting whether two photos depict the same person (Megreya & Burton, 2006, 2008; Papesh & Goldinger, 2014). Consider the photo-ID matching task shown in Fig. 1, with the simple task of deciding whether each license photo matches the adjacent person. Even knowing that half the examples are mismatching faces, the task is quite challenging.Footnote 2

Face matching is very different when viewing familiar people. Anecdotally, it is trivially easy to recognize a close friend, even if she changes hair color. Familiar actors are easily recognized across movies. In a recent study, Jenkins et al. (2011) had U.K. participants sort 40 photographs into separate piles, such that each pile should only contain photographs of the same person. Unknown to participants, only two individuals (both Dutch celebrities) were included in the set of 40 photographs. No participants accurately sorted the photographs into two piles, with 7.5 piles as the median performance. In contrast, nearly all Dutch participants (for whom the celebrities were familiar) sorted the photographs into two piles. Similarly, people may attend reunions once every 10 years, but easily recognize hundreds of old friends (Bahrick, Bahrick, & Wittlinger, 1975).

Another robust effect is the own-race bias (ORB): People are better at discriminating among (unknown) members of their own race, relative to other races (Chiroro & Valentine, 1995; Goldinger, He, & Papesh, 2009; Meissner & Brigham, 2001; Valentine & Endo, 1992). This effect does not reflect inherent differences in physiognomic variability across races and is not strongly predicted by racial attitudes. Instead, it seems to emerge as a function of perceptual expertise (the contact hypothesis), developing early in childhood (Kelly et al., 2007). Some findings suggest that own-race faces are processed more holistically than other-race faces (Michel, Rossion, Han, Chung, & Caldara, 2006), allowing own-race faces to be classified in a higher dimensional space. The ORB is widely observed in face learning, memory and neural-processing measures. But, as with face matching, the ORB does not affect the perception of familiar faces, which appear to enjoy “special” status.

Taking these ideas together, face perception is surprisingly error-prone when processing unknown faces, especially from other races. On the other hand, familiarity confers robust face recognition, despite myriad changes in appearance, age, or context. Even without familiarity, people fluently appreciate faces as high-dimensional perceptual objects, instantly classifying them with respect to sex, race, age, attractiveness, emotional states. For known people, however, we are often sensitive to subtle cues that strangers might not appreciate. Face perception therefore illustrates a general principle in cognitive science: In perception, classification, and memory, theories must account for both the generality and specificity of knowledge. For example, people can appreciate dogs as a category, and can discriminate dachshunds, Dalmatians, and Pomeranians. But they can also recognize their own dogs as familiar pets.

In EC, a recurring theme is that embodiment connects cognitive, cultural and emotional processing (Glenberg, 2010). Of all topics considered thus far, embodiment appears best positioned to address face perception. Faces generate expressions by virtue of motor commands, leading to visible displays that are easily simulated and imitated. Faces (and people) move, allowing the full leverage of perception–action loops for tracking changes over time. Having conceded these points, we still cannot understand how face perception “works” from an EC perspective. How do bodily influences predict the own-race bias? What EC principle predicts the dramatic changes that arise between known and unknown people? Perhaps, once we become familiar with someone (even indirectly, as with famous actors), we develop fluent routines for simulating their idiosyncratic facial gestures. Given this hypothesis, why are there such dramatic differences between recognizing static images of known and unknown people? Finally, although EC is claimed to encompass cultural and emotional processing, the mechanisms to achieve such connections are unexplained.

Faces are visible and expressive parts of other peoples’ bodies and seem perfect for embodied theories of person perception. Yet, we quickly encounter the same conceptual barriers as before: How can EC explain large psychological effects that clearly derive from stored knowledge? In everyday cognitive life, you can scan a crowded room and easily spot your friend, an amazing perceptual feat. The fluency and stability conferred from known faces cannot be attributed to bodily states, or cues in the environment, or a mind without representations. Although person perception likely involves perception–action loops, they are not sufficient. In ecological psychology, there are principles to explain how a person tracks and intercepts a Frisbee (e.g., optic flow, tau). But what if there are multiple Frisbees in the air, and the perceiver must catch only his own? Perhaps all the flying Frisbees belong to the perceiver and, once they are airborne, he is told to “catch the one you bought last month.” Now, personal memory must be used in concert with ecological principles. Although perception–action coupling is critical to achieving the goal (as in McBeath et al., 1995), even “strongly embodied” behaviors are easily understood to require a broad array of psychological processes. Face perception has all the hallmarks of embodiment, but EC fails to address its inherently cognitive dimensions.

Serial recall

In a simple memory task, people may hear a series of words, then later recall them (either while trying to preserve order, or in any order). The results can be plotted, showing recall rates as a function of each word’s position in the original list (McCrary & Hunter, 1953; Deese & Kaufman, 1957). In almost all cases, items are best recalled from the beginning and ending of the list (the primacy and recency effects, respectively), leading to a U-shaped serial position curve (SPC; Murdock, 1962). The SPC is a classic result, easily replicated in a classroom and across numerous changes of materials, modes of presentation, and participants (Eslinger & Grattan, 1994).

The most common account of the SPC posits that the primacy and recency effects reflect different memory systems. Early items are rehearsed and transferred into LTM, allowing later retrieval. Late items are still active in STM when testing begins, and can thus be recalled if no distraction occurs (e.g., Rundus, 1971). Consistent with this theory, behavioral manipulations elicit double dissociations of the primacy and recency effects. For example, presenting items faster decreases the primacy effect but leaves the recency effect unchanged. Conversely, distracting participants just after list presentation will eliminate the recency effect but leave the primacy effect unchanged. Different forms of brain damage selectively modulate each effect, leaving the other untouched.

Unlike prior topics in this review, the theoretical division of STM and LTM has been directly addressed in the EC literature, most notably by Glenberg (1997). In our view, his theory is fairly schematic, with no clear account for the enormous empirical literature on serial recall. The first claim is that memory reflects modality- and effector-specific interactions with the world (Barsalou, 2008; Glenberg, 2010), meaning either real or simulated sensorimotor experiences. We cannot discriminate this claim from any cognitive theory, wherein memories reflect real or imagined experiences. The second claim is that memory is not dissociable into systems or subsystems. Glenberg (1997) explicitly rejected the hypothesis of short- and long-term stores, stating that STM is simply an “illusion.” Many cognitive theories posit continuity between these systems, for example, suggesting that STM is an activated subset of LTM (e.g., Cowan, 1993). However, by positing no division at all, it appears difficult for EC to predict primacy and recency effects, or to accommodate all the neurological and behavioral data for dissociations.

Speaking more generally from EC principles (rather than focusing on one specific article), we arrive at a familiar impasse. The data are simple: Words are presented in serial order, but recall creates a U-shaped function, with leading and trailing branches that are independently affected by different manipulations. As before, we must ask how the body, or the environment, or sensorimotor simulations create this pattern. Although we do not advocate for any particular model, the cognitive literature offers numerous computational models that address serial recall. Such models can predict the SPC and related effects; many produce impressive quantitative fits across dozens of experiments. The response from EC is a blanket rejection of the principles that motivate those models, with no coherent alternative explanation.

Generalization in psychological space

In classic research on associative learning in dogs, Pavlov (1927) famously discovered that if some signal (e.g., a bell) consistently preceded the delivery of food, the dogs would quickly learn its predictive value, and the signal could then trigger salivation alone. He also discovered stimulus generalization: Other sounds could also trigger salivation, with stronger responses for sounds that more closely resembled the original signal. In the following decades, generalization became a bedrock principle of learning and behavior: Once a person or animal learns something about stimulus X, that learning will generalize to stimulus Y, as a function of the perceived similarity between X and Y. Generalization can take many forms, such as perceptual confusion, slower discrimination, or implicit biases (e.g., a man dislikes his boss, then feels irrational hostility toward other people who resemble his boss).

Regardless of the organism or stimuli involved, the generalization gradient is a function that describes the “drop-off” in responding as the similarity between learned and novel stimuli decreases. Although learning theorists (such as Hull, 1943) were eager to discover a systematic function governing generalization, they became discouraged: When physical stimulus differences were measured, many different gradients were observed. Moreover, gradients differed across species, and across individual animals or people. Decades later, Shepard (1987) proposed a solution – a universal law of generalization is achievable when relations among stimuli are cast in psychological space, such as one derived using multidimensional scaling. Shepard showed that, once stimuli are properly represented in this abstract space, generalization across stimuli decreases exponentially with their psychological distance. He then derived a mathematical theory wherein simple geometric assumptions predict the exponential gradient, across numerous conditions. The concepts from Shepard’s theory have been expanded (Chater & Vitányi, 2003) and are critical to models of perceptual classification (e.g., Nosofsky’s 1984, 1988, generalized context model).

Without articulating many new assumptions, it appears impossible for EC to explain (or coherently address) lawful generalization across items in psychological space. As Shepard (1987, p. 1318) wrote, “Analogously in psychology, a law that is invariant across perceptual dimensions, modalities, individuals, and species may be attainable only by formulating that law with respect to the appropriate abstract psychological space.” By definition and design, the principles of EC are exceedingly concrete, such as bodily cues and actions, movement in the environment, external resources, and cognition without representations. None of these ideas comport with stimulus relations inside abstract psychological spaces. The universal law of generalization is an elegant achievement in cognitive science. It is inconsistent with EC, not only because EC is too vague to allow mathematical formulation, but because its core tenets directly contradict the critical ideas that make Shepard’s law possible.

Mental rotation

Among all topics in cognitive science, mental imagery is perhaps the most challenging to study with scientific rigor. A person may affirm that she is imagining some object or action, creating activity that registers in fMRI, but how can we evaluate the substance of her imagery? The best known approach is the mental rotation procedure, developed by Shepard and Metzler (1971). In this task, a person is shown two figures and must quickly decide whether they are identical, or mirror images of each other. The objects are misaligned, with orientations that mismatch along the vertical axis to various degrees. Shepard and Metzler’s data were striking: RTs to correctly classify “same” pairs increased in linear fashion as the angle of rotation increased, suggesting that people mentally rotated one image, relative to the other, until they could appreciate a match. Since the original study, hundreds of experiments have replicated and extended mental rotation, finding similar results across objects and procedures (e.g., Cooper & Shepard, 1973; Jolicoeur, 1985).

When it comes to mental rotation, there is considerable evidence that motor activity accompanies mental imagery, although EC does not provide a complete account. In behavioral data, Wexler, Kosslyn and Berthoz (1998) observed systematic patterns of facilitation and interference when people performed concurrent mental and physical rotations. Dozens of neuroimaging studies have shown activity in premotor and motor cortices (among other brain regions) during mental rotation. These studies typically indicate motor-related activity as a fundamental correlate of mental rotation (Cohen & Bookheimer, 1994; Richter, Somorjai, Summers, & Jarmasz, 2000; Vingerhoets, de Lange, Vandemaele, Deblaere, & Achten, 2002; see Zacks, 2008, for meta-analysis). If motor cortex is stimulated using TMS, it changes mental rotation performance (Ganis, Keenan, Kosslyn & Pascual-Leone, 2000). Unlike prior topics in this review, mental rotation is influenced by the body and is performed (at least concurrently) with motor simulation. Moreover, in keeping with the spirit of this article, the embodiment hypothesis makes sense with respect to mental rotation. The task does not clearly require stored representations, the psychological process has a clear physical counterpart, and the task naturally recruits brain regions that typically guide object manipulation in space. Nevertheless, it remains challenging to argue that EC helps to explain mental imagery in a broader sense: Although mental rotation recruits motor systems, how might we address other forms of imagery (such as conjuring a mental image of a rose) that lack corresponding motoric tasks? In our view, a more reasonable claim is that the human mind can recruit motor knowledge when it is beneficial to some task, but motor knowledge cannot explain other common forms of mental imagery.

Sentence processing

The preceding sections have focused on classic findings from cognitive science (e.g., semantic priming), without regard to their presence or absence in the EC literature. In this final section, we specifically focus on the most prominent finding that motivates EC. The action-sentence compatibility effect (ACE; Glenberg & Kaschak, 2002) is a hallmark finding in EC, implicating the motor system in language comprehension. In the ACE paradigm, people make sensibility judgments about sentences that imply movement either toward or away from themselves. For example, concrete sentences might be, “Close the drawer,” or “You tossed the keys to Christine.” Experiments may also include abstract sentences such as, “You told Mike about the theory” versus “Mike told you about the theory,” implying movement away from and toward the participant.

Glenberg and Kaschak (2002) developed an innovative method, allowing them to examine whether overt motor behaviors interact with (theorized) motor simulation during language processing. Participants made “yes/no” sensibility decisions using a special response box with buttons near and far to themselves, and a central key that served as a launching point. The “sensible” response button was located either near or far, such that responding involved moving the arm either toward or away from oneself. When sentence-implied movements matched the required response movements, reading times (the latency between sentence onset and releasing the “start” key) were relatively fast. When the implied and intended movements were incompatible, reading times were slower. As Glenberg and Kaschak (2002, p. 558) wrote, “These data are consistent with the claim that language comprehension is grounded in bodily action, and they are inconsistent with abstract symbol theories of meaning.”

The ACE is widely cited as evidence that language comprehension is embodied, rather than symbolic. As of January, 2015, Glenberg and Kaschak (2002) had been cited over 1,400 times (Google Scholar). It also motivated numerous studies examining motor activity during word or sentence perception (including behavioral data, neuroimaging, EMG measures, or TMS interference; Borreggine & Kaschak, 2006; Buccino et al., 2005; Chersi, Thill, Ziemke, & Borghi, 2010; de Vega, Moreno & Castillo, 2013; de Vega & Urrutia, 2011; Glenberg et al., 2008a, b; Kaschak & Borreggine, 2008; Nazir et al., 2008; Pulvermüller et al., 2005; Sato et al., 2008; Zwaan & Taylor, 2006). These studies have typically produced results consistent with the EC view of language processing. For example, Pulvermüller et al. (2005) used MEG to show that processing action verbs results in premotor and motor activity within 200 ms of word onset. Across studies, the typical account is that sentence processing requires internal simulation that recruits corresponding sensorimotor brain areas (an idea often linked to mirror neurons; e.g., Glenberg & Gallese, 2012).

The ACE has generated considerable debate, with authors questioning the results and interpretation (e.g., Arbib, Gasser, & Barrès, 2014; Mahon & Caramazza, 2008; Weiskopf, 2010). Our goal is to address a broader issue: Once researchers have defined an arena for scientific inquiry, there is a strong tendency for other researchers to focus on that arena. In the case of Glenberg and Kaschak (2002) and many following studies, there has been a strong focus on motor-related words and phrases. Many theorists have noted that purely abstract language poses a challenge to embodied accounts of language, and some EC theorists have conceded that hybrid theories may be required (e.g., Zwaan, 2014). Despite this concession, we must ask our familiar question: As with word frequency, prototype abstraction and other findings, does EC really help explain sentence processing?

Here is the problem, stated plainly: In the present article, the vast majority of sentences cannot be “simulated,” or mapped onto actions, in any transparent manner. That is true for this sentence, and the prior one, and nearly every previous one. Consider the earlier sentence: “Familiar actors are easily recognized across movies.” This is a perfectly legitimate sentence, but offers no obvious (or subtle) approach to simulation. Even though our opening paragraph described actions being performed by a young woman, it included sentences such as: “Upon seeing an unfamiliar car in a numbered parking spot, she wonders whether new neighbors have moved in downstairs.” This sentence is readily understood and can be visually imagined, but how exactly would the motor system intercede in comprehension? If vanishingly few sentences are suitable candidates for motor simulation (such as this one), then positing simulation as a core principle is theoretically empty.

To their credit, Glenberg and Kaschak (2002) recognized this issue in their original article. They dismissed it, however, with a flourish of speculation, using the chimerical power of affordances. As they wrote (p. 563):

What is the scope of this analysis? Clearly, our data illustrate an action-based understanding for only a limited set of English constructions. Furthermore, the constructions we examined are closely associated with explicit action. Even the abstract transfer sentences are not far removed from literal action. Although we have not attempted a formal or an experimental analysis of how to extend the scope of the [indexical hypothesis], we provide three sketches that illustrate how it may be possible to do so. Consider first how we might understand such sentences as “The dog is growling” or “That is a beautiful sunset.” We propose that language is used and understood in rich contexts and that, in those rich contexts, some statements are understood as providing new perspectives—that is, as highlighting new affordances for action. Thus, while taking a walk in a neighborhood, one person may remark that an approaching dog is quite friendly. A companion might note, “The dog is growling.” This statement is meant to draw attention to a new aspect of the situation (i.e., a changing perspective), thereby revealing new affordances. These new affordances change the possibilities for action and, thus, change the meaning of the situation. A similar analysis applies to such sentences as “That is a beautiful sunset.” The statement is meant to change the meaning of a situation by calling attention to an affordance: The sunset affords looking at, and acting on this affordance results in the goal of a pleasurable experience.

We have two general responses to this quote. First, in mechanical terms, we cannot conceive of any language comprehension system that would allow a person to appreciate the affordances of a sunset as a precondition to understanding a sentence about that sunset. The claim is that motor simulations (or situational affordances) are integral to linguistic processing, but what system could theoretically activate such high-level semantics before the sentence itself is processed? This problem arises even for clear “motor” sentences, such as “Jane handed David the stapler.” Although “handing something” could activate a motor simulation, how would the rest of the sentence (two people and a stapler) become part of that simulation, in advance of sentence understanding? There are well-known theories in word perception (e.g., Harm & Seidenberg, 2004) wherein semantic features can generate top-down feedback to facilitate perception, typically for words that are “disadvantaged” (low-frequency, inconsistent words; Strain et al., 1995). Such a system could be conceived for motoric features, which are conceptually akin to concreteness, but their potential role is logically limited to a small set of sentences.

Second, we are powerfully struck by the similarity between Glenberg and Kaschak’s (2002) speculation and the earlier quote from Chomsky (1959). As presented, affordances are wholly unconstrained. Given the hypothesis that context constrains interpretation, we could doubtless find many confirming examples. However, we could also generate thousands of sentences with no contextual relevance (or affordances), and people would readily understand them all. “Few people realize it, but Hitler adored paintings of kittens.” An appeal to affordances does not address the motor simulation hypothesis, and it renders the embodied account untestable. Taking the EC principles in turn, the claim that language perception is “fundamentally embodied” or entails motor simulation is untenable. There are far too many sentences (like this one) wherein “simulation” makes no sense. Appealing to the environment (or context-specific affordances) does not help, because countless sentences are understandable without connections to context. Finally, explaining sentence perception without internal representations appears hopeless.