Auditory speech perception is known to categorize continuous acoustic inputs into discrete phonetic units, and this property is generally considered to be crucial for the acquisition and the preservation of phonological classes in a native language. Somatosensory input associated with speech production has also been shown to influence speech perception when the acoustic signal is unclear, inducing shifts in category boundaries. Using a specifically designed experimental paradigm we were able to demonstrate that in the absence of auditory feedback, somatosensory feedback on its own enables the identification of phonological categories. This finding indicates that both the auditory and orosensory correlates of speech articulation could jointly contribute to the emergence, the maintenance, and the evolution of phonological categories.

Auditory speech perception enables listeners to access phonological categories from speech sounds. During speech production and speech motor learning, speakers’ experience matched auditory and somatosensory input. Accordingly, access to phonetic units might also be provided by somatosensory information. The present study assessed whether humans can identify vowels using somatosensory feedback, without auditory feedback. A tongue-positioning task was used in which participants were required to achieve different tongue postures within the /e, ε, a/ articulatory range, in a procedure that was totally nonspeech like, involving distorted visual feedback of tongue shape. Tongue postures were measured using electromagnetic articulography. At the end of each tongue-positioning trial, subjects were required to whisper the corresponding vocal tract configuration with masked auditory feedback and to identify the vowel associated with the reached tongue posture. Masked auditory feedback ensured that vowel categorization was based on somatosensory feedback rather than auditory feedback. A separate group of subjects was required to auditorily classify the whispered sounds. In addition, we modeled the link between vowel categories and tongue postures in normal speech production with a Bayesian classifier based on the tongue postures recorded from the same speakers for several repetitions of the /e, ε, a/ vowels during a separate speech production task. Overall, our results indicate that vowel categorization is possible with somatosensory feedback alone, with an accuracy that is similar to the accuracy of the auditory perception of whispered sounds, and in congruence with normal speech articulation, as accounted for by the Bayesian classifier.

Producing speech requires precise control of vocal tract articulators in order to perform the specific movements that give rise to speech sounds. The sensory correlates of speech production are therefore both auditory (associated with the spectrotemporal characteristics of sounds) and somatosensory (related to the position or shape of the vocal tract articulators and to contacts between articulators and vocal tract boundaries). While the propagation of sounds is the means through which linguistic information passes between speakers and listeners, most recent models of speech motor control [DIVA (1), FACTS (2), HSFC (3), Bayesian GEPPETO (4), and ACT (5)] posit that both auditory and somatosensory information is used during speech production for the planning, monitoring, and correction of movements. The crucial role of auditory information has been documented in experiments using bite blocks or lip tubes (6, 7), in which articulation has been shown to be reorganized in order to preserve the acoustical characteristics of speech. The importance of somatosensory information in speech production has been shown in studies in which external perturbations of jaw movement induced compensatory reactions (8, 9). A study of speech production in cochlear implanted patients, who switched their implants on and off (10), has provided evidence that a combination of both sensory inputs results in greater accuracy in speech production.

Auditory speech perception involves the categorization of speech sounds into discrete phonetic units (11, 12), and neural correlates of phonetic representations have been found in the superior temporal gyrus (13). However, it is unknown whether somatosensory information in speech can also be categorized by listeners in a similar and coherent way. In nonspeech tasks, there is extensive evidence of the use of somatosensory information for categorization as, for example, in tactile object recognition (14), but in speech, no study so far has addressed whether speakers are able to identify phonemes based on somatosensory information.

In the present study, we provide evidence that phonological representations can be accessed from somatosensory information alone. Our results indicate that participants are able to recognize vowels without any contribution of auditory feedback, based on tongue postures. This finding has required the design of a paradigm adapted to the specificities of the tongue. Indeed, unlike studies of limb movement in which tests of somatosensory processing can be conducted by using a robotic device to passively displace the limb, provoking passive displacement of the tongue is very challenging: the tongue is difficult to access inside the oral cavity and highly resistant to displacement. In our paradigm, speakers were instead required to position their tongue using visual feedback in a task that provided no information about the shape and the location of the tongue. We were able to show that subjects succeed in this positioning task, although some speakers are more successful than others. We then tested whether, once they had positioned their tongue, speakers were able to provide a somatosensory categorization of the vowel associated with the reached tongue position. At each reached tongue posture, subjects were asked to whisper in order to enable a subsequent independent auditory evaluation of their articulation by a group of external listeners. The auditory feedback of the speakers was masked by noise in order to ensure that categorization was based on somatosensation only.

We found that speakers were able to identify vowels based on tongue somatosensory information, and there was a good concordance with listeners’ judgements of the corresponding sounds. Finally, it is shown that subjects’ somatosensory classification of vowels was close to the classification provided by a Bayesian classifier constructed separately from subjects’ vowel articulations recorded under normal speech conditions, i.e., with phonation and auditory feedback. These results suggest that phonological categories are specified not only in auditory terms but also in somatosensory ones. The results support the idea that in the sensory–motor representation of phonetic units, somatosensory feedback plays a role similar to that of the auditory feedback.

Results

In order to assess whether vocal tract somatosensory information can be used for phonetic categorization, we first needed a means to instruct subjects to achieve a set of tongue postures without relying on normal speech movement or auditory feedback that might provide nonsomatic cues. We designed a tongue-positioning task using electromagnetic articulography (EMA) to visually guide eight subjects (henceforth somatosensory subjects) toward different target tongue postures, evenly spanning the articulatory range of the three vowels /e, ε, and a/ (Fig. 1B). Nine target tongue postures were specified for each subject, including three vowel tongue postures corresponding to normal productions of vowels /e, ε, and a/, and six intermediate tongue postures distributed regularly over the same tongue vowel workspace. Vowel tongue postures were recorded with EMA for each subject during a preliminary speech production task involving repetitions of vowels under normal speech conditions (Materials and Methods). Fig. 1 shows the placement of the EMA sensors and illustrates the set of nine target tongue postures for one representative subject (the set of targets for other subjects are presented in SI Appendix, Fig. S3).

Fig. 1. (A) Sagittal view of EMA sensors (black dots) and (B) example of target tongue postures from one subject. Targets include three vowel tongue postures (black lines) and six additional intermediate tongue postures (gray dashed lines) distributed regularly over the /e-ε-a/ tongue workspace. The solid line on the top of B represents the subject’s palate trace.

On each trial of the tongue-positioning task (Fig. 2A), one of the nine target tongue postures was displayed on a screen under a spatial transformation intended to remove visual information about tongue position and shape (Fig. 2B). Target positions were always displayed in the same horizontally aligned configuration at the center of the screen, ensuring that targets in all trials looked the same (red circles in Fig. 2A, Bottom). Subjects were provided with real-time visual feedback of their tongue movements according to the spatial transformation shown in Fig. 2B and were instructed to 1) move their tongue in order to reach the displayed target within a 5-s time interval (reaching phase), 2) hold the reached tongue posture and whisper with auditory feedback masked by noise (whispering task), and 3) identify the vowel associated with the reached tongue posture (somatosensory identification task).

Fig. 2. (A) Design of each trial of the tongue-positioning, whispering, and somatosensory identification tasks. (Top) The real positions of sensors (sagittal view) corresponding to the target (red circles and lines) and sensors corresponding to subject’s tongue (black circles and lines). (Bottom) The modified position of sensors as displayed to the subjects. The lip target sensors were displayed as vertically aligned circles in the left part of the display and were intended to help subjects to keep their lip position constant during the task. (Bottom Right) The three alternative forced choice display of the somatosensory identification task. (B) Illustration of the spatial transformation used for the visual display in the tongue-positioning and whispering task. Red circles and lines correspond to the target tongue posture and black circles, and lines correspond to the actual tongue shape. The target tongue posture is transformed in such a way that all segments become equal in length and horizontally aligned. Then the actual tongue shape is deformed by preserving δ i and α i after transformation.

Importantly, the target tongue postures used in the tongue-positioning task were chosen in order to sample the articulatory workspace from /e/ to /a/ via /ε/ as comprehensively and as evenly as possible, in order to obtain a set of well-distributed articulatory configurations for the primary aim of the study, which is to evaluate the use of somatosensory information in the identification of these vowels. Thus, for the tongue-positioning task to be carried out correctly, it was not required that the set of tongue postures which subjects produced at the end of the reaching phase exactly matched the displayed target tongue postures but rather that overall, the set of tongue configurations uniformly sampled the articulatory workspace for the vowels that were studied here.

The sounds that were whispered by the eight somatosensory subjects were identified by eight additional listeners (henceforth auditory subjects) in a forced-choice (/e/, /ε/, or /a/) auditory identification task. This perceptual evaluation is crucial since it is the only way to assess whether or not the tongue position described by the EMA sensors corresponded to a vowel and whether or not it was associated with clear auditory characteristics in the /e/–/ε/–/a/ vowel range. However, it is also crucial that the whispering phase did not influence the somatosensory identification performed by somatosensory subjects, by providing either auditory or motor cues that might have helped them in their identification task. In regard to possible auditory cues, we have taken a number of precautions in order to minimize the likelihood that subjects could hear themselves during the whispering phase with the auditory feedback masked by noise. To check the effectiveness of this approach, we asked subjects to report whether or not they could hear themselves whisper, and no subject responded that he/she could. We have also evaluated the acoustic power of the whispers and found them to be more than 40 dB below the masking noise, which has been previously shown to make impossible the auditory perception of vowels in French (15) (see SI Appendix for details). In regard to possible motor control cues, as is commonly observed in postural control, subjects did not strictly maintain a stable tongue posture during the whispering phase (see SI Appendix, Fig. S1, for a quantification of this articulatory variability). To check the possibility that these small tongue movements could have provided helpful motor cues for the somatosensory identification task, we carried out a number of analyses of these articulatory variations, which are described in SI Appendix. We found no indication supporting such a possibility. In particular, we found no evidence suggesting that these small movements would be directed toward vowel tongue postures. These observations indicate that the somatosensory identification task was not biased by auditory or motor cues during whispering.

The analysis of the data was divided into two main parts. The first part was devoted to the evaluation of the tongue-positioning task. This first part is necessary for the second part of the analysis since one cannot expect subjects that did not perform the tongue-positioning task correctly to succeed in the somatosensory identification task. More specifically, we assessed whether the participants reached tongue postures 1) that varied continuously and smoothly over the whole /e, ε, and a/ articulatory range and 2) that corresponded to sounds that can be reliably auditorily identified as a vowel in the /e, ε, and a/ range. Tongue postures meeting these evaluation criteria will henceforth be referred to as vowel-like tongue postures.

In the second part of the study we assessed performance in the somatosensory identification task. This had three main parts. First we evaluated the separability of the three vowel categories as obtained by somatosensory categorization. Second, we compared the somatosensory categorization with auditory categorization provided by the auditory subjects who evaluated the whispered sounds. Finally, we compared the somatosensory categorization with the outcome of a Bayesian classifier that relied on tongue postures recorded during normal speech movements.

Stage 1: Evaluation of the Tongue-Positioning Task. Articulatory analysis of the tongue-positioning task. Fig. 3 shows for each of the participants in the study the set of tongue postures reached in the tongue-positioning task, superimposed on the average tongue configurations measured for /e, ε, and a/ during normal speech production. It can be seen that for subjects S3, S6, S7, and S8 the set of reached tongue postures (Fig. 3, Bottom, left-hand side of each panel) uniformly covers the /e, ε, and a/ range, whereas for the remaining subjects (S1, S2, S4, and S5; Fig. 3, Top, left-hand side of each panel), there were noticeable deviations from the average vowel postures. In order to quantitatively assess this observation, we first evaluated whether the set of reached tongue postures actually covered the expected /e, ε, and a/ range of tongue configurations. We also assessed quantitatively whether these tongue postures were uniformly distributed over the range and direction associated with the set of target tongue postures. We conducted two separate principal component analyses (PCAs) for each subject separately, one using their set of target tongue postures and the other using their set of reached tongue postures. Details about this analysis are provided in SI Appendix. We summarize the main results below. Fig. 3. (A) Distribution of the set of 90 tongue postures reached by each somatosensory subject across trials in the tongue-positioning task. For each subject the set of reached tongue postures (gray lines) is represented on the left-hand side of each panel, together with the average /e, ε, a/ tongue postures (black lines) obtained from the speech production task (Materials and Methods). The right-hand side of each panel presents the distribution of reached tongue postures in the main /e, ε, a/ direction (represented vertically), as determined by a PCA carried out on the set of target tongue postures. (B) Clustering of the eight subjects in two classes (elliptical contours drawn by hand), based on the proportion of variance explained by the first two principal components describing the set of reached tongue postures. The PCA on the nine target tongue postures showed that the target tongue posture workspace was well described by a single dimension for all subjects but S2. The second PCA showed that only subjects S3, S6, S7, and S8 produced reached tongue postures that were also well represented by a single dimension for the reached tongue postures (see the two clusters in Fig. 3B). Moreover, for these four subjects we also found a good match between the single direction which characterized the target tongue postures and the one which characterized the reached tongue postures. We also tested whether or not there was uniform coverage of the task workspace by the reached tongue postures. To do so we estimated the densities of the reached tongue postures of each subject along the dimension defined by the first principal component of the target tongue postures. In each panel of Fig. 3A the right side shows the resulting distribution over the range of the nine target tongue postures. For all subjects apart from S1 and S5 we observe quite uniform distributions, and this was confirmed by a statistical Kolmogorov–Smirnov test. Auditory analysis of the adequacy of the reached tongue postures. In order to investigate the adequacy of the reached tongue postures for purposes of somatosensory vowel identification, we evaluated whether the whispered sounds associated with these postures were compatible with vowel production. To do so, subjects’ whispered productions were classified by a separate group of listeners (auditory subjects). First, we analyzed the consistency of the responses provided by auditory subjects in the auditory identification task, which we call auditory answers henceforth. Second, we inferred from the set of auditory answers given for each whispered sound a canonical auditory label for this sound (see Materials and Methods for details) and assessed whether these auditory labels were associated with well-separated clusters of tongue postures.

Consistency of Auditory Answers Across Auditory Subjects. We assessed the consistency of the auditory classification of each reached tongue posture recorded during the tongue-positioning task by computing the entropy of the distribution of auditory answers attributed to each of the whispered utterances (see SI Appendix for details). We expected that whispers which sound like whispered vowels would be labeled consistently by auditory subjects and would therefore result in distributions of auditory answers with low entropy. On the other hand, we expected that whispers that would not sound like whispered vowels would be labeled with greater uncertainty, resulting in distributions of auditory answers with greater entropy (close to uniformly random answers). Fig. 4 presents violin plots of the distribution of entropies of auditory answers for each whispered sound produced by each of the somatosensory subjects. From this figure it can be seen that the whispers of all but two somatosensory subjects (S4 and S5) have low average entropy, with violin plots being larger at the base than at the top. This means that most of the whispers produced by the somatosensory subjects were classified in a consistent way across auditory subjects. Statistical analysis confirmed that the distribution of entropy differs across subjects (Kruskal–Wallis χ 2 = 123.95 , P < 0.001). Pairwise comparisons revealed that the entropy distributions for subjects S4 and S5 were significantly greater than for the others (P < 0.01 with Wilcoxon test and Benjamini and Hochberg correction method for multiple comparisons). This indicates that the whispers produced by these two subjects conveyed little information about vowel identity, presumably because the associated reached tongue postures did not correspond to one of the sounds /e, ε, and a/ that were under test. Fig. 4. Distribution of entropy of auditory answers for each whisper produced by each somatosensory subject (violin plots). Small entropy values correspond to whispers with low uncertainty about vowel identity; in particular, zero entropy values correspond to whispers that were given the same label by all auditory subjects. Data points correspond to the different whispers performed across trials by the somatosensory subjects during the positioning task. They are distributed over 10 entropy values, which correspond to possible entropy values for the three vowel labels and eight auditory answers (see SI Appendix for details). The dashed line indicates the theoretical maximum uncertainty (entropy of 1.1) of auditory labeling. Gray colored bars represent the average entropy of the auditory answers for each somatosensory subject.

Consistency and Separability of the Clusters of Reached Tongue Postures Associated with the Three Auditory Labels. In normal speech production there is substantial similarity of tongue posture across repetitions of the same vowel (consistency across repetitions), and these tongue postures can also be reliably distinguished from the tongue postures associated with repetitions of another vowel (separability across vowels). In order to check whether the tongue-positioning task reproduced these speech characteristics, we asked whether or not the set of tongue postures associated, for example, with sounds that listeners judged as the vowel /a/ was different from the set of tongue postures associated with sounds that listeners judged as /ε/. We assessed the auditory labels assigned by the set of auditory subjects by evaluating the consistency and the separability of the grouping of tongue postures made on the basis of the auditory labels. We expected that whispers that carry relevant information for vowel classification should be associated with articulatory characteristics that are 1) consistent within each category and 2) different enough across categories to preserve their distinctiveness. Hence, if the four EMA sensors are good descriptors of the posture of the tongue, the auditory labels for the whispers that sound like one of the /e, ε, and a/ vowels should be associated with quite compact and well-separated clusters of reached tongue postures. In the case of whispers that do not sound like one of these vowels, the clusters of reached tongue postures should be wide (more variable) and largely overlap one another. Silhouette scores provide a measure of clustering quality in terms of consistency and separability, by comparing how well each data point belongs to its own cluster as compared to its neighboring clusters (16). Our silhouette analysis assigns to each tongue posture a value in the [ − 1,1 ] range, with positive values corresponding to data that are well inside their cluster, close to 0 values corresponding to data that are close to the boundaries of their cluster and negative values corresponding to data that are closer to a neighboring cluster than to their own cluster (see SI Appendix for details). The gray bars in Fig. 5, Middle, show for each somatosensory subject the average silhouette score of clusters of reached tongue postures based on auditory labels. We call these scores auditory clustering scores. They are arranged in Fig. 5 in descending order from left to right. Fig. 5, Middle, also shows average silhouette scores of clusters of reached tongue postures based on somatosensory labels (see below for an analysis of somatosensory clustering). Examples of tongue clusters associated with high and low average silhouette scores are shown in the left and right upper panels in Fig. 5, respectively (figures corresponding to other subjects are presented in SI Appendix, Fig. S5). It can be seen that for a subject with high silhouette scores (S6; Fig. 5, Left), the clusters of tongue postures are well separated with moderate overlap between auditory vowel categories. In contrast, clusters of a subject with low silhouette scores (S5; Fig. 5, Right) show strong overlap. Furthermore, for well-separated clusters, the tongue postures associated with the auditory labels /e/, /ε/, and /a/ are correctly distributed from high to low around the tongue postures characteristic of each vowel (black lines in Fig. 5, Left). Fig. 5. Clustering of the reached tongue postures associated with auditory and somatosensory labels. (Middle) Bars show the auditory and somatosensory clustering scores obtained for each somatosensory subject, arranged in descending order (from left to right) of auditory clustering scores (light gray bars). Auditory clustering scores are significantly different from chance (P < 0.01) for all subjects except S4 and S5. Somatosensory clustering scores are significantly different from chance (P < 0.01) for all subjects. The clusters of reached tongue postures associated to each vowel category for a representative subject corresponding to (Left) good and (Right) poor clustering scores ([Top] auditory scores and [Bottom] somatosensory scores). For each vowel category, the associated target tongue postures are specifically colored in order to distinguish them from the other reached tongue postures displayed in gray (as in Fig. 3). Black lines correspond to the three vowel tongue postures indicated for reference (upper, middle, and bottom black lines corresponding to /e/, /ε/, and /a/, respectively). In order to assess the significance of the auditory clustering scores, as compared to the null hypothesis that tongue postures were labeled at random, we performed a nonparametric randomization test (17) with 1 0 4 random permutations. It was found that auditory clustering scores were significantly different from chance (P < 0.01 with Holm–Bonferroni correction for multiple comparisons) for all subjects except S4 and S5. It can be seen that these two subjects also have the highest values of entropy of the auditory answers provided by listeners (Fig. 4). This is consistent with our hypothesis, stated above, that their whispers conveyed little relevant information for vowel identification because their reached tongue postures poorly correspond to configurations compatible with vowel production. Two other somatosensory subjects (S1 and S2) had silhouette scores that were clearly lower than those of the remaining four subjects, despite the fact that their whispers seemed to convey some relevant information for vowel identification, as indicated by their low level of entropy of the auditory answers (Fig. 4). In contrast, the high auditory clustering scores of the remaining four somatosensory subjects (S3, S6, S7, and S8), for whom low entropy auditory classification was also observed, indicate that these subjects regularly reached tongue postures compatible with typical productions of the vowels /e/, /ε/, and /a/ and that the regularities observed in the EMA sensor positions are reliable indicators of the whole tongue posture typically associated with these vowels. This interpretation of the relationship between tongue postures and sounds is further supported by the fact that the three subjects with the highest clustering scores are also those who present the fewest data points with high entropy in the violin plots of Fig. 4.