Although number words are highly frequent in languages like English, and appear regularly in child-directed speech, children's acquisition of them is slow and labored [1]. Ask a three-year old for “3 balls,” and they are likely to give you a handful instead, having treated “3,” rather indiscriminately, like “some” [2]. This behavior does not stem from an inability to recognize differences between set-sizes: even 6-month-olds are able to discriminate between large set-sizes if the ratio is at least 2∶1 [3]–[6] and this discriminability ratio becomes more fine-tuned over time [7]–[9]. Children's difficulties with number are thus unlikely to be due to problems with detecting differences in quantity [10]. Yet nor do they stem from an inability to grasp the relationship between language and quantity: one– and two–year–olds grasp that number words relate to quantities [1], [11] and are often quite adept at reciting the count sequence [1], [12]. The puzzle, then, is why children – who clearly both recognize number words as quantity designators and discriminate between set–sizes – go through an extended phase where they fail to understand how specific words match to specific quantities [13].

An ordinary child learning about number certainly will not suffer from any lack of exposure to count-relevant auditory and visual stimuli: count words are highly frequent and sets of items are everywhere. However, learning to discriminate which words match with which sets is not an insignificant problem: it involves 1) abstracting representations of specific set-sizes from the variable objects that make up any particular set, and then 2) mapping those representations on to specific number words. Here, we show how tightly coupled these processes are in learning [14] and how they are effectively impeded by the way information is structured in English, and many other languages. We present a formal analysis and series of simulations that illustrate the problem and suggest a means of correcting it. Further, our simulations offer a solution to a puzzle relating to the nature of numerical knowledge: while most English speaking children will eventually learn to recognize and name sets of items in the small number range 1–4 without relying on counting [15], [16], in most cases, this ability does not reliably develop much beyond these values [17]. In our model, this pattern emerges naturally as a result of the discriminatory requirements of number learning, and the characteristics of the environment in which children learn numbers words.

A training experiment then puts this analysis of number learning to the test, contrasting the performance gains of children after typical number training – in which information was presented as usual – with that of children after restructured number training – in which the sequencing of linguistic information was manipulated to make it more conducive to learning and discrimination. The experiment reveals that when information is structured appropriately, 3-year olds rapidly improve their accuracy and consistency on not only trained number sets (2, 4, 6) but also on untrained sets (3, 5, 7). The improvement of the children following our intervention is particularly remarkable given that other recent training studies with older children have failed to find improvement even for trained numbers [18], a finding replicated by the children in our ‘typically structured’ training condition.

Given the weight of behavioral and neurobiological support for this learning process, and the insight it offers into the way children learn of other verbal categories [21] , we next consider whether it might help explain why children are so taxed by the challenge of acquiring an understanding of number.

These formalized learning rules have been shown to accurately predict the behavior of humans and animals across a wide variety of learning tasks, and to accurately reflect the firing patterns of mid-brain dopamine neurons [25] – [27] . When an event in a learner's environment is incorrectly predicted, it provokes an error response [20] . This response is bidirectional: if an unexpected event occurs, dopaminergic activity spikes; if an expected event does not occur, activity dampens. More subtly, the strength of this spike – or dampening effect – is contingent on how poorly predicted the event was to begin with. Greater discrepancies between expectation and reality result in more error, and so more learning occurs; conversely, as discrepancies shrink, errors decrease in kind, and learning asymptotes [27] – [30] .

In line with this, experimental work in animal learning has demonstrated that when learning the predictive relationship between a given cue and a given outcome, animals do not simply chart how often cues predict certain outcomes, they also track how often cues fail to predict potential outcomes. The engine that drives learning is not positive reinforcement, but surprise, or more formally, ‘prediction error’ [22] , [23] . In learning models, prediction error is formalized as the discrepancy between the expected and actual outcomes a learner experiences, and learning is a process of incrementally updating a learner's expectations in response to events [24] .

Importantly, learning is no longer conceived of as simply a running tally of rewards and punishments; nor is it thought to be a process of accumulating simple associations between cues and outcomes in isolation. Instead, learning is best understood as a process that has evolved to help a learner better predict events in the world around her by weighing and assessing the informativity of cues for predicting relevant outcomes. In a similar vein, learning is no longer conceived of as simply a series of stimulus→response associations. Rather, it is understood as a process in which all the information available to a learner – both from the environment, and prior experience – is brought to bear on the task of predicting an outcome. Learning models describe the way that this information is sampled and processed for the purpose of better predicting events in the environment [21] .

In what follows, we describe and model the problem of learning number sets in learning and information theoretic terms. Given that straightforward applications of this approach are rare in language research, it is helpful to provide a basic outline of learning theory at the outset, particularly since contemporary models of learning represent a significant departure from the classic stimulus-response paradigm and do not share many of its limitations [19] , [20] .

Information Structure in English

The first problem that a child learning number words must overcome is that she will never encounter numerical sets independently: she may encounter three apples, or three bears, but she will never encounter a “set of three” on its own [31]. To further complicate matters, it is virtually impossible to ascertain the meaning of a given number word from a single encounter. For example, for a child faced with two apples and three oranges, the cues to the words “2” and “less” and “3” and “more” will initially be identical. This creates a discrimination problem: over time, a child must learn to discriminate which features appropriately match a given word in a given context (Figure 1).

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. The challenges presented by number learning. This picture contains nine objects: one red ball, two hats, three balls and four bears; there are more bears than balls or hats, fewer hats than balls, and more balls and hats than bears. Somehow, a child must discern the cues that discriminate between appropriate and inappropriate usage of each word. Unless one assumes that children's vocabulary is innate, this problem will have to be solved even if children are granted some innate representation of number. That is, even if children have some internal concept of two, they still need to map the presence of two things in the environment to the word “2” and not to, say, “3,” which might be heard in the same context. https://doi.org/10.1371/journal.pone.0022501.g001

In many biological and computational models of learning, this kind of problem is solved by adjusting the degree to which various features in the environment are valued as cues to predicting a relevant outcome. This ‘adjustment process’ is competitive. Over the course of learning, features compete for predictive value, a contest which highlights reliably informative features, while downgrading or even eliminating uninformative features [21], [23], [32], [33]. Characterized in these terms, number learning is a process of coming to value the appropriate set-size as the most reliable cue to a given number word, while at the same time discriminating it from other less reliable competitors (such as alternate set-sizes and other object features). The end goal is one of establishing which set-size best predicts which number word.

Notably, so long as a specific set-size is the most informative predictor of a number word in the learning environment, competitive discrimination learning ought to lead naturally to successful number learning [21], [23], [32], [33], allowing a child to discover and form a strong association between, say, set-size three and the word “3,” while simultaneously weakening any spurious associations to “3”. With the correct association in place – and with ever-reducing interference from competitors – a child will then be able to accurately use and comprehend “3” (Figure 2).

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. How the number three is learned over time. In competitive discrimination learning, positive evidence (reinforcement) increases associative value for cues, whereas negative evidence (prediction-error) correspondingly decreases value. In the left panel, each of the features present potentially predicts “3.” In the center panel, many of these unhelpful features will later erroneously cause “3” to be expected. Because these unhelpful cues will result in prediction-error when “2” is heard instead, they will lose value as cues to “3,” both in this instance, and in other cases where they erroneously predict a number word. Further, because discrimination learning is competitive, they will lose associative value to more reliably predictive cues (namely, set-size three). In the right panel, further positive evidence means that three continues to gain value with respect to the initial set of cues. As can be seen, learning is facilitated both by positive evidence – hearing the word “3” after seeing sets of three – and negative evidence – unlearning erroneous cues to “3,” like round and green. Provided that the relationship between the labels and the set-sizes is reliable, set-size three will eventually be learned as the meaning of “3.” https://doi.org/10.1371/journal.pone.0022501.g002

However, the picture is somewhat more complicated than this suggests. Given that learning is driven by prediction, the temporal structure of information can play a critical role in whether or not competitive learning actually occurs. Indeed, the effects of competitive learning can be isolated by comparing learning in a situation where complex (multi-feature) stimuli predict a series of discrete classes, to its inverse [21]. As Figure 2 shows, learning to predict a discrete Label – such as “2” or “3” – from a complex set of Features (FL-learning) [21] allows for competitive learning amongst features, causing value to shift from features that produce more error to those that produce less. However, when this arrangement is temporally reversed, and the process becomes one of learning to predict a complex set of Features from a discrete Label (LF-learning), competition between cues cannot occur, since the label is the only cue present (value cannot transfer to other cues when there are no other cues) [21], [23]. Although these two processes appear similar, the differences in their temporal sequencing result in their having markedly different information structures, which produce very different patterns of learning [21]. This can be illustrated in relation to color, another aspect of vocabulary that children master only after a noticeable delay [34].

Children's pattern of delay in learning colors words bears a striking resemblance to the pattern observed in number learning. Although color words appear in children's vocabularies from a very young age, sighted children's early use of them is comparable to that of blind children: that is, they can produce them in familiar contexts (“yellow banana”), but cannot pick out novel objects by color, or reliably apply color words in unfamiliar contexts [35], [36]. Here again, children do not appear to grasp how specific words match to specific hues.

Colors and numbers share several notable characteristics that may help explain the common pattern. First, like numbers, colors are properties of the environment, and cannot be encountered independently. Second, as with set-sizes, many different shades of color are present in any given context (Figure 1). This means that in order to learn to map colors to their labels, a child must somehow discriminate the range of hues that best predict a specific color label from an environment in which color is ubiquitous [21], [37]. Fortunately, the difficulty of this problem can be significantly reduced if a child is encouraged to localize mappings – for example, by seeking to extract color matches from known objects. This situation allows the environment to be sampled in way that is far more informative [36]. Unfortunately, as we will show in a moment, the structure of many languages proves largely unhelpful to learners in this regard [21].

To understand why, consider a child learning about the relationship between the features of a ball and various color labels (Figure 3). As noted above, there are two possible ways this process can be structured temporally: either the various Features of the ball can predict the color Label (Feature-to-Label-learning, FL) or the color Label can predict the ball's Features (Label-to-Feature learning, LF) [21]. Because FL-sequencing produces competitive learning, whereas LF does not, the results of learning from these information structures differ markedly [21], [38]. However, as Figure 3 illustrates, which learning sequence results depends critically on how a child's attention is directed in time, which, in turn, depends on whether the novel color label is introduced before or after the familiar noun.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. Linguistic sequence determines learning sequence. Learning can be dramatically affected by how information is presented to a learner in time [39]. Here, a child learns about the relationship between the features of a ball and various color labels. As illustrated, there are two possible ways this process can be structured temporally: either the child hears the color word used postnominally, which promotes Feature-to-Label learning (the Features of the ball predict the color Label, bottom panel), or the child hears the color word used prenominally, which promotes Label-to-Feature learning (the color Label predicts the ball's Features, top panel) [21]. Prior research into category learning indicates that only FL-sequencing facilitates accurate category acquisition, whereas LF-sequencing does not [21], [38], [40]. https://doi.org/10.1371/journal.pone.0022501.g003

Like adults, children track linguistically relevant events in their environment as speech unfolds in time [41]–[44], often directing their gaze at objects or object features as they are labeled in discourse. However, this kind of linguistically mediated visual attention requires that children actually know the meanings of labels. Because children learn the semantics of common nouns long before they learn those of common colors and numbers [45], [46], a typical 2½ year old will readily direct her gaze toward a ball (or ball-like item) upon hearing the word “ball,” whereas a color word such as “blue” or “red” will not direct her visual attention in this way [47]. What this means, in practice, is that the sequence of events in an English sentence employing a postnominal construction (such as “Look! The ball is blue”) presents the information a child needs for color-label discrimination prior to the label that needs to be learned about, a sequence which supports FL-learning. However, the opposite is true for prenominal constructions (such as “Look at the blue ball”). Here, the color label is heard prior to the known label, which means that the child's attention is not drawn to the ball until after she hears “blue.” Accordingly, prenominal presentation typically promotes LF-learning.

These two information structures can have dramatically different effects on learning. In FL-learning, all of the features of the ball are initially available as potential cues to “blue,” but with experience, unreliable features (such as shape, size and texture) lose value to the most reliable feature (color). This results in competitive learning, which produces predictive representations that value features relative to their informativity - that is, how well they predict the relevant label. Over time, this allows children to master the meanings of color labels [21]. By contrast, in LF-learning, competitive learning amongst features is not possible – as there is, in effect, just one feature – and as a consequence, a child will learn a simple, probabilistic representation of the relationship between the label and object features (specifically, the co-occurrence probability between the label and each feature, normalized by the probability of the label). Because overlapping, unreliable features will not be appropriately ‘unlearned,’ color discrimination will be poor. Consistent with this analysis, a prior study found that training with postnominal constructions (FL) significantly improved the accuracy and consistency of two-year olds' color word application, whereas a similar schedule of prenominal training (LF) had no effect on performance at all [21].

Unfortunately for English-speaking children, however, color words are used prenominally around 70% of the time in child-directed speech [48], which may help explain why color acquisition is typically delayed [21], [35]. This also raises the question of whether information structure plays a similar role in the acquisition of number words. In English and many other languages, number words are far more likely to occur in a prenominal position (e.g., “those three chairs”), than in a postnominal position (e.g., “those chairs, the three of them”). If our analysis is correct, hearing a number word postnominally will facilitate competitive discrimination learning (helping a child discriminate what it is about, say, those chairs that predicts the word “three”), while instances in which number words occur prenominally will be far less helpful to a child trying to learn to isolate the appropriate semantic cues to number words (i.e., set-sizes).

Of course, words are not the only cues that a child has to guide visual attention, and there may be alternate ‘routes’ to FL-learning, even when a prenominal expression is used. Research into joint attention has shown that children also make use of social cues such as gaze and gesture in learning to discriminate a word's meaning [49]–[51]. For example, a parent might hand a child a handful of cookies before saying, “Here are three cookies,” or else point or look directly to a trio of cookies before mentioning their set-size. However, situations in which this kind of explicit instruction takes place are not representative of the majority of contexts in which children encounter number words [45]. Moreover, there is a great deal of variability in caregiver-child interaction during language learning: while some parents engage in frequent and sustained verbal interactions with their children, and explicitly label new objects, others only rarely communicate directly with their children, and do not engage in overt teaching behaviors [52].

To better isolate the effects of word order on number learning, we make a simplifying assumption here that all prenominal constructions support LF-learning, and all postnominal constructions support FL-learning. The analysis we present suggests that even learning in socially guided situations will benefit greatly from the information structure in postnominal constructions, and that postnominal ordering may be critical for learning in the majority of contexts in which children naturally encounter number words in speech.