Humans and animals have a “number sense,” an innate capability to intuitively assess the number of visual items in a set, its numerosity. This capability implies that mechanisms to extract numerosity indwell the brain’s visual system, which is primarily concerned with visual object recognition. Here, we show that network units tuned to abstract numerosity, and therefore reminiscent of real number neurons, spontaneously emerge in a biologically inspired deep neural network that was merely trained on visual object recognition. These numerosity-tuned units underlay the network’s number discrimination performance that showed all the characteristics of human and animal number discriminations as predicted by the Weber-Fechner law. These findings explain the spontaneous emergence of the number sense based on mechanisms inherent to the visual system.

The innate presence of the number sense implies that mechanisms to extract numerosity indwell the brain’s visual system, although it is by nature primarily concerned with visual objects. In recent years, biologically inspired deep neural networks have provided valuable insights into the workings of the visual system. Generative neural networks, a class of deep networks that learn to form an internal model of the sensory input, have been shown to become sensitive to numerosity but could not explain the emergence of real number neurons ( 13 ). Here, we use a hierarchical convolutional neural network (HCNN), a class of biologically inspired models that have recently achieved great success in computer vision applications ( 14 , 15 ) and in the modeling of the ventral visual stream ( 16 , 17 ). Like the brain, these models comprise several feedforward and retinotopically organized layers containing individual network units that mimic different types of visual neurons. The training procedure autonomously determines selectivity for individual features in each unit to maximize the network’s performance on a given task. Here, we built such a network and trained it on a visual object recognition task unrelated to numbers to explore whether and how sensitivity to numbers would spontaneously emerge.

Humans and animals have a “number sense,” an innate capability to intuitively assess the number of visual items in a set, its “numerosity” ( 1 , 2 ). This capacity allows newborn human infants ( 3 ) and animals ( 4 ) to assess the number of items in a visual scene. Human psychophysics ( 5 , 6 ), brain imaging studies in humans ( 7 , 8 ), and single-neuron recordings in animals support the direct and automatic assessment of numerosity in the brain. In animals that had not been trained to judge number, single neurons spontaneously responded to numerosity and were tuned to preferred numerosities ( 9 , 10 ). These “number neurons” that also exist in the human brain ( 11 ) are regarded as the neuronal foundation of numerical information processing ( 12 ).

RESULTS

Numerosity selectivity spontaneously emerges in a deep neural network trained for object classification We trained a deep neural network to classify objects in natural images. The network model was an instance of HCNNs (18), originally inspired by the discovery of simple and complex cells in early visual cortex (19). The network model (Fig. 1A and Table 1; see Materials and Methods for details) can be conceptually divided into two parts: a feature extraction network that learned to convert natural images into a high-level representation suitable for object classification and a classification network that produced object-class probabilities based on this representation. The network consisted mainly of convolutional layers and pooling layers. Network units in convolutional layers performed local filtering operations analogous to simple cells in the visual cortex, while the units in pooling layers aggregated responses in local patches in their input, similar to complex cells. Network units that had the same receptive fields in convolutional layers competed with each other using a simple form of lateral inhibition (14). Fig. 1 An HCNN for object recognition. (A) Simplified architecture of the HCNN. The feature extraction network consists of convolutional layers that compute multiple feature maps. Each feature map represents the presence of a certain visual feature at all possible locations in the input and is computed by convolving the input with a filter and then applying a nonlinear activation function. Max-pooling layers aggregate responses by computing the maximum response in small nonoverlapping regions of their input. The classification network consists of a global average-pooling layer that computes the average response in each input feature map, and a fully connected layer where the response of each unit represents the probability that a specific object class is present in the input image. (B) Successful classification of a wolf spider by the network from other arthropods is shown as an example. Example images representative of those used in the test set and the top 5 predictions made by the network for each image ranked by confidence. Ground-truth labels are shown above each image. Images shown here are from the public domain (Wikimedia Commons). Table 1 Description of the layers in the HCNN. View this table: We trained the network on object recognition using the ILSVRC2012 ImageNet dataset [(14); see Materials and Methods for details]. This dataset contains around 1.2 million images that have been classified into 1000 categories based on the most prominent object depicted in each image. After training, the network was tested on object classification with 50,000 new images that the network had never seen before. The network achieved a highly significant object classification accuracy of 49.9% (chance level = 0.1%; P < 0.001, binomial test) on this dataset. Figure 1B shows examples of the test images and the predictions made by the network. To explore whether the network trained on object classification with natural images could spontaneously assess the number of items in dot displays (their numerosity), we investigated whether different numerosities elicit different activations in the network units. To that aim, we discarded the classification network and presented only the feature extraction network with newly generated images of dot patterns depicting various numerosities ranging from 1 to 30, following (20) for monkey experiments. Figure 2A shows examples of those images. To control for the effect that the visual appearance of the dot displays might have on unit activations, we used 21 images for each numerosity across three different stimulus sets. The first stimulus set (standard set) showed circular dots of random size and spacing. The second stimulus set (control set 1) displayed dots of equal total dot area and dot density across numerosities. The third stimulus set (control set 2) consisted of items of different geometric shapes with equal overall convex hull across numerosities (see Materials and Methods for details). Fig. 2 Numerosity-tuned units emerging in the HCNN. (A) Examples of the stimuli used to assess numerosity encoding. Standard stimuli contain dots of the same average radius. Dots in Area & Density stimuli have a constant total area and density across all numerosities. Dots in Shape & Convex hull stimuli have random shapes and a uniform pentagon convex hull (for numerosities >4). (B) Tuning curves for individual numerosity-selective network units. Colored curves show the average responses for each stimulus set. Black curves show the average responses over all stimulus set. Error bars indicate SE measure. PN, preferred numerosity. (C) Same as (B), but for neurons in monkey prefrontal cortex (20). Only the average responses over all stimulus sets are shown. (D) Distribution of preferred numerosities of the numerosity-selective network units. (E) Same as (D), but for real neurons recorded in monkey prefrontal cortex [data from (20)]. We presented a total of 336 images to the network and recorded the responses of the final layer. A two-way analysis of variance (ANOVA) with numerosity and stimulus set as factors was performed to detect network units selective to the number of items (P < 0.01) but without significant effects for stimulus set or interaction. Of the 37,632 network units in the final layer, 3601 (9.6%) were found to be numerosity-selective network units. The responses of numerosity-selective units exhibited a clear tuning pattern (Fig. 2B) that was virtually identical to those of real neurons [Fig. 2C; real neurons from (20)]: Each network unit responded maximally to a presented numerosity, its preferred numerosity, and progressively decreased its response as the presented numerosity deviated from the preferred numerosity. The distribution of preferred numerosities covered the entire range (1 to 30) of presented numerosities, with more network units preferring smaller than larger numerosities (Fig. 2D), similar to the distribution observed in real neurons (Fig. 2E) (20).

Tuning properties of numerosity-selective network units If the numerosity-selective network units are analogous to numerosity-selective neurons found in the brain, then they should exhibit the same tuning properties. To investigate this, we averaged the responses from numerosity-selective network units that have the same preferred numerosity and normalized them to the 0 to 1 activation range to create the pooled network tuning curves (Fig. 3). The pooled network units’ tuning curves revealed characteristics of real neurons (12): The shape of the units’ tuning curves was asymmetric peak functions on a linear number scale, with more sharply decaying slopes toward smaller than larger numerosities. This pattern suggests that the network units’ tuning was better represented on a nonlinearly compressed, possibly logarithmic, scale, where large numerosities occur closer together than small numerosities. Fig. 3 Tuning curves of numerosity-selective network units. Average tuning curves of numerosity-selective network units tuned to each numerosity. Each curve is computed by averaging the responses of all numerosity-selective units that have the same preferred numerosity. The pooled responses are normalized to the 0 to 1 range. Preferred numerosity and number of numerosity-selective network units are indicated above each curve. Error bars indicate SE measure. To verify this, we first plotted the pooled network tuning curves once on a linear scale and again on a logarithmic scale (Fig. 4A). The network tuning curves became more symmetric and had a near-constant tuning width across preferred numerosities on the logarithmic scale. To quantify this effect, we fit Gaussian functions to the network tuning curves plotted on a linear scale and on three different nonlinearly compressed scales, namely, two power scales and a logarithmic scale [f(x) = x0.5, f(x) = x0.33, f(x) = log 2 (x)]. These scales represent different levels of nonlinear compression such that the level of compression progressively increases when going from the linear scale to the logarithmic scale. The Gaussian function was chosen because it is a standard symmetric function. If a scale is suited to the tuning curves, they should become symmetric around preferred numerosities when plotted on that scale, and therefore, the goodness of fit (r2 score) of the Gaussian function to the tuning curves should be increased (21). We found that the Gaussian function proved a significantly better fit for the data on any of the nonlinear scales than on the linear scale (P < 0.05, paired t test) (Fig. 4B). The goodness of fit was not significantly different between any of the nonlinear scales (P > 0.05). Furthermore, we plotted the SD of the Gaussian fit as a measure of the tuning curve width for each of the tuning curves against the preferred numerosity associated with each curve (Fig. 4C). The clear and positive slope of the Gaussian widths on the linear scale (r = 0.96, P = 2.1 × 10−9) indicated that tuning width systematically increased with numerosity. In contrast, the slope had values close to zero for the logarithmic scale (r = 0.20, P = 0.47), indicating that tuning widths were invariant with lognormal tuning curves of increasing numerosity. Fig. 4 Tuning properties of numerosity-selective network units. (A) Left: Average tuning curves for network units preferring each numerosity plotted on a linear scale. Right: Same tuning curves plotted on a logarithmic scale. (B) Average goodness-of-fit measure for fitting Gaussian functions to the tuning curves on different scales [P linear-log = 0.009; P linear-pow(1/2) = 0.003; P linear-pow(1/3) = 0.001]. (C) SD of the best-fitting Gaussian function for each of the tuning curves of numerosity-selective network units for different scales. Previous network models of number coding postulated summation units, units that monotonically increase or decrease responses with increasing numbers, either as necessary precursors to tuned number detectors (22, 23) or as actual output units (13). In our network, however, summation units were negligible in both respects. In the output layer (layer 13), only 0.5% of all units were summation units, in stark contrast to 9.6% tuned number units. The preceding intermediate layers (layers 12 and 11) contained only 0.9 and 2.3% summation units, respectively. Crucially, when we eliminated the responses of all summation units before testing the model, the proportions of tuned neurons, their distribution, and average tuning curves were qualitatively unchanged. Therefore, summation units were not necessary for the network to develop number detectors.