Communicative interactions involve a kind of procedural knowledge that is used by the human brain for processing verbal and nonverbal inputs and for language production. Although considerable work has been done on modeling human language abilities, it has been difficult to bring them together to a comprehensive tabula rasa system compatible with current knowledge of how verbal information is processed in the brain. This work presents a cognitive system, entirely based on a large-scale neural architecture, which was developed to shed light on the procedural knowledge involved in language elaboration. The main component of this system is the central executive, which is a supervising system that coordinates the other components of the working memory. In our model, the central executive is a neural network that takes as input the neural activation states of the short-term memory and yields as output mental actions, which control the flow of information among the working memory components through neural gating mechanisms. The proposed system is capable of learning to communicate through natural language starting from tabula rasa, without any a priori knowledge of the structure of phrases, meaning of words, role of the different classes of words, only by interacting with a human through a text-based interface, using an open-ended incremental learning process. It is able to learn nouns, verbs, adjectives, pronouns and other word classes, and to use them in expressive language. The model was validated on a corpus of 1587 input sentences, based on literature on early language assessment, at the level of about 4-years old child, and produced 521 output sentences, expressing a broad range of language processing functionalities.

Funding: This study was supported by Regione Autonoma della Sardegna, O. P. FSE 2007-2012 L.R.7/2007, BG, GLM; United Kingdom, Engineering and Physical Sciences Research Council, BABEL project, AC; European Community, Seventh Framework Programme, POETICON++ project, AC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2015 Golosio et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Neural gating mechanisms play an important role in the cortex and in other regions of the brain [ 39 ]. They rely on the action of bistable neurons, i.e. neurons that can oscillate between a quiescent “down” state, associated with a hyperpolarized membrane potential, and an “up” state, characterized by a membrane potential that is just below the cell's firing threshold. The gatekeeper neurons can modulate the membrane potential of the bistable neurons, shifting them from the “down” state to the “up” state and vice versa. Different types of neural gating mechanisms have been observed in the brain. Fig 3 represents the type of gating mechanism that is exploited in our model. In this example, a gatekeeper neuron is fully connected to a set of bistable neurons. When the gating signal is “off”, the gate is closed: the bistable neurons are in the “down” state, and they do not respond to the input signal. Conversely, when the gating signal is “on” the gate is open: the bistable neurons are in the “up” state and they transmit the input signal to the second set of neurons. The bistable neurons therefore perform a type of biological AND relative to their inputs.

All classical neurobiological models of language attribute a fundamental role to Broca's area, which includes Brodmann's areas (BA) 44 and 45, in the left frontal cortex. Several studies show that BA 47 and the ventral part of BA 6 are also involved in language processing tasks [ 34 – 36 ]. The language-relevant part of the frontal cortex is thus the left inferior frontal gyrus (LIFG) which comprises BA 44, 45, 47 and 6. Results from neuroimaging and psycholinguistic studies show that LIFG is involved in the unification operations required for binding individual words into larger structures [ 37 , 38 ]. Hagoort [ 37 ] proposes a model that distinguishes three functional components of language processing: memory, unification and control. Fig 2 shows the main areas of the cortex that support the three components.

Localization of brain areas that are involved in language comprehension and production requires the combination of findings from neuroimaging and psycholinguistic research. Several studies on the functional neuroanatomy of language indicate that both semantic and syntactic processes involve mainly the left frontal cortex and part of the temporal cortex [ 34 – 38 ]. The left frontal cortex is considered to be responsible for strategic and executive aspects of language processing. The left temporal cortex supports the processes that identify phonetic and lexical elements. It is involved in storage and retrieval of phonological, syntactic and semantic information form memory.

contain the cue (“7”) and are similar to phrase 1, in the sense that both phrase 2 and phrase 3 are close to phrase 1 in the input space of the state-action association system. Unfortunately, phrase 2 is closer. If the choice was based solely on similarity with the phrase retrieved during training, the system would choose phrase 2, and following the same action sequence of the training example, it would give a wrong answer, i.e. “nine” instead of “ten”. In our model, the generalization capabilities are supported by a “comparison structure”, which is an additional component of the STM that recognizes similarities among elements of different STM components. For instance, it can recognize that one word in the phonological store is equal to a word of the phrase stored in the goal stack. In our example, the comparison structure allows the system to recognize that the third word of phrase 2 ("three") is equal to the fourth word in the goal phrase "add the number three". In a simple neural model of the comparison structure, the neurons that compare those two words will be activated. Our model includes a comparison structure, which is part of the input to the state-action association system of the central executive. We will show that the connections from the comparison structure to the central executive are weighted more than the connections from the phonological store to the central executive, therefore in the above example the system will select phrase 3 rather than phrase 2, and it will give the correct answer.

Through this sequence, the system will extract the phrase "add the number three" and push it in the goal stack, then it will transfer the sequence "7 8 2 5" to the phonological store, it will transfer the first number (“7”) to the focus of attention and use it as a cue to retrieve information from LTM. Now we come to another question: why the retrieval process should be modeled using a neural architecture, or more generally why the retrieval process should be described as a statistical process? In principle there could be thousands of phrases that could be retrieved from LTM using the digit "7" as a cue. How can the system choose the appropriate phrase among them? The system can recognize that some of the phrases that can be retrieved from LTM using the digit “7” as a cue are similar to the one retrieved during the training stage, which was:

Since this sentence is similar to that of the first task, the central executive will provide the same output, i.e. the same mental-action sequence.

At this point, one may wonder why a neural architecture is necessary to model this process. Apart from the obvious consideration that our brain is a neural architecture, why a symbolic model is not enough? What we try to emphasize in our work is that the decision processes operated by the central executive are not rule-based process, they are statistical decision processes. In our model, the central executive is a neural network that takes as input the signal from the STM components (the internal state) and provides as output mental actions that direct the flow of information among the slave systems. Therefore, the central executive should comprise a state-action association system. If the central executive was not a statistical tool, the system would not be able to generalize. But how might the generalization arise in the previous example? Suppose that an artificial model of the working memory was trained to respond to the "add the number two" task described above, and that it is tested on a similar task, but with different numbers:

In the next section we will illustrate how our model implements the "mental action sequence" (a,b,1–6), which includes the two actions a and b described above and the actions 1–6 listed previously. In the following sections, we will also demonstrate that a broad range of tasks in human language processing can be performed using iterations of this basic action sequence. A minimal system that can perform this sequence should include (at least) the following components:

and so on, until the last digit is processed. Additionally, several studies [ 32 , 33 ] suggest that the task goal should be stored in the working memory in some directly accessible form. Therefore, the previous sequence should be extended by including at the beginning, before step 1, two other operations, such as:

We assume that the subject has memorized additions with small numbers in LTM, so that the cognitive load for a single addition is small. The sequence of mental operations that are performed by the subject can be the following:

In classical tasks used to study working memory capacity [ 31 ], a subject is asked to hold in mind a short sequence of digits and to perform some simple process on each of these digits (or on a subset), for example adding the number two to each digit. Consider, for instance, the following task:

Baddeley's model is supported by evidences from experimental psychology, neuropsychology and cognitive neuroscience (see Ref. [ 28 ] for a review). However, some criticism has been raised and alternative models have been proposed. Cowan [ 29 ] proposed a working memory model in which the LTM was not a separate component, but a part of the working memory. Cowan's model consists of four components: a central executive, a LTM, an activated memory and a focus of attention. The central executive directs attention and controls voluntary processing. The activated memory is the subset of LTM in a state of temporal activation, and it can hold a large number of activated elements. The focus of attention is a subset of the activated memory. It has a limited capacity and can hold up to about four independent items or chunks. According to Baddeley, the differences between his view and that of Cowan are mainly in "emphasis and terminology" [ 28 ]. In particular, the episodic buffer of his model has a similar role to Cowan's focus of attention. McElree [ 30 ] suggested a focus of attention limited to a single chunk. Oberauer [ 31 ] proposed a model that distinguishes three states of representations in WM: the activated part of LTM, the region of direct access and the focus of attention. The region of direct access roughly corresponds to the broader focus of attention in Cowan's model, with a scope of about four chunks. The focus of attention in Oberauer's model corresponds to the single-chunk focus of McElree's model. The function of the focus of attention is to select a single item or chunk from the direct-access region.

In 2000, Baddeley [ 27 ] extended this model by adding a third slave system, the episodic buffer, which binds information from different domains (phonological, visual, spatial, semantic) to form integrated units of information with chronological ordering. Fig 1 shows a schematic diagram of this model.

In 1974, Baddeley and Hitch [ 26 ] proposed a working memory model composed of three main components: a central executive and two slave systems, i.e. the phonological loop and the visuo-spatial sketchpad. The central executive operates as a supervisory system by controlling the flow of information from and to the slave systems. The slave systems are responsible for short-term maintenance of information: the phonological loop stores verbal content, while the visuo-spatial sketchpad stores visual and spatial information.

Although there are different perspectives regarding the organization of memory in the human brain, all approaches recognize at least two types of memory: the short-term memory (STM) and the long-term memory (LTM). STM can be defined as the capacity of the human mind to hold a limited amount of information in a readily accessible state for a short period of time. In contrast, LTM is a large repository of knowledge and of information on prior events, which can be stored in the mind for long periods of time. The term working-memory (WM) has been defined in different ways, however most researchers assume that WM includes (at least) the STM and the processing mechanisms used for temporarily storing and manipulating information in the STM.

Our model uses adaptive neural gating mechanisms to control the flow of information among different subsystems of the short-term memory. Such mechanisms are controlled by a state-action association system, which learns through Hebbian changes in the synaptic strengths. We claim that this model can develop from tabula rasa a broad range of language processing functionalities. We propose that adaptive neural gating mechanisms have an important role in the development of language processing skills at the sentence level. We test our hypothesis by evaluating our model on a database based on literature on early language assessment, using a k-fold cross-validation technique.

Dominey and Hinaut [ 24 , 25 ] proposed a neural model of brain areas involved in language processing, able to learn grammatical constructions and to generalize the acquired knowledge to novel constructions. In their work, language understanding is identified as the ability to recognize the thematic role of the open-class words in the surface form of sentences, and meaning is interpreted as a mapping from the surface form to a functional form of sentences. This notion of understanding is not sufficient for the purpose of the present work, which is more focused on the elaboration of verbal information in the working memory. The purpose of our work is to contribute to understanding the mechanisms that make the human brain able to develop a broad range of language processing skills, starting from a tabula rasa condition. Such skills involve a procedural knowledge that is used to process verbal information at the sentence level, to combine it with information retrieved from long-term memory, to select relevant items and to plan language production. Here we present a comprehensive cognitive neural model, aimed at explaining how this procedural knowledge is developed, through a neural-network structure and biologically motivated learning rules.

The symbolic approach dominated the research in the field of natural language processing (NLP) for several decades. Natural language itself appears to be a strong symbolic activity, because words can be considered symbols used to represent real objects, concepts, events, and actions. The formal language theory, introduced in the '50s, used algebra and set theory to define formal languages as sequences of symbols. This theory includes the context-free grammar, defined by Chomsky [ 12 ]. Today the field of NLP is dominated by machine learning approaches, which include neural-network based approaches, support vector machine, Bayesian approaches and many others (See Ref. [ 13 ] for a review). Neural network language models have widely been used in NLP, demonstrating superior performances in next-word prediction and other standard NLP tasks over conventional approaches, such as n-gram models. Recently, deep learning techniques based on recurrent neural networks (RNNs) have been used successfully for several NLP tasks, including speech recognition [ 14 ], parsing [ 15 , 16 ], machine translation [ 17 ], sentiment analysis of text [ 18 ]. Although some of these models are biologically inspired, they are mainly designed as engineering solutions to specific problems in NLP. It is important to outline that NLP has been treated very differently in computer science, linguistics, and cognitive science. The connectionist approach demonstrated to be suitable for modeling the cognitive foundations of language processing [ 19 – 21 ]. Connectionist models have been used to explain the emergence of language skills with only simple learning rules that operate at a neural level, instead of requiring detailed innate knowledge. The connectionist approach emphasizes the role of learning through the interaction with the environment. According to this approach, language skills are the behavioral manifestation of internal representations and processes that take place in the brain. Although connectionist models have been widely used in the field of NLP, little work was done to integrate neural models of language into comprehensive cognitive models compatible with current knowledge of how verbal information is stored and processed in the brain, i.e. with verbal working memory models. Miikkulainen [ 21 , 22 ] and Fidelman et al. [ 23 ] presented a cognitive neural architecture able to parse script-based stories, to store them in episodic memory, to generate paraphrases of the narratives, and to answer questions about them. Their model was tested on a small corpus of nine scripts, each of which consisted of 4–7 sentences.

The central idea of the connectionist approach is that mental processes can be modeled as emergent processes of networks of highly interconnected processing units. The information is represented by activation signals flowing through such networks. The most used type of connectionist model is the artificial neural network (ANN) model, which has been widely used to account for different aspects of human cognition, including memory, perception, attention, pattern recognition and language. In many cases, connectionist architectures have been very effective in explaining some features of human behavior described by psychological findings. However, up to now they have never been implemented in large scale simulations for tasks that require complex reasoning [ 6 ]. Recently, Eliasmith et al. proposed a 2.5-million neuron model of the brain, able to process visual image sequences and to respond through movements of a physically modeled arm [ 9 ]. Other large-scale neural simulations have been reported [ 10 , 11 ], however they focus on biological realism of the neuron model, while none of them deal with the problem of natural language elaboration.

The attempts to build artificial systems capable of simulating important aspects of human cognitive abilities have a long history, and have contributed to the debate among two different theoretical approaches, the computationalism and the connectionism. According to the computational theory of mind, the brain is an information processing system, and thought can be described as a computation that operates on mental states [ 1 , 2 ]. This perspective has led to the implementation of a class of cognitive architectures called symbolic [ 3 – 5 ] (see Ref.s [ 6 ] and [ 7 ] for a review). Different criteria have been proposed for the classification of cognitive architectures [ 6 , 8 ]. We will use here the simple taxonomy proposed by Duch et al. [ 6 ], which focuses on how information is represented and processed. In symbolic architectures, information is represented by high-level symbols. Cognition takes place as a computation that operates on symbol structures and produces symbolic outputs. Symbolic architectures can realize high-level cognitive functions, such as complex reasoning and planning. However, the main issue of such architectures is that all information must be represented and processed in the form of symbols pertaining to a predefined domain. This constraint makes it difficult for such systems to recognize regularities in large datasets, particularly in presence of noisy data and in dynamic environments.

Methods

The ANNABELL model The model presented in this work, called ANNABELL (Artificial Neural Network with Adaptive Behavior Exploited for Language Learning), is a cognitive neural architecture, designed to help understand the cognitive processes involved in early language development. The source code of the software, the User Guide and the datasets used for its validation are available in the ANNABELL web site at https://github.com/golosio/annabell/wiki). The global organization of the system is compatible with the multicomponent working memory (M-WM) framework. However, our work is focused on the role of executive functions in language processing tasks, and not on many other important questions concerning WM, as those related to working memory capacity or information maintenance in STM. Therefore, for the sake of simplicity, our model does not take into account many effects that are of central importance for working memory theories, as for instance phonological/semantic similarity, word length effect, recency, and other effects in serial and free recall tasks. We also do not take a position in the controversy on whether information in the phonological store is maintained by passive storage or by active rehearsal, and it is again for reasons of simplicity that we have chosen passive maintenance. The building blocks of the model are artificial neurons. The system is based on the concept of sparse-signal map (SSM). A SSM is simply an ANN that has only a small fraction of all neurons active at a given time. The advantage of this representation is that it can be implemented in a very efficient way both in terms of computation time and in terms of memory usage, therefore it can partially compensate for the relatively limited parallelism of available hardware compared to the biological brain. The design of the neuron model focused on computational efficiency rather than biological details. It is important to point out that the purpose of this approach it not an engineering solution to the human-machine dialogue problem, but a cognitive model of how verbal information is processed in the brain. Computational efficiency is necessary for building a large-scale neural model of the verbal working memory, able to sustain a long training procedure on a relatively large database. The system is composed by several SSMs, connected to each other either by fixed-weight or by variable-weight (learnable) connections. The latter ones are updated through a discrete version of the Hebbian learning rule, combined with the k-winner-take-all rule. Most of the learnable connections are virtual: they are not actually allocated in memory, unless their default weight value is modified. As will be explained below, a connection weight is modified only if the presynaptic neuron is active and the postsynaptic neuron is one the winners of the k-winner-take-all competition. As the signal is sparse, only a small fraction of the neurons is active at a given time, therefore most learnable connections remain virtual, i.e. they are not allocated in memory. With this approach memory requirements and, most importantly, computation time are greatly reduced compared to conventional techniques. The use of virtual connections produces a gain of more than three orders of magnitude in execution time, because the weighted sum used to compute the neuron input signals (which is the part of the simulation that takes most of the execution time) is limited to the connections that are actually allocated in memory. The communication between the system and the human interlocutor is achieved through an interface that converts words into input patterns, submits them one by one to the system, extracts output patterns and converts them to words. The network architecture is designed in such a way that the system can process phrases using mental actions, which are elementary operations on word groups and phrases that are used, for instance, for acquiring the words of the input phrases, for memorizing phrases, for extracting word groups from the working phrase, for retrieving memorized phrases from word groups through an association mechanism, etc. Such actions are performed by special neurons, called mental action neurons, which can control the flow of signal between different subnetworks. A key feature of the model is that the connections that are affected by the reward mechanism are connected to mental action neurons, rather than being directly connected to output words or phrases. In this way, the system learns preferentially to build the output through sequences of elementary operations on word groups or phrases. This type of architecture underpins the generalization capabilities of the system. The system was implemented on a PC equipped with a high-performance GPU (graphics processing unit) NVIDIA Kepler GK104 having 1536 processing units (called cores). GPUs are programmable logic chips that are widely used not only for graphical applications, but more generally for high-performance-computing applications that require a high degree of parallelism. The current version of the system is composed by 2.1 million neurons, interconnected through 33 billion virtual connections. At the end of the complete learning process described in this work, the number of real (allocated) connections was 27 million. The size of the system is comparable to that of the neural architecture described in Ref. [9], although our model privileges computational efficiency over biological details. The ability to perform real time communication and the large scale of the network make our system adequate for sustaining a relatively long developmental process (this property is called open-ended, cumulative learning in developmental robotics [40]). The system is being trained through an approach that, compared to those used for other artificial systems, is much more similar to children language training. This process is conducted by personifying the system as a child in a virtual social environment. The validation of its performance is inspired by the literature on early language assessment. Test sessions are used to assess syntax, semantics, pragmatic language skills, communicative interactions, language processing skills and comprehension of sentence structure.

Learning mechanisms and signal flow control The ANNABELL system is entirely composed of interconnected artificial neurons, and all processes are achieved at the neural level. Although different subsystems can be distinguished by their function, the whole system has a unitary structure. The subnetworks are arranged in layers that determine the update order, with both forward and backward (recurrent) connections among different layers. The system uses a standard artificial neuron model. The neurons are connected among each other by directional weighted connections (links). Three types of connections are used: fixed-weight connections, which do not change during the learning process;

variable-weight (learnable) connections, which are modified by the learning process;

forcing connections, which are variable-weight connections that have a positive or negative weight much greater in absolute value than that of the other two connection types, thus they can force the target neurons to a high-level or to a low-level state. The total input signal of each neuron is evaluated as the weighted sum of the signals coming from its input connections: where i is the neuron index, y i is its total input signal, S i is the set of neurons that are connected to the other ends of its input connections, j is an index that runs on the set S i , w ij are the weights of the input connections, o j are the output signals of the neurons connected to its input, and b i is a bias signal. The neuron output is computed from the total input by a nonlinear activation function [41]: which approaches zero as y i tends to minus infinity, or one as y i tends to plus infinity. Two types of activation functions are used in the model, i.e. the Heaviside step function for the neurons that receive their input from fixed-weight connections, and the logistic function [41] for the neurons that receive it from variable-weight connections. In the subnetworks that have learnable input connections, the inhibitory competition among neurons is modeled using the k-winner-take-all rule, i.e. the k neurons with the highest activation state are switched on, while all the remaining neurons are left off. This rule provides a computationally effective approximation of the activation dynamics produced by inhibitory interneurons [42]. The Hebbian theory provides a theoretical basis for the learning mechanisms in biological neural networks [41,43]. According to this theory, the strength of the synaptic junction between two neurons is increased when the outputs of the two neurons are strongly correlated, i.e. when the two neurons fire together. In our model, the learnable connections are modified through a discrete version of the Hebbian learning rule (DHL rule), combined with the k-winner-take-all rule: the connection weight is modified only if the postsynaptic neuron is one of the k winners of the k-winner-take-all competition; if the presynaptic neuron at the other end of the connection is in the same activation state as the winner neuron (i.e. in the high-level state “on”) the connection weight is saturated to its maximum value. In the opposite case, it is saturated to its minimum value. A detailed description of the learning algorithms and of the statistical properties of the state-action association system is provided in S4 Appendix. In the ANNABELL model, the flow of information among different parts of the system is controlled by the central executive, which includes a set of gatekeeper neurons, a set of mental-action neurons and a state-action association system (see Fig 4). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 4. Schematic diagram of the ANNABELL system main components. https://doi.org/10.1371/journal.pone.0140866.g004 The gatekeeper neurons are neurons that can control the flow of signal between different subnetworks by acting in a similar way as an increase or a decrease of the bias signal, as described in the previous section. The output connections of the gatekeeper neurons are generally fully connected to one or more subnetworks, in such a way that they can allow or inhibit the flow of signal through such subnetworks. The mental-action neurons are neurons that trigger elementary operations, called mental actions, on word groups, phrase buffers and other subnetworks. The output connections of the (mental) action neurons are connected to the gatekeeper neurons. Each action neuron performs a mental action by activating simultaneously one or more gatekeeper neurons. The connections between the action neurons and the gatekeeper neurons have fixed, predetermined weights, in such a way that each action neuron corresponds to a well-defined operation. The mental action neurons and the gatekeeper neurons are based on the same simple neuron model used for all neurons of the system. Their specialization is only a result of the way how they are connected to other subnetworks. The state-action association system is a structure that is trained by a rewarding procedure to associate mental actions to the internal states of the system. The input and the output connections of this system follow a distributed model, i.e. the state-action association network is fully connected to the subnetworks that represent the internal state of the system (input) and to the action neurons (output). Its input and output connections are updated through the DHL rule combined with the k-winner-take-all rule. Note that, although the gating signals are sent by the gatekeeper neurons, it is the state-action association system that controls which action neurons are active, and thus which gatekeeper neurons are active. Therefore the decision of which gates should be open and which should not is made by the state-action association system. A key feature of the ANNABELL system that is particularly important for its generalization capabilities is that the learnable connections that are affected by the reward (i.e. the connections of the state-action association SSM) are connected to action neurons, rather than being directly connected to output words or phrases. In this way, the system learns preferentially to build the output through sequences of elementary operations on word groups or phrases.

Global organization of the model The global organization of our model is compatible with the M-WM framework. This section presents an overview of the system architecture and operating modes. S5 Appendix provides a detailed description of the architecture, while S3 Appendix describes in detail how the neural activation patterns evolve and how the connection weights are modified on concrete examples. However we must point out that the details of the implementation and further divisions in subcomponents, as described in S3 Appendix and in S5 Appendix, mainly respond to the need of building a neural-network model suitable for simulations that produce cognitively relevant behavior, and should not be considered as a premature attempt to map the model architecture to neural circuits in the biological brain. The ANNABELL model comprises four main components, as shown in Fig 4: a verbal short-term memory (STM), a verbal long-term memory (LTM), a central executive (CE) and a reward structure. The STM includes a phonological store, a focus of attention, a goal stack and a comparison structure. The phonological store maintains the working phrase. The focus of attention holds up to about four words. It is involved in several functions, including language production planning, and it is also used as a cue for retrieving information from LTM. For reasons of simplicity, our model does not include a visuo-spatial system or other types of sensory inputs; therefore, unlike Baddeley's episodic buffer, the focus of attention of our model can hold only verbal content. The goal stack is a structure for storing goal chunks that contribute to decision-making processes. The comparison structure recognizes similarities among words in the phonological store, in the focus of attention and in the goal stack, and is also used for decision-making processes. The LTM includes a structure for memorizing phrases and a retrieval structure that uses the focus of attention as a cue for retrieving memorized phrases. The CE is a supervisory system that controls all decision-dependent processes through neural gating mechanisms, as described in the previous section. It is important to outline that the central executive does not necessarily correspond to a well-localized area of the brain. It is a system that accounts for functions that could be distributed in different areas. How such functions map onto anatomical locations is an empirical question that is still under investigation. The reward structure memorizes and retrieves the sequences of internal states of the system and the mental actions performed by the system (state-action sequences). When an exploration phase produces a target output, the reward structure retrieves the state-action sequence, and it rewards the association between each internal state and the corresponding mental action, by triggering Hebbian changes of the state-action association synaptic weights. Mental actions, executed through neural gating mechanisms, are used to perform elementary operations on phrases, as increasing the phrase index, extracting a single word from the working-phrase buffer and mapping it to the word-group buffer, retrieving a memorized phrase from a word group, storing the working phrase in the goal stack, etc. The system can perform three types of actions. Acquisition actions. Those actions are used during the acquisition and during the association phases, for acquiring the input phrases, memorizing them and building the associations between word groups and memorized phrases. Elaboration actions. Those actions are used during the exploration and during the exploitation phases, for extracting word groups from the working phrase, for retrieving memorized phrases from word groups through the association mechanism, for retrieving memorized phrases belonging to the same context, for composing output phrases. Reward actions. Those actions are used by the rewarding system and can be executed in parallel to the elaboration actions. They are used for memorizing the state-action sequences produced during the exploration and during the exploitation phases, for retrieving such sequences after a reward signal and for triggering the changes of the state-action-association connection weights. A complete list of the actions is presented in S5 Appendix. The ANNABELL system is composed of several subnetworks. Fig 5 represents a schematic diagram of the main subnetworks in the STM and in the LTM. Each rectangular block in this diagram represents a subnetwork composed by interconnected artificial neurons. A detailed description of the system architecture is provided in S5 Appendix. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 5. Schematic diagram of the main system architecture. Each rectangle represents a subnetwork, which is composed by interconnected artificial neurons. Only the main subnetworks are represented in this diagram. The arrows that join the rectangles represent directional connections among neurons of different subnetworks. https://doi.org/10.1371/journal.pone.0140866.g005 The communication is achieved through a user interface between the human interlocutor and the system. The interface converts words into input patterns and submits them one by one to the system, extracts output patterns and convert them to words. It also sends reward signals to the system when prompted by the human. The interface includes a monitor tool that can be used to display the content of the SSMs that compose the system. The system can work in five operating modes, which are briefly described below. Acquisition. In this operating mode, the words of a phrase are acquired one by one and stored in the input-phrase buffer. Association. In this operating mode, the input phrase is copied from the input-phrase buffer to the working-phrase buffer and it is stored in a long-term memory (represented by the block Memorized Phrases in Fig 5). After that, all possible groups of contiguous words (with maximum four words in a group) are extracted from the working phrase and copied to the word-group buffer, and the association between the word group and the whole phrase is memorized in a long term memory (the block retrieval structure in Fig 5). Exploration. In this mode the system executes partially random sequences of elementary operations (mental actions) on word groups and phrase buffers. The human interlocutor can suggest to the system a target phrase or a target word group. The exploration is terminated when it produces the target phrase / target word group, or when the number of iterations becomes greater than a predefined limit. Reward. When the exploration process produces a phrase or a word group that the teacher recognizes as worth to be rewarded (target phrase or target word group) he can activate a rewarding procedure. In this operating mode the system retrieves the state-action sequence that led to the target phrase / target word group. The association between each state of the sequence and each corresponding action is rewarded by changing the connection weights of the state-action association SSM through the DHL rule. Exploitation. In this operating mode the state-action association SSM, trained by the rewarding procedure, is used to associate a mental action to each system state. The state-action-association SSM is updated through the k-winner-take-all rule. It receives as input the internal state of the system (represented by a dashed rectangle in Fig 5), and it sends its output to the elaboration-actions SSM, which is updated through the (one) winner-take-all rule. In this way a single elaboration action is selected, the one that is more represented among the outputs of the k winners of the state-action-association SSM. The basic action sequence used during the exploration operating mode is the following: W_FROM_WK: initializes the phrase index (PhI) to zero, to prepare the extraction of words from the working-phrase buffer;

NEXT_W (N 1 times): skips N 1 words of the working phrase buffer;

times): skips N words of the working phrase buffer; FLUSH_WG: clears the content of the word-group buffer;

GET_W, NEXT_W (N 2 times): copies N 2 consecutive words from the working phrase buffer to the word-group buffer;

times): copies N consecutive words from the working phrase buffer to the word-group buffer; WG_OUT (0/1 times): copies the word-group buffer content to the output buffer;

RETR_AS (0/1 times): retrieves a phrase associated to the word group by the association mechanism. N 1 and N 2 are random integer numbers. N 1 can eventually be null, while N 2 must be greater than or equal to one. The range of N 1 and N 2 depends on the maximum phrase size (ten words in the current implementation). Additionally, the system can eventually execute the following actions: GET_START_PH (0/1 times): retrieves the starting phrase in the same context of the working phrase;

GET_NEXT_PH (N 3 times): retrieves sequentially phrases belonging to the same context; The basic action sequence can be iterated more times, until the system produces an output. If the output does not correspond to the target output, the whole process is restarted. When the working phrase indicates a task that cannot be executed immediately, it can be set as a goal by inserting it in a SSM that acts as a goal stack with the action PUSH_GOAL. When the goal is reached, the phrase can be removed from the stack with the action DROP_GOAL. S3 Appendix describes in detail, on two examples, how the neural activation patterns evolve, how the connection weights are modified during training, and how these weight changes make the system able to generalize the acquired knowledge to new sentences.

The database The database of sentences used for training and testing the system is organized in five datasets, each devoted to a thematic group, i.e. people, parts of the body, categorization, communicative interactions and movement in a text-based virtual environment. Each of those datasets includes declarative sentences, conversational sentences and interrogative sentences. Declarative sentences are used to give some information to the system without expecting a response. As the system has no sensory input, apart from that provided by the text-based interface, all the information must be provided in the form of input sentences. Interrogative sentences are questions that expect an answer from the system. In the training stage, for each question the teacher suggests the associations that can be used to build a valid answer. In the test stages, the questions are used to verify whether the system is able to generalize what it learned during the training phase. An answer is considered to be correct only if it is both syntactically and semantically correct. Conversational sentences that expect a turn taking from the system are treated in the same way as the questions: for this type of sentences, in the training stage the teacher suggests response sentences that are appropriate for the conversation. On the other hand, conversational sentences that do not expect a turn taking are treated as declarative sentences.

The people dataset The first dataset is devoted to the subject people, and it is partially inspired by the Language Development Survey work of Rescorla et al. [44,45]. The sentences of this dataset have been prepared by personifying the system in a four years old little girl in her social environment, which includes the two parents, a sister, a friend, two cousins, the four grandparents, two aunts, two uncles and six other children, for a total number of twenty persons. Four of those persons, namely the two parents, the sister and the friend, are considered to have a closer relationship to the system, which means that the dataset provides more information for those four persons than for the others. In some cases, the two cousins are also included in the group of closer persons. Some sentences depend on the possible relationships between the persons and the system. In such case, we distinguish nine types of relationships, i.e. father, mother, sister, friend, cousin, grandmother, grandfather, aunt and uncle. The six other children are included in the social environment mainly for training and evaluating the system in age comparison tasks. Some declarative sentences (how-to sentences) are used to provide prescriptions on how to accomplish some specific tasks, as for instance to answer if someone is younger or older than you, you should compare your age with his age or to express language rules in a simple verbal form, as the possessive pronoun for a woman is her. Table 1 shows the types of declarative sentences used in the people dataset. The total number of declarative sentences in this dataset is 225. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Sentences of the people dataset. The social environment described in this dataset includes twenty persons. In the second column, <person> can be “Mum”, “Dad”, or the name of one of the other eighteen persons, <relationship> can be “father”, “mother”, “sister”, “friend”, “cousin”, “Grandma”, “Grandpa”, “aunt” or “uncle”. <number> can be a number or, in row 21, also “some” or “many”. The “(s)” denotes the possibility of a plural form. In row 15, <verb> and <complement> describe the profession in terms understandable for a preschool child, e.g. “the journalist writes in the newspaper”. The sentences in row 24 use the present progressive, as in “Susan is reading a book”. The sentences in row 25 (how-to sentences) are verbal prescriptions, expressed through the natural language, that are used to instruct the system on how to perform specific tasks in language processing. https://doi.org/10.1371/journal.pone.0140866.t001 The questions used in the people dataset are also inspired by the work of Rescorla [44,45], and they are appropriate for a preschool child, as in the following examples: what does your father do? what games do you like? do you have a sister? is Dad older than Mum? etc. A full list of the declarative sentences and of the questions can be found in the files that are distributed with the software package. They explore the meaning of words, but they are also used to train the system for language and reasoning skills, as: - use of personal and possessive pronouns; - answering polar (yes/no) questions, alternative (choice) questions, wh-questions and question-like imperative sentences (e.g. tell me); - counting and comparing numbers, as for instance in age comparison: is Letizia older or younger than your sister? - learning language rules: the possessive pronoun for a female person is her The following question/answer example illustrates some of the abilities acquired by the system: Q: is your friend younger than you? A: no, she is older. The system is able to answer the question Q by following a line of reasoning that it has learned through the communication with the human, thanks to its adaptive behavior. The system uses the past experience listed below. 1) The system has been taught to count; 2) The system has been taught to decide whether another child is younger or older than the girl that it impersonates, through the following phrases: to answer if someone is younger or older than you, you should compare your age with his age 3) The system has learned the age of the girl that it impersonates: you are four years old 4) The system has learned that the words “your friend” refer to the friend Letizia Letizia is your friend 5) The system knows the age of Letizia: Letizia is five years old 6) The system has learned how to use personal pronouns, therefore it can answer using the personal pronoun she instead of the name Letizia. The teacher taught the system to answer questions similar to the question Q, guiding it through a series of mental operations (associations and extractions of word groups from sentences), through the exploration-reward method described previously. At this point the system is able to generalize the procedure and to answer questions similar to those used for training. It is important to emphasize that this whole process takes place in the system at a subsymbolic (neural) level and that phrase memorization and learning take place in the form of synaptic weight changes through the DHL rule. The examples shown in S1 Appendix show in more detail how the system is trained to answer a question.

The parts of the body dataset The second dataset is devoted to the main parts of the body, and it is also partially based on the words of this subject category included in the Language Development Survey. Through this dataset, the system is trained to recognize the definition of a word as well as different ways to specify the location of an object. After the training, the system should be able to answer questions of the type what is and where is. Table 2 represents the type of declarative sentences used in this dataset. Thirty-three body parts are considered. For each of them, a declarative sentence provides a simple definition in a form that should be understandable for a preschool child. Other sentences specify the locations of the body parts. It can be observed that in this case the correspondence between body parts and sentences is not one-to-one, because the location of a body part can be described in more than one way. Eight declarative sentences describe in simple terms what is the function of some body parts, e.g. with your legs you can walk, run and jump and finally, six sentences are how-to sentences. The total number of declarative sentences in the parts of the body dataset is 122. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. Sentences of the parts-of-the-body dataset. Thirty-three body parts are included in the dataset. In the first column, <part> is the name of a body part. The “(s)” refers to a possible plural form. https://doi.org/10.1371/journal.pone.0140866.t002 Only five types of questions are used in this dataset, i.e. what is the <part>? what are the <part>s? where is the <part>? where are the <part>s? what can you do with your <part>(s)? where <part> is the name of a body part.

The categorization dataset The third dataset is used for evaluating the categorization capabilities of the system. This dataset uses 62 different animal names from 6 categories: 13 mammals, 13 birds, 13 fishes, 8 reptiles, 4 amphibians, 11 insects. The animal name memberships to the six categories are specified by 62 declarative sentences of the form: the <animal> is a <category> where <animal> is an animal name, and <category> is one of the six categories listed previously, as for instance: the turtle is a reptile Other 6 sentences of the form: <category>s are animals and one how-to sentence are included to train the system to deal with categorization hierarchies. The dataset also includes 48 declarative sentences of the form: the <animal> is <adjective> where <adjective> is one of the five adjectives: big, dangerous, domestic, fast or small. The total number of sentences in this dataset is 117. In the training stage, the human teacher asks the system to tell him an animal belonging to one of the categories, e.g. tell me a mammal then he guides the system to a correct answer, as shown in detail in Sect. 2 in S1 Appendix. A single training example, involving one animal name from one category, is sufficient. After that, the system is able to answer correctly the analogous question for all 6 categories. This test shows that the system is able to learn that the “is a” couple is used in sentences as “the dog is a mammal” to state that a concept belongs to a category, and that the “tell me a” group in a question can be used for asking to retrieve a concept from a category. A more complex categorization task in the same dataset involves the ability to learn categorization hierarchies. In this case, the human asks the system two consecutive questions, as in the following example: Q: what is the turtle? A: it is an animal Q: what kind of animal? A: a reptile Other questions in this dataset are used to evaluate the system capability to combine information on categories and adjectives, as in the following example: Q: tell me a big reptile A: crocodile

The communicative interactions dataset The fourth session is devoted to communicative interactions, and it is based on a mother/child dialogue extracted from the Warren-Leubecker corpus [46,47], which is part of the CHILDES database [48]. This corpus contains data from 20 children interacting with one of their parents. The sessions took place in the child’s home. The parent was instructed to bring the child into conversation and to talk to him as naturally as possible. This corpus appeared to be more appropriate than others for training the system, because the children ages were appropriate and because verbal communication was predominant over nonverbal communication, play and actions. The session used in this work is based on the file “david.cha”, which contains a transcription of the dialogue between a 5-years-and-10-months-old child and his mother. The system was trained in a text-based virtual environment. First, we guessed what kind of past experiences of the child could be compatible with the David dialogue: one day a relative brought the child to an amusement park; the child played to a video game (Pacman). Another day, at the kindergarten, the teacher organized a costume party, where each child should dress as a character that represents a letter of the alphabet. At home, the mother helped the child to prepare his letterman dress. Those past experiences are described through a first set of 52 declarative sentences. Then we describe similar possible past experiences of the child impersonated by the system (a little girl, in our case): one day her father brought her, her sister and her cousin to the central park, where they played hide-and-seek and other games; another day, she was in Susan's room; aunt Carol told Susan to tidy up her room, therefore Susan started to put things inside her toy-boxߪ Those experiences are described through another set of 44 declarative sentences, similar in syntax but different in the content from those of the first set. The training is based on this second set. Other 18 sentences in this dataset are how-to sentences. The human teacher guided the system into a conversation similar in syntax to the David dialogue, but related to a different past experience, and suggested either possible answers to the questions, or sentences appropriate for the conversation. In the test stage, the human interlocutor had a conversation with the ANNABELL system similar to that taken from the Warren-Leubecker corpus. Sect. 2 in S2 Appendix shows a list of the declarative sentences used to build the system experience in a virtual text-based environment. Sect. 3 in S2 Appendix shows the sentences used to train the system.