First published Thu Feb 6, 2014; substantive revision Wed Feb 26, 2014

The following article outlines the goals and methods of computational linguistics (in historical perspective), and then delves in some detail into the essential concepts of linguistic structure and analysis (section 2), interpretation (sections 3–5), and language use (sections 6–7), as well as acquisition of knowledge for language (section 8), statistical and machine learning techniques in natural language processing (section 9), and miscellaneous applications (section 10).

Computational linguistics is the scientific and engineering discipline concerned with understanding written and spoken language from a computational perspective, and building artifacts that usefully process and produce language, either in bulk or in a dialogue setting. To the extent that language is a mirror of mind, a computational understanding of language also provides insight into thinking and intelligence. And since language is our most natural and most versatile means of communication, linguistically competent computers would greatly facilitate our interaction with machines and software of all sorts, and put at our fingertips, in ways that truly meet our needs, the vast textual and other resources of the internet.

The theoretical goals of computational linguistics include the formulation of grammatical and semantic frameworks for characterizing languages in ways enabling computationally tractable implementations of syntactic and semantic analysis; the discovery of processing techniques and learning principles that exploit both the structural and distributional (statistical) properties of language; and the development of cognitively and neuroscientifically plausible computational models of how language processing and learning might occur in the brain.

The practical goals of the field are broad and varied. Some of the most prominent are: efficient text retrieval on some desired topic; effective machine translation (MT); question answering (QA), ranging from simple factual questions to ones requiring inference and descriptive or discursive answers (perhaps with justifications); text summarization; analysis of texts or spoken language for topic, sentiment, or other psychological attributes; dialogue agents for accomplishing particular tasks (purchases, technical trouble shooting, trip planning, schedule maintenance, medical advising, etc.); and ultimately, creation of computational systems with human-like competency in dialogue, in acquiring language, and in gaining knowledge from text.

The methods employed in theoretical and practical research in computational linguistics have often drawn upon theories and findings in theoretical linguistics, philosophical logic, cognitive science (especially psycholinguistics), and of course computer science. However, early work from the mid-1950s to around 1970 tended to be rather theory-neutral, the primary concern being the development of practical techniques for such applications as MT and simple QA. In MT, central issues were lexical structure and content, the characterization of “sublanguages” for particular domains (for example, weather reports), and the transduction from one language to another (for example, using rather ad hoc graph transformation grammars or transfer grammars). In QA, the concern was with characterizing the question patterns encountered in a specific domain, and the relationship of these question patterns to the forms in which answers might stored, for instance in a relational database.

By the mid-1960s a number of researchers emboldened by the increasing power and availability of general-purpose computers, and inspired by the dream of human-level artificial intelligence, were designing systems aimed at genuine language understanding and dialogue. The techniques and theoretical underpinnings employed varied greatly. An example of a program minimally dependent on linguistic or cognitive theory was Joseph Weizenbaum's ELIZA program, intended to emulate (or perhaps caricature) a Rogerian psychiatrist. ELIZA relied on matching user inputs to stored patterns (brief word sequences interspersed with numbered slots, to be filled from the input), and returned one of a set of output templates associated with the matched input pattern, instantiated with material from the input. While ELIZA and its modern chatbot descendants are often said to rely on mere trickery, it can be argued that human verbal behavior is to some degree reflexive in the manner of ELIZA, i.e., we function in “preprogrammed” or formulaic manner in certain situations, for example, in exchanging greetings, or in responding at a noisy party to comments whose contents, apart from an occasional word, eluded us.

A very different perspective on linguistic processing was proffered in the early years by researchers who took their cue from ideas about associative processes in the brain. For example, M. Ross Quillian (1968) proposed a model of word sense disambiguation based on “spreading activation” in a network of concepts (typically corresponding to senses of nouns) interconnected through relational links (typically corresponding to senses of verbs or prepositions). Variants of this “semantic memory” model were pursued by researchers such as Rumelhart, Lindsay and Norman (1972), and remain as an active research paradigm in computational models of language and cognition. Another psychologically inspired line of work was initiated in the 1960s and pursued for over two decades by Roger Schank and his associates, but in his case the goal was full story understanding and inferential question answering. A central tenet of the work was that the representation of sentential meaning as well as world knowledge centered around a few (e.g., 11) action primitives, and inference was driven by rules associated primarily with these primitives; (a prominent exponent of a similar view was Yorick Wilks). Perhaps the most important aspect of Schank's work was the recognition that language understanding and inference were heavily dependent on a large store of background knowledge, including knowledge of numerous “scripts” (prototypical ways in which familiar kinds of complex events, such as dining at a restaurant, unfold) and plans (prototypical ways in which people attempt to accomplish their goals) (Schank & Abelson 1977).

More purely AI-inspired approaches that also emerged in the 1960s were exemplified in systems such as Sad Sam (Lindsay 1963), Sir (Raphael 1968) and Student (Bobrow 1968). These featured devices such as pattern matching/transduction for analyzing and interpreting restricted subsets of English, knowledge in the form of relational hierarchies and attribute-value lists, and QA methods based on graph search, formal deduction protocols and numerical algebra. An influential idea that emerged slightly later was that knowledge in AI systems should be framed procedurally rather than declaratively—to know something is to be able to perform certain functions (Hewitt 1969). Two quite impressive systems that exemplified such a methodology were shrdlu (Winograd 1972) and Lunar (Woods et al. 1972), which contained sophisticated proceduralized grammars and syntax-to-semantics mapping rules, and were able to function fairly robustly in their “micro-domains” (simulated blocks on a table, and a lunar rock database, respectively). In addition, shrdlu featured significant planning abilities, enabled by the microplanner goal-chaining language (a precursor of Prolog). Difficulties that remained for all of these approaches were extending linguistic coverage and the reliability of parsing and interpretation, and most of all, moving from microdomains, or coverage of a few paragraphs of text, to more varied, broader domains. Much of the difficulty of scaling up was attributed to the “knowledge acquisition bottleneck”—the difficulty of coding or acquiring the myriad facts and rules evidently required for more general understanding. Classic collections containing several articles on the early work mentioned in the last two paragraphs are Marvin Minsky's Semantic Information Processing (1968) and Schank and Colby's Computer Models of Thought and Language (1973).

Since the 1970s, there has been a gradual trend away from purely procedural approaches to ones aimed at encoding the bulk of linguistic and world knowledge in more understandable, modular, re-usable forms, with firmer theoretical foundations. This trend was enabled by the emergence of comprehensive syntactico-semantic frameworks such as Generalized Phrase Structure Grammar (GPSG), Head-driven Phrase Structure Grammar (HPSG), Lexical-Functional Grammar (LFG), Tree-Adjoining Grammar (TAG), and Combinatory Categorial Grammar (CCG), where in each case close theoretical attention was paid both to the computational tractability of parsing, and the mapping from syntax to semantics. Among the most important developments in the latter area were Richard Montague's profound insights into the logical (especially intensional) semantics of language, and Hans Kamp's and Irene Heim's development of Discourse Representation Theory (DRT), offering a systematic, semantically formal account of anaphora in language.

A major shift in nearly all aspects of natural language processing began in the late 1980s and was virtually complete by the end of 1995: this was the shift to corpus-based, statistical approaches (signalled for instance by the appearance of two special issues on the subject by the quarterly Computational Linguistics in 1993). The new paradigm was enabled by the increasing availability and burgeoning volume of machine-readable text and speech data, and was driven forward by the growing awareness of the importance of the distributional properties of language, the development of powerful new statistically based learning techniques, and the hope that these techniques would overcome the scalability problems that had beset computational linguistics (and more broadly AI) since its beginnings.

The corpus-based approach has indeed been quite successful in producing comprehensive, moderately accurate speech recognizers, part-of-speech (POS) taggers, parsers for learned probabilistic phrase-structure grammars, and even MT and text-based QA systems and summarization systems. However, semantic processing has been restricted to rather shallow aspects, such as extraction of specific data concerning specific kinds of events from text (e.g., location, date, perpetrators, victims, etc., of terrorist bombings) or extraction of clusters of argument types, relational tuples, or paraphrase sets from text corpora. Currently, the corpus-based, statistical approaches are still dominant, but there appears to be a growing movement towards integration of formal logical approaches to language with corpus-based statistical approaches in order to achieve deeper understanding and more intelligent behavior in language comprehension and dialogue systems. There are also efforts to combine connectionist and neural-net approaches with symbolic and logical ones. The following sections will elaborate on many of the topics touched on above. General references for computational linguistics are Allen 1995, Jurafsky and Martin 2009, and Clark et al. 2010.

Language is structured at multiple levels, beginning in the case of spoken language with patterns in the acoustic signal that can be mapped to phones (the distinguishable successive sounds of which languages are built up). Groups of phones that are equivalent for a given language (not affecting the words recognized by a hearer, if interchanged) are the phonemes of the language. The phonemes in turn are the constituents of morphemes (minimal meaningful word segments), and these provide the constituents of words. (In written language one speaks instead of characters, graphemes, syllables, and words.) Words are grouped into phrases, such as noun phrases, verb phrases, adjective phrases and prepositional phrases, which are the structural components of sentences, expressing complete thoughts. At still higher levels we have various types of discourse structure, though this is generally looser than lower-level structure.

Techniques have been developed for language analysis at all of these structural levels, though space limitations will not permit a serious discussion of methods used below the word level. It should be noted, however, that the techniques developed for speech recognition in the 1980s and 1990s were very influential in turning NLP research towards the new corpus-based, statistical approach referred to above. One key idea was that of hidden Markov models (HMMs), which model “noisy” sequences (e.g., phone sequences, phoneme sequences, or word sequences) as if generated probabilistically by “hidden” underlying states and their transitions. Individually or in groups, successive hidden states model the more abstract, higher-level constituents to be extracted from observed noisy sequences, such as phonemes from phones, words from phonemes, or parts of speech from words. The generation probabilities and the state transition probabilities are the parameters of such models, and importantly these can be learned from training data. Subsequently the models can be efficiently applied to the analysis of new data, using fast dynamic programming algorithms such as the Viterbi algorithm. These quite successful techniques were subsequently generalized to higher-level structure, soon influencing all aspects on NLP.

Before considering how grammatical structure can be represented, analyzed and used, we should ask what basis we might have for considering a particular grammar “correct”, or a particular sentence “grammatical,” in the first place. Of course, these are primarily questions for linguistics proper, but the answers we give certainly have consequences for computational linguistics.

Traditionally, formal grammars have been designed to capture linguists' intuitions about well-formedness as concisely as possible, in a way that also allows generalizations about a particular language (e.g., subject-auxiliary inversion in English questions) and across languages (e.g., a consistent ordering of nominal subject, verb, and nominal object for declarative, pragmatically neutral main clauses). Concerning linguists' specific well-formedness judgments, it is worth noting that these are largely in agreement not only with each other, but also with judgments of non-linguists—at least for “clearly grammatical” and “clearly ungrammatical” sentences (Pinker 2007). Also the discovery that conventional phrase structure supports elegant compositional theories of meaning lends credence to the traditional theoretical methodology.

However, traditional formal grammars have generally not covered any one language comprehensively, and have drawn sharp boundaries between well-formedness and ill-formedness, when in fact people's (including linguists') grammaticality judgments for many sentences are uncertain or equivocal. Moreover, when we seek to process sentences “in the wild”, we would like to accommodate regional, genre-specific, and register-dependent variations in language, dialects, and erroneous and sloppy language (e.g., misspellings, unpunctuated run-on sentences, hesitations and repairs in speech, faulty constituent orderings produced by non-native speakers, and fossilized errors by native speakers, such as “for you and I”—possibly a product of schoolteachers inveighing against “you and me” in subject position). Consequently linguists' idealized grammars need to be made variation-tolerant in most practical applications. The way this need has typically been met is by admitting a far greater number of phrase structure rules than linguistic parsimony would sanction—say, 10,000 or more rules instead of a few hundred. These rules are not directly supplied by linguists (computational or otherwise), but rather can be “read off” corpora of written or spoken language that have been decorated by trained annotators (such as linguistics graduate students) with their basic phrasal tree structure. Unsupervised grammar acquisition (often starting with POS-tagged training corpora) is another avenue (see section 9), but results are apt to be less satisfactory. In conjunction with statistical training and parsing techniques, this loosening of grammar leads to a rather different conception of what comprises a grammatically flawed sentence: It is not necessarily one rejected by the grammar, but one whose analysis requires some rarely used rules.

As mentioned in section 1.2, the representations of grammars used in computational linguistics have varied from procedural ones to ones developed in formal linguistics, and systematic, tractably parsable variants developed by computationally oriented linguists. Winograd's shrdlu program, for example, contained code in his programmar language expressing,

To parse a sentence, try parsing a noun phrase (NP); if this fails, return NIL, otherwise try parsing a verb phrase (VP) next and if this fails, or succeeds with words remaining, return NIL, otherwise return success.

Similarly Woods' grammar for lunar was based on a certain kind of procedurally interpreted transition graph (an augmented transition network, or ATN), where the sentence subgraph might contain an edge labeled NP (analyze an NP using the NP subgraph) followed by an edge labeled VP (analogously interpreted). In both cases, local feature values (e.g., the number and person of a NP and VP) are registered, and checked for agreement as a condition for success. A closely related formalism is that of definite clause grammars (e.g., Pereira & Warren 1982), which employ Prolog to assert “facts” such as that if the input word sequence contains an NP reaching from index I1 to index I2 and a VP reaching from index I2 to index I3, then the input contains a sentence reaching from index I1 to index I3. (Again, feature agreement constraints can be incorporated into such assertions as well.) Given the goal of proving the presence of a sentence, the goal-chaining mechanism of Prolog then provides a procedural interpretation of these assertions.

At present the most commonly employed declarative representations of grammatical structure are context-free grammars (CFGs) as defined by Noam Chomsky (1956, 1957), because of their simplicity and efficient parsability. Chomsky had argued that only deep linguistic representations are context-free, while surface form is generated by transformations (for example, in English passivization and in question formation) that result in a non-context-free language. However, it was later shown that on the one hand, unrestricted Chomskian transformational grammars allowed for computationally intractable and even undecidable languages, and on the other, that the phenomena regarded by Chomsky as calling for a transformational analysis could be handled within a context-free framework by use of suitable features in the specification of syntactic categories. Notably, unbounded movement, such as the apparent movement of the final verb object to the front of the sentence in “Which car did Jack urge you to buy?”, was shown to be analyzable in terms of a gap (or slash) feature of type /NP[wh] that is carried by each of the two embedded VPs, providing a pathway for matching the category of the fronted object to the category of the vacated object position. Within non-transformational grammar frameworks, one therefore speaks of unbounded (or long-distance) dependencies instead of unbounded movement. At the same time it should be noted that at least some natural languages have been shown to be mildly context-sensitive (e.g., Dutch and Swiss German exhibit cross-serial dependencies where a series of nominals “NP1 NP2 NP3 …” need to be matched, in the same order, with a subsequent series of verbs, “V1 V2 V3 …”). Grammatical frameworks that seem to allow for approximately the right degree of mild context sensitivity include Head Grammar, Tree-Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), and Linear Indexed Grammar (LIG). Head grammars allow insertion of a complement between the head of a phrase (e.g., the initial verb of a VP, the final noun of a NP, or the VP of a sentence) and an already present complement; they were a historical predecessor of Head-Driven Phrase Structure Grammar (HPSG), a type of unification grammar (see below) that has received much attention in computational linguistics. However, unrestricted HPSG can generate the recursively enumerable (in general only semi-decidable) languages.

A typical (somewhat simplified) sample fragment of a context-free grammar is the following, where phrase types are annotated with feature-value pairs:

S[vform:v] → NP[pers:p numb:n case:subj] VP[vform:v pers:p numb:n] VP[vform:v pers:p numb:n] → V[subcat:_np vform:v pers:p numb:n] NP[case:obj] NP[pers:3 numb:n] → Det[pers:3 numb:n] N[numb:n] NP[numb:n pers:3 case:c] → Name[numb:n pers:3 case:c]

Here v, n, p, c are variables that can assume values such as ‘past’, ‘pres’, ‘base’, ‘pastparticiple’, … (i.e., various verb forms), ‘1’, ‘2’, ‘3’ (1st, 2nd, and 3rd person), ‘sing’, ‘plur’, and ‘subj’, ‘obj’. The subcat feature indicates the complement requirements of the verb. The lexicon would supply entries such as

V[subcat:_np vform:pres numb:sing pers:3] → loves Det[pers:3 numb:sing] → a N[pers:3 numb:sing] → mortal Name[pers:3 numb:sing gend:fem case:subj] → Thetis,

allowing, for example, a phrase structure analysis of the sentence “Thetis loves a mortal” (where we have omitted the feature names for simplicity, leaving only their values, and ignored the case feature):



Figure 1: Syntactic analysis of a sentence as a parse tree

As a variant of CFGs, dependency grammars (DGs) also enjoy wide popularity. The difference from CFGs is that hierarchical grouping is achieved by directly subordinating words to words (allowing for multiple dependents of a head word), rather than phrases to phrases. For example, in the sentence of figure 1 we would treat Thetis and mortal as dependents of loves, using dependency links labeled subj and obj respectively, and the determiner a would in turn be a dependent of mortal, via a dependency link mod (for modifier). Projective dependency grammars are ones with no crossing dependencies (so that the descendants of a node form a continuous text segment), and these generate the same languages as CFGs. Significantly, mildly non-projective dependency grammars, allowing a head word to dominate two separated blocks, provide the same generative capacity as the previously mentioned mildly context-sensitive frameworks that are needed for some languages (Kuhlmann 2013).

As noted at the beginning of this section, traditional formal grammars proved too limited in coverage and too rigid in their grammaticality criteria to provide a basis for robust coverage of natural languages as actually used, and this situation persisted until the advent of probabilistic grammars derived from sizable phrase-bracketed corpora (notably the Penn Treebank). The simplest example of this type of grammar is a probabilistic context-free grammar or PCFG. In a PCFG, each phrase structure rule X → Y1 … Yk is assigned a probability, viewed as the probability that a constituent of type X will be expanded into a sequence of (immediate) constituents of types Y1, …, Yk. At the lowest level, the expansion probabilities specify how frequently a given part of speech (such as Det, N, or V) will be realized as a particular word. Such a grammar provides not only a structural but also a distributional model of language, predicting the frequency of occurrence of various phrase sequences and, at the lowest level, word sequences.

However, the simplest models of this type do not model the statistics of actual language corpora very accurately, because the expansion probabilities for a given phrase type (or part of speech) X ignore the surrounding phrasal context and the more detailed properties (such as head words) of the generated constituents. Yet context and detailed properties are very influential; for example, whether the final prepositional phrase in “She detected a star with {binoculars, planets}” modifies detected or planets is very dependent on word choice. Such modeling inaccuracies lead to parsing inaccuracies (see next subsection), and therefore generative grammar models have been refined in various ways, for example (in so-called lexicalized models) allowing for specification of particular phrasal head words in rules, or (in tree substitution grammars) allowing expansion of nonterminals into subtrees of depth 2 or more. Nevertheless, it seems likely that fully accurate distributional modeling of language would need to take account of semantic content, discourse structure, and intentions in communication, not only of phrase structure. Possibly construction grammars (e.g., Goldberg 2003), which emphasize the coupling between the entrenched patterns of language (including ordinary phrase structure, clichés, and idioms) and their meanings and discourse function, will provide a conceptual basis for building statistical models of language that are sufficiently accurate to enable more nearly human-like parsing accuracy.

Natural language analysis in the early days of AI tended to rely on template matching, for example, matching templates such as (X has Y) or (how many Y are there on X) to the input to be analyzed. This of course depended on having a very restricted discourse and task domain. By the late 1960s and early 70s, quite sophisticated recursive parsing techniques were being employed. For example, Woods' lunar system used a top-down recursive parsing strategy interpreting an ATN in the manner roughly indicated in section 2.2 (though ATNs in principle allow other parsing styles). It also saved recognized constituents in a table, much like the class of parsers we are about to describe. Later parsers were influenced by the efficient and conceptually elegant CFG parsers described by Jay Earley (1970) and (separately) by John Cocke, Tadao Kasami, and Daniel Younger (e.g., Younger 1967). The latter algorithm, termed the CYK or CKY algorithm for the three separate authors, was particularly simple, using a bottom-up dynamic programming approach to first identify and tabulate the possible types (nonterminal labels) of sentence segments of length 1 (i.e., words), then the possible types of sentence segments of length 2, and so on, always building on the previously discovered segment types to recognize longer phrases. This process runs in cubic time in the length of the sentence, and a parse tree can be constructed from the tabulated constituents in quadratic time. The CYK algorithm assumes a Chomsky Normal Form (CNF) grammar, allowing only productions of form Np → Nq Nr, or Np → w, i.e., generation of two nonterminals or a word from any given nonterminal. This is only a superficial limitation, because arbitrary CF grammars are easily converted to CNF.

The method most frequently employed nowadays in fully analyzing sentential structure is chart parsing. This is a conceptually simple and efficient dynamic programming method closely related to the algorithms just mentioned; i.e., it begins by assigning possible analyses to the smallest constituents and then inferring larger constituents based on these, until an instance of the top-level category (usually S) is found that spans the given text or text segment. There are many variants, depending on whether only complete constituents are posited or incomplete ones as well (to be progressively extended), and whether we proceed left-to-right through the word stream or in some other order (e.g., some seemingly best-first order). A common variant is a left-corner chart parser, in which partial constituents are posited whenever their “left corner”—i.e., leftmost constituent on the right-hand side of a rule—is already in place. Newly completed constituents are placed on an agenda, and items are successively taken off the agenda and used if possible as left corners of new, higher-level constituents, and to extend partially completed constituents. At the same time, completed constituents (or rather, categories) are placed in a chart, which can be thought of as a triangular table of width n and height n (the number of words processed), where the cell at indices (i, j), with j > i, contains the categories of all complete constituents so far verified reaching from position i to position j in the input. The chart is used both to avoid duplication of constituents already built, and ultimately to reconstruct one or more global structural analyses. (If all possible chart entries are built, the final chart will allow reconstruction of all possible parses.) Chart-parsing methods carry over to PCFGs essentially without change, still running within a cubic time bound in terms of sentence length. An extra task is maintaining probabilities of completed chart entries (and perhaps bounds on probabilities of incomplete entries, for pruning purposes).

Because of their greater expressiveness, TAGs and CCGs are harder to parse in the worst case (O(n6)) than CFGs and projective DGs (O(n3)), at least with current algorithms (see Vijay-Shankar & Weir 1994 for parsing algorithms for TAG, CCG, and LIG based on bottom-up dynamic programming). However, it does not follow that TAG parsing or CCG parsing is impractical for real grammars and real language, and in fact parsers exist for both that are competitive with more common CFG-based parsers.

Finally we mention connectionist models of parsing, which perform syntactic analysis using layered (artificial) neural nets (ANNs, NNs) (see Palmer-Brown et al. 2002; Mayberry and Miikkulainen 2008; and Bengio 2008 for surveys). There is typically a layer of input units (nodes), one or more layers of hidden units, and an output layer, where each layer has (excitatory and inhibitory) connections forward to the next layer, typically conveying evidence for higher-level constituents to that layer. There may also be connections within a hidden layer, implementing cooperation or competition among alternatives. A linguistic entity such as a phoneme, word, or phrase of a particular type may be represented within a layer either by a pattern of activation of units in that layer (a distributed representation) or by a single activated unit (a localist representation).

One of the problems that connectionist models need to confront is that inputs are temporally sequenced, so that in order to combine constituent parts, the network must retain information about recently processed parts. Two possible approaches are the use of simple recurrent networks (SRNs) and, in localist networks, sustained activation. SRNs use one-to-one feedback connections from the hidden layer to special context units aligned with the previous layer (normally the input layer or perhaps a secondary hidden layer), in effect storing their current outputs in those context units. Thus at the next cycle, the hidden units can use their own previous outputs, along with the new inputs from the input layer, to determine their next outputs. In localist models it is common to assume that once a unit (standing for a particular concept) becomes active, it stays active for some length of time, so that multiple concepts corresponding to multiple parts of the same sentence, and their properties, can be simultaneously active. A problem that arises is how the properties of an entity that are active at a given point in time can be properly tied to that entity, and not to other activated entities. (This is the variable binding problem, which has spawned a variety of approaches—see Browne and Sun 1999). One solution is to assume that unit activation consists of pulses emitted at a globally fixed frequency, and pulse trains that are in phase with one another correspond to the same entity (e.g., see Henderson 1994). Much current connectionist research borrows from symbolic processing perspectives, by assuming that parsing assigns linguistic phrase structures to sentences, and treating the choice of a structure as simultaneous satisfaction of symbolic linguistic constraints (or biases). Also, more radical forms of hybridization and modularization are being explored, such as interfacing a NN parser to a symbolic stack, or using a neural net to learn the probabilities needed in a statistical parser, or interconnecting the parser network with separate prediction networks and learning networks. For an overview of connectionist sentence processing and some hybrid methods (see Crocker 2010).

If natural language were structurally unambiguous with respect to some comprehensive, effectively parsable grammar, our parsing technology would presumably have attained human-like accuracy some time ago, instead of levelling off at about 90% constituent recognition accuracy. In fact, however, language is ambiguous at all structural levels: at the level of speech sounds (“recognize speech” vs. “wreck a nice beach”); morphology (“un-wrapped” vs. “unwrap-ped”); word category (round as an adjective, noun, verb or adverb); compound word structure (wild goose chase); phrase category (nominal that-clause vs. relative clause in “the idea that he is entertaining”); and modifier (or complement) attachment (“He hit the man with the baguette”). The parenthetical examples here have been chosen so that their ambiguity is readily noticeable, but ambiguities are far more abundant than is intuitively apparent, and the number of alternative analyses of a moderately long sentence can easily run into the thousands.

Naturally, alternative structures lead to alternative meanings, as the above examples show, and so structural disambiguation is essential. The problem is exacerbated by ambiguities in the meanings and discourse function even of syntactically unambiguous words and phrases, as discussed below (section 4). But here we just mention some of the structural preference principles that have been employed to achieve at least partial structural disambiguation. First, some psycholinguistic principles that have been suggested are Right Association (RA) (or Late Closure, LC), Minimal Attachment (MA), and Lexical Preference (LP). The following examples illustrate these principles:

(2.1) (RA) He bought the book that I had selected for Mary. (Note the preference for attaching for Mary to selected rather than bought.) (2.2) (MA?) She carried the groceries for Mary. (Note the preference for attaching for Mary to carried, rather than groceries, despite RA. The putative MA-effect might actually be an LP-like verb modification preference.) (2.3) (LP) She describes men who have worked on farms as cowboys. (Note the preference for attaching as cowboys to describes, rather than worked.)

Another preference noted in the literature is for parallel structure in coordination, as illustrated by the following examples:

(2.4) They asked for tea and coffee with sugar. (Note the preference for the grouping [[tea and coffee] with sugar], despite RA.) (2.5) John decided to buy a novel, and Mary, a biography. (The partially elided conjunct is understood as “Mary decided to buy a biography”.) (2.6) John submitted short stories to the editor, and poems too. (The partially elided conjunct is understood as “submitted poems to the editor too”.)

Finally, the following example serves to illustrate the significance of frequency effects, though such effects are hard to disentangle from semantic biases for any single sentence (improvements in parsing through the use of word and phrase frequencies provide more compelling evidence):

(2.7) What are the degrees of freedom that an object in space has? (Note the preference for attaching the relative clause to degrees of freedom, rather than freedom, attributable to the tendency of degree(s) of freedom to occur as a “multiword”.)

Language serves to convey meaning. Therefore the analysis of syntactic structure takes us only partway towards mechanizing that central function, and the merits of particular approaches to syntax hinge on their utility in supporting semantic analysis, and in generating language from the meanings to be communicated.

This is not to say that syntactic analysis is of no value in itself—it can provide a useful support in applications such as grammar checking and statistical MT. But for the more ambitious goal of inferring and expressing the meaning of language, an essential requirement is a theory of semantic representation, and how it is related to surface form, and how it interacts with the representation and use of background knowledge. We will discuss logicist approaches, cognitive science approaches, and (more briefly) emerging statistical approaches to meaning representation.

Most linguistic semanticists, cognitive scientists, and anthropologists would agree that in some sense, language is a mirror of mind. But views diverge concerning how literally or non-literally this tenet should be understood. The most literal understanding, which we will term the logicist view, is the one that regards language itself as a logical meaning representation with a compositional, indexical semantics—at least when we have added brackets as determined by parse trees, and perhaps certain other augmentation (variables, lambda-operators, etc.) In itself, such a view makes no commitments about mental representations, but application of Occam's razor and the presumed co-evolution of thought and language then suggest that mentalese is itself language-like. The common objection that “human thinking is not logical” carries no weight with logicists, because logical meaning representations by no means preclude nondeductive modes of inference (induction, abduction, etc.); nor are logicists impressed by the objection that people quickly forget the exact wording of verbally conveyed information, because both canonicalization of inputs and systematic discarding of all but major entailments can account for such forgetting. Also assumption of a language-like, logical mentalese certainly does not preclude other modes of representation and thought, such as imagistic ones, and synergistic interaction with such modes (Paivio 1986; Johnston & Williams 2009).

Since Richard Montague (see especially Montague 1970, 1973) deserves much of the credit for demonstrating that language can be logically construed, let us reconsider the sentence structure in figure 1 and the corresponding grammar rules and vocabulary, but this time suppressing features, and instead indicating how logical interpretations expressed in (a variant of) Montague's type-theoretic intensional logic can be obtained compositionally. We slightly “twist” Montague's type system so that the possible-world argument always comes last, rather than first, in the denotation of a symbol or expression. For example, a two-place predicate will be of type (e → (e → (s → t))) (successively applying to an entity, another entity, and finally a possible world to yield a truth value), rather than Montague's type (s → (e → (e → t))), where the world argument is first. This dispenses with numerous applications of Montague's intension (∧) and extension (∨) operators, and also slightly simplifies truth conditions. For simplicity we are also ignoring contextual indices here, and treating nouns and VPs as true or false of individuals, rather than individual concepts (as employed by Montague to account for such sentences as “The temperature is 90 and rising”).

S → NP VP; S′ = NP′(VP′) VP → V NP; VP′ = (λx NP′(λy V′(y)(x))) NP → Det N; NP′ = Det′(N′) NP → Name; NP′ = Name′

Here primed constituents represent the intensional logic translations of the corresponding constituents. (Or we can think of them as metalinguistic expressions standing for the set-theoretic denotations of the corresponding constituents.) Several points should be noted. First, each phrase structure rule is accompanied by a unique semantic rule (articulated as the rule-to-rule hypothesis by Emmon Bach (1976)), where the denotation of each phrase is fully determined by the denotations of its immediate constituents: the semantics is compositional.

Second, in the S′-rule, the subject is assumed to be a second-order predicate that is applied to the denotation of the VP (a monadic predicate) to yield a sentence intension, whereas we would ordinarily think of the subject-predicate semantics as being the other way around, with the VP-denotation being applied to the subject. But Montague's contention was that his treatment was the proper one, because it allows all types of subjects—pronouns, names, and quantified NPs—to be handled uniformly. In other words, an NP always denotes a second-order property, or (roughly speaking) a set of first-order properties (see also Lewis 1970). So for example, Thetis denotes the set of all properties that Thetis (a certain contextually determined individual with that name) has; (more exactly, in the present formulation Thetis denotes a function from properties to sentence intensions, where the intension obtained for a particular property yields truth in worlds where the entity referred to has that property); some woman denotes the union of all properties possessed by at least one woman; and every woman denotes the set of properties shared by all women. Accordingly, the S′-rule yields a sentence intension that is true at a given world just in case the second-order property denoted by the subject maps the property denoted by the VP to such a truth-yielding intension.

Third, in the VP′-rule, variables x and y are assumed to be of type e (they take basic individuals as values), and the denotation of a transitive verb should be thought of as a function that is applied first to the object, and then to the subject (yielding a function from worlds to truth values—a sentence intension). The lambda-abstractions in the VP′-rule can be understood as ensuring that the object NP, which like any NP denotes a second-order property, is correctly applied to an ordinary property (that of being the love-object of a certain x), and the result is a predicate with respect to the (still open) subject position. The following is an interpreted sample vocabulary:

V → loves; V′ = loves Det → a; Det′ = λP λQ(∃x[P(x) ∧ Q(x)]) (For comparison: Det → every; Det′ = λP λQ(∀x[P(x) ⊃ Q(x)]) N → mortal; N′ = mortal Name → Thetis; Name′ = λP(P(Thetis))

Note the interpretation of the indefinite determiner (on line 2) as a generalized quantifier—in effect a second-order predicate over two ordinary properties, where these properties have intersecting truth domains. We could have used an atomic symbol for this second-order predicate, but the above way of expanding it shows the relation of the generalized quantifier to the ordinary existential quantifier. Though it is a fairly self-evident matter, we will indicate in section 4.1 how the sentence “Thetis loves a mortal” yields the following representation after some lambda-conversions:

(∃x [mortal(x) ∧ loves(x)(Thetis)]).

(The English sentence also has a generic or habitual reading, “Thetis loves mortals in general”, which we ignore here.) This interpretation has rather a classical look to it, but only because of the reduction from generalized to ordinary quantifiers that we have built into the lexical semantics of the indefinite a in the above rules, instead of using an atomic symbol for it. Montague was particularly interested in dealing satisfactorily with intensional locutions, such as “John seeks a unicorn.” This does not require the existence of a unicorn for its truth—John has a certain relation to the unicorn-property, rather than to an existing unicorn. Montague therefore treated all predicate arguments as intensions; i.e., he rendered “John seeks a unicorn” as

seeks(λQ ∃x[unicorn(∧x) ∧ Q(∧x)]) (∧john),

which can be reduced to a version where unicorn is extensionalized to unicorn * :

seeks(λQ ∃x[unicorn * (x) ∧ Q(∧x)]) (∧john).

But ultimately Montague's treatment of NPs, though it was in a sense the centerpiece of his proposed conception of language-as-logic, was not widely adopted in computational linguistics. This was in part because the latter community was not convinced that an omega-order logic was needed for NL semantics, found the somewhat complex treatment of NPs in various argument positions and in particular, the treatment of scope ambiguities in terms of multiple syntactic analyses, unattractive, and was preoccupied with other semantic issues, such as adequately representing events and their relationships, and developing systematic nominal and verb “ontologies” for broad-coverage NL analysis. Nonetheless, the construal of language as logic left a strong imprint on computational semantics, generally steering the field towards compositional approaches, and in some approaches such as CCG, providing a basis for a syntax tightly coupled to a type-theoretic semantics (Bach et al. 1987; Carpenter 1997).

An alternative to Montague's syntax-based approach to quantifier scope ambiguity is to regard NPs of form Det+N (or strictly, Det+N-bar) as initially unscoped higher-order predicates in an underspecified logical form, to be subsequently “raised” so as to apply to a first-order predicate obtained by lambda-abstraction of the vacated term position. For example, in the sentence “Everyone knows a poem”, with the object existentially interpreted, we would have the underspecified LF

knows〈a(poem)〉〈every(person)〉

(without reducing determiners to classical quantifiers) and we can now “raise” 〈a(poem)〉 to yield

a(poem)(λy knows(y)〈every(person)〉,

and then “raise” 〈every(person)〉 to yield either

a(poem)(λy every(person)(λx knows(y)(x))),

or

every(person)(λx a(poem)(λy knows(y)(x))).

Thus we obtain a reading according to which there is a poem that everyone knows, and another according to which everyone knows some poem (not necessarily the same one). (More on scope disambiguation will follow in section 4). A systematic version of this approach, known as Cooper storage (see Barwise & Cooper 1981) represents the meaning of phrases in two parts, namely a sequence of NP-interpretations (as higher-order predicates) and the logical matrix from which the NP-interpretations were extracted.

But one can also take a more conventional approach, where first of all, the use of “curried” (Schönfinkel-Church-Curry) functions in the semantics of predication is avoided in favor of relational interpretations, using lexical semantic formulas such as loves′ = λyλx(loves(x, y)), and second, unscoped NP-interpretations are viewed as unscoped restricted quantifiers (Schubert & Pelletier 1982). Thus the unscoped LF above would be knows(〈∃poem〉, 〈∀person〉), and scoping of quantifiers, along with their restrictors, now involves “raising” quantifiers to take scope over a sentential formula, with simultaneous introduction of variables. The two results corresponding to the two alternative scopings are then

(∃y: poem(y))(∀x: person(x))knows(x, y),

and

(∀x: person(x))(∃y: poem(y))knows(x, y).

While this strategy departs from the strict compositionality of Montague Grammar, it achieves results that are often satisfactory for the intended purposes and does so with minimal computational fuss. A related approach to logical form and scope ambiguity enjoying some current popularity is minimal recursion semantics (MRS) (Copestake et al. 2005), which goes even further in fragmenting the meaningful parts of an expression, with the goal of allowing incremental constraint-based assembly of these pieces into unambiguous sentential LFs. Another interesting development is an approach based on continuations, a notion taken from programming language theory (where a continuation is a program execution state as determined by the steps still to be executed after the current instruction). This also allows for a uniform account of the meaning of quantifiers, and provides a handle on such phenomena as “misplaced modifiers”, as in “He had a quick cup of coffee” (Barker 2004).

An important innovation in logical semantics was discourse representation theory (DRT) (Kamp 1981; Heim 1982), aimed at a systematic account of anaphora. In part, the goal was to provide a semantic explanation for (in)accessibility of NPs as referents of anaphoric pronouns, e.g., in contrasting examples such as “John doesn't drive a car; *he owns it,” vs. “John drives a car; he owns it”. More importantly, the goal was to account for the puzzling semantics of sentences involving donkey anaphora, e.g., “If John owns a donkey, he beats it.” Not only is the NP a donkey, the object of the if-clause, accessible as referent of the anaphoric it, contrary to traditional syntactic binding theory (based on the notion of C-command), but furthermore we seem to obtain an interpretation of the type “John beats every donkey that he owns”, which cannot be obtained by “raising” the embedded indefinite a donkey to take scope over the entire sentence. There is also a weaker reading of the type, “If John owns a donkey, he beats a donkey that he owns”, and this reading also is not obtainable via any scope analysis. Kamp and Heim proposed a dynamic process of sentence interpretation in which a discourse representation structure (DRS) is built up incrementally. A DRS consists of a set of discourse referents (variables) and a set of conditions, where these conditions may be simple predications or equations over discourse referents, or certain logical combinations of DRS's (not of conditions). The DRS for the sentence under consideration can be written linearly as

[: [x, y: john(x), donkey(y)] ⇒ [u, v: he(u), it(v), beats(u, v), u=x, v=y]]

or diagrammed as



Figure 2: DRS for “If John owns a donkey, he beats it”

Here x, y, u, v are discourse referents introduced by John, a donkey, he, and it, and the equations u=x, v=y represent the result of reference resolution for he and it. Discourse referents in the antecedent of a conditional are accessible in the consequent, and discourse referents in embedding DRSs are accessible in the embedded DRSs. Semantically, the most important idea is that discourse referents are evaluated dynamically. We think of a variable assignment as a state, and this state changes as we evaluate a DRS outside-to-inside, left-to-right. For example (simplifying a bit), the conditional DRS in figure 4 is true (in a given model) if every assignment with domain {x, y} that makes the antecedent true can be extended to an assignment (new state) with domain {x, y, u, v} that makes the consequent true.

On the face of it, DRT is noncompositional (though DRS construction rules are systematically associated with phrase structure rules); but it can be recast in compositional form, still of course with a dynamic semantics. A closely related approach, dynamic predicate logic (DPL) retains the classical quantificational syntax, but in effect treats existential quantification as nondeterministic assignment, and provides an overtly compositional alternative to DRT (Groenendijk & Stokhof 1991). Perhaps surprisingly, the impact of DRT on practical computational linguistics has been quite limited, though it certainly has been and continues to be actively employed in various projects. One reason may be that donkey anaphora rarely occurs in the text corpora most intensively investigated by computational linguists so far (though it is arguably pervasive and extremely important in generic sentences and generic passages, including those found in lexicons or sources such as Common Sense Open Mind—see sections 4.3 and 8.3). Another reason is that reference resolution for non-donkey pronouns (and definite NPs) is readily handled by techniques such as Skolemization of existentials, so that subsequently occurring anaphors can be identified with the Skolem constants introduced earlier. Indeed, it turns out that both explicit and implicit variants of Skolemization, including functional Skolemization, are possible even for donkey anaphora (e.g., in sentences such as “If every man has a gun, many will use it”—see Schubert 2007). Finally, another reason for the limited impact of DRT and other dynamic semantic theories may be precisely that they are dynamic: The evaluation of a formula in general requires its preceding and embedding context, and this interferes with the kind of knowledge modularity (the ability to use any given knowledge item in a variety of different contexts) desirable for inference purposes. Here it should be noted that straightforward translation procedures from DRT, DPL, and other dynamic theories to static logics exist (e.g., to FOL, for nonintensional versions of the dynamic approaches), but if such a conversion is desirable for practical purposes, then the question arises whether starting with a dynamic representation is at all advantageous.

A long-standing issue in linguistic semantics has been the theoretical status of thematic roles in the argument structure of verbs and other argument-taking elements of language (e.g., Dowty 1991). The syntactically marked cases found in many languages correspond intuitively to such thematic roles as agent, theme, patient, instrument, recipient, goal, and so on, and in English, too, the sentence subject and object typically correspond respectively to the agent and theme or patient of an action, and other roles may be added as an indirect object or more often as prepositional phrase complements and adjuncts. To give formal expression to these intuitions, many computational linguists decompose verbal (and other) predicates derived from language into a core predicate augmented with explicit binary relations representing thematic roles. For example, the sentence

(3.1) John kicked the ball to the fence

might be represented (after referent determination) as

∃e(kick(e) ∧ before(e, Now1) ∧ agent(e, John) ∧ theme(e, Ball2) ∧ goal-loc(e, Fence3)),

where e is thought of as the kicking event. Such a representation is called neo-Davidsonian, acknowledging Donald Davidson's advocacy of the view that verbs tacitly introduce existentially quantified events (Davidson 1967a). The prefix neo- indicates that all arguments and adjuncts are represented in terms of thematic roles, which was not part of Davidson's proposal but is developed, for example, in (Parsons 1990). (Parsons attributes the idea of thematic roles to the 4th century BCE Sanskrit grammarian Pāṇini.) One advantage of this style of representation is that it absolves the writer of the interpretive rules from the vexing task of distinguishing verb complements, to be incorporated into the argument structure of the verb, from adjuncts, to be used to add modifying information. For example, it is unclear in (3.1) whether to the fence should be treated as supplying an argument of kick, or whether it merely modifies the action of John kicking the ball. Perhaps most linguists would judge the latter answer to be correct (because an object can be kicked without the intent of propelling it to a goal location), but intuitions are apt to be ambivalent for at least one of a set of verbs such as dribble, kick, maneuver, move and transport.

However, thematic roles also introduce new difficulties. As pointed out by Dowty (1991), thematic roles lack well-defined semantics. For example, while (3.1) clearly involves an animate agent acting causally upon a physical object, and the PP evidently supplies a goal location, it is much less clear what the roles should be in (web-derived) sentences such as (3.2–3.4), and what semantic content they would carry:

(3.2) The surf tossed the loosened stones against our feet. (3.3) A large truck in front of him blocked his view of the traffic light. (3.4) Police used a sniffer dog to smell the suspect's luggage.

As well, the uniform treatment of complements and adjuncts in terms of thematic relations does not absolve the computational linguist from the task of identifying the subcategorized constituents of verb phrases (and similarly, NPs and APs), so as to guide syntactic and semantic expectations in parsing and interpretation. And these subcategorized constituents correspond closely to the complements of the verb, as distinct from any adjuncts. Nevertheless, thematic role representations are widely used, in part because they mesh well with frame-based knowledge representations for domain knowledge. These are representations that characterize a concept in terms of its type (relating this to supertypes and subtypes in an inheritance hierarchy), and a set of slots (also called attributes or roles) and corresponding values, with type constraints on values. For example, in a purchasing domain, we might have a purchase predicate, perhaps with supertype acquire, subtypes like purchase-in-installments, purchase-on-credit, or purchase-with-cash, and attributes with typed values such as (buyer (a person-or-group)), (seller (a person-or-group)), (item (a thing-or-service)), (price (a monetary-amount)), and perhaps time, place, and other attributes. Thematic roles associated with relevant senses of verbs and nouns such as buy, sell, purchase, acquire, acquisition, take-over, pick up, invest in, splurge on, etc., can easily be mapped to standard slots like those above. This leads into the issue of canonicalization, which we briefly discuss below under a separate heading.

A more consequential issue in computational semantics has been the expressivity of the semantic representation employed, with respect to phenomena such as event and temporal reference, nonstandard quantifiers such as most, plurals, modification, modality and other forms of intensionality, and reification. Full discussion of these phenomena would be out of place here, but some commentary on each is warranted, since the process of semantic interpretation and understanding (as well as generation) clearly depends on the expressive devices available in the semantic representation.

Event and situation reference are essential in view of the fact that many sentences seem to describe events or situations, and to qualify and refer to them. For example, in the sentences

(3.5) Molly barked last night for several minutes. This woke up the neighbors.

the barking event is in effect predicated to have occurred last night and to have lasted for several minutes, and the demonstrative pronoun this evidently refers directly to it; in addition the past tense places the event at some point prior to the time of speech (and would do so even without the temporal adverbials). These temporal and causal relations are readily handled within the Davidsonian (or neo-Davidsonian) framework mentioned above:

(3.5′) bark(Molly, E) ∧ last-night(E, S) ∧ before(E, S) ∧ duration(E)=minutes(N) ∧ several(N). cause-to-wake-up(E, Neighbors, E′) ∧ before(E′, S).

However, examples (3.6) and (3.7) suggest that events can be introduced by negated or quantified formulas, as was originally proposed by Reichenbach (1947):

(3.6) No rain fell for a month, and this caused widespread crop failures. (3.7) Each superpower imperiled the other with its nuclear arsenal. This situation persisted for decades.

Barwise and Perry (1983) reconceptualized this idea in their Situation Semantics, though this lacks the tight coupling between sentences and events that is arguably needed to capture causal relations expressed in language. Schubert (2000) proposes a solution to this problem in an extension of FOL incorporating an operator that connects situations or events with sentences characterizing them.

Concerning nonstandard quantifiers such as most, we have already sketched the generalized quantifier approach of Montague Grammar, and pointed out the alternative of using restricted quantifiers; an example might be (Most x: dog(x))friendly(x). Instead of viewing most as a second-order predicate, we can specify its semantics by analogy with classical quantifiers: The sample formula is true (under a given interpretation) just in case a majority of individuals satisfying dog(x) (when used as value of x) also satisfy friendly(x). Quantifying determiners such as few, many, much, almost all, etc., can be treated similarly, though ultimately the problem of vagueness needs to be addressed as well (which of course extends beyond quantifiers to predicates and indeed all aspects of a formal semantic representation). Vague quantifiers, rather than setting rigid quantitative bounds, seem instead to convey probabilistic information, as if a somewhat unreliable measuring instrument had been applied in formulating the quantified claim, and the recipient of the information needs to take this unreliability into account in updating beliefs. Apart from their vagueness, the quantifiers under discussion are not first-order definable (e.g., Landman 1991), so that they cannot be completely axiomatized in FOL. But this does not prevent practical reasoning, either by direct use of such quantifiers in the logical representations of sentences (an approach in the spirit of natural logic), or by reducing them to set-theoretic or mereological relations within an FOL framework.

Plurals, as for instance in

(3.8) People gathered in the town square,

present a problem in that the argument of a predicate can be an entity comprised of multiple basic individuals (those we ordinarily quantify over, and ascribe properties to). Most approaches to this problem employ a plural operator, say, plur, allowing us to map a singular predicate P into a plural predicate plur(P), applicable to collective entities. These collective entities are usually assumed to form a join semilattice with atomic elements (singular entities) that are ordinary individuals (e.g., Scha 1981; Link 1983; Landman 1989, 2000). When an overlap relation is assumed, and when all elements of the semilattice are assumed to have a supremum (completeness), the result is a complete Boolean algebra except for lack of a bottom element (because there is no null entity that is a part of all others). One theoretical issue is the relationship of the semilattice of plural entities to the semilattice of material parts of which entities are constituted. Though there are differences in theoretical details (e.g., Link 1983; Bunt 1985), it is agreed that these semilattices should be aligned in this sense: When we take the join of material parts of which several singular or plural entities are constituted, we should obtain the material parts of the join of those singular or plural entities. Note that while some verbal predicates, such as (intransitive) gather, are applicable only to collections, others, such as ate a pizza, are variously applicable to individuals or collections. Consequently, a sentence such as

(3.9) The children ate a pizza,

allows for both a collective reading, where the children as a group ate a single pizza, and a distributive reading, where each of the children ate a pizza (presumably a different one!) One way of dealing with such ambiguities in practice is to treat plural NPs as ambiguous between a collection-denoting reading and an “each member of the collection” reading. For example the children in (3.9) would be treated as ambiguous between the collection of children (which is the basic sense of the phrase) and each of the children. This entails that a reading of type each of the people should also be available in (3.8) — but we can assume that this is ruled out because (intransitive) gather requires a collective argument. In a sentence such as

(3.10) Two poachers caught three aracaris,

we then obtain four readings, based on the two interpretations of each NP. No readings are ruled out, because both catching and being caught can be individual or collective occurrences. Some theorists would posit additional readings, but if these exist, they could be regarded as derivative from readings in which at least one of the terms is collectively interpreted. But what is uncontroversial is that plurals call for an enrichment in the semantic representation language to allow for collections as arguments. In an expression such as plur(child), both the plur operator, which transforms a predicate into another predicate, and the resulting collective predicate, are of nonstandard types.

Modification is a pervasive phenomenon in all languages, as illustrated in the following sentences:

(3.11) Mary is very smart. (3.12) Mary is an international celebrity. (3.13) The rebellion failed utterly.

In (3.11), very functions as a predicate modifier, in particular a subsective modifier, since the set of things that are very(P) is a subset of the things that are P. Do we need such modifiers in our logical forms? We could avoid use of a modifier in this case by supposing that smart has a tacit argument for the degree of smartness, where smart(x, d) means that x is smart to degree d; adding that d > T for some threshold T would signify that x is very smart. Other degree adjectives could be handled similarly. However, such a strategy is unavailable for international celebrity in (3.12). International is again subsective (and not intersective—an international celebrity is not something that is both international and a celebrity), and while one can imagine definitions of the particular combination, international celebrity, in an ordinary FOL framework, requiring such definitions to be available for constructing initial logical forms could create formidable barriers to broad-coverage interpretation. (3.13) illustrates a third type of predicate modification, namely VP-modification by an adverb. Note that the modifier cannot plausibly be treated as an implicit predication utter(E) about a Davidsonian event argument of fail. Taken together, the examples indicate the desirability of allowing for monadic-predicate modifiers in a semantic representation. Corroborative evidence is provided in the immediately following discussion.

Intensionality has already been mentioned in connection with Montague Grammar, and there can be no doubt that a semantic representation for natural language needs to capture intensionality in some way. The sentences

(3.14) John believes that our universe is infinite. (3.15) John looked happy. (3.16) John designed a starship. (3.17) John wore a fake beard.

all involve intensionality. The meaning (and thereby the truth value) of the attitudinal sentence (3.14) depends on the meaning (intension) of the subordinate clause, not just its truth value (extension). The meaning of (3.15) depends on the meaning of happy, but does not require happy to be a property of John or anything else. The meaning of (3.16) does not depend on the actual existence of a starship, but does depend on the meaning of that phrase. And fake beard in (3.17) refers to something other than an actual beard, though its meaning naturally depends on the meaning of beard. A Montagovian analysis certainly would deal handily with such sentences. But again, we may ask how much of the expressive richness of Montague's type theory is really essential for computational linguistics. To begin with, sentences such as (3.14) are expressible in classical modal logics, without committing to higher types. On the other hand (3.16) resists a classical modal analysis, even more firmly than Montague's “John seeks a unicorn,” for which an approximate classical paraphrase is possible: “John tries (for him) to find a unicorn”. A modest concession to Montague, sufficient to handle (3.15)–(3.17), is to admit intensional predicate modifiers into our representational vocabulary. We can then treat look as a predicate modifier, so that look(happy) is a new predicate derived from the meaning of happy. Similarly we can treat design as a predicate modifier, if we are willing to treat a starship as a predicative phrase, as we would in “The Enterprise is a starship”. And finally, fake is quite naturally viewed as a predicate modifier, though unlike most nominal modifiers, it is not intersective (#John wore something that was a beard and was fake) or even subsective (#John wore a particular kind of beard). Note that this form of intensionality does not commit us to a higher-order logic—we are not quantifying over predicate extensions or intensions so far, only over individuals (aside from the need to allow for plural entities, as noted). The rather compelling case for intensional predicate modifiers in our semantic vocabulary reinforces the case made above (on the basis on extensional examples) for allowing predicate modification.

Reification, like the phenomena already enumerated, is also pervasive in natural languages. Examples are seen in the following sentences.

(3.18) Humankind may be on a path to self-destruction. (3.19) Snow is white. (3.20) Politeness is a virtue. (3.21) Driving recklessly is dangerous. (3.22) For John to sulk is unusual. (3.23) That our universe is infinite is a discredited notion.

(3.18)–(3.21) are all examples of predicate reification. Humankind in (3.18) may be regarded as the name of an abstract kind derived from the nominal predicate human, i.e., with lexical meaning K(human), where K maps predicate intensions to individuals. The status of abstract kinds as individuals is evidenced by the fact that the predicate “be on a path to self-destruction” applies as readily to ordinary individuals as to kinds. The name-like character of the term is apparent from the fact that it cannot readily be premodified by an adjective. The subjects in (3.19) and (3.20) can be similarly analyzed in terms of kinds K(snow) and K(-ness(polite)). (Here -ness is a predicate modifier that transforms the predicate polite, which applies to ordinary (usually human) individuals, into a predicate over quantities of the abstract stuff, politeness.) But in these cases the K operator does not originate in the lexicon, but in a rule pair of type “NP → N, NP′ = K(N′)”. This allows for modification of the nominal predicate before reification, in phrases such as fluffy snow or excessive politeness. The subject of (3.21) might be rendered logically as something like Ka(-ly(reckless)(drive)), where Ka reifies action-predicates, and -ly transforms a monadic predicate intension into a subsective predicate modifier. Finally (3.22) illustrates a type of sentential-meaning reification, again yielding a kind; but in this case it is a kind of situation—the kind whose instances are characterized by John sulking. Here we can posit a reification operator Ke that maps sentence intensions into kinds of situations. This type of sentential reification needs to be distinguished from that-clause reification, such as appears to be involved in (3.14). We mentioned the possibility of a modal-logic analysis of (3.14), but a predicative analysis, where the predicate applies to a reified sentence intension (a proposition) is actually more plausible, since it allows a uniform treatment of that-clauses in contexts like (3.14) and (3.23). The use of reification operators is a departure from a strict Montgovian approach, but is plausible if we seek to limit the expressiveness of our semantic representation by taking predicates to be true or false of individuals, rather than of objects of arbitrarily high types, and likewise take quantification to be over individuals in all cases, i.e., to be first-order.

Some computational linguists and AI researchers wish to go much further in avoiding expressive devices outside those of standard first-order logic. One strategy that can be used to deal with intensionality within FOL is to functionalize all predicates, save one or two. For example, we can treat predications, such as that Romeo loves Juliet, as values of functions that “hold” at particular times: Holds(loves(Romeo, Juliet), t). Here loves is regarded as a function that yields a reified property, while Holds (or in some proposals, True), and perhaps equality, are the only predicates in the representation language. Then we can formalize (3.14), for example, without recourse to intensional semantics as

Holds(believes(John, infinite(Universe)), t)

(where t is some specific time). Humankind in (3.18) can perhaps be represented as the set of all humans as a function of time:

∀x∀t[Holds(member(x, Humankind), t) ↔ Holds(human(x), t)],

(presupposing some axiomatization of naïve set theory); and, as one more example, (4.22) might be rendered as

Holds(unusual(sulk(John)), t)

(for some specific time t). However, a difficulty with this strategy is encountered for quantification within intensional contexts, as in the sentence “John believes that every galaxy harbors some life-form.” While we can represent the (implausible) wide-scope reading “For every galaxy there is some life-form such that John believes that the galaxy harbors that life-form,” using the Holds strategy, we cannot readily represent the natural narrow-scope reading because FOL disallows variable-binding operators within functional terms (but see McCarthy 1990). An entirely different approach is to introduce “eventuality” arguments into all predicates, and to regard a predication as providing a fact about the actual world only if the eventuality corresponding to that predication has been asserted to “occur” (Hobbs 2003). The main practical impetus behind such approaches is to be able to exploit existing FOL inference techniques and technology. However, there is at present no reason to believe that any inferences that are easy in FOL are difficult in a meaning representation more nearly aligned with the structure of natural language; on the contrary, recent work in implementing natural logic (MacCartney & Manning 2009) suggests that a large class of obvious inferences can be most readily implemented in syntactically analyzed natural language (modulo some adjustments)—a framework closer to Montagovian semantics than an FOL-based approach.

Another important issue has been canonicalization (or normalization): What transformations should be applied to initial logical forms in order to minimize difficulties in making use of linguistically derived information? The uses that should be facilitated by the choice of canonical representation include the interpretation of further texts in the context of previously interpreted text (and general knowledge), as well as inferential question answering and other inference tasks.

We can distinguish two types of canonicalization: logical normalization and conceptual canonicalization. An example of logical normalization in sentential logic and FOL is the conversion to clause form (Skolemized, quantifier-free conjunctive normal form). The rationale is that reducing multiple logically equivalent formulas to a single form reduces the combinatorial complexity of inference. However, full normalization may not be possible in an intensional logic with a “fine-grained” semantics, where for instance a belief that the Earth is round may differ semantically from the belief that the Earth is round and the Moon is either flat or not flat, despite the logical equivalence of those beliefs.

Conceptual canonicalization involves more radical changes: We replace the surface predicates (and perhaps other elements of the representational vocabulary) with canonical terms from a smaller repertoire, and/or decompose them using thematic roles or frame slots. For example, in a geographic domain, we might replace the relations (between countries) is next to, is adjacent to, borders on, is a neighbor of, shares a border with, etc., with a single canonical relation, say borders-on. In the domain of physical, communicative, and mental events, we might go further and decompose predicates into configurations of primitive predicates. For example, we might express “x walks” in the manner of Schank as

∃e, e′(ptrans(e, x, x) ∧ move(e′, x, feet-of(x)) ∧ by-means-of(e′, e)),

where ptrans(e, x, y) is a primitive predicate expressing that event e is a physical transport by agent x of object y, move expresses bodily motion by an agent, and by-means-of expresses the instrumental-action relation between the move event and the ptrans event. As discussed earlier, these multi-argument predicates might be further decomposed, with ptrans(e, x, y) rewritten as ptrans(e) ∧ agent(e, x) ∧ theme(e, y), and so on. As in the case of logical normalization, conceptual canonicalization is intended to simplify inference, and to minimize the need for the axioms on which inference is based.

A question raised by canonicalization, especially by the stronger versions involving reduction to primitives, is whether significant meaning is lost in this process. For example, the concept of being neighboring countries, unlike mere adjacency, suggests the idea of side-by-side existence of the populations of the countries, in a way that resembles the side-by-side existence of neighbors in a local community. More starkly, reducing the notion of walking to transporting oneself by moving one's feet fails to distinguish walking from running, hopping, skating, and perhaps even bicycling. Therefore it may be preferable to regard conceptual canonicalization as inference of important entailments, rather than as replacement of superficial logical forms by equivalent ones in a more restricted vocabulary. Another argument for the latter position is computational: If we decompose complex actions, such as dining at a restaurant, into constellations of primitive predications, we will need to match the many primitive parts of such constellations even in answering simple questions such as “Did John dine at a restaurant?”. We will comment further on primitives in the context of the following subsection.

While many AI researchers have been interested in semantic representation and inference as practical means for achieving linguistic and inferential competence in machines, others have approached these issues from the perspective of modeling human cognition. Prior to the 1980s, computational modeling of NLP and cognition more broadly were pursued almost exclusively within a representationalist paradigm, i.e., one that regarded all intelligent behavior as reducible to symbol manipulation (Newell and Simon's physical symbol systems hypothesis). In the 1980s, connectionist (or neural) models enjoyed a resurgence, and came to be seen by many as rivalling representationalist approaches. We briefly summarize these developments under two subheadings below.

“A physical symbol system has the necessary and sufficient means for general intelligent action.” –Allen Newell and Herbert Simon (1976: 116)

Some of the cognitively motivated researchers working within a representationalist paradigm have been particularly concerned with cognitive architecture, including the associative linkages between concepts, distinctions between types of memories and types of representations (e.g., episodic vs. semantic memory, short-term vs. long-term memory, declarative vs. procedural knowledge), and the observable processing consequences of such architectures, such as sense disambiguation, similarity judgments, and cognitive load as reflected in processing delays. Others have been more concerned with uncovering the actual internal conceptual vocabulary and inference rules that seem to underlie language and thought. M. Ross Quillian's semantic memory model, and models developed by Rumelhart, Norman and Lindsay (Rumelhart et al. 1972; Norman et al. 1975) and by Anderson and Bower (1973) are representative of the former perspective, while Schank and his collaborators (Schank and Colby 1973; Schank and Abelson 1977; Schank and Riesbeck 1981; Dyer 1983) are representative of the latter. A common thread in cognitively motivated theorizing about semantic representation has been the use of graphical semantic memory models, intended to capture direct relations as well as more indirect associations between concepts, as illustrated in Figure 3:



Figure 3

This particular example is loosely based on Quillian (1968). Quillian suggested that one of the functions of semantic memory, conceived in this graphical way, was to enable word sense disambiguation through spreading activation. For example, processing of the sentence, “He watered the plants”, would involve activation of the terms water and plant, and this activation would spread to concepts immediately associated with (i.e., directly linked to) those terms, and in turn to the neighbors of those concepts, and so on. The preferred senses of the initially activated terms would be those that led to early “intersection” of activation signals originating from different terms. In particular, the activation signals propagating from sense 1 (the living-plant sense) of plant would reach the concept for the stuff, water, in four steps (along the pathways corresponding to the information that plants may get food from water), and the same concept would be reached in two steps from the term water, used as a verb, whose semantic representation would express the idea of supplying water to some target object. Though the sense of plant as a manufacturing apparatus would probably lead eventually to the water concept as well, the corresponding activation path would be longer, and so the living-plant sense of plant would “win”.

Such conceptual representations have tended to differ from logical ones in several respects. One, as already discussed, has been the emphasis by Schank and various other researchers (e.g., Wilks 1978; Jackendoff 1990) on “deep” (canonical) representations and primitives. An often cited psychological argument for primitives is the fact that people rather quickly forget the exact wording of what they read or are told, recalling only the “gist”; it is this gist that primitive decomposition is intended to derive. However, this involves a questionable assumption that subtle distinctions between, say, walking to the park, ambling to the park, or traipsing to the park are simply ignored in the interpretive process, and as noted earlier it neglects the possibility that seemingly insignificant semantic details are pruned from memory after a short time, while major entailments are retained for a longer time.

Another common strain in much of the theorizing about conceptual representation has been a certain diffidence concerning logical representations and denotational semantics. The relevant semantics of language is said to be the transduction from linguistic utterances to internal representations, and the relevant semantics of the internal representations is said to be the way they are deployed in understanding and thought. For both the external language and the internal (mentalese) representation, it is said to be irrelevant whether or not the semantic framework provides formal truth conditions for them. The rejection of logical semantics has sometimes been summarized in the dictum that one cannot compute with possible worlds.

However, it seems that any perceived conflict between conceptual semantics and logical semantics can be resolved by noting that these two brands of semantics are quite different enterprises with quite different purposes. Certainly it is entirely appropriate for conceptual semantics to focus on the mapping from language to symbolic structures (in the head, realized ultimately in terms of neural assemblies or circuits of some sort), and on the functioning of these structures in understanding and thought. But logical semantics, as well, has a legitimate role to play, both in considering how words (and larger linguistic expressions) relate to the world and how the symbols and expressions of the internal semantic representation relate to the world. This role is metatheoretic in that the goal is not to posit cognitive entities that can be computationally manipulated, but rather to provide a framework for theorizing about the relationship between the symbols people use, externally in language and internally in their thinking, and the world in which they live. It is surely undeniable that utterances are at least sometimes intended to be understood as claims about things, properties, and relationships in the world, and as such are at least sometimes true or false. It would be hard to understand how language and thought could have evolved as useful means for coping with the world, if they were incapable of capturing truths about it.

Moreover, logical semantics shows how certain syntactic manipulations lead from truths to truths regardless of the specific meanings of the symbols involved in these manipulations (and these notions can be extended to uncertain inference, though this remains only very partially understood). Thus, logical semantics provides a basis for assessing the soundness (or otherwise) of inference rules. While human reasoning as well as reasoning in practical AI systems often needs to resort to unsound methods (abduction, default reasoning, Bayesian inference, analogy, etc.), logical semantics nevertheless provides an essential perspective from which to classify and study the properties of such methods. A strong indication that cognitively motivated conceptual representations of language are reconcilable with logically motivated ones is the fact that all proposed conceptual representations have either borrowed deliberately from logic in the first place (in their use of predication, connectives, set-theoretic notions, and sometimes quantifiers) or can be transformed to logical representations without much difficulty, despite being cognitively motivated.

As noted earlier, the 1980s saw the re-emergence of connectionist computational models within mainstream cognitive science theory (e.g., Feldman and Ballard 1982; Rumelhart and McClelland 1986; Gluck and Rumelhart 1990). We have already briefly characterized connectionist models in our discussion of connectionist parsing. But the connectionist paradigm was viewed as applicable not only to specialized functions, but to a broad range of cognitive tasks including recognizing objects in an image, recognizing speech, understanding language, making inferences, and guiding physical behavior. The emphasis was on learning, realized by adjusting the weights of the unit-to-unit connections in a layered neural network, typically by a back-propagation process that distributes credit or blame for a successful or unsuccessful output to the units involved in producing the output (Rumelhart and McClelland 1986).

From one perspective, the renewal of interest in connectionism and neural modeling was a natural step in the endeavor to elaborate abstract notions of cognitive content and functioning to the point where they can make testable contact with brain theory and neuroscience. But it can also be seen as a paradigm shift, to the extent that the focus on subsymbolic processing began to be linked to a growing skepticism concerning higher-level symbolic processing as models of mind, of the sort associated with earlier semantic network-based and rule-based architectures. For example, Ramsay et al. (1991) argued that the demonstrated capacity of connectionist models to perform cognitively interesting tasks undermined the then-prevailing view of the mind as a physical symbol system. But others have continued to defend the essential role of symbolic processing. For example, Anderson (1983, 1993) contended that while theories of symbolic thought need to be grounded in neurally plausible processing, and while subsymbolic processes are well-suited for exploiting the statistical structure of the environment, nevertheless understanding the interaction of these subsymbolic processes required a theory of representation and behavior at the symbolic level.

What would it mean for the semantic content of an utterance to be represented in a neural network, enabling, for example, inferential question-answering? The anti-representationalist (or “eliminativist”) view would be that no particular structures can be or need to be identified as encoding semantic content. The input modifies the activity of the network and the strengths of various connections in a distributed way, such that the subsequent behavior of the network effectively implements inferential question-answering. However, this leaves entirely open how a network would learn this sort of behavior. The most successful neural net experiments have been aimed at mapping input patterns to class labels or to other very restricted sets of outputs, and they have required numerous labeled examples (e.g., thousands of images labeled with the class of the objects depicted) to learn their task. By contrast, humans excel at “one-shot” learning, and can perform complex tasks based on such learning.

A less radical alternative to the eliminativist position, termed the subsymbolic hypothesis, was proposed by Smolensky (1988), to the effect that mental processing cannot be fully and accurately described in terms of symbol manipulation, requiring instead a description at the level of subsymbolic features, where these features are represented in a distributed way in the network. Such a view does not preclude the possibility that assemblies of units in a connectionist system do in fact encode symbols and more complex entities built out of symbols, such as predications and rules. It merely denies that the behavior engendered by these assemblies can be adequately modelled as symbol manipulation. In fact, much of the neural net research over the past two or three decades has sought to understand how neural nets can encode symbolic information (e.g., see Smolensky et al. 1992; Browne and Sun 2001).

Distributed schemes associate a set of units and their activation states with particular symbols or values. For example, Feldman (2006) proposes that concepts are represented by the activity of a cluster of neurons; triples of such clusters representing a concept, a role, and a filler (value) are linked together by triangle nodes to represent simple attributes of objects. Language understanding is treated as a kind of simulation that maps language onto a more concrete domain of physical action or experience, guided by background knowledge in the form of a temporal Bayesian network.

Global schemes encode symbols in overlapping fashion over all units. One possible global scheme is to view the activation states of the units, with each unit generating a real value between −1 and 1, as propositions: State p entails state q (equivalently, p is at least as specific as q) if the activation q i of each unit i in state q satisfies p i ≤ q i ≤ 0, or q i = 0, or 0 ≤ q i ≤ p i depending on whether the activation p i of that unit in state p is negative, zero, or positive respectively. Propositional symbols can then be interpreted in terms of such states, and truth functions in terms of simple max-min operations and sign inversions performed on network states. (See Blutner, 2004; however, Blutner ultimately focuses on a localist scheme in which units represent atomic propositions and connections represent biconditionals.) Holographic neural network schemes (e.g., Manger et al. 1994; Plate 2003) can also be viewed as global; in the simplest cases these use one “giant neuron” that multiplies an input vector whose components are complex numbers by a complex-valued matrix; a component of the resultant complex-valued output vector, written in polar coordinates as reiθ, supplies a classification through the value of θ and a confidence level through the value of r. A distinctive characteristic of such networks is their ability to classify or reconstruct patterns from partial or noisy inputs.

The status of the subsymbolic hypothesis remains an issue for debate and further research. Certainly it is unclear how symbolic approaches could match certain characteristics of neural network approaches, such as their ability to cope with novel instances and their graceful degradation in the face of errors or omissions. On the other hand, some neural network architectures for storing knowledge and performing inference have been shown (or designed) to be closely related to “soft logics” such as fuzzy logic (e.g., Kasabov 1996; Kecman 2001) or “weight-annotated Poole systems” (Blutner 2004), suggesting the possibility that neural network models of cognition may ultimately be characterizable as implementations of such soft logics. Researchers more concerned with practical advances than biologically plausible modeling have also explored the possibility of hybridizing the symbolic and subsymbolic approaches, in order to gain the advantages of both (e.g., Sun 2001). A quite formal example of this, drawing on ideas by Dov Gabbay, is d'Avila Garcez (2004).

Finally, we should comment on the view expressed in some of the cognitive science literature that mental representations of language are primarily imagistic (e.g., Damasio 1994; Humphrey 1992). Certainly there is ample evidence for the reality and significance of mental imagery (Johnson-Laird 1983; Kosslyn 1994). Also creative thought often seems to rely on visualization, as observed early in the 20th century by Poincaré (1913) and Hadamard (1945). But as was previously noted, symbolic and imagistic representations may well coexist and interact synergistically. Moreover, cognitive scientists who explore the human language faculty in detail, such as Steven Pinker (1994, 2007) or any of the representationalist or connectionist researchers cited above, all seem to reach the conclusion that the content derived from language (and the stuff of thought itself) is in large part symbolic—except in the case of the eliminativists who deny representations altogether. It is not hard to see, however, how raw intuition might lead to the meanings-as-images hypothesis. It appears that vivid consciousness is associated mainly with the visual cortex, especially area V1, which is also crucially involved in mental imagery (e.g., Baars 1997: chapter 6). Consequently it is entirely possible that vast amounts of non-imagistic encoding and processing of language go unnoticed, while any evoked imagistic artifacts become part of our conscious experience. Further, the very act of introspecting on what sort of imagery, if any, is evoked by a given sentence may promote construction of imagery and awareness thereof.

In its broadest sense, statistical semantics is concerned with semantic properties of words, phrases, sentences, and texts, engendered by their distributional characteristics in large text corpora. For example, terms such as cheerful, exuberant, and depressed may be considered semantically similar to the extent that they tend to occur flanked by the same (or in turn similar) nearby words. (For some purposes, such as information retrieval, identifying labels of documents may be used as occurrence contexts.) Through careful distinctions among various occurrence contexts, it may also be possible to factor similarity into more specific relations such as synonymy, entailment, and antonymy. One basic difference between (standard) logical semantic relations and relations based on distributional similarity is that the latter are a matter of degree. Further, the underlying abstractions are very different, in that statistical semantics does not relate strings to the world, but only to their contexts of occurrence (a notion similar to, but narrower than, Wittgenstein's notion of meaning as use). However, statistical semantics does admit elegant formalizations. Various concepts of similarity and other semantic relations can be captured in terms of vector algebra, by viewing the occurrence frequencies of an expression as values of the components of a vector, with the components corresponding to the distinct contexts of occurrence. In this way, one arrives at a notion of semantics based on metrics and operators in vector spaces, where vector operators can mimic Boolean operators in various ways (Gärdenfors 2000; Widdows 2004; Clarke 2012).

But how does this bear on meaning representation of natural language sentences and texts? In essence, the representation of sentences in statistical semantics consists of the sentences themselves. The idea that sentences can be used directly, in conjunction with distributional knowledge, as objects enabling inference is a rather recent and surprising one, though it was foreshadowed by many years of work on question answering based on large text corpora. The idea has gained traction as a result of recent efforts to devise statistically based algorithms for determining textual entailment, a program pushed forward by a series of Recognizing Textual Entailment (RTE) Challenges initiated in 2005, organized by the PASCAL Network of Excellence, and more recently by the National Institute of Standards and Technology (NIST). Recognizing textual entailment requires judgments as to whether one given linguistic string entails a second one, in a sense of entailment that accords with human intuitions about what a person would naturally infer (with reliance on knowledge about word meanings, general knowledge such as that any person who works for a branch of a company also works for that company, and occasional well-known specific facts). For example, “John is a fluent French speaker” textually entails “John speaks French”, while “The gastronomic capital of France is Lyon” does not entail that “The capital of France is Lyon”. Some examples are intermediate; e.g., “John was born in France” is considered to heighten the probability that John speaks French, without fully entailing it (Glickman and Dagan 2005). Initial results in the annual competitions were poor (not far above the random guessing mark), but have steadily improved, particularly with the injection of some reasoning based on ontologies and on some general axioms about the meanings of words, word classes, relations, and phrasal patterns (e.g., de Salvo Braz et al. 2005).

It is noteworthy that the conception of sentences as meaning representations echoes Montague's contention that language is logic. Of course, Montague understood “sentences” as unambiguous syntactic trees. But research in textual entailment seems to be moving towards a similar conception, as exemplified in the work of Dagan et al. (2008), where statistical entailment relations are based on syntactic trees, and these are generalized to templates that may replace subtrees by typed variables. Also Clarke (2012) proposes a very general vector-algebraic framework for statistical semantics, where “contexts” for sentences might include (multiple) parses and even (multiple) logical forms for the sentences, and where statistical sentence meanings can be built up compositionally from their proper parts. One way of construing degrees of entailment in this framework is in terms of the entailment probabilities relating each possible logical form of the premise sentence to each possible logical form of the hypothesis in question.

Having surveyed three rather different brands of semantics, we are left with the question of which of these brands serves best in computational linguistic practice. It should be clear from what has been said above that the choice of semantic “tool” will depend on the computational goals of the practitioner. If the goal, for example, is to create a dialogue-based problem-solving system for circuit fault diagnosis, emergency response, medical contingencies, or vacation planning, then an approach based on logical (or at least symbolic) representations of the dialogue, underlying intentions, and relevant constraints and knowledge is at present the only viable option. Here it is of less importance whether the symbolic representations are based on some presumed logical semantics for language, or some theory of mental representation—as long as they are representations that can be reasoned with. The most important limitation