1 Introduction

Understanding human linguistic abilities is a central problem for cognitive science. A key theoretical question is whether the knowledge that underlies these abilities is probabilistic or categorical in nature. Cognitive scientists and computational linguists have debated this issue for at least the past two decades (Ambridge, Bidgood, Pine, Rowland, & Freudenthal, 2012; Fanselow, Féry, Schlesewsky, & Vogel, 2006; Keller, 2000; Manning, 2003; Sorace & Keller, 2005; Sprouse, 2007), and indeed from the earliest days of modern linguistics (Chomsky, 1957, 1975; Hockett, 1955).

It is widely believed that much of human and animal cognition is probabilistic (Chater, Tenenbaum, & Yuille, 2006). But in some respects, natural language is different from other cognitive domains. Language is a set of discrete combinatorial systems above the phonetic level (phonology, morphology, and syntax). The absence of such systems in other species has led some researchers to posit a distinct rule‐driven mechanism for combining and manipulating symbols at the core of language, as well as the high‐order cognitive abilities that involve it (Chomsky 1965, 1995; Fodor, 1983, 2000; Hauser, Chomsky, & Fitch, 2002).

However, language use clearly involves probabilistic inference. The ability to recognize phonemes in a noisy environment, for example, requires an ability to assess the relative likelihood of different phoneme sequences (Clayards, Tanenhaus, Aslin, & Jacobs, 2008; Lieberman, 1963; Swinney, 1979). There are no obvious non‐probabilistic explanations for these kinds of phenomena. Similarly, frequency effects of word recognition and production are a staple of the psycholinguistic literature (Ambridge, Kidd, Rowland, & Theakston, 2015; Levy, 2008; Piantadosi, Tily, & Gibson, 2011). For a survey of evidence for the central role of probabilistic inference across a wide variety of linguistic processes, see Chater and Manning (2006).

Before proceeding, we need to clarify two important issues: a methodological distinction between competence and performance, and a terminological question concerning our use of the terms grammaticality and acceptability. We adopt what we take to be a minimal and uncontroversial version of the competence–performance distinction for linguistic activity. The competence component abstracts away from those aspects of linguistic output that are not directly conditioned by linguistic knowledge, specifically grammatical knowledge. Performance encompasses the production and interpretation of linguistic expressions. The distinction turns on the difference between the processes that are responsible for an event like someone interrupting his/her production of a sentence due to a distraction, and those that govern phenomena such as subject‐verb agreement. The distinction becomes problematic when it is applied to purely linguistic phenomena, where we have limited information concerning the mechanisms involved in the representation of linguistic knowledge. Still, we do have a reasonable understanding of at least some processing elements. Crocker and Keller (2006), for example, discuss the role of local ambiguity and processing load as factors that may cause difficulties in comprehension.

We use grammaticality (in the narrow sense of syntactic grammaticality) to refer to the theoretical competence that underlies the performance phenomenon of speaker acceptability judgements. We measure acceptability in experiments when we ask subjects to rate sentences. Grammaticality is one of the possible elements in determining an acceptability judgement. It is not directly accessible to observation or measurement. This view is widespread in linguistics, and we follow it here. Of course, other factors can affect acceptability: semantic plausibility, various types of processing difficulties, and so on, can individually or jointly cause grammaticality and acceptability to come apart. A grammatical sentence may be unacceptable because it is hard to process, or an ungrammatical sentence can be judged to be acceptable because of various features of the processing system. It is important to recognize that grammatical competence is a theoretical entity, which is not directly accessible to observation or measurement. The primary evidence available for ascertaining its properties are speakers' acceptability judgements.

In the light of these distinctions, we can specify the theoretical question that we are addressing, in terms of two slightly caricatured alternatives. First we have the idea that the underlying grammatical competence generates a set of structures (or of sound–meaning pairs). On this approach, there is a binary distinction between those elements that are in the set of well‐formed structures and those that are not. In addition to the grammar, there are performance components. These are processing devices of various types that may be probabilistic. In this framework, which many theoretical linguists assume, the formal device that encodes syntactic competence is categorical. It generates all and only the grammatical structures of a language.

On the second alternative, grammatical competence does not define a set of well‐formed structures with a binary membership condition. Instead, it generates a probability distribution over a set of structures that includes both well‐formed and ill‐formed elements.

Of course, these two approaches to do not exhaust the set of choices. There are other logically possible alternatives, but these have not been articulated to the same level of systematic clarity and detail as the two models that we focus on here. We will briefly take up these other alternatives in Section 4.1

Both views have strengths and weaknesses. The probabilistic approach can model certain aspects of linguistic behavior—disambiguation, perception, etc.—quite easily, but it does not naturally account for intuitions of grammaticality. By contrast, binary categorical models can easily express the distinction between grammatical and ungrammatical sentences, but they construe all sentences in each class as having the same status with respect to well‐formedness. They do not, in themselves, allow for distinctions among more or less likely words or constructions, nor do they express different degrees of naturalness.

Part of this debate hinges on a disagreement over what constitutes the central range of data to be explained. One view takes the observed facts of actual linguistic use to be the relevant data. This perspective is often associated with corpus linguistics and the use of statistical models trained on these corpora. However, language use is repetitive, and it may fail to contain the crucial examples that distinguish between different theories of syntax. These examples typically combine several different phenomena, and they may be rare to the point of non‐occurrence in observed speech. As a result, syntacticians of a Chomskyan orientation have traditionally relied on artificially constructed example data, which they test through informal speaker judgement queries. Important questions can and have been raised concerning the rigor and reliability of these methods (Gibson & Fedorenko, 2013; Gibson, Piantadosi, & Fedorenko, 2013; Schütze, 1996; Sprouse & Almeida, 2013), but we will pass over this debate here.

While probabilistic methods have frequently been used to model naturally occurring speech, they have seldom, if ever, been applied to the prediction of acceptability judgments. Indeed, theoretical linguists, following Chomsky (1957), tend to dismiss probability as irrelevant to syntax. In the early days of generative grammar (Chomsky, 1975), there was some interest in probabilistic approaches, but it quickly disappeared in the wake of Chomsky's criticisms.

One might, initially, suggest that it is possible to treat the probability of a sentence as a measure of its grammaticality, with 1 indicating full grammaticality and 0 complete ill‐formedness. This move misconstrues the nature of the values in a probability distribution that a language model determines. The probability of a sentence, s, for a model, is the probability that a randomly selected sentence will be s, and not a measure of its relative grammaticality. One of the defining characteristics of probabilities is that they must sum to 1. If we add up the probabilities of every possible sentence, the total is 1. Hence, the probability of each individual sentence is very small.

So, for example, the sentence “When the Indians went hunting, whether for animals or for rival Indians, their firepower was deadly.” This sentence, from a traditional linguistic perspective, is perfectly grammatical, and it does, in fact, receive a high acceptability rating from native speakers of English (a mean rating of 3.69 on a scale of 1 to 4, in our crowd source annotation experiments). This sentence occurs once in the British National Corpus (BNC), which contains almost 5 million sentences. If we constructed a new corpus of the same size, we would be very surprised if this exact sentence occurred at all. Thus, the probability of this sentence will be much less than 1 in 5 million.

There is a qualitative difference between the numbers that express probabilities and those that measure acceptability. They are both values that represent objective features of the sentence, but these are entirely distinct properties, which are determined in different ways. It is clear that there is no direct relationship between them. The probability of a sentence is affected by several different factors that do not, in general, determine its acceptability. If we take two sentences which are acceptable, and join them with a conjunction, we have a sentence that is often perfectly acceptable, but whose probability may only slightly exceed the product of the probability of the two conjuncts.2 Longer sentences will generally have lower probabilities. Moreover, the probability of individual lexical items is an important element in generating the probability of the sentences in which they appear. “I saw a cat” and “I saw a yak” are roughly equivalent in acceptability value, but the word “yak” is much less probable than “cat.” This creates a significant difference in the probability values of these two sentences. Short comparatively unacceptable sentences may have higher probabilities than very long acceptable sentences that contain rare words.

One straightforward way of deriving grammaticality from probabilities would be to fix some small positive threshold ε and to consider as grammatical all those sentences whose probability is above ε. However this has some undesirable consequences. Most important, since all of the probabilities must sum to one, though there can be infinitely many sentences with non‐zero probability, there can be only finitely many sentences with probability above some finite threshold. Indeed, since 1/ε is a finite number, there can be at most 1/ε sentences whose probability is above ε. If we had more, then the total probability would exceed 1. To illustrate this, assume that ε = 0.01, in which case the maximum number of grammatical sentences would be 100 (i.e., 1/0.01). Clearly if there were more than 100 sentences, each of which had probability at least 0.01, then the total probability would exceed 1, which is impossible. The claim that there are only finitely many grammatical sentences is, of course, entirely unreasonable from a linguistic perspective. See Clark and Lappin (2011) for additional discussion of this issue.

However, there is clearly some relation between acceptability and probability. After all, native speakers are more likely to produce acceptable rather than unacceptable sentences, and so the probability mass is concentrated largely on the acceptable sentences. All else being equal, acceptable sentences are more likely than unacceptable sentences, once we have controlled for confounding factors.

It does therefore seem, in principle, possible to predict acceptability on the basis of a probabilistic model. But this requires that we find a way of filtering out those aspects of probability that vary independently of acceptability, and so cannot be used to predict it. We propose that a probabilistic model can generate both probabilities and acceptability judgments if we augment it with an acceptability measure that compensates for other factors, notably lexical frequency and sentence length. These are functions that normalize the probability value of a sentence through an equation that discounts its length and the frequency of its lexical items. Some measures also magnify the contribution of other factors to the acceptability value of the sentence.

We experiment with various different acceptability measures, which we will explain in detail. To illustrate how they operate, we describe the simplest one that we apply. Suppose that we have a probabilistic model M that assigns a probability value to a sentence s, which we write as . We normalize this probability using the formula . Here we take the logarithm of the probability value of s, divided by s's length. This score is no longer a number between 0 and 1. Since the log probability value is <1, this will be a negative number. But crucially this number will not, in general, decrease in proportion to the length of a sentence. If we pick a threshold value, we may have an infinite number of sentences whose score is above that threshold. We will see that, when applied to the distribution of a suitable language model M, scores of this kind (which incorporate other information in addition to sentence length) correlate surprisingly well with human acceptability ratings.

The core contribution of this paper is to demonstrate that grammatical competence can be probabilistic rather than categorical in nature. We wish to show that it is possible for such a theory of competence to model both the probabilities of actual language use and, crucially, to accurately predict human acceptability judgments.

We present two families of experiments that support this claim. In Section 2, we describe experiments on various datasets which demonstrate pervasive gradience in a wide range of acceptability judgements over sentences from different domains and languages. Some data sets are generated by drawing sentences from a corpus and introducing errors through round trip machine translation. We use crowd sourcing to obtain native speaker acceptability judgements. In addition, we use test sets of linguists' constructed examples (both good and starred), and we filter one of these test sets to eliminate semantic/pragmatic anomaly. We examine both mean and individual judgement patterns. We compare the results to two non‐linguistic benchmark classifiers, one binary and the other gradient that we also test through crowd sourcing. The results of these experiments show that sentence acceptability judgements, both individual and aggregate, are intrinsically gradient in nature.

In Section 3, we present computational modeling work that shows how some probabilistic models, trained on large corpora of well‐formed sentences and enriched with an acceptability measure, predict acceptability judgments with encouraging levels of accuracy. We experiment with a variety of different models representing the current state of the art in machine learning for computational linguistics, and we test them on the the crowd source annotation data described in Section 2. Our models include N‐grams, Bayesian Hidden Markov Models of different levels of complexity, and recursive neural networks. All of them are entirely unsupervised. They are trained on raw text that contains no syntactic annotation, and no information about acceptability. Each model is trained on approximately 100M words of text. We also apply a variety of different acceptability measures to map probabilities to acceptability scores, and we determine the correlations between these scores and the human judgments.

Our experimental work suggests two main conclusions. First, gradience is intrinsic to acceptability judgements. Second, grammatical competence can be naturally represented by a probabilistic model. The second conclusion is supported by the fact that our language models, augmented by acceptability measures, predict the observed gradient acceptability data to an encouraging level of accuracy.

Before presenting our experimental work, we need to address two points. First, one might ask why acceptability judgements are directly relevant to a theory of linguistic competence, given that they may, in part, be generated, by performance factors external to competence. In fact, such judgements have been the primary data by which linguists have tested their theories since the emergence of modern theoretical linguistics in the 1950s. Chomsky (1965) identifies the descriptive adequacy of a theory of grammar with its capacity to predict speakers' linguistic intuitions. It seems reasonable to identify intuitions with acceptability judgements. This data constitutes the core evidence for evaluating theories of syntactic competence.

Second, we wish to stress that our experimental work does not show that a binary formal grammar is excluded as a viable theory of competence. However, it does indicate that our probabilistic account achieves coverage of acceptability judgements in a way that binary formal grammars, even when augmented by current theories of processing, have not yet been shown to do.

The structure of our argument is as follows. An adequate theory of competence must account for the observed distribution of speakers' acceptability judgements. Several of our language models, enriched with acceptability scoring measures, predict mean speakers' acceptability judgements to an encouragingly high degree of accuracy, across a range of test set domains and several languages. By contrast, classical formal grammars cannot, on their own, explain these judgement patterns. In principle, they might be able to do so if they are supplemented with a theory of processing. To date no such combined account has been formulated that can accommodate the data of acceptability judgements to the extent that our best performing language models can. We conclude that characterizing grammatical knowledge as a probabilistic classifier does, at present, offer a better account of a crucial set of facts relevant to the assessment of a theory of competence.

The choice between the two approaches remains open. A proper comparison between them awaits the emergence of a fully articulated model that integrates a binary formal grammar into a precise account of processing. Such a model must be able to generate acceptability ratings in such a way that permits a comparison with the predictions of our enriched language models.