Ocean of Awareness

Mon, 04 Jun 2018

It is often said that parsing is a "solved problem". Given the level of frustration with the state of the art, the underuse of the very powerful technique of Language-Oriented Programming due to problematic tools , and the vast superiority of human parsing ability over computers, this requires explanation.

On what grounds would someone say that parsing is "solved"? To understand this, we need to look at the history of Parsing Theory. In fact, we'll have to start decades before computer Parsing Theory exists, with a now nearly-extinct school of linguistics, and its desire to put the field on strictly scientific basis.

1929: Bloomfield redefines "language"

In 1929 Leonard Bloomfield, as part of his effort to create a linguistics that would be taken seriously as a science, published his "Postulates". The "Postulates" include his definition of language:

The totality of utterances that can be made in a speech community is the language of that speech-community.

There is no reference in this definition to the usual view, that the utterances of a language "mean" something. This omission is not accidental:

The statement of meanings is therefore the weak point in language-study, and will remain so until human knowledge advances very far beyond its present state. In practice, we define the meaning of a linguistic form, wherever we can, in terms of some other science.

Bloomfield is passing the buck, because the behaviorist science of his time rejects any claims about mental states as unverifiable statements -- essentially, as claims to be able to read minds. "Hard" sciences like physics, chemistry and even biology avoid dealing with unverifiable mental states. Bloomfield and the behaviorists want to make the methods of linguistics as close to hard science as possible.

Draconian as Bloomfield's exclusion of meaning is, it is a big success. Known as structural linguistics, Bloomfield's approach dominates lingustics for the next couple of decades.

1955: Noam Chomsky graduates

Noam Chomsky earns his PhD at the University of Pennsylvania. His teacher, Zelig Harris, is a prominent Bloomfieldian, and Chomsky's early work is thought to be in the Bloomfield school. Chomsky becomes a professor at MIT. MIT does not have a linguistics department, and Chomsky is free to teach his own approach to the subject.

The term "language" as of 1956

Chomsky publishes his "Three models" paper, one of the most important papers of all time. His definition of language uses the terminology of set theory:

By a language then, we shall mean a set (finite or infinite) of sentences, each of finite length, all constructed from a finite alphabet of symbols.

This definition is pure Bloomfield in substance, but signs of departure from the behaviorist orthodoxy are apparent in "Three Models" -- Chomsky is quite willing to talk about what sentences mean, when it serves his purposes. For a utterance with multiple meanings, Chomsky's new model produces multiple syntactic derivations. Each of these syntactic derivations "looks" like the natural representation of one of the meanings. Chomsky points out that the insight into semantics that his new model provides is a very desirable property to have.

1959: Chomsky reviews Skinner

In 1959, Chomsky reviews a book by B.F. Skinner's on linguistics. Skinner is the most prominent behaviorist of the time.

Chomsky's review removes all doubt about where he stands on behaviorism or on the relevance of linguistics to the study of meaning. His review galvanizes the opposition to behaviorism, and Chomsky establishes himself as behavorism's most prominent and effective critic.

In later years, Chomsky will make it clear that he had had no intention of avoiding semantics:

[...] it would be absurd to develop a general syntactic theory without assigning an absolutely crucial role to semantic considerations, since obviously the necessity to support semantic interpretation is one of the primary requirements that the structures generated by the syntactic component of a grammar must meet.

1961: Oettinger discovers pushdown automata

While the stack itself goes back to Turing , its significance for parsing becomes an object of interest in itself with Samuelson and Bauer's 1959 paper . Mathematical study of stacks as models of computing begins with Anthony Oettinger's 1961 paper.

Oettinger 1961 is full of evidence that stacks (which he calls "pushdown stores") are still very new. For example, Oettinger does not use the terms "push" or "pop", but instead describes operations on his pushdown stores using a set of vector operations which will later form the basis of the APL language.

Oettinger defines 4 languages. Oettinger's definitions all follow the behavorist model -- they are sets of strings. Oettinger's pushdown stores will eventually be called deterministic pushdown automata (DPDA's) and become the basis of a model of language and the subject of a substantial literature, all of which will use the behaviorist definition of "language".

Oettinger hopes that DPDA's will be an adequate basis for the study of both computer and natural language translation. (Oettinger's own field is Russian translation.) DPDA's soon prove totally inadequate for natural languages.

But for dealing with computing languages, DPDA's will have a much longer life. As of 1961, all algorithms with acceptable speed are using stacks with various modifications.

The development of a theory of pushdown algorithms should hopefully lead to systematic techniques for generating algorithms satisfying given requirements to replace the ad hoc invention of each new algorithm.

The search for a comprehensive theory of stack-based parsing quickly becomes identified with the search for a theoretical basis for practical parsing.

1965: Knuth discovers LR(k)

Donald Knuth reports his new results on stack-based parsing. In a pivotal paper , Knuth sets out a theory that encompasses all the "tricks" used for efficient parsing up to that time. With this Oettinger's hope for a theory to replace "ad hoc invention" is fulfilled. In an exhilarating (and exhausting) 39-page demonstration of mathematical virtuousity, Knuth shows that stack-based parsing is equivalent to a new and unexpected class of grammars. Knuth calls these LR(k), and provides a parsing algorithm for them.

Knuth's new algorithm might be expected to be "the one to rule them all". Unfortunately, while deterministic and linear, it is not practical -- it requires huge tables well beyond the memory capabilities of the time.

The impracticality of his LR(k) algorithm does not suggest to Knuth that the stack-based model is inappropriate as a model of practical parsing. Instead it suggests to him, and to the field, that the boundary of practical parsing lies in a subclass of the LR(k) grammars.

To be sure, Knuth, in his program for further research , does suggests investigation of parsers for superclasses of LR(k). He even describes a new superclass of his own: LR(k,t), which is LR(k) with more aggressive lookahead. But he is clearly unenthusiastic about LR(k,t). It is reasonable to suppose that Knuth is even more negative about the more general approaches that he does not bother to mention.

In any case, those reading Knuth's LR(k) paper focused almost exclusively on his suggestions for research within the stack-based model. These included grammar rewrites; streamlining of the LR(k) tables; and research into LR(k) subclasses. It is LR(k) subclassing that will receive the most attention.

The idea that the solution to the parsing problem must be stack-based is not without foundation. In 1965, the limits of computer technology are severe. For practitioners, any parsing technique that required much more than a reasonably-sized state machine and a stack, is not likely to happen. After all, only four years earlier, stacks were bleeding edge.

The practitioners of 1965 are inclined to believe that, like it or not, they are stuck with stack-based parsing. But why do the theoreticians feel compelled to follow them? The answer is that theoreticians talk themselves into it, using a misleading equivalence based on the behaviorist definition of language.

"Language" as of 1965

Knuth defines language as follows:

The language defined by G is

{ α | S => α and α is a string over T }

namely, the set of all terminal strings derivable from S by using the productions of G as substitution rules.

Here G is a grammar whose start symbol is S and whose set of terminals is T. This is the behavorist definition of language translated into set-theoretic terms.

Knuth proves, to the satisfaction of the profession, the "equivalence" of LR(k) and DPDA's. LR(k) is a class of grammars and the DPDA model is of languages -- sets of strings. At first glance, this is an "apples and oranges" comparison -- how do you prove the equivalence of a class of languages and a class of grammars?

Knuth does this by reducing the class of DPDA languages and the class of grammars to their lowest common denominator, which is the language. And, of course, the "language" in the usage of Parsing Theory is a set of strings, without consideration of their syntax.

Every grammar, when stripped of its syntax, defines a language. So Knuth compares the language which results from stripping down the LR(k) grammars, to the language of DPDA's. After some very impressive mathematics, Knuth is able to show that the two languages are equivalent.

In theoretical mathematics, of course, you can define "equivalent" however you like. But if the purpose is to suggest limits in practice, you have to be much more careful. And in fact, as Knuth's paper shows, if you equate languages and grammars, you get into a very serious degree of magical thinking. Using the Knuth algorithm,

parsing LR(k) grammars for arbitrary k is hopelessly impractical;

parsing LR(1) grammars is impractical, but close to the boundary ; and

parsing LR(0) grammars is very practical.

A problem for the relevance of Knuth's proof of equivalence is that, if you just look at sets of strings without regard to syntax, LR(1) and LR(k) are equivalent. That means that from the sets-of-strings point of view, hopelessly impractical and borderline impractical are the same thing.

Worse, both LR(1) and LR(k) are equivalent to LR(0) for most applications. If you add an explicit end marker to an LR(1) language, which in most applications is easy to do , your LR(1) language becomes LR(0). Therefore, for most applications,

LR(k) = LR(1) = LR(0)

This means that, in the world of sets-of-strings, extremely impractical and very practical are usually the same thing.

Clearly the world of sets of strings is a magical one, in which we can easily transport ourselves across the boundary between practical and impractical. We can take visions of a magical world back into the world of practice, but we cannot assume they will be helpful. In that light, it is no surprise that Joop Leo will show how to extend practical parsing well beyond LR(k).

Comments, etc.

I encourage those who want to know more about the story of Parsing Theory to look at my Parsing: a timeline 3.0. In particular, "Timeline 3.0" tells the story of the search for a good LR(k) subclass, and what happened afterwards.

To learn about Marpa, my Earley/Leo-based parsing project, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

posted at: 06:58 | direct link to this entry

§ § §