How can we tackle natural language if we have linguistic models like this?

Writing on social chatbots, Mariya Yao published a detailed and approachable overview of the most popular approaches to natural language processing (NLP), which highlights the theories guiding those approaches. I found her summary interesting because I’m a data-centered linguist (sociolinguistics and corpus linguistics) who partners with computer scientists and programmers to make text analysis tools. From my perspective, Yao’s summary points to three critical reasons why progress in NLP has been so slow:

NLP research is dominated by very smart, computationally sophisticated people who know much less about language than computers. When NLP researchers have partnered with experts in language, it’s been with linguists who ultimately come from a tradition grounded in philosophy (not science), and uses introspection and intuition instead of data. Scientific approaches to language, grounded in data — emergent theories and data-oriented disciplines such as sociolinguistics and corpus linguistics — are much less used in NLP research.

Consider this quote from the MIT Media lab (which Yoa cites):

“Language is grounded in experience. Unlike dictionaries which define words in terms of other words, humans understand many basic words in terms of associations with sensory-motor experiences.”

That sounds reasonable to a lay person, but it’s at odds with a scientific approach based on empirical observation of the real world: in actuality, language is grounded in social interaction with other humans. There’s a huge disconnect between theory that imagines how language works, and a data-derived theory acknowledging that human beings learn language from being around other human beings. NLP researchers like data in their tools and methods, but they are often working from language theories that aren’t data-driven.

There’s a serious problem with not having data-based theory to guide research: intuition-based models like syntax/semantics/pragmatics limit research. You may have heard that “all models are wrong, but some are useful,” but some models are both wrong and not very useful. The moment you go to data-based approaches to language, you have to drop an idea like syntax (the idea that there is a hidden code behind language use), because it doesn’t match the data. In the real world, language doesn’t have an inherent system or code — it has an emergent structure: there are hundreds of structured ways of talking across the world, changing and evolving slowly but constantly as we talk and write, and ultimately based on massive amounts of social interactions.

There may be more utility in a data-grounded model that treats language as a multi-level whole where each level is connected:

Lexical: language at the level of words. Lexico-grammatical: language at the level of types of words. Thematic: language at the level of themes or messages.

With that model, wrong as it is, we at least have something practical and useful. We can think through how a human recognizes something like genre, e.g. gothic (deserted locales, family secrets, dread, etc.). But we can also use machines to do the same classification: at the lexical level by looking at most frequent word frequencies (him, his, was, etc.), or at the level of lexico-grammar through stance markers (reporting events, personal pronouns, fear, etc). And because it’s visible that these levels are connected as a kind of whole, we now have multiple features to use as hooks so that machines can do a better job with real world tasks like sorting through millions of documents to find relevant ones, classifying social media posts to detect potential danger, or understanding the online recruiting techniques of vioelnt extremists.

Current work on NLP, whether it’s for commercial tasks like understanding consumer attitudes, or for moon-shot goals like general linguistic AI (computers that can “read” like humans), will require data-grounded theories of language. Without that, computational experts in NLP are reduced to technicians: deep in their understanding of their tools, but lost about where/what those tools should be applied too.

There’s hope though. In Yao’s article, the last of the four approaches to NLP she summarizes is “Interactive Learning Approaches.” She quotes Stanford computer science professor Percy Liang:

“Language is intrinsically interactive…How do we represent knowledge, context, memory? Maybe we shouldn’t be focused on creating better models, but rather better environments for interactive learning.”

All the things we do in language — choose specific words that have context-dependent effects, subtly construct stance through style moves— are the result of billions of intelligent human interactions over time. And so if we ever hope to artificially replicate something of the power and usefulness of human linguistic intelligence, we would do well to remember that intelligence is characterized by sociality and interaction.