Grammar and style-checking tools for Emacs

LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

Grammar be hard. Both for human beings and for software programs. These days, writers who use free software generally have their choice of reliable utilities for catching spelling mistakes, regardless of what editors or word processors they use. The outlook for grammar-and-style checking is not nearly as rosy. I recently explored the options available for Emacs, and was underwhelmed with the status quo.

But the limited options available testify primarily to the difficulty of the problem, rather than indicting the development community. Natural-language processing is at the heart of grammar-checking, and there are few relevant projects available to the public that offer much of a general solution. Those that do exist (and have free-software compatible licenses) tend to come from academia. As a result, users can choose either lightweight programs that offer only a limited set of simple grammar checks, or more complete grammar-checkers that can involve awkward glue code to hook Emacs into an external service.

Perhaps it goes without saying, but I limited my research into grammar utilities that support my native language, English. Yet, as far as I can discern, the situation is not dramatically better for any other languages—in fact, once one ventures too far outside of the European languages, the situation seems to be much worse on a practical level. The theoretical problems abound, and one is at the mercy of whoever has the funds to support the necessary research. I also limited this search to tools with Emacs integration, but a bit of looking suggests the number and variety of solutions available for Vim and other editors is similar.

Limited tools, unlimited problem space

Among the first discoveries one makes when reading about grammar checking is that there is a wide range of errors that someone might consider a grammatical mistake. The simplest are obvious syntactic errors, like repeated words—and there are, indeed, quite a few options available to catch duplicates. Simple string matching will catch most of these, although false positives are possible. Programs can also easily check documents against a blacklist, so commonly misused patterns (such as "rather then") can be highlighted. Only slightly more complicated to catch are grammatical constructions like the passive voice; regular expressions can match the most common forms of verbs as they are used in the passive voice (such as "are used").

But not everyone agrees on whether or not many such stylistic rules are genuinely grammatical rules. It is common for textbooks and schools to teach students to avoid slang, contractions, and the like (especially in "formal" writing), but those are conventions largely about what is appropriate, not what is correct. Detecting genuine grammar mistakes like subject/verb disagreement, misplaced commas, or dangling participles is apparently far more difficult.

Consequently, there are multiple options available to tackle the syntactic issues that can be dealt with through regular expressions and simple blacklists. But the use of a single blacklist for grammatical mistakes and for words that are undesirable for stylistic reasons (for example, words that are regarded as imprecise, like "some," or that add no information, like "very") muddles the picture. Fortunately, some of these tools are flexible enough that users can adapt them to issue warnings about their particular set of concerns.

Duplicate words

At the simple end of the offerings is the dupwords.el package written in the 1990s by Stephen Eglen. Naive double-word detection is almost trivial; some existing spell-checkers for Emacs already perform the function. Eglen's script improves matters by being able to detect repeats that are separated by a user-configurable threshold of other words. Setting the variable dw-forward-words changes this threshold; the default is one (which catches adjacent duplicates only). Setting it to a negative value will catch duplicates anywhere within the same sentence.

Eglen's script is sentence-oriented; it will not catch situations where the same word ends one sentence and starts the next (for that, there are other solutions to be found with a bit of searching, such as this function by Matthew Morley). The script must be called explicitly; M-x dw-check-to-end will check from the cursor point to the end of the active buffer.

Diction

A step up from dupwords.el is the diction.el package by Sven Utke, which depends on the operating system's GNU diction package. While perhaps not terribly well-known, diction is a classic UNIX text-processing utility. It can find duplicate words as well as match problematic words from the program's rules database. The default databases are stored in /usr/share/diction/ , and currently cover English, German, Dutch, and C. Each entry can include a recommended substitute or a brief explanation of why the word in question is frowned upon.

The English database focuses on unnecessarily verbose language, such as recommending that "along the lines of" be replaced with "like," and on pointing out the distinctions between often confused pairs of words (such as proceed and precede). Many of the recommendations are drawn from Strunk and White's The Elements of Style , which is a classic manual on writing style. But the book has its share of critics, who contend that it contains lots of "rules" that are little more than opinion on whether or not certain words and phrases are "inelegant" or overused.

For Emacs users, GNU diction is likely to highlight an excessive number of words, many of which are hits on Strunk and White's stylistic recommendations—at least, that is the case when using the built-in diction database. But it is possible to create a custom database that is more useful for a particular user or writing project. The diction.el script contains some logic to automatically deduce the correct database to use based on the ispell dictionary in use in the active buffer; to point the script to a different database, this value needs to be overwritten using the command:

M-x set-variable RET diction-ruleset RET "databasename"

Like the previous tool, diction.el must be evoked by the user. Calling M-x diction-buffer will scan the current Emacs buffer. The diction-ruleset variable is per-buffer, so users who wish to use different custom databases for different files will either need to set the variable separately or add the command to the relevant mode hooks for each file type.

Write good

Benjamin Beckwith's writegood-mode uses a similar approach to diction.el, but it relies on a custom blacklist that covers slightly different ground. It matches three classes of error: duplicate words, passive voice constructions, and "weasel words," a term more-or-less synonymous with "stylistic problems" as listed in the GNU diction database.

The writegood-mode blacklist, however, is adapted from a set of shell scripts by Matt Might at the University of Utah. Might's list was assembled from years of reading student papers; it breaks "weasel words" into three categories:

Salt and pepper words that " look and feel like technical words, but convey nothing. " Examples include "various," "fairly," and "a number of."

" Examples include "various," "fairly," and "a number of." Beholder words that tell the reader how to react, such as "interestingly," "clearly," or "surprisingly."

Adverbs, which Might says should be removed from all "technical" writing.

Writegood-mode's list of weasel words is editable; one only needs to add a string to the write-good-weasel-words list. But, notably, the list consists of string literals, not regular expressions; if one decides to supplement it in bulk or to add a lot of variations, it could grow unwieldy.

On the plus side, writegood-mode is an Emacs minor mode, which is a class of feature commonly used to perform on-the-fly syntax highlighting and indentation. Thus, when activated, writegood-mode highlights all of the matching words in the current buffer as one continues to work on the document. That is more convenient than periodically stopping to re-run a command, and users can selectively enable the mode based on the type of document (in addition to enabling it manually). In addition, using syntax highlighting makes it simple for the user to ignore false positives, whereas using a function that steps through each flagged word sequentially can quickly become an interminable chore.

Art Bollocks

Another minor-mode option worth considering is artbollocks-mode, which was originally written by Rob Myers and was later revived by Sacha Chua. The name, incidentally, is a reference to a famous article criticizing postmodern art, which contended that postmodernism is more of a linguistic argument about art than it is an approach to creativity itself.

In a sense, the original Art Bollocks was an attack on weasel words, and that is what artbollocks-mode focuses on as well. It includes checks to highlight passive-voice constructions, "jargon" words, duplicated words, and a set of weasel words that covers the same general categories described by Might. In addition, each of these checks can be enabled or disabled individually, and there are commands available to compute some statistics about the active buffer (such as its Flesch-Kincaid readability score).

Writegood-mode is newer, but artbollocks-mode includes a larger list of weasel and jargon words—although, it should be pointed out, some of those words originate from art criticism and may not be useful in other disciplines. The distinction between weasel words and jargon could be useful for anyone hoping to tailor artbollocks-mode to their own writing; the different categories are highlighted in different colors.

As far as modifications go, artbollocks-mode is not as simple to update as writegood-mode. Rather than a list of literal strings to search for, artbollocks-mode uses a single regular expression for each of its checks, and those regular expressions are optimized with Emacs's regex-opt function. On the plus side, this results in faster string matching, but it also requires the user to regenerate the optimized regular expressions, out of band, in order to update the mode.

Style versus grammar

The utilities examined above focus on writing style, rather than fundamental English grammar. But, in a lot of the online debates, mailing-list threads, and Stack Overflow answers that I examined when looking for Emacs grammar-checking tools, users were interested in stylistic issues. After all, the underlying concern is users wanting their writing to be clear; whether the problem at hand is a vague adverb or a split infinitive, the user wants it fixed.

So the style-oriented tools clearly have their place, and many writers seem to find them useful. Nevertheless, many of the same writers probably have "real" grammar-checking in mind when they first go looking for such an Emacs utility. Next time, we'll take a look at the tools available for assessing grammatical correctness from Emacs. All of the tools involve linking to external processes or even remote servers, which raises its own set of hurdles for those intent on working with a purely free-software solution.

