More resources

There are four books at arm's length in my office:

Precision and clarity

My Ph.D. advisor, Olin Shivers, taught me that technical writing is a balancing act between precision, clarity and marketing.

After a recent round of paper submissions with my own Ph.D. students, I've identified mechanically recognizable ways that precision and clarity leak out of a paper: weasel words and abuse of the passive voice.

So, I've written shell scripts to detect these leaks.

(I don't think I'll ever be able to write a shell script that detects bad marketing for a scientific idea.)

Weasel words

Weasel words--phrases or words that sound good without conveying information--obscure precision.

I notice three kinds of weasel words in my students' writing: (1) salt and pepper words, (2) beholder words and (3) lazy words.

Salt and pepper words

New grad students sprinkle in salt and pepper words for seasoning. These words look and feel like technical words, but convey nothing.

My favorite salt and pepper words/phrases are various, a number of, fairly, and quite. Sentences that cut these words out become stronger.

Bad: It is quite difficult to find untainted samples. Better: It is difficult to find untainted samples.

Bad: We used various methods to isolate four samples. Better: We isolated four samples.

Beholder words

Beholder words are those whose meaning is a function of the reader; for example: interestingly, surprisingly, remarkably, or clearly.

Peer reviewers don't like judgments drawn for them.

Bad: False positives were surprisingly low. Better: To our surprise, false positives were low. Good: To our surprise, false positives were low (3%).

Lazy words

Students insert lazy words in order to avoid making a quantitative characterization. They give the impression that the author has not yet conducted said characterization.

These words make the science feel unfirm and unfinished.

The two worst offenders in this category are the words very and extremely. These two adverbs are never excusable in technical writing. Never.

Other offenders include several, exceedingly, many, most, few, vast.

Bad: There is very close match between the two semantics. Better: There is a close match between the two semantics.

Adverbs

In technical writing, adverbs tend to come off as weasel words.

I'd even go so far as to say that the removal of all adverbs from any technical writing would be a net positive for my newest graduate students. (That is, new graduate students weaken a sentence when they insert adverbs more frequently than they strengthen it.)

Bad: We offer a completely different formulation of CFA. Better: We offer a different formulation of CFA.

A script to find weasel words

With this script, you can supply an alternate list of weasel words in a file if you don't like the default:

#!/bin/bash weasels="many|various|very|fairly|several|extremely\ |exceedingly|quite|remarkably|few|surprisingly\ |mostly|largely|huge|tiny|((are|is) a number)\ |excellent|interestingly|significantly\ |substantially|clearly|vast|relatively|completely" wordfile="" # Check for an alternate weasel file if [ -f $HOME/etc/words/weasels ]; then wordfile="$HOME/etc/words/weasels" fi if [ -f $WORDSDIR/weasels ]; then wordfile="$WORDSDIR/weasels" fi if [ -f words/weasels ]; then wordfile="words/weasels" fi if [ ! "$wordfile" = "" ]; then weasels="xyzabc123"; for w in `cat $wordfile`; do weasels="$weasels|$w" done fi if [ "$1" = "" ]; then echo "usage: `basename $0` <file> ..." exit fi egrep -i -n --color "\\b($weasels)\\b" $* exit $?

Passive voice

There are times when the passive voice is acceptable in technical writing.

I also believe, as with adverbs, that removal of the passive voice would have been a net improvement for over half the technical writing I've edited. (That is, students abuse the passive voice more often than they use it well.)

Of course, I do not advocate dogmatic removal of the passive voice.

The passive voice is tough to shake. Even while writing this article, I caught myself defaulting to the passive in situations where the active was better.

The passive voice is bad when it hides relevant or explanatory information:

Bad: Termination is guaranteed on any input. Better: Termination is guaranteed on any input by a finite state-space. OK: A finite state-space guarantees termination on any input.

In the first sentence, the passive hides relevant information.

The second sentence includes the relevant information, but the passive misplaces the emphasis.

The third sentence contains all the relevant information, and it feels crisp.

There's one case where I think the passive is preferrable in technical writing--when the subject is truly irrelevant:

OK: 4 mL HCl were added to the solution.

Even in this example, I personally don't believe it's egregious to use we:

OK (to me): We added 4 mL HCl to the solution.

In summary, for each use of the passive highlighted by my script, ask the following questions:

Is the agent relevant yet unclear? Does the text read better with the sentence in the active?

If the answer to both questions is "yes," then change to the active.

If only the answer to the first question is "yes," then specify the agent.

A script to find passive voice

#!/bin/bash irregulars="awoken|\ been|born|beat|\ become|begun|bent|\ beset|bet|bid|\ bidden|bound|bitten|\ bled|blown|broken|\ bred|brought|broadcast|\ built|burnt|burst|\ bought|cast|caught|\ chosen|clung|come|\ cost|crept|cut|\ dealt|dug|dived|\ done|drawn|dreamt|\ driven|drunk|eaten|fallen|\ fed|felt|fought|found|\ fit|fled|flung|flown|\ forbidden|forgotten|\ foregone|forgiven|\ forsaken|frozen|\ gotten|given|gone|\ ground|grown|hung|\ heard|hidden|hit|\ held|hurt|kept|knelt|\ knit|known|laid|led|\ leapt|learnt|left|\ lent|let|lain|lighted|\ lost|made|meant|met|\ misspelt|mistaken|mown|\ overcome|overdone|overtaken|\ overthrown|paid|pled|proven|\ put|quit|read|rid|ridden|\ rung|risen|run|sawn|said|\ seen|sought|sold|sent|\ set|sewn|shaken|shaven|\ shorn|shed|shone|shod|\ shot|shown|shrunk|shut|\ sung|sunk|sat|slept|\ slain|slid|slung|slit|\ smitten|sown|spoken|sped|\ spent|spilt|spun|spit|\ split|spread|sprung|stood|\ stolen|stuck|stung|stunk|\ stridden|struck|strung|\ striven|sworn|swept|\ swollen|swum|swung|taken|\ taught|torn|told|thought|\ thrived|thrown|thrust|\ trodden|understood|upheld|\ upset|woken|worn|woven|\ wed|wept|wound|won|\ withheld|withstood|wrung|\ written" if [ "$1" = "" ]; then echo "usage: `basename $0` <file> ..." exit fi egrep -n -i --color \ "\\b(am|are|were|being|is|been|was|be)\ \\b[ ]*(\w+ed|($irregulars))\\b" $* exit $?

A script to find lexical illusions

Read the following text:

Many readers are not aware that the the brain will automatically ignore a second instance of the word "the" when it starts a new line.

Read that same text again, but with different line breaks:

Many readers are not aware that the the brain will automatically ignore a second instance of the word "the" when it starts a new line.

Duplicating words is a phenomenon of electronic composition.

They seem to happen as cut and paste accidents, and most frequently it's with the word the.

Unfortunately, it can be difficult to proofread away duplicate words, because this lexical illusion prevents us from finding them.

No reviewer will shoot down a submission solely because it contains duplicate words, but when small mistakes like spelling errors and duplicate words pile up, they convey a lack of proofreading.

Reviewers will (rightfully) interpret inadequate proofreading as a lack of respect for their time and attention.

Fortunately, a short perl script hunts these bugs down:

#!/usr/bin/env perl # Finds duplicate adjacent words. use strict ; my $DupCount = 0 ; if (!@ARGV) { print "usage: dups <file> ...

" ; exit ; } while (1) { my $FileName = shift @ARGV ; # Exit code = number of duplicates found. exit $DupCount if (!$FileName) ; open FILE, $FileName or die $!; my $LastWord = "" ; my $LineNum = 0 ; while (<FILE>) { chomp ; $LineNum ++ ; my @words = split (/(\W+)/) ; foreach my $word (@words) { # Skip spaces: next if $word =~ /^\s*$/ ; # Skip punctuation: if ($word =~ /^\W+$/) { $LastWord = "" ; next ; } # Found a dup? if (lc($word) eq lc($LastWord)) { print "$FileName:$LineNum $word

" ; $DupCount ++ ; } # Thanks to Sean Cronin for tip on case. # Mark this as the last word: $LastWord = $word ; } } close FILE ; }

Makefile integration

I keep a local copy of the scripts in the bin/ directory of each paper's repository. Then, I add a make proof rule to Makefile:

# Check style: proof: echo "weasel words: " sh bin/weasel *.tex echo echo "passive voice: " sh bin/passive *.tex echo echo "duplicates: " perl bin/dups *.tex

A few words on marketing

A grad student's first impulse when she starts grad school is to assume that, as long as she tells the whole truth and nothing but the truth, everything she writes has to be accepted for publication.

But, there are a lot of true things.

Given the volume of submissions to top peer-reviewed venues, there will always be more than enough technically correct papers to fill the venue.

The function of peer review has become to decide which true things are worth knowing.

In that sense, peer reviewers are the guardians of the scientific community's most limited resource: our collective attention span.

To market a paper, the author must make a compelling case for why her idea deserves access to that resource.

Resources

Benjamin Beckwith has contributed a "writegood" mode for emacs inspired by these scripts.

[1] For example, my colleague John Regehr, suggested simple scripts to catch students' use of superfluous phrases like, "Note that," and "Notice that." Others have suggested scripts for using the future tense in technical writing.