$\begingroup$

Over on the TeX stackexchange, we have been discussing how to detect "rivers" in paragraphs in this question.

In this context, rivers are bands of white space that result from accidental alignment of interword spaces in the text. Since this can be quite distracting to a reader bad rivers are considered to be a symptom of poor typography. An example of text with rivers is this one, where there are two rivers flowing diagonally.

There is interest in detecting these rivers automatically, so that they can be avoided (probably by manual editing of the text). Raphink is making some progress at the TeX level (which only knows of glyph positions and bounding boxes), but I feel confident that the best way to detect rivers is with some image processing (since glyph shapes are very important and not available to TeX). I have tried various ways to extract the rivers from the above image, but my simple idea of applying a small amount of ellipsoidal blurring doesn't seem to be good enough. I also tried some Radon Hough transform based filtering, but I didn't get anywhere with those either. The rivers are very visible to the feature-detection circuits of the human eye/retina/brain and somehow I would think this could be translated to some kind of filtering operation, but I am not able to make it work. Any ideas?

To be specific, I'm looking for some operation that will detect the 2 rivers in the above image, but not have too many other false positive detections.

EDIT: endolith asked why I am pursuing a image-processing-based approach given that in TeX we have access to the glyph positions, spacings, etc, and it might be much faster and more reliable to use an algorithm that examines the actual text. My reason for doing things the other way is that the shape of the glyphs can affect how noticeable a river is, and at the text level it is very difficult to consider this shape (which depends on the font, on ligaturing, etc). For an example of how the shape of the glyphs can be important, consider the following two examples, where the difference between them is that I have replaced a few glyphs with others of almost the same width, so that a text-based analysis would consider them equally good/bad. Note, however, that the rivers in the first example are much worse than in the second.