Document Arc Diagrams

By: Jeff Clark Date: Sat, 28 Apr 2007

I have written before about Martin Wattenberg's Arc Diagrams for visualizing structure within strings. They are an intriguing way of visualizing repetition at varying scales within a linear sequence. When applied to music they produce beautiful images that illustrate the structure. I noted that for most narrative text these diagrams likely wouldn't work very well because of the lack of regular repetition but that it might be fruitful to explore some lower dimensional derived feature of the text.

In my recent exploration of ways to visualize arbitrary text documents I tried out something visually inspired by Wattenberg's Arc Diagrams. Rather than using arcs to connect identical patterns within a document I'm connecting instead segments that contain similar words. Here is the algorithm:

break the document up into a stream of words throw away any 'stop words' (a, at, of, the ...) divide the remaining stream of more interesting words into 50 equal segments based on linear position calculate a similarity metric between each pair of segments based on the amount of overlapping words draw a diagram where the document segments are connected by arcs with the transparency determined by the similarity between the segments. Use a threshold so that weakly connected arcs don't get drawn at all. show the top two words for each arc drawn at both segment endpoints

Update:The interactive application is available now for Document Arc Diagrams.

Here are a few sample diagrams:

Despite the arbitrary nature of the segmentation the technique appears to reveal some aspect of the document structure in a visually interesting manner. In Alice in Wonderland, for example, it shows what appears to be four distinct scenes present in the last half of the text. The third is highlighted in orange and has as high frequency words Alice, Mock, Turtle, and Gryphon. The third example is for the lyrics of a song and shows darker lines because the similarity between segments is stronger. There are also regular patterns that repeat multiple times which isn't surprising for song lyrics. It would be interesting to use a line-based or syllable/phoneme-based segmentation for song lyrics rather than the simplistic approach taken here.

I will post an interactive application soon that will let anyone explore a fixed set of documents.