Quantified cantillation

When read publicly, the Torah is often sung using a system of cantillation marks, or trop in Yiddish. There are many different cantillation marks, each of which has a name, a unique sound (or sounds), and comes in combination with other trop.

When the cycle of readings started over this year after Simchas Torah, it seemed like there were more telisha gedolahs in Bereshit (Genesis), whereas there were more telisha ketanas in D’varim (Deuteronomy). I decided to find out whether or not this was really the case.

First, I needed a dataset. Tanach.us offers the entire Tanakh in XML form, including trop and vowels. I was only interested in the Torah, so I downloaded XML files for each of the five sfarim (books). I went through the XML and tabulated how many of each trop were present in each pasuk (sentence).

Aggregating by sefer to consider my original question about the relative frequencies of telisha gedolahs and telisha ketanas, we see that my intuition was somewhat correct: while there are more ketanas throughout, there are more overall ketanas in D’varim.

However, the ratio of telisha gedola to telisha ketana is actually not substantially different in D’varim and Bereshit. So while overall counts are higher, the relative frequencies are not so different.

Aggregating by sefer is interesting, but I wanted to see more continuous variations. Looking at a series of what for most trop would be zeros and ones, with an occasional two or three, isn’t that useful, but Zach (a Ph.D. student in Statistics) suggested a moving average, and that worked quite nicely. We used a 500-pasuk-wide window, which struck a balance between detail and low-pass filtering. (I come from a signal processing background, not time-series analysis.)

As with the initial bar graph, you can really see the number of telisha ketanas explode in D’varim. But more interestingly, we can get a sense of how they track each other through the Torah.

Seeing how different trop track each other is fun. There are some things that you’d expect. For example, munakh is often associated with katan, revi’i, and mapakh–pashta, and we see that clearly here.

Particklarly striking is the tight correlation between zarka and segol.

Although other combinations, though, like darga–tevir are more loosely correlated.

(For more correlations, here are the pasuk by pasuk and moving window correlation tables.)

While these patterns are intuitive, the fact that trop — especially common ones like merkha and tipkha — aren’t uniformly distributed across the Torah was, to me, somewhat less expected. A big reason for this is changes in sentence structure. This becomes extremely obvious when looking at etnakhta, which essentially functions as a comma.

The reason for the rather dramatic plunge toward the beginning of B’midbar seems to be a shift in sentence structure. Checking the text, this part of the Torah contains quite a bit of genealogy, which contains many single-phrase sentences (“So-and-so begat so-and-so”), and many occurrences of the common pasuk “וידבר ה` אל־משה לאמר”.

Oddly, I did a bit of digging into this, and it looks like a drop in words per pasuk actually lags the drop in etnakhtas. I’m not sure why.

I could imagine running a logistic regression to see whether words per pasuk predicts the presense of an etnakhta, but I’m going to cut myself off now.

If you’re interested in playing around with this yourself, everything is on GitHub. If you just want to cut to the chase, here’s a CSV file of the raw data. And here’s an IPython Notebook.