« previous post | next post »

A couple of days ago, I mentioned ("Sarah Koenig", 2/5/2015) that David Talkin was releasing a new pitch tracking program called REAPER (available from github at the link). After a few minor improvements in documentation, it's ready for the general public.

The reaper program uses the EpochTracker class to simultaneously estimate the location of voiced-speech "epochs" or glottal closure instants (GCI), voicing state (voiced or unvoiced) and fundamental frequency (F0 or "pitch"). We define the local (instantaneous) F0 as the inverse of the time between successive GCI.

After trying it out, I can recommend it whole-heartedly — it's robust and accurate and fast. It's my new standard pitch tracker.

It's easy to download and build, at least on OSX and Linux systems. (I haven't tried it on Windows under Cygwin, because my Windows laptop is out on loan.) Its output is in the form of Edinburgh Speech Tools files, but the ascii version of those files is easy to assimilate into other programs.

Here are the "Glottal Closure Instants" that it finds for a challenging stretch of a recent This American Life episode (Prologue to "If You Don't Have Anything Nice to Say, SAY IT IN ALL CAPS", 1/23/2015), where Ira Glass gets down to 27 Hz or so:

Your browser does not support the audio element.

In order for that passage to be tracked accurately, I had to change the "minimum f0" to 20 Hz from the default 40 Hz — for speakers whose voices are less heroically creaky, the default settings work well.

As a quick demonstration, I tracked Ira Glass's voice through the whole of that prologue passage (until the music kicks in, about 51 seconds):

Your browser does not support the audio element.

… and compared it to Kai Ryssdal's voice in two recent Marketplace segments:

"Coming soon: New York's first men's fashion week", 2/5/2015 (45 seconds, 10.1 seconds to analyze on my rather antique laptop):

Your browser does not support the audio element.

"Goldman Sachs' reputation sinks even lower", 2/6/2015, (36 seconds, 6.5 seconds to analyze):

Your browser does not support the audio element.

The distribution of Reaper's f0 estimates confirms that Ira Glass's voice is lower overall than Kai Ryssdal's (median 95.2 versus 146.8 Hz, more than a fifth lower), and that he is much more often in the "creaky" perceptual range below 70 Hz (17% of f0 estimates vs. 4%):

Here's the same distribution on a semitone scale, which is probably more perceptually appropriate:

If we apply the same metric to the samples of Sarah Koenig's radio voice mentioned in the earlier post, we find that the samples from 2000 (TAL #151 and #162) have 5% of their f0 estimates below 70 Hz, while the sample from 2014 (TAL #537) has 16%.

The usual "creakiness" definitions are more complex, and involve a combination of human auditory perception and human evaluation of time-domain or frequency-domain evidence for period-doubling or irregular glottal oscillation. Thus Kirstine Yu and Hui Wai Lam, "The role of creaky voice in Cantonese tonal perception", Journal of the Acoustical Society of America 2014:

A token was defined to be creaky if it had the auditory percept of creaky voice, as determined by the authors and if: (1) there were alternating cycles of amplitude and/or frequency or irregular glottal pulses in the waveform or wide-band spectrogram, (2) missing values or discontinuities in the f0 track determined by Praat's autocorrelation algorithm with default settings, or (3) the appearance of strong subharmonics or lack of harmonic structure in the narrow-band spectrogram.

This (entirely appropriate) definition combines aspects of period-doubling and erratic phonation, with the presence of sounds that are simply low enough in pitch for listeners to start to hear individual glottal cycles. A problem with such definitions, however, is that they involve a lot of human perceptual testing and human annotation. And this means that a meaningful attempt to evaluate the claims of a "vocal fry epidemic" among young women in America — that is, to investigate the distribution of "vocal fry" (by which people mostly mean "creak") across age, gender, and time — would be a daunting amount of work, because it requires analyzing natural speech samples from hundreds if not thousands of speakers.

We might (and should) try to automate such human annotations — but there may be a much simpler way.

As I noted yesterday in "Vocal creak and fry, exemplified", any sequence of buzz-like oscillations will sound "creaky" when its frequency gets low enough, even if the oscillations are perfectly periodic. The laryngeal and pulmonary gestures that produce these low fundamental frequencies in human speech generally do also tend to produce period-doubling and chaotic oscillation, but just the low fundamental frequency is enough to create the perception of creakiness.

So I hypothesize that given an accurate-enough pitch tracker, a simple metric based on the distribution of estimated f0 values will correlate quite well with human perceptions of the voice-quality characteristics commonly called "vocal fry". And it looks to me — based on these two admittedly limited tests — as if REAPER is accurate enough to support this research.

We need a better metric on f0 distributions than just my crude "percent below 70 Hz" attempt. And we should explore various automated measurements of jitter and/or period-doubling. But I like the idea that a simple quantification of f0 distributions might work well enough that we can finally test (aspects of) the widespread perception that young women are doing something different with their voices that includes increased amounts of vocal "creak" or "fry".

Permalink