Forced-alignment and segmentation of airflow data


March 23, 2009

An airflow system like the SQLab EVA2 used in our lab creates separate wav-like files for audio, oral airflow, and nasal airflow. Usually we use a program like wavesurfer or Praat to view these files and extract our measurements.

But what if you want to extract measurements for all of the segments in a reasonably-sized corpus of continuous speech? This can, of course, be done by hand but the process is tedious, labor-intensive, and inescapably subjective. This tutorial explains how to use Praat and HTK to segment audio recordings and use that segmentation to extract phone-level airflow data for an entire corpus. The segmentation is unlikely to be as good as a trained linguist might achieve, but it’s definitely faster and a very good starting point. The corpus I’m working with is a set of recordings of the TIMIT prompts I made for my doctoral candidacy qualifying project: Modeling Airflow for Concatenative Speech Synthesis.


  1. Convert the EVA2 files to WAV format.

SQLab distribute a piece of Windows-only, closed-source software called RIFF Edit for converting from their proprietary format to WAV format. Works great on my Mac using CrossOver which, I believe, is just a very nice version of wine.

  1. Break the audio recordings up into separate files (one prompt/sentence per file) Of course there are many ways to do this. Usually one would record each prompt separately so that the audio files are naturally in separate files. The clumsy EVA2 system makes this arrangement difficult, though, so I used a TextGrid to mark each prompt on an interval tier. This can be done automatically using Praat’s “Annotate –> To TextGrid (Silences)” command (some clean-up is usually necessary, but it saves a lot of work). Once the prompts are marked in the TextGrid I run this Praat script that I did not write.

( As a quick aside, Kyle Gorman has written a terrific little Python class for manipulating TextGrids –does exactly what you’d want in just exactly the way you’d probably expect. )

  1. Now use the same TextGrid and script to extract prompts from the OAF (oral airflow) and NAF (nasal airflow) files.

This is the great strength of using something like Praat in the first place. Be sure to check a random sample of your alignments to make sure everything worked, but I’ve never had a problem.

  1. Label all three files properly

My utterances are in a file called (identical to the file needed by Festival for voice creation) with the format:

(audio_0003 "This was easy for us.")
(audio_0004 "Jane may earn more money by working hard.")
(audio_0005 "She is thinner than I am.")
(audio_0006 "Bright sunshine shimmers on the ocean.")
(audio_0007 "Nothing is as offensive as innocence.")
(audio_0008 "Why yell or worry over silly items?")
(audio_0009 "Where were you while we were away?")


And then I run the shell script to double check the audio and create the prompt files.


export UTTS=../

echo which file are we starting with [ 1 = first ]?
read START;

if [ "$START" == "" ]; then

for i in `ls a*.wav | sort -n -k1.2`; do
    num=`printf "%04d" $START`

    line=`grep "audio_$num" $UTTS`
    prompt=`echo $line | awk -F\" '{print $2}'`
    echo $line
    afplay $i;
    echo make $i $num [y] ?;

    read ANSWER;

    if [ "$ANSWER" == "n" ]; then
        echo doing nothing
        if [ "$ANSWER" == "p" ]; then
            let "START = START - 1"
            num=`printf "%04d" $START`

            echo overwriting $i to $num ...
            echo saving $i to $num ...

        cp $i audio_${num}.wav
        cp oaf_$i oaf_${num}.wav
        cp naf_$i naf_${num}.wav
        echo $prompt > ${num}.txt
        rm $i

        let "START = START + 1"