Make all columns categorical variables, except the Start and End times (integers).
4
Convert values in Speaker column from names to initials.
Trim audio for the duration of the caption (with 250ms before and after). This results ~240 audio files each 2-10sec long, each containing a target word.
def trim_audio(df):"""Use caption timestamps to trim audio."""for i in df.index: episode = df.loc[i, 'Episode'] word = df.loc[i, 'Word'] speaker = df.loc[i, 'Speaker'] count = df.loc[i, 'Number'] out_file = os.path.join('audio', 'words', f'episode_{word}_{speaker}_{count}.wav')ifnot os.path.isfile(out_file): in_file = os.path.join('audio', 'episodes', f'{episode}.mp4') audio = AudioSegment.from_file(in_file, format='mp4') start =max(df.loc[i, 'Start'] -250, 0) end =min(len(audio), df.loc[i, 'End'] +250) clip = audio[start:end] clip.export(out_f=out_file, format='wav')trim_audio(timestamps)
1
Go through dataframe that has the example words to annotate and their timestamps.
2
Make file name for current word, and if it does not already exist…
3
Open the audio file for the whole episode.
4
Trim the episode audio to start 250 ms after the caption timestamp and end 250 after the caption timestamp; save it.
2.3 Wells Lexical Set
The Wells lexical set is a set of examples for each vowel/diphthong, chosen to be maximally distinguishable and consistent. You can read more about it on Wikipedia and John Wells’ blog. These recordings are going to be more controlled than the vowels pulled from the episode recordings and easier to annotate because they’re spoken more slowly and carefully. This set contains some fairly low-frequency words, which is why there’s not a lot of overlap with the words pulled from the episodes.
Combine word lists from episodes and Wells lexical set.
2
Dataframe for Wells lexical set with columns for List, Vowel, and Word.
3
Subset dataframe for episode word list, with columns for Vowel and List.
4
Fill the NA values of List with episode.
5
Sort and reset index.
6
Print, not including index.
List
Vowel
Word
episode
i
beat
episode
i
believe
lexicalset
i
fleece
episode
i
people
lexicalset
o
force
lexicalset
o
goat
episode
u
blue
lexicalset
u
goose
episode
u
through
episode
u
who
episode
æ
bang
lexicalset
æ
bath
episode
æ
hand
episode
æ
laugh
lexicalset
æ
trap
episode
ɑ
ball
episode
ɑ
father
episode
ɑ
honorific
lexicalset
ɑ
lot
lexicalset
ɑ
palm
lexicalset
ɑ
start
episode
ɔ
bought
lexicalset
ɔ
cloth
episode
ɔ
core
lexicalset
ɔ
north
lexicalset
ɔ
thought
episode
ɔ
wrong
episode
ə
among
episode
ə
famous
episode
ə
support
episode
ɛ
bet
lexicalset
ɛ
dress
episode
ɛ
guest
episode
ɛ
says
lexicalset
ɛ
square
episode
ɪ
bit
episode
ɪ
finish
lexicalset
ɪ
kit
lexicalset
ɪ
near
episode
ɪ
pin
episode
ʊ
could
lexicalset
ʊ
cure
lexicalset
ʊ
foot
episode
ʊ
foot
episode
ʊ
put
episode
ʌ
another
episode
ʌ
but
episode
ʌ
fun
lexicalset
ʌ
strut
2.4 An Interlude in Praat
Now, we have about 400 audio clips of Gretchen and Lauren saying words that are good examples of each vowel. The vowel data that will actually be going into the plots is F1 and F2 [todo: link to overview of formants]. The easiest way to calculate the vowel formants is using Praat, a software designed for doing phonetic analysis.
Here’s an example of what that looked like:
The vowel [ɪ] is highlighted in pink, and you can see it’s darker on the spectrogram than the consonants [k] and [t] before and after it. I placed an annotation (the blue lines, which Praat calls a “point tier”) right in the middle of the vowel sound—this one is pretty easy, because Lauren was speaking slowly and without anyone overlapping, so the vowel sound is long and clear. The formants are the lines of red dots, and the popup window is the formant values at the vowel annotation time. We’ll be using F1 (the bottom one) and F2 (the second from the bottom).
You can download Praat and see the documentation here. It’s fairly old, so a con is that the interface isn’t necessarily intuitive if you’re used to modern programs, but a pro is that there are a ton of resources available for learning how to use it. Here’s a tutorial about getting started with Praat, and here’s one for recording your own audio and calculating the formants in it.
After going through all of the audio clips, I had a .TextGrid file (Praat’s annotation file format) for each audio clip that includes the timestamp for middle(ish) of the vowel. You can copy formant values manually out of Praat, or you can use Praat scripting to export them to a csv file (see this tutorial, for example). But I prefer to go back to Python instead of wrangling Praat scripting code.
2.5 Read Annotation Data
There are packages that read Praat TextGrid files, but I kept getting errors. Luckily, the textgrids for these annotations are simple text files, where we only need to extra one variable (the time of the point tier). These two functions do that:
def get_tier(text):"""Get annotation info from Praat TextGrid.""" tg = text.split('class = "TextTier"')[1] tg = tg.splitlines()[1:] tg = pd.Series(tg) tg = tg.str.partition('=') tg.drop(columns=1, inplace=True) tg.rename(columns={0: 'Name', 2: 'Value'}, inplace=True) tg['Name'] = tg['Name'].str.strip() tg['Value'] = tg['Value'].str.strip() tg.set_index('Name', inplace=True)return tg['Value'].to_dict()def get_point_tier_time(t):"""Get time from TextGrid PointTier.""" tg = get_tier(t) time = tg['number'] time =float(time)returnround(time, 4)
1
Section we need in TextGrid files start with this string.
2
Split string by line breaks and convert to pandas series.
3
Split into columns by = character, where the first column is the variable name, the second column is = (and gets dropped), and the third column is the variable value.
4
Remove extra whitespace.
5
Make Name into index, so the dataframe can be converted to a dictionary.
6
Read TextGrid file using function defined immediately above.
7
The variable we want (the timestamp for the PointTier annotation) is called number.
8
Convert the time from character to numeric and round to 4 digits.
These functions cycle through the list of TextGrid files, extract the point tier times, and put them into a dataframe with the rest of the information for each word:
Try to get timestamp from PointTier annotation for each TextGrid file, using function defined in previous code chunk.
2
Put results into a dataframe.
3
Remove the path prefix and type suffix from the file names, leaving just the word_speaker_number format.
4
Get other variables from File, using function defined immediately below.
5
When File is split by _, List (episode or lexicalset) is the first item, Word is the second item, Speaker (G or L) is the third item, and Count is the fourth item in the resulting list.
6
Merge with the word list dataframe to add a column for Vowel by matching on Word.
7
Organize.
Make list of TextGrid files and read PointTier times:
Group the data by Vowel and Speaker, then calculate the min, mean, and max of F1 and F2 for each Vowel + Speaker combination. Print the results as a table, with values rounded to whole numbers.