Part 2: Annotating the Audio

Author

Bethany Gardner

Published

February 9, 2024

Doi

There are two sets of data going into these vowel plots:

Vowels pulled from the Lingthusiasm episode recordings, which were located in Part 1
Vowels from Gretchen & Lauren recording the Wells lexical set for me

The next steps are to trim the words out of the episode audio files for #1, then annotate the vowels for both #1 and #2.

2.1 Setup

"""Part 2 of Lingthusiasm Vowel Plots: Trimming Audio and Getting Vowel Formants."""

import glob
import os
import pandas as pd
from pytube import Playlist, YouTube
from pydub import AudioSegment
import parselmouth

1: File utilities.
2: Dataframes.
3: Getting captions and audio data from YouTube.
4: Working with audio files.
5: Interface with Praat.

Get video info from Lingthusiasm’s all episodes playlist:

Go through each video and download audio (if not already downloaded):

def get_audio(videos):
    """Download episode audio from Youtube."""
    for url in videos:
        video = YouTube(url)
        video.bypass_age_gate()

        title = video.title
        episode = int(title[:2])

        audio_file_name = os.path.join(
            'audio', 'episodes', f'{episode}.mp4')
        if not os.path.isfile(audio_file_name):
            audio_stream = video.streams.get_audio_only()
            print(f'downloading {episode}')
            audio_stream.download(filename=audio_file_name)


get_audio(video_list)

1: Go through the list of video URLs and open each one as a YouTube object.
2: Need to include this to download data.
3: The video title is an attribute of the YouTube object, and the episode number is the first word of the title.
4: Create file name for episode audio.
5: If file is not already downloaded, select and download the highest-quality audio-only stream.

downloading 88
downloading 87

2.2 Trim Audio from Episodes

Open the timestamps data from Part 1:

timestamps = pd.read_csv( 
    'data/timestamps_annotate.csv',
    usecols=[
        'Vowel', 'Word', 'Speaker', 'Number',
        'Episode', 'Start', 'End'
    ],
    dtype={
        'Vowel': 'category', 'Word': 'category', 'Speaker': 'category',
        'Number': 'category', 'Episode': 'category',
        'Start': 'int', 'End': 'int'
    }
)
timestamps['Speaker'] = timestamps['Speaker'] \
    .str.replace('retchen', '').str.replace('auren', '') \
    .astype('category')

1: Keep columns specifying word variables.
2: And keep columns specifying where audio is.
3: Make all columns categorical variables, except the Start and End times (integers).
4: Convert values in Speaker column from names to initials.

Trim audio for the duration of the caption (with 250ms before and after). This results ~240 audio files each 2-10sec long, each containing a target word.

def trim_audio(df):
    """Use caption timestamps to trim audio."""

    for i in df.index:
        episode = df.loc[i, 'Episode']
        word = df.loc[i, 'Word']
        speaker = df.loc[i, 'Speaker']
        count = df.loc[i, 'Number']

        out_file =  os.path.join(
            'audio', 'words', f'episode_{word}_{speaker}_{count}.wav')
        if not os.path.isfile(out_file):
            in_file = os.path.join('audio', 'episodes', f'{episode}.mp4')
            audio = AudioSegment.from_file(in_file, format='mp4')
            start = max(df.loc[i, 'Start'] - 250, 0)
            end = min(len(audio), df.loc[i, 'End'] + 250)
            clip = audio[start:end]
            clip.export(out_f=out_file, format='wav')


trim_audio(timestamps)

1: Go through dataframe that has the example words to annotate and their timestamps.
2: Make file name for current word, and if it does not already exist…
3: Open the audio file for the whole episode.
4: Trim the episode audio to start 250 ms after the caption timestamp and end 250 after the caption timestamp; save it.

2.3 Wells Lexical Set

The Wells lexical set is a set of examples for each vowel/diphthong, chosen to be maximally distinguishable and consistent. You can read more about it on Wikipedia and John Wells’ blog. These recordings are going to be more controlled than the vowels pulled from the episode recordings and easier to annotate because they’re spoken more slowly and carefully. This set contains some fairly low-frequency words, which is why there’s not a lot of overlap with the words pulled from the episodes.

wells_lexical_set = {
    '\u0069': 'fleece',   # i
    '\u026A': ['kit', 'near'],  # ɪ
    '\u025B': ['dress', 'square'],  # ɛ
    '\u00E6': ['trap', 'bath'],  # æ
    '\u006F': ['force', 'goat'], # o 
    '\u0075': 'goose',  # u
    '\u028A': ['cure', 'foot'],  # ʊ
    '\u0254': ['cloth', 'north', 'thought'],  # ɔ
    '\u0251': ['lot', 'palm', 'start'],  # ɑ
    '\u028C': 'strut'  # ʌ
}

wells_lexical_set = pd.DataFrame.from_dict(wells_lexical_set, orient='index') \
    .rename(columns={0: 'Word'}) \
    .explode('Word') \
    .reset_index(names='Vowel')

1: Dictionary where keys are the IPA vowel unicode and values is word(s).
2: Convert to dataframe with columns for Vowel and Word.

Here’s the full set of words for each vowel:

word_list = pd.concat([
    pd.DataFrame({
        'List': 'lexicalset',
        'Vowel': wells_lexical_set['Vowel'],
        'Word': wells_lexical_set['Word']
    }),
    timestamps[['Vowel', 'Word']].drop_duplicates()
  ])
word_list = word_list.fillna('episode')
word_list = word_list.sort_values(by = ['Vowel', 'Word'])
word_list = word_list.reset_index(drop = True)

word_list.style.hide()

1: Combine word lists from episodes and Wells lexical set.
2: Dataframe for Wells lexical set with columns for List, Vowel, and Word.
3: Subset dataframe for episode word list, with columns for Vowel and List.
4: Fill the NA values of List with episode.
5: Sort and reset index.
6: Print, not including index.

List	Vowel	Word
episode	i	beat
episode	i	believe
lexicalset	i	fleece
episode	i	people
lexicalset	o	force
lexicalset	o	goat
episode	u	blue
lexicalset	u	goose
episode	u	through
episode	u	who
episode	æ	bang
lexicalset	æ	bath
episode	æ	hand
episode	æ	laugh
lexicalset	æ	trap
episode	ɑ	ball
episode	ɑ	father
episode	ɑ	honorific
lexicalset	ɑ	lot
lexicalset	ɑ	palm
lexicalset	ɑ	start
episode	ɔ	bought
lexicalset	ɔ	cloth
episode	ɔ	core
lexicalset	ɔ	north
lexicalset	ɔ	thought
episode	ɔ	wrong
episode	ə	among
episode	ə	famous
episode	ə	support
episode	ɛ	bet
lexicalset	ɛ	dress
episode	ɛ	guest
episode	ɛ	says
lexicalset	ɛ	square
episode	ɪ	bit
episode	ɪ	finish
lexicalset	ɪ	kit
lexicalset	ɪ	near
episode	ɪ	pin
episode	ʊ	could
lexicalset	ʊ	cure
lexicalset	ʊ	foot
episode	ʊ	foot
episode	ʊ	put
episode	ʌ	another
episode	ʌ	but
episode	ʌ	fun
lexicalset	ʌ	strut

2.4 An Interlude in Praat

Now, we have about 400 audio clips of Gretchen and Lauren saying words that are good examples of each vowel. The vowel data that will actually be going into the plots is F1 and F2 [todo: link to overview of formants]. The easiest way to calculate the vowel formants is using Praat, a software designed for doing phonetic analysis.

Here’s an example of what that looked like:

Praat screenshot of Lauren saying “pit.”

The vowel [ɪ] is highlighted in pink, and you can see it’s darker on the spectrogram than the consonants [k] and [t] before and after it. I placed an annotation (the blue lines, which Praat calls a “point tier”) right in the middle of the vowel sound—this one is pretty easy, because Lauren was speaking slowly and without anyone overlapping, so the vowel sound is long and clear. The formants are the lines of red dots, and the popup window is the formant values at the vowel annotation time. We’ll be using F1 (the bottom one) and F2 (the second from the bottom).

You can download Praat and see the documentation here. It’s fairly old, so a con is that the interface isn’t necessarily intuitive if you’re used to modern programs, but a pro is that there are a ton of resources available for learning how to use it. Here’s a tutorial about getting started with Praat, and here’s one for recording your own audio and calculating the formants in it.

After going through all of the audio clips, I had a .TextGrid file (Praat’s annotation file format) for each audio clip that includes the timestamp for middle(ish) of the vowel. You can copy formant values manually out of Praat, or you can use Praat scripting to export them to a csv file (see this tutorial, for example). But I prefer to go back to Python instead of wrangling Praat scripting code.

2.5 Read Annotation Data

There are packages that read Praat TextGrid files, but I kept getting errors. Luckily, the textgrids for these annotations are simple text files, where we only need to extra one variable (the time of the point tier). These two functions do that:

def get_tier(text):
    """Get annotation info from Praat TextGrid."""

    tg = text.split('class = "TextTier"')[1]
    tg = tg.splitlines()[1:]
    tg = pd.Series(tg)
    tg = tg.str.partition('=')
    tg.drop(columns=1, inplace=True)
    tg.rename(columns={0: 'Name', 2: 'Value'}, inplace=True)
    tg['Name'] = tg['Name'].str.strip()
    tg['Value'] = tg['Value'].str.strip()
    tg.set_index('Name', inplace=True)

    return tg['Value'].to_dict()


def get_point_tier_time(t):
    """Get time from TextGrid PointTier."""

    tg = get_tier(t)
    time = tg['number']
    time = float(time)

    return round(time, 4)

1: Section we need in TextGrid files start with this string.
2: Split string by line breaks and convert to pandas series.
3: Split into columns by = character, where the first column is the variable name, the second column is = (and gets dropped), and the third column is the variable value.
4: Remove extra whitespace.
5: Make Name into index, so the dataframe can be converted to a dictionary.
6: Read TextGrid file using function defined immediately above.
7: The variable we want (the timestamp for the PointTier annotation) is called number.
8: Convert the time from character to numeric and round to 4 digits.

These functions cycle through the list of TextGrid files, extract the point tier times, and put them into a dataframe with the rest of the information for each word:

def read_textgrid_times(file_list, word_list):
    """Read textgrid files into dataframe."""

    tg_times = []
    for file_name in file_list:
        with open(file_name, encoding='utf-8') as t_file:
            t = t_file.read()
            try:
                tg = get_point_tier_time(t)
            except KeyError:
                tg = None
            tg_times.append(tg)

    df = pd.DataFrame({'File': file_list, 'Vowel_Time': tg_times})
    df['File'] = df['File'].str.rpartition('\\')[2]
    df['File'] = df['File'].str.removesuffix('.TextGrid')

    return textgrid_vars(df, word_list)


def textgrid_vars(df, word_list):
    """Format df of vowel timestamps."""

    df['List'] = df['File'].str.split('_', expand=True)[0]
    df['Word'] = df['File'].str.split('_', expand=True)[1]
    df['Speaker'] = df['File'].str.split('_', expand=True)[2]
    df['Count'] = df['File'].str.split('_', expand=True)[3]

    df = pd.merge(df, word_list, how='left', on=['Word', 'List'])

    return df[['List', 'Vowel', 'Word', 'Speaker', 'Count', 'Vowel_Time']]

1: Try to get timestamp from PointTier annotation for each TextGrid file, using function defined in previous code chunk.
2: Put results into a dataframe.
3: Remove the path prefix and type suffix from the file names, leaving just the word_speaker_number format.
4: Get other variables from File, using function defined immediately below.
5: When File is split by _, List (episode or lexicalset) is the first item, Word is the second item, Speaker (G or L) is the third item, and Count is the fourth item in the resulting list.
6: Merge with the word list dataframe to add a column for Vowel by matching on Word.
7: Organize.

Make list of TextGrid files and read PointTier times:

tg_list = glob.glob(os.path.join('audio', 'words', '*.TextGrid'))
formants = read_textgrid_times(tg_list, word_list)

pd.concat([formants.head(), formants.tail()]).style.hide()

1: Get a list of all .TextGrid files in the audio/words/ directory.
2: Get the data from each TextGrid file, using functions defined in previous two code chunks.
3: Get the first and last 10 rows, then display as a table, not including row numbers.

List	Vowel	Word	Speaker	Count	Vowel_Time
episode	ə	among	G	1	1.445500
episode	ə	among	G	2	2.705600
episode	ə	among	G	3	3.847700
episode	ə	among	G	4	2.780400
episode	ə	among	G	5	2.237000
lexicalset	æ	trap	G	2	0.363400
lexicalset	æ	trap	G	3	0.235500
lexicalset	æ	trap	L	1	0.277300
lexicalset	æ	trap	L	2	0.221500
lexicalset	æ	trap	L	3	0.181500

2.6 Calculate Formants

It’s possible to export the formants from Praat, but I think it’s easier to use the parselmouth package here, which runs Praat from Python.

def get_formants(df):
    """Get F1 and F2 at specified time."""

    for i in df.index:
        file = os.path.join(
            'audio', 'words',
            (df.loc[i, 'List'] + '_' + df.loc[i, 'Word'] + '_' +
            df.loc[i, 'Speaker'] + '_' + df.loc[i, 'Count'] + '.wav')
        )
        audio = parselmouth.Sound(file)
        formants = audio.to_formant_burg(
            time_step=0.01, max_number_of_formants=5
        )
        if 'fleece_L_2' in file:
            formants = audio.to_formant_burg(
                time_step=0.01, max_number_of_formants=4
            )
        vowel_time = df.loc[i, 'Vowel_Time']
        df.loc[i, 'F1'] = formants.get_value_at_time(1, vowel_time)
        df.loc[i, 'F2'] = formants.get_value_at_time(2, vowel_time)

    return df

1: Reconstruct the current audio file name from Word + Speaker + Count variables.
2: Use the parselmouth package to open the audio file.
3: Call Praat via parselmouth and calculate the formants. These are the default settings: every 0.010 seconds, up to 5 formants.
4: Because of artefacts in the recording, this file needs a limit of 4 formants to identify them correctly.
5: Get the timestamp of the vowel for the current audio file.
6: Get F1 and F2 at the specified time from the parselmouth formant object.

Calculate the formants and summarize the results:

formants = get_formants(formants)

pd.concat([formants.head(), formants.tail()]).style.hide()

1: Calculate vowel formants for each word, using function defined in previous code chunk.
2: Get the first and last 10 rows, then display as a table, not including row numbers.

List	Vowel	Word	Speaker	Count	Vowel_Time	F1	F2
episode	ə	among	G	1	1.445500	765.824033	1259.934246
episode	ə	among	G	2	2.705600	602.121127	1253.818304
episode	ə	among	G	3	3.847700	733.394036	1174.821421
episode	ə	among	G	4	2.780400	766.380340	1372.590268
episode	ə	among	G	5	2.237000	625.651118	1259.707370
lexicalset	æ	trap	G	2	0.363400	993.455993	1672.473430
lexicalset	æ	trap	G	3	0.235500	994.743332	1709.214039
lexicalset	æ	trap	L	1	0.277300	915.395294	1516.026185
lexicalset	æ	trap	L	2	0.221500	903.045126	1466.625440
lexicalset	æ	trap	L	3	0.181500	987.048619	1601.687398

formants.groupby(['Vowel', 'Speaker']) \
    .agg({'F1': ['min', 'mean', 'max'], 'F2': ['min', 'mean', 'max']}) \
    .style \
    .format(precision=0)

1: Group the data by Vowel and Speaker, then calculate the min, mean, and max of F1 and F2 for each Vowel + Speaker combination. Print the results as a table, with values rounded to whole numbers.

		F1			F2
		min	mean	max	min	mean	max
Vowel	Speaker
i	G	244	353	436	2284	2566	2882
i	L	355	420	512	2240	2635	2924
o	G	505	592	691	781	1087	1491
o	L	497	583	665	727	1011	1269
u	G	337	397	536	977	1589	2050
u	L	293	396	481	1416	1834	2140
æ	G	586	859	1141	1438	1802	2187
æ	L	784	914	1015	1236	1619	2127
ɑ	G	513	721	884	753	1024	1439
ɑ	L	443	726	900	672	1123	1347
ɔ	G	338	695	900	604	1016	1214
ɔ	L	483	600	766	609	925	1273
ə	G	404	575	766	1175	1505	1893
ə	L	384	621	935	1208	1513	1863
ɛ	G	512	721	883	1629	1827	2093
ɛ	L	489	669	803	1670	1954	2327
ɪ	G	367	483	624	1675	2098	2693
ɪ	L	275	452	533	1940	2383	2797
ʊ	G	316	521	661	962	1365	1940
ʊ	L	407	488	633	950	1344	2111
ʌ	G	528	726	971	1239	1470	1719
ʌ	L	528	728	970	1140	1382	1799

Save results as data/formants.csv:

formants.to_csv('data/formants.csv', index=False)

Citation

BibTeX citation:

@online{gardner2024,
  author = {Gardner, Bethany},
  title = {Lingthusiasm {Vowel} {Plots}},
  date = {2024-02-09},
  url = {https://bethanyhgardner.github.io/lingthusiasm-vowel-plots},
  doi = {10.5281/zenodo.10642632},
  langid = {en}
}

For attribution, please cite this work as:

Gardner, Bethany. 2024. “Lingthusiasm Vowel Plots.” February 9, 2024. https://doi.org/10.5281/zenodo.10642632.