Lingthusiasm Vowel Plots
  • Tutorial
    • Part 1: Finding Vowels to Plot
    • Part 2: Annotating the Audio
    • Part 3: Plotting the Vowels
  • Episode
    • Main Episode: What visualizing our vowels tells us about who we are
    • Patreon Bonus Episode: How we made vowel plots
  • Downloads
  • Source Code
  • Ask a Question

Contents

  • 2.1 Setup
  • 2.2 Trim Audio from Episodes
  • 2.3 Wells Lexical Set
  • 2.4 An Interlude in Praat
  • 2.5 Read Annotation Data
  • 2.6 Calculate Formants
  • View source

Part 2: Annotating the Audio

Author

Bethany Gardner

Published

February 9, 2024

Doi

10.5281/zenodo.10642632


There are two sets of data going into these vowel plots:

  1. Vowels pulled from the Lingthusiasm episode recordings, which were located in Part 1
  2. Vowels from Gretchen & Lauren recording the Wells lexical set for me

The next steps are to trim the words out of the episode audio files for #1, then annotate the vowels for both #1 and #2.

2.1 Setup

"""Part 2 of Lingthusiasm Vowel Plots: Trimming Audio and Getting Vowel Formants."""

import glob
import os
import pandas as pd
from pytube import Playlist, YouTube
from pydub import AudioSegment
import parselmouth
1
File utilities.
2
Dataframes.
3
Getting captions and audio data from YouTube.
4
Working with audio files.
5
Interface with Praat.

Get video info from Lingthusiasm’s all episodes playlist:

video_list = Playlist('https://www.youtube.com/watch?v=xHNgepsuZ8c&' +
                      'list=PLcqOJ708UoXQ2wSZelLwkkHFwg424u8tG')

Go through each video and download audio (if not already downloaded):

def get_audio(videos):
    """Download episode audio from Youtube."""
    for url in videos:
        video = YouTube(url)
        video.bypass_age_gate()

        title = video.title
        episode = int(title[:2])

        audio_file_name = os.path.join(
            'audio', 'episodes', f'{episode}.mp4')
        if not os.path.isfile(audio_file_name):
            audio_stream = video.streams.get_audio_only()
            print(f'downloading {episode}')
            audio_stream.download(filename=audio_file_name)


get_audio(video_list)
1
Go through the list of video URLs and open each one as a YouTube object.
2
Need to include this to download data.
3
The video title is an attribute of the YouTube object, and the episode number is the first word of the title.
4
Create file name for episode audio.
5
If file is not already downloaded, select and download the highest-quality audio-only stream.
downloading 88
downloading 87

2.2 Trim Audio from Episodes

Open the timestamps data from Part 1:

timestamps = pd.read_csv( 
    'data/timestamps_annotate.csv',
    usecols=[
        'Vowel', 'Word', 'Speaker', 'Number',
        'Episode', 'Start', 'End'
    ],
    dtype={
        'Vowel': 'category', 'Word': 'category', 'Speaker': 'category',
        'Number': 'category', 'Episode': 'category',
        'Start': 'int', 'End': 'int'
    }
)
timestamps['Speaker'] = timestamps['Speaker'] \
    .str.replace('retchen', '').str.replace('auren', '') \
    .astype('category')
1
Keep columns specifying word variables.
2
And keep columns specifying where audio is.
3
Make all columns categorical variables, except the Start and End times (integers).
4
Convert values in Speaker column from names to initials.

Trim audio for the duration of the caption (with 250ms before and after). This results ~240 audio files each 2-10sec long, each containing a target word.

def trim_audio(df):
    """Use caption timestamps to trim audio."""

    for i in df.index:
        episode = df.loc[i, 'Episode']
        word = df.loc[i, 'Word']
        speaker = df.loc[i, 'Speaker']
        count = df.loc[i, 'Number']

        out_file =  os.path.join(
            'audio', 'words', f'episode_{word}_{speaker}_{count}.wav')
        if not os.path.isfile(out_file):
            in_file = os.path.join('audio', 'episodes', f'{episode}.mp4')
            audio = AudioSegment.from_file(in_file, format='mp4')
            start = max(df.loc[i, 'Start'] - 250, 0)
            end = min(len(audio), df.loc[i, 'End'] + 250)
            clip = audio[start:end]
            clip.export(out_f=out_file, format='wav')


trim_audio(timestamps)
1
Go through dataframe that has the example words to annotate and their timestamps.
2
Make file name for current word, and if it does not already exist…
3
Open the audio file for the whole episode.
4
Trim the episode audio to start 250 ms after the caption timestamp and end 250 after the caption timestamp; save it.

2.3 Wells Lexical Set

The Wells lexical set is a set of examples for each vowel/diphthong, chosen to be maximally distinguishable and consistent. You can read more about it on Wikipedia and John Wells’ blog. These recordings are going to be more controlled than the vowels pulled from the episode recordings and easier to annotate because they’re spoken more slowly and carefully. This set contains some fairly low-frequency words, which is why there’s not a lot of overlap with the words pulled from the episodes.

The Wells lexical set

The Wells lexical set
wells_lexical_set = {
    '\u0069': 'fleece',   # i
    '\u026A': ['kit', 'near'],  # ɪ
    '\u025B': ['dress', 'square'],  # ɛ
    '\u00E6': ['trap', 'bath'],  # æ
    '\u006F': ['force', 'goat'], # o 
    '\u0075': 'goose',  # u
    '\u028A': ['cure', 'foot'],  # ʊ
    '\u0254': ['cloth', 'north', 'thought'],  # ɔ
    '\u0251': ['lot', 'palm', 'start'],  # ɑ
    '\u028C': 'strut'  # ʌ
}

wells_lexical_set = pd.DataFrame.from_dict(wells_lexical_set, orient='index') \
    .rename(columns={0: 'Word'}) \
    .explode('Word') \
    .reset_index(names='Vowel')
1
Dictionary where keys are the IPA vowel unicode and values is word(s).
2
Convert to dataframe with columns for Vowel and Word.

Here’s the full set of words for each vowel:

word_list = pd.concat([
    pd.DataFrame({
        'List': 'lexicalset',
        'Vowel': wells_lexical_set['Vowel'],
        'Word': wells_lexical_set['Word']
    }),
    timestamps[['Vowel', 'Word']].drop_duplicates()
  ])
word_list = word_list.fillna('episode')
word_list = word_list.sort_values(by = ['Vowel', 'Word'])
word_list = word_list.reset_index(drop = True)

word_list.style.hide()
1
Combine word lists from episodes and Wells lexical set.
2
Dataframe for Wells lexical set with columns for List, Vowel, and Word.
3
Subset dataframe for episode word list, with columns for Vowel and List.
4
Fill the NA values of List with episode.
5
Sort and reset index.
6
Print, not including index.
List Vowel Word
episode i beat
episode i believe
lexicalset i fleece
episode i people
lexicalset o force
lexicalset o goat
episode u blue
lexicalset u goose
episode u through
episode u who
episode æ bang
lexicalset æ bath
episode æ hand
episode æ laugh
lexicalset æ trap
episode ɑ ball
episode ɑ father
episode ɑ honorific
lexicalset ɑ lot
lexicalset ɑ palm
lexicalset ɑ start
episode ɔ bought
lexicalset ɔ cloth
episode ɔ core
lexicalset ɔ north
lexicalset ɔ thought
episode ɔ wrong
episode ə among
episode ə famous
episode ə support
episode ɛ bet
lexicalset ɛ dress
episode ɛ guest
episode ɛ says
lexicalset ɛ square
episode ɪ bit
episode ɪ finish
lexicalset ɪ kit
lexicalset ɪ near
episode ɪ pin
episode ʊ could
lexicalset ʊ cure
lexicalset ʊ foot
episode ʊ foot
episode ʊ put
episode ʌ another
episode ʌ but
episode ʌ fun
lexicalset ʌ strut

2.4 An Interlude in Praat

Now, we have about 400 audio clips of Gretchen and Lauren saying words that are good examples of each vowel. The vowel data that will actually be going into the plots is F1 and F2 [todo: link to overview of formants]. The easiest way to calculate the vowel formants is using Praat, a software designed for doing phonetic analysis.

Here’s an example of what that looked like:

Praat screenshot of Lauren saying “pit.”

The vowel [ɪ] is highlighted in pink, and you can see it’s darker on the spectrogram than the consonants [k] and [t] before and after it. I placed an annotation (the blue lines, which Praat calls a “point tier”) right in the middle of the vowel sound—this one is pretty easy, because Lauren was speaking slowly and without anyone overlapping, so the vowel sound is long and clear. The formants are the lines of red dots, and the popup window is the formant values at the vowel annotation time. We’ll be using F1 (the bottom one) and F2 (the second from the bottom).

You can download Praat and see the documentation here. It’s fairly old, so a con is that the interface isn’t necessarily intuitive if you’re used to modern programs, but a pro is that there are a ton of resources available for learning how to use it. Here’s a tutorial about getting started with Praat, and here’s one for recording your own audio and calculating the formants in it.

After going through all of the audio clips, I had a .TextGrid file (Praat’s annotation file format) for each audio clip that includes the timestamp for middle(ish) of the vowel. You can copy formant values manually out of Praat, or you can use Praat scripting to export them to a csv file (see this tutorial, for example). But I prefer to go back to Python instead of wrangling Praat scripting code.

2.5 Read Annotation Data

There are packages that read Praat TextGrid files, but I kept getting errors. Luckily, the textgrids for these annotations are simple text files, where we only need to extra one variable (the time of the point tier). These two functions do that:

def get_tier(text):
    """Get annotation info from Praat TextGrid."""

    tg = text.split('class = "TextTier"')[1]
    tg = tg.splitlines()[1:]
    tg = pd.Series(tg)
    tg = tg.str.partition('=')
    tg.drop(columns=1, inplace=True)
    tg.rename(columns={0: 'Name', 2: 'Value'}, inplace=True)
    tg['Name'] = tg['Name'].str.strip()
    tg['Value'] = tg['Value'].str.strip()
    tg.set_index('Name', inplace=True)

    return tg['Value'].to_dict()


def get_point_tier_time(t):
    """Get time from TextGrid PointTier."""

    tg = get_tier(t)
    time = tg['number']
    time = float(time)

    return round(time, 4)
1
Section we need in TextGrid files start with this string.
2
Split string by line breaks and convert to pandas series.
3
Split into columns by = character, where the first column is the variable name, the second column is = (and gets dropped), and the third column is the variable value.
4
Remove extra whitespace.
5
Make Name into index, so the dataframe can be converted to a dictionary.
6
Read TextGrid file using function defined immediately above.
7
The variable we want (the timestamp for the PointTier annotation) is called number.
8
Convert the time from character to numeric and round to 4 digits.

These functions cycle through the list of TextGrid files, extract the point tier times, and put them into a dataframe with the rest of the information for each word:

def read_textgrid_times(file_list, word_list):
    """Read textgrid files into dataframe."""

    tg_times = []
    for file_name in file_list:
        with open(file_name, encoding='utf-8') as t_file:
            t = t_file.read()
            try:
                tg = get_point_tier_time(t)
            except KeyError:
                tg = None
            tg_times.append(tg)

    df = pd.DataFrame({'File': file_list, 'Vowel_Time': tg_times})
    df['File'] = df['File'].str.rpartition('\\')[2]
    df['File'] = df['File'].str.removesuffix('.TextGrid')

    return textgrid_vars(df, word_list)


def textgrid_vars(df, word_list):
    """Format df of vowel timestamps."""

    df['List'] = df['File'].str.split('_', expand=True)[0]
    df['Word'] = df['File'].str.split('_', expand=True)[1]
    df['Speaker'] = df['File'].str.split('_', expand=True)[2]
    df['Count'] = df['File'].str.split('_', expand=True)[3]

    df = pd.merge(df, word_list, how='left', on=['Word', 'List'])

    return df[['List', 'Vowel', 'Word', 'Speaker', 'Count', 'Vowel_Time']]
1
Try to get timestamp from PointTier annotation for each TextGrid file, using function defined in previous code chunk.
2
Put results into a dataframe.
3
Remove the path prefix and type suffix from the file names, leaving just the word_speaker_number format.
4
Get other variables from File, using function defined immediately below.
5
When File is split by _, List (episode or lexicalset) is the first item, Word is the second item, Speaker (G or L) is the third item, and Count is the fourth item in the resulting list.
6
Merge with the word list dataframe to add a column for Vowel by matching on Word.
7
Organize.

Make list of TextGrid files and read PointTier times:

tg_list = glob.glob(os.path.join('audio', 'words', '*.TextGrid'))
formants = read_textgrid_times(tg_list, word_list)

pd.concat([formants.head(), formants.tail()]).style.hide()
1
Get a list of all .TextGrid files in the audio/words/ directory.
2
Get the data from each TextGrid file, using functions defined in previous two code chunks.
3
Get the first and last 10 rows, then display as a table, not including row numbers.
List Vowel Word Speaker Count Vowel_Time
episode ə among G 1 1.445500
episode ə among G 2 2.705600
episode ə among G 3 3.847700
episode ə among G 4 2.780400
episode ə among G 5 2.237000
lexicalset æ trap G 2 0.363400
lexicalset æ trap G 3 0.235500
lexicalset æ trap L 1 0.277300
lexicalset æ trap L 2 0.221500
lexicalset æ trap L 3 0.181500

2.6 Calculate Formants

It’s possible to export the formants from Praat, but I think it’s easier to use the parselmouth package here, which runs Praat from Python.

def get_formants(df):
    """Get F1 and F2 at specified time."""

    for i in df.index:
        file = os.path.join(
            'audio', 'words',
            (df.loc[i, 'List'] + '_' + df.loc[i, 'Word'] + '_' +
            df.loc[i, 'Speaker'] + '_' + df.loc[i, 'Count'] + '.wav')
        )
        audio = parselmouth.Sound(file)
        formants = audio.to_formant_burg(
            time_step=0.01, max_number_of_formants=5
        )
        if 'fleece_L_2' in file:
            formants = audio.to_formant_burg(
                time_step=0.01, max_number_of_formants=4
            )
        vowel_time = df.loc[i, 'Vowel_Time']
        df.loc[i, 'F1'] = formants.get_value_at_time(1, vowel_time)
        df.loc[i, 'F2'] = formants.get_value_at_time(2, vowel_time)

    return df
1
Reconstruct the current audio file name from Word + Speaker + Count variables.
2
Use the parselmouth package to open the audio file.
3
Call Praat via parselmouth and calculate the formants. These are the default settings: every 0.010 seconds, up to 5 formants.
4
Because of artefacts in the recording, this file needs a limit of 4 formants to identify them correctly.
5
Get the timestamp of the vowel for the current audio file.
6
Get F1 and F2 at the specified time from the parselmouth formant object.

Calculate the formants and summarize the results:

formants = get_formants(formants)

pd.concat([formants.head(), formants.tail()]).style.hide()
1
Calculate vowel formants for each word, using function defined in previous code chunk.
2
Get the first and last 10 rows, then display as a table, not including row numbers.
List Vowel Word Speaker Count Vowel_Time F1 F2
episode ə among G 1 1.445500 765.824033 1259.934246
episode ə among G 2 2.705600 602.121127 1253.818304
episode ə among G 3 3.847700 733.394036 1174.821421
episode ə among G 4 2.780400 766.380340 1372.590268
episode ə among G 5 2.237000 625.651118 1259.707370
lexicalset æ trap G 2 0.363400 993.455993 1672.473430
lexicalset æ trap G 3 0.235500 994.743332 1709.214039
lexicalset æ trap L 1 0.277300 915.395294 1516.026185
lexicalset æ trap L 2 0.221500 903.045126 1466.625440
lexicalset æ trap L 3 0.181500 987.048619 1601.687398
formants.groupby(['Vowel', 'Speaker']) \
    .agg({'F1': ['min', 'mean', 'max'], 'F2': ['min', 'mean', 'max']}) \
    .style \
    .format(precision=0)
1
Group the data by Vowel and Speaker, then calculate the min, mean, and max of F1 and F2 for each Vowel + Speaker combination. Print the results as a table, with values rounded to whole numbers.
    F1 F2
    min mean max min mean max
Vowel Speaker            
i G 244 353 436 2284 2566 2882
L 355 420 512 2240 2635 2924
o G 505 592 691 781 1087 1491
L 497 583 665 727 1011 1269
u G 337 397 536 977 1589 2050
L 293 396 481 1416 1834 2140
æ G 586 859 1141 1438 1802 2187
L 784 914 1015 1236 1619 2127
ɑ G 513 721 884 753 1024 1439
L 443 726 900 672 1123 1347
ɔ G 338 695 900 604 1016 1214
L 483 600 766 609 925 1273
ə G 404 575 766 1175 1505 1893
L 384 621 935 1208 1513 1863
ɛ G 512 721 883 1629 1827 2093
L 489 669 803 1670 1954 2327
ɪ G 367 483 624 1675 2098 2693
L 275 452 533 1940 2383 2797
ʊ G 316 521 661 962 1365 1940
L 407 488 633 950 1344 2111
ʌ G 528 726 971 1239 1470 1719
L 528 728 970 1140 1382 1799

Save results as data/formants.csv:

formants.to_csv('data/formants.csv', index=False)
Back to top

Citation

BibTeX citation:
@online{gardner2024,
  author = {Gardner, Bethany},
  title = {Lingthusiasm {Vowel} {Plots}},
  date = {2024-02-09},
  url = {https://bethanyhgardner.github.io/lingthusiasm-vowel-plots},
  doi = {10.5281/zenodo.10642632},
  langid = {en}
}
For attribution, please cite this work as:
Gardner, Bethany. 2024. “Lingthusiasm Vowel Plots.” February 9, 2024. https://doi.org/10.5281/zenodo.10642632.
  • View source