Part 1: Finding Vowels to Plot

Author

Bethany Gardner

Published

February 9, 2024

Doi

10.5281/zenodo.10642632

There are going to be two sources of data, one more naturalistic and one more controlled:

Vowels pulled from the Lingthusiasm episode recordings
Vowels from Gretchen & Lauren recording the Wells lexical set

The first steps are to find the words for #1, and we’ll come back to #2 later.

1.1 Setup

"""Part 1 of Lingthusiasm Vowel Plots: Finding Vowels to Annotate."""

import re
import pandas as pd
import requests
from bs4 import BeautifulSoup
from thefuzz import process
from pytube import Playlist, YouTube

1: Regex functions.
2: Dataframe functions.
3: Scraping data from webpages.
4: Fuzzy string matching.
5: Getting captions and audio from YouTube.

(Note: All of this could also be done in R, but I’ve done it in Python here primarily because there are Python packages that can get data from YouTube without having to set up anything on the YouTube API side. Here, all you need to do is install the package.)

1.2 Get Transcript Text & Speakers

1.2.1 List Episodes

The Lingthusiasm website has a page listing all of the available transcripts. Step 1 is to load that page and get the list of URLs to the transcripts. (They have similar but not identical structures, so it’s easiest to read the list from the website instead of trying to construct them here.)

This function uses the BeautifulSoup package to return an HTML object and a text string for a URL:

def get_html_text(url):
    """Use BeautifulSoup to get the webpage text from the URL."""

    resp = requests.get(url, timeout=1000)
    html = BeautifulSoup(resp.text, 'html.parser')

    return {'html': html, 'text': html.get_text()}

1: Connect to webpage.
2: Load the HTML data from the webpage.
3: Return the HTML data object and the text from the HTML data object.

This function uses BeautifulSoup to filter just the transcript URLs from the HTML data:

def get_transcript_links(url):
    """Get URLs to episode transcripts from HTML page."""

    html = get_html_text(url)['html']
    url_objs = html.find_all('a')
    urls = [l.get('href') for l in url_objs]
    urls = pd.Series(data=urls, name='URL', dtype='string')
    urls = urls[urls.str.contains('transcript')]
    urls = urls[::-1]

    return urls.reset_index(drop=True)

1: Get HTML data from webpage using function defined above.
2: Filter using the a tag to get all of the link items.
3: Get the href item from each a tag item, which is the text of the URL.
4: Convert from list to pandas series.
5: Filter to only include URLs including the word transcript.
6: Sort to have earliest episodes first.
7: Return with index (row numbers) reset.

There are 84 episodes available (as of early October 2023):

transcript_urls = get_transcript_links('https://lingthusiasm.com/transcripts')

transcript_urls.head().to_frame().style

1: Get the URLs for the episode transcripts from the table of contents page, using the functions defined above.
2: Take the first 10 rows of the resulting list and print it as a nice table.

	URL
0	https://lingthusiasm.com/post/155357756341/transcript-lingthusiasm-episode-1-speaking-a
1	https://lingthusiasm.com/post/156181768226/transcript-lingthusiasm-episode-2-pronouns
2	https://lingthusiasm.com/post/157167562811/transcript-lingthusiasm-episode-3-arrival-of-the
3	https://lingthusiasm.com/post/157268108811/transcript-lingthusiasm-episode-4-inside-the-word
4	https://lingthusiasm.com/post/158014366301/transcript-lingthusiasm-episode-5-colour-words

Only keep episodes 1-84, for consistency replicating this code later:

transcript_urls = transcript_urls[:84]

1.2.2 Scrape Transcript Text

Now we can download transcripts, which are split into turns and labelled by speaker.

Splitting out some sub-tasks into separate methods, this function gets the episode number (following episode-) as an integer from the URL:

def transcript_number_from_url(url):
    """Find transcript number in URL."""

    index_episode = url.find('episode-')
    cur_number = url[index_episode + 8:]
    index_end = cur_number.find('-')
    if index_end != -1:
        cur_number = cur_number[:index_end]

    return int(cur_number)

1: Find location of episode-, since episode number is immediately following it.
2: Subset URL starting 8 characters after start of episode- string, which is immediately after it.
3: Most of the transcript URLs have more text after the number. If so, trim to just keep the number.
4: Return episode number converted from string to integer.

The text returned from the transcript pages has information at the top and bottom that we don’t need. "This is a transcript for" marks the start of the transcript, and "This work is licensed under" marks the end, so subset the transcript dataframe rows to only include that section:

def trim_transcript(df):
    """Find start and end of transcript text."""

    start_index = df.find('This is a transcript for')
    end_index = df.find('This work is licensed under')

    return df[start_index:end_index]

This function cleans up the text column so it plays a bit nicer with Excel CSV export:

def clean_text(l):
    """Clean text column so it opens correctly as Excel CSV."""

    l = l.strip()
    l = l.replace('\u00A0', ' ').replace('–', '--').replace('…', '...')
    l = re.sub('“|”|‘|’', "'", l)

    return ' '.join(l.split())

1: Remove leading and trailing whitespaces.
2: Replace non-breaking spaces, en dashes, and ellipses.
3: Replace slanted quotes.
4: Remove double spaces between words.

After a bit of trial and error, it’s easiest to split on the speaker names, since the paragraph formatting isn’t identical across all the pages:

speaker_names = [
    'Gretchen', 'Lauren', 'Ake', 'Bona', 'Ev', 'Fei Ting', 'Gabrielle',
    'Jade', 'Hannah', 'Hilaria', 'Janelle', 'Kat', 'Kirby', 'Lina', 'Nicole',
    'Pedro', 'Randall', 'Shivonne', 'Suzy'
]
speaker_regex = '(' + ':)|('.join(speaker_names) + ':)'

1: Regex looks like (Name1:)|(Name2:) etc. Surrounding names in parentheses makes it a capture group, so the names are included in the list of split items instead of dropped.

This function puts it all together to read the transcripts into one dataframe:

def get_transcripts(urls):
    """Get transcript text from URLs and format into dataframe."""

    df = []
    for l in urls:
        cur_text = get_html_text(l)['text']
        cur_text = trim_transcript(cur_text)
        cur_lines = re.split(speaker_regex, cur_text)
        cur_lines = [l for l in cur_lines if l is not None]
        cur_lines = cur_lines[1:]
        speakers = cur_lines[::2]
        turns = cur_lines[1::2]
        cur_df = pd.DataFrame({
            'Episode': transcript_number_from_url(l),
            'Speaker': speakers,
            'Text': [clean_text(line) for line in turns],
        })
        cur_df['Speaker'] = cur_df['Speaker'].str.removesuffix(':')
        cur_df['Turn'] = cur_df.index + 1
        cur_df = cur_df.set_index(['Episode', 'Turn'], drop=True)
        df.append(cur_df)

    df = pd.concat(df)
    df['Speaker'] = df['Speaker'].astype('category')
    df['Text'] = df['Text'].astype('string')

    return df

1: Make list to store results.
2: Read text string (vs HTML object) from URL, using function defined above.
3: Trim to only include transcript section, using function defined above.
4: Split string into list with one item for each speaker turn, using regex defined above.
5: Drop items in list that are None (which are included because of capture group syntax).
6: Drop the initial “This is a transcript for…” line.
7: Now all the odd items are speaker labels and even items are turn text. [::2] is every 2nd item starting at 0, and [1::2] is every 2nd item starting at 1.
8: Clean up strings and put everything into a dataframe.
9: Make column for turn number, which is index + 1.
10: Set Episode and Turn columns as indices.
11: Add to list of parsed episodes.
12: Combine list of dataframes into one dataframe. This works because the Episode + Turn index combination is unique.
13: Set datatypes explicitly, to avoid warnings later.

This takes a minute or so to run:

transcripts = get_transcripts(transcript_urls)

pd.concat([transcripts.head(), transcripts.tail()]).style

1: Run get_transcripts() on the lists of links to each transcript, defined above.
2: Show the first and last 10 rows of the resulting dataframe. .style prints it as a nicer-looking table.

		Speaker	Text
Episode	Turn
1	1	Lauren	Welcome to Lingthusiasm! A podcast that's enthusiastic about linguistics. I'm Lauren Gawne.
	2	Gretchen	And I'm Gretchen McCulloch. And today we're going to be talking about universal language and why it doesn't work. But first a little bit about what Lingthusiasm is.
	3	Lauren	So I guess the important thing is to unpack 'a podcast that's enthusiastic about linguistics'.
	4	Gretchen	Yeah! We chose our tagline because we're here to explore interesting things that language has to offer and especially what looking at language from a linguistics perspective can tell us about how language works.
	5	Lauren	Both of us have blogs where we do a fair amount of that, but that's a fairly solitary enterprise and we wanted to take the opportunity to have more of a conversation -- something that's a little less disembodied
84	184	Lauren	Ah, well, that's okay, because now we're at the end of this episode, I'm pointing to everyone. [Music]
	185	Lauren	For more Lingthusiasm and links to all the things pointed to in this episode, go to lingthusiasm.com. You can listen to us on Apple Podcasts, Google Podcasts, Spotify, SoundCloud, YouTube, or wherever else you get your podcasts. You can follow @lingthusiasm on Twitter, Facebook, Instagram, and Tumblr. You can get 'Etymology isn't Destiny' on t-shirts and tote bags and lots of other items, and aesthetic IPA posters, and other Lingthusiasm merch at lingthusiasm.com/merch. I tweet and blog as Superlinguo.
	186	Gretchen	I can be found as @GretchenAMcC on Twitter, my blog is AllThingsLinguistic.com, and my book about internet language is called Because Internet. Lingthusiasm is able to keep existing thanks to the support of our patrons. If you wanna get an extra Lingthusiasm episode to listen to every month, our entire archive of bonus episodes to listen to right now, or if you just wanna help keep the show running ad-free, go to patreon.com/lingthusiasm or follow the links from our website. Patrons can also get access to our Discord chatroom to talk with other linguistics fans and be the first to find out about new merch and other announcements. Recent bonus topics include interviews with Sarah Dopierala and Martha Tsutsui-Billins about their own linguistic research and their work on Lingthusiasm and a very special Lingthusiasmr episode where we read The Harvard Sentences to you [ASMR voice] in a calm, soothing voice. [Regular voice] Can't afford to pledge? That's okay, too. We also really appreciate it if you can recommend Lingthusiasm to anyone in your life who's curious about language.
	187	Lauren	Lingthusiasm is created and produced by Gretchen McCulloch and Lauren Gawne. Our Senior Producer is Claire Gawne, our Editorial Producer is Sarah Dopierala, our Production Assistant is Martha Tsutsui-Billins, and our Editorial Assistant is Jon Kruk. Our music is 'Ancient City' by The Triangles.
	188	Gretchen	Stay lingthusiastic! [Music]

Save results to data/transcripts.csv:

transcripts.to_csv('data/transcripts.csv', index=True)

1.3 Find Words For Each Vowel

1.3.1 Which Words To Use?

These words aren’t all as controlled as you’d probably want for an experiment, but they’re high frequency enough that there are multiple tokens for each one in the episodes.

word_list = {
    '\u0069': {   # i
        'beat': r'\bbeat\b',
        'believe': r'\bbelieve\b',
        'people': r'\bpeople\b'},
    '\u026A': {  # ɪ
        'bit': r'\bbit\b',
        'finish': r'\bfinish',
        'pin': r'\bpin\b'},
    '\u025B': {  # ɛ
        'bet': r'\bbet\b',
        'guest': r'\bguest',
        'says': r'\bsays\b'},
    '\u00E6': {  # æ
        'bang': r'\bbang\b',
        'hand': r'\bhand\b',
        'laugh': r'\blaugh\b'},
    '\u0075': {  # u
        'blue': r'\bblue\b',
        'through': r'\bthrough\b',
        'who': r'\bwho\b'},
    '\u028A': {  # ʊ
        'could': r'\bcould\b',
        'foot': r'\bfoot\b',
        'put': r'\bput\b'},
    '\u0254': {  # ɔ
        'bought': r'\bbought\b',
        'core': r'\bcore\b',
        'wrong': r'\bwrong\b'},
    '\u0251': {  # ɑ
        'ball': r'ball\b|\bballs\b',
        'father': r'\bfather\b',
        'honorific': r'\bhonorific'},
    '\u028C': {  # ʌ
        'another': r'\banother\b',
        'but': r'\bbut\b',
        'fun': r'\bfun\b'},
    '\u0259': {  # ə
        'among': r'\bamong\b',
        'famous': r'\bfamous\b',
        'support': r'\bsupport\b'}
}

1: Set up a nested dictionary, where the outer keys are the IPA vowels, the inner keys are the words, and the values are the regex strings for the words.
2: \u makes it a unicode string.
3: \b means search at word boundaries, and only a few words here also include suffixes (e.g., only return results for bit, but return results for finish, finishes, and finishing).
4: Also get finishes/finishing.
5: Also get guests.
6: Also get honorifics. Using honorific instead of honor because there’s an episode about honorifics.

1.3.2 Find Words In Transcript

This function goes through the transcript dataframe and finds turns by Gretchen or Lauren containing the target words:

def filter_for_words(words, df_all):
    """Find turns by Gretchen/Lauren that contain the target words."""

    df_gl = df_all[
        (df_all['Speaker'] == "Gretchen") |
        (df_all['Speaker'] == "Lauren")
    ]

    df_words = pd.DataFrame()
    for vowel, examples in words.items():
        for word, reg in examples.items():
            has_word = df_gl[
                df_gl['Text'].str.contains(pat=reg, flags=re.IGNORECASE)
            ]
            has_word.insert(0, 'Vowel', vowel)
            has_word.insert(1, 'Word', word)
            has_word.set_index(['Word', 'Vowel'], inplace=True, append=True)
            has_word.index = has_word.index.reorder_levels(
                ['Vowel', 'Word', 'Episode', 'Turn'])

            df_words = pd.concat([df_words, has_word])

    return df_words.sort_index()

1: Filter to only include rows where Speaker is Gretchen or Lauren, not guests.
2: Create dataframe to store results.
3: Loop through outer layer of dictionary, where keys are vowels and values are lists of example words.
4: Loop through inner dictionaries, where keys are the words and values are the regexes.
5: Filter to only include rows where Text column matches word regex. re.IGNORECASE means the search is not case-sensitive.
6: Make columns for the current Vowel and Word, then add them to the index. The results dataframe is now indexed by unique combinations of Vowel, Word, Episode, and Turn.
7: Add results for current word to dataframe of all results.
8: Sorting the dataframe indices makes search faster later, and pandas will throw warnings if you search this large-ish dataframe before sorting.

Next, this function trims the conversation turn around the target word, so it can be matched to the caption timestamps later:

def trim_turn(df):
    """Find location of target word in conversation turn and subset -/+ 25 characters around it."""

    df['Text_Subset'] = pd.Series(dtype='string')

    for vowel, word, episode, turn in df.index:
        text_full = df.loc[vowel, word, episode, turn]['Text']
        word_loc = re.search(re.compile(word, re.IGNORECASE), text_full)
        word_loc = word_loc.span()[0]
        sub_start = max(word_loc - 25, 0)
        sub_end = min(word_loc + 25, len(text_full))
        text_sub = text_full[sub_start:sub_end]
        df.loc[(vowel, word, episode, turn), 'Text_Subset'] = str(text_sub)

    return df

1: Make a new column for the subset text, so it can be specified as a string. If you insert values later without initializing the column as a string, pandas will throw warnings.
2: Go through each row of the dataframe with all the transcript turns by Gretchen or Lauren that contain target words.
3: Use index values to get the value of Text.
4: Search for the current row’s Word in the current row’s Text (again case-insensitive). This returns a search object, and the first item in that tuple is the location of the first letter of Word in Text.
5: Start index of Text_Subset is 25 characters before the start of the Word, or the beginning of the string, whichever is larger.
6: End index of Text_Subset is 25 characters after the start of the Word, or the end of the string, whichever is smaller.
7: Insert Text_Subset into dataframe.

Subset the transcripts dataframe to only include turns by Gretchen/Lauren that contain a target word:

transcripts_subset = filter_for_words(word_list, transcripts)
transcripts_subset = trim_turn(transcripts_subset)

pd.concat([transcripts_subset.head(), transcripts_subset.tail()]).style

1: Run the two functions just defined on the dataframe of all the transcripts, to filter to only include turns by Gretchen or Lauren that contain the target words, then trim the text of each turn around the target word.
2: Show the first and last 10 rows of the resulting dataframe. .style prints it as a nicer-looking table.

				Speaker	Text	Text_Subset
Vowel	Word	Episode	Turn
i	beat	1	119	Gretchen	no, but you like, you just beat the potato but you don't necessarily deep-fried it you could just pan-fry it.	, but you like, you just beat the potato but you d
		12	86	Gretchen	Yeah, we make it into something like what we have. But another example is, so the sound /ɪ/ as in 'bit,' is not a sound that French has. So you'll have French speakers saying -- well, it kind of has it, but not really -- so if you have something like 'beat' versus 'bit,' for French speakers those are both just kind of 'beat.'	you have something like 'beat' versus 'bit,' for F
		18	68	Gretchen	And so you have your, like, duh-DUH beat, your iamb, with weak-strong --	have your, like, duh-DUH beat, your iamb, with wea
		18	100	Gretchen	And the other translators tend to render it in prose, or in, like, shortened lines, but without paying attention to that beat in the same sort of way.	paying attention to that beat in the same sort of
		22	208	Lauren	Yep. I've already used the word 'proximal' in this episode so you're gonna beat me.	episode so you're gonna beat me.
ʌ	fun	82	127	Gretchen	Yes, we did. Because it's sometimes when you know that you're gonna need a whole bunch of examples in a text or an episode, it's fun to theme them so that you're not just reaching for the same -- like I think we used examples like 'I like cake' a lot, or like, 'I eat ice cream.' A lot of our examples are about ice cream and cake, which is fun. I mean, we do like both of these things, but sometimes, for an episode that's gonna be really example heavy, using horses and farmyard animals and stuff is a fun way to make it a little more distinct from other episodes.	text or an episode, it's fun to theme them so that
		82	137	Gretchen	That was a deliberate choice back in the day. That was a style guide thing. Undoing that, now that it's become this unconscious thing that people are still doing, is more challenging. The fun thing, I guess, about how a lot of example sentences in old school syntax papers use 'John' and 'Mary' and 'Bill' is that there is someone who took a bunch of example sentences from papers since the refer to the same people and stitched them together into a single narrative about the adventures of John and Mary and Bill.	is more challenging. The fun thing, I guess, about
		83	1	Gretchen	Welcome to Lingthusiasm, a podcast that's enthusiastic about linguistics! I'm Gretchen McCulloch. I'm here with Dr. Pedro Mateo Pedro who's an Assistant Professor at the University of Toronto, Canada, a native speaker of Q'anjob'al, and a learner of Kaqchikel. Today, we're getting enthusiastic about kids acquiring Indigenous languages. But first, some announcements. We love looking up whether two words that look kind of similar are actually historically related, but the history of a word doesn't have to define how it's used today. To celebrate how we can grow up to be more than we ever expected, we have new merch that says, 'Etymology isn't Destiny.' Our artist, Lucy Maddox, has made 'Etymology isn't Destiny' into a swoopy, cursive design with a fun little destiny star on the dot of the eye, available in black, white, and my personal favourite, rainbow gradient. This design is available on lots of different colours and styles of shirts. We've got hoodies, tank tops, t-shirts in classic fit, relaxed fit, curved fit -- plus mugs, notebooks, stickers, water bottles, zipper pouches. You know, if it's on Redbubble, we might've put 'Etymology isn't Destiny' on it. We also have tons of other lingthusiastic merch available in our merch store at lingthusiasm.com/merch. I have to say, it makes a great gift to give to a linguistics enthusiast in your life or to request as a gift if you are that linguistics enthusiast. We also wanna give a special shoutout to our aesthetic redesign of the International Phonetic Alphabet. Last year, we reorganised the classic IPA chart to have colours and have little cute circles and not just be boring grey lines of boxes and to even more elegantly represent the principle that the location of the symbols and rows and columns represents the place and degree of constriction in the mouth. I think it looks really cool. It's also a fun little puzzle to sit there and figure out which of the specific circles around different things stands for what. We've now made this aesthetic IPA chart redesign available on lots more merch options, including several different sizes of posters from small ones you can put on a corkboard to large ones you can put up in your hallway. They look really, really good, especially if you have some sort of office-y space that needs to be decorated. Plus, it's on tote bags and notebooks and t-shirts. If you want everyone you meet to know that you're a giant linguistics nerd, you can take them to conferences and use them to start nerdy conversations with people. If you like the idea of linguistics merch but none of ours so far is quite hitting your aesthetic, or if there's an item that Redbubble sells that you think one of our existing designs would look good on, we've added quite a few merch items in response to people's requests over the years, so we'd love to know where the gaps still are and keep an eye on lingthusiasm.com/merch. Our most recent bonus episode was a behind-the-scenes interview with Sarah Dopierala, who you may recognise as a name from the end credits, about what it's like doing transcripts from a linguistics perspective and her life generally as a linguistics grad student. You can go to patreon.com/lingthusiasm to get access to all of the many bonus episodes and to help Lingthusiasm keep running. [Music]	y, cursive design with a fun little destiny star o
		84	167	Gretchen	I find this a fun and interesting way of thinking about meaning because the way that I think we're sometimes introduced to meaning in a formal context is through dictionary definitions that give you a description of a cat that says something like, 'A cat is a furry creature with four legs, a long tail, purrs,' whatever other attributes you wanna assign to a cat. But in practice, I didn't learn the word 'cat' by looking it up in a dictionary or having a description provided to me. I learned the word 'cat' by having a bunch of cats pointed out to me. Sometimes, a cat has three legs or is one of those weird hairless cats.	I find this a fun and interesting way o
		84	173	Gretchen	We did a whole bonus episode about meaning and the ways that we interact with it talking about the 'Is a pizza a sandwich?' meme, which I would highly recommend because I think it's a very fun bonus episode. It turns out that this is actually how meaning works for most people in context. You see a bunch of examples of cats or birds or chairs or whatever. You come up with a sense of what's in the cluster based on generalising from those examples rather than by having an exact list of parameters because some stuff is an exception to those parameters, and it's still a point in the cluster.	ause I think it's a very fun bonus episode. It tur

Here’s how many tokens we have for each speaker and word:

transcripts_subset \
    .value_counts(['Vowel', 'Word', 'Speaker'], sort=False) \
    .to_frame('Tokens') \
    .reset_index(drop=False) \
    .style \
    .hide()

1: Count the number of items in each combination of Vowel, Word, and Speaker. Convert those results to a dataframe and call the column of counts Tokens. Change Vowel, Word, and Speaker from indices to columns. Print the results as a table, not including row numbers.

Vowel	Word	Speaker	Tokens
i	beat	Gretchen	10
i	beat	Lauren	8
i	believe	Gretchen	25
i	believe	Lauren	23
i	people	Gretchen	981
i	people	Lauren	679
u	blue	Gretchen	29
u	blue	Lauren	11
u	through	Gretchen	123
u	through	Lauren	120
u	who	Gretchen	512
u	who	Lauren	329
æ	bang	Gretchen	4
æ	bang	Lauren	2
æ	hand	Gretchen	48
æ	hand	Lauren	31
æ	laugh	Gretchen	7
æ	laugh	Lauren	5
ɑ	ball	Gretchen	15
ɑ	ball	Lauren	11
ɑ	father	Gretchen	9
ɑ	father	Lauren	7
ɑ	honorific	Gretchen	4
ɑ	honorific	Lauren	5
ɔ	bought	Gretchen	5
ɔ	bought	Lauren	4
ɔ	core	Gretchen	7
ɔ	core	Lauren	2
ɔ	wrong	Gretchen	49
ɔ	wrong	Lauren	21
ə	among	Gretchen	18
ə	among	Lauren	6
ə	famous	Gretchen	19
ə	famous	Lauren	18
ə	support	Gretchen	42
ə	support	Lauren	24
ɛ	bet	Gretchen	13
ɛ	bet	Lauren	9
ɛ	guest	Gretchen	22
ɛ	guest	Lauren	3
ɛ	says	Gretchen	93
ɛ	says	Lauren	40
ɪ	bit	Gretchen	267
ɪ	bit	Lauren	242
ɪ	finish	Gretchen	8
ɪ	finish	Lauren	11
ɪ	pin	Gretchen	18
ɪ	pin	Lauren	5
ʊ	could	Gretchen	444
ʊ	could	Lauren	174
ʊ	foot	Gretchen	13
ʊ	foot	Lauren	6
ʊ	put	Gretchen	250
ʊ	put	Lauren	116
ʌ	another	Gretchen	239
ʌ	another	Lauren	130
ʌ	but	Gretchen	1785
ʌ	but	Lauren	1049
ʌ	fun	Gretchen	187
ʌ	fun	Lauren	57

Gretchen and Lauren each say all of the words more than once, and most of the words have 5+ tokens to pick from.

1.4 Get Timestamps

The transcripts on the Lingthusiasm website have all the speakers labelled, but no timestamps, and the captions on YouTube have timestamps, but few speaker labels. To find where in the episodes the token words are, we’ll have to combine them.

We’re going to be getting the caption data and audio from Lingthusiasm’s “all episodes” playlist, using the pytube package.

This function uses the YouTube playlist link to download captions for each video:

def get_captions(videos):
    """Get and format captions from YouTube URLs."""
    df = pd.DataFrame()

    for url in videos:
        video = YouTube(url)
        video.bypass_age_gate()

        title = video.title
        number = int(title[:2])

        caps = caption_version(video.captions)
        try:
            caps = parse_captions(caps.xml_captions)
        except AttributeError:
            caps = pd.DataFrame()

        caps.insert(0, 'Episode', number)
        df = pd.concat([df, caps])

    df = df[['Episode', 'Text', 'Start', 'End']]
    df[['Start', 'End']] = df[['Start', 'End']].astype('float32')

    return df.sort_values(['Episode', 'Start'])

1: Start dataframe for results.
2: Go through the list of video URLs and open each one as a YouTube object.
3: Need to include this to access captions.
4: The video title is an attribute of the YouTube object, and the episode number is the first word of the title.
5: Load captions, which returns a dictionary with the language name as the key and the captions and the values. Select which English version to use (function defined below).
6: Convert XML data to dataframe (function defined below). If this doesn’t work, make an empty placeholder dataframe.
7: Add Episode column (integer).
8: Add results from current video to dataframe of all results.
9: Return dataframe of all captions, after organizing columns and sorting rows by episode number then time.

Now, some helper functions to select which caption version to download and then reformat the data from XML to a dataframe.

Caption data is most commonly formatted as XML or SRT, which is a text format, but not easily convertible to a dataframe.

def caption_version(cur_captions):
    """Select which version of the captions to use (formatting varies)."""

    if 'en' in cur_captions.keys():
        return cur_captions['en']
    elif 'en-GB' in cur_captions.keys():
        return cur_captions['en-GB']
    elif 'a.en' in cur_captions.keys():
        return cur_captions['a.en']


def caption_times(xml):
    """Find timestamps in XML data."""

    lines = pd.Series(xml.split('<p')[1:], dtype='string')
    df = lines.str.extract(r'(?P<Start>t="[0-9]*")')
    df['Start'] = df['Start'].str.removeprefix('t="').str.removesuffix('"')
    df['Start'] = df['Start'].astype('int')

    return df


def caption_text(xml):
    """Find text and filter out extra tags in XML data."""

    lines = pd.Series(xml.split('<p')[1:], dtype='string')
    tags = r'"[0-9]*"|t=|d=|w=|a=|ac=|</p>|</s>|<s\s|>\s|>|</body|</timedtext'
    texts = lines.str.replace(tags, '', regex=True)
    texts = texts.str.replace('&#39;', '\'') \
        .str.replace('&quot;', "'") \
        .str.replace('&lt;', '[') \
        .str.replace('&gt;', ']') \
        .str.replace('&amp;', ' and ')
    texts = [clean_text(line) for line in texts.to_list()]

    return pd.DataFrame({'Text': texts}, dtype='string')


def parse_captions(xml):
    """Combine timestamp rows and text rows."""

    times = caption_times(xml)
    texts = caption_text(xml)
    df = times.join(texts)
    df = df[df['Text'].str.contains(r'[a-z]')]
    df = df.reset_index(drop=True)
    df['End'] = df['Start'].shift(-1) - 1

    return df

1: Prefer en captions, and if those aren’t available, try en-GB then a.en. The main difference is in the formatting, and I think if they are user- or auto-generated.
2: The XML data is one giant string. Split into a list using the <p tag, which results in one string per caption item (timestamps, text, and various other tags).
3: Each caption item has a start time preceded by t=. This regex gets t="[numbers]" and puts each result into a column called Start.
4: Remove t=" from the beginning of the Start column and " from the end, leaving a number that can be converted from a string to an integer.
5: The XML data is one giant string. Split into a list using the <p tag, which results in one string per caption item (timestamps, text, and various other tags).
6: This is a list of the other variables that come with the caption text, including t= for start time and d= for duration. The formatting varies somewhat based on which version of captions you get, and the reason this selects the en captions before the en-GB and a.en captions is that this formatting is simpler.
7: Take all the tags/variables out, leaving only the actual caption text.
8: Remove special characters and extra spaces.
9: Return as dataframe, with Text specified as a string column. If you leave it as the default object type, pandas will throw warnings later.
10: Get dataframes with caption times and caption texts, using two functions just defined. Join the results into one dataframe (this works because both are indexed by row number).
11: Remove rows where caption text is blank, then reset index (row number).
12: Make a column for the End time, which is the Start time of the new row - 1.

Running this takes a minute or so.

captions = get_captions(video_list)

captions = captions[captions['Episode'] <= 84]

pd.concat([captions.head(), captions.tail()]).style.format(precision=0)

1: Load and format captions for each video in the “all episodes” playlist, using functions defined above.
2: Only analyze episodes 1-84.
3: Show the first and last 10 rows of the resulting dataframe. .style prints it as a nicer-looking table.

	Episode	Text	Start	End
0	1	[introductory music playing]	160	3189
1	1	L: Welcome to Lingthusiasm a podcast that's	3190	6759
2	1	enthusiastic about linguistics	6760	9218
3	1	I'm Lauren Gawne.	9219	11319
4	1	G: and I'm Gretchen McCulloch and today we're going to be talking about	11320	12399
449	84	your life who's curious about language. L: Lingthusiasm is created and produced by	2298000	2301899
450	84	Gretchen McCulloch and Lauren Gawne. Our Senior Producer is Claire Gawne,	2301900	2305559
451	84	our Editorial Producer is Sarah Dopierala, our Production Assistant is Martha Tsutsui-Billins,	2305560	2311679
452	84	and our Editorial Assistant is Jon Kruk. Our music is 'Ancient City' by The Triangles.	2311680	2316359
453	84	G: Stay lingthusiastic! [Music]	2316360	nan

Save results to data/captions.csv:

captions.to_csv('data/captions.csv', index=False)

1.5 Match Timestamps to Transcripts

Now match the text from the transcript (target word start -/+ 25 characters) to the caption timestamps, using the thefuzz package to find the closest match.

def match_transcript_times(df_trans, df_cap):
    """Use fuzzy matching to match transcript turn to caption timestamp."""

    df_times = df_trans.reset_index(drop=False)
    to_cat = ['Vowel', 'Word', 'Episode']
    df_times[to_cat] = df_times[to_cat].astype('category')
    df_times['Text_Caption'] = pd.Series(dtype='string')

    for i in df_times.index:
        episode = df_times.loc[i, 'Episode']
        captions_subset = df_cap[df_cap['Episode'] == episode]
        text = df_times.loc[i, 'Text_Subset']
        text = match_exceptions(text, search_exceptions)
        if text is not None:
            match = process.extractOne(text, captions_subset['Text'])
            if match is not None:
                df_times.loc[i, 'Text_Caption'] = match[0]
                df_times.loc[i, 'Match_Rate'] = match[1]
                df_times.loc[i, 'Start'] = captions_subset.loc[match[2], 'Start']
                df_times.loc[i, 'End'] = captions_subset.loc[match[2], 'End']

    return df_times

1: Make copy of the transcripts-containing-target-words dataframe, and convert Vowel, Word, and Episode from indices to columns, and specify those columns are categorical.
2: Make column for results and specify it is a string type. Again, this will work without this line, but will result in warnings from pandas.
3: Subset the captions dataframe to only include the episode of the current row, so we have less to search through.
4: Text from transcript (target word -/+ 25 characters) to search for in captions.
5: There are a few cases where the search text needs to be tweaked or subset to get the correct match, and a few cases where a token needs to be skipped because the audio quality wasn’t good enough. These cases are defined in the next code chunk.
6: Find row in caption dataframe (subset to just include the current episode) that best matches the search text (target word -/+ 25 characters).
7: If a match is identified, a tuple is returned where the first item is the matching text, the second item is the certainty (where 100 is an exact match), and the third item is the row index (in the Text column of the captions dataframe, subset to just include the current episode).
8: Use the row number from the match to get the corresponding Start and End times.
9: Results include all of the target words found in the transcript, with columns added for Text_Caption, Start, and End.

Fuzzy string matching accounts for a lot of the differences between the transcripts (which are proofread/edited) and the captions (not all proofread/edited, have some speaker names and initials mixed in). But it wasn’t always able to account for the differences in splitting sections, since the captions aren’t split by speaker turns. So, there are some where matching the right timestamp requires a transcript text that’s subset further (search_text_subsets) or tweaked (search_text_replacements). This isn’t the most elegant solution, but it works (and it was still faster than opening up 84 30-minute audio files to find and extract out 200 3-second clips).

Then, after going through all the clips and annotating the vowels (see Part 2), there are some that get skipped (search_text_skips) because of the audio quality (e.g., overlapping speech).

search_exceptions = {
    'replace': {
        "all to a sing": "all to a single place",
        "and you have a whole room full":
            "speakers who have a way of communicating with each other that",
        "'bang' or something": "bang or something",
        "emphasising the beat gestures in a very": "B gestures",
        "by the foot and they really want":
            "Yeah they rented the books by the foot and they really want",
        "father's brother's wife": "father's brother's wife",
        "foot/feet": "foot feet",
        "is known as a": "repetition is known as a",
        "an elastic thing with a ball on the end.":
        "paddle thing that has an elastic thing with a ball on the end",
        "pin things": "trying to use language names as a way to pin things",
        "red, green, yellow, blue, brown, pink, purple":
            "L: Black, white, red, green, yellow, blue, brown",
        "see; red, green, yellow, blue, brown, purple, pink":
            "green, yellow, blue, brown, purple, pink,",
        "tied up with": "complicated or really difficult histories",
        "to cite": "believe I got to",
        "your father's brother, and you": "your father's brother",
        "your father's brothers or your":
            "your father's brothers or your mother's brothers"
    },
    'trim': [
        'a lot of very core', 'a more honorific one if you',
        'an agreement among', 'audiobook', 'auditorily', 'ball of rocks',
        'between those little balls', 'big yellow', 'bonus topics',
        'bought a cot', 'bought! And it was true', 'bought some cheese',
        'but let me finish', 'core idea', 'core metaphor',
        'core set of imperatives', 'core thing', 'finished with their turn'
        'finished yours', 'guest, Gabrielle', 'guest, Pedro',
        'honorific register', 'how could this plan',
        'imperative or this honorific', 'letter from her father',
        'like if I bought', 'on which country you bought', 'pangrams', 
        'pater -- father', 'returning special guest',
        'see among languages are based', 'substitute',
        'that did it with the Big Bang', 'the idea is that honorifics',
        'wanna hear more from our guests', 'with your elbow or your foot',
        'You bet', 'you tap your foot by the treat'
    ],
    'skip': [
        'believe you left this', 'bit of a rise', 'By the foot!',
        'can pin down', 'existing thanks', 'featuring', 'fun and delightful',
        'fun to flip', 'fun with adults', 'group of things', 'his data?',
        'in November', 'local differences', 'people notice', 'She says no',
        'situation where', 'things wrong', 'what letters', 'who are around you'
    ]
}

def match_exceptions(text, exceptions):
    """Check if text to match is an exception and gets replaced or subset."""

    for k, v in exceptions['replace'].items():
        if k in text:
            return v

    for s in exceptions['trim']:
        if s in text:
            return s

    return None if any(text in s for s in exceptions['skip']) else text

1: Cases where search text needs to be replaced to match correctly.
2: Cases where search text needs to be subset to match correctly.
3: Cases where search text needs to be skipped because of audio quality.
4: If a replace key string is a substring of search text, then return key’s value.
5: If a trim string is a subset of search text, then return that string.
6: If a skip string is a subset of search text, then return None. If none of exceptions apply, then return original search text.

This takes a few minutes to run.

timestamps = match_transcript_times(transcripts_subset, captions)
timestamps = timestamps[timestamps['Text_Caption'].notna()]
timestamps.to_csv('data/timestamps_all.csv', index=False)

1: Get timestamps for selected turns (function defined in two code chunks above).
2: Most ID a match, filter rows that don’t.
3: Save all results.

Now select a subset to annotate. Pick the 5 highest matches (to caption timestamps from transcript) for each word + speaker, and if there are multiple good matches, pick the ones from the later episodes since those have higher sound quality.

timestamps = timestamps \
    .sort_values(['Match_Rate', 'Episode'], ascending=False) \
    .groupby(['Vowel', 'Word', 'Speaker'], observed=True) \
    .head(5) \
    .sort_values(['Vowel', 'Word', 'Speaker']) \
    .reset_index(drop=True)

timestamps['Number'] = timestamps \
    .groupby(['Word', 'Speaker'], observed=True) \
    .cumcount() + 1

timestamps = timestamps[[
    'Vowel', 'Word', 'Speaker', 'Number', 'Episode', 'Turn',
    'Text', 'Text_Subset', 'Text_Caption', 'Start', 'End'
]]

1: Sort with the highest match rates and later episodes at the top. Then group by Vowel, Word, and Speaker. Select the 5 highest within each Vowel + Word + Speaker group, which gets the closest matches, and if there are multiple good matches, picks the one from the later episode (higher audio quality). Then re-sort by Vowel, Word, and Speaker and convert those from indices to columns.
2: Add new column for count within each Word + Speaker combination, which we’ll use later to name files.
3: Sort columns.

We have 5 tokens of most words (except for the words where there weren’t >= 5 tokens in the transcripts):

timestamps.value_counts(['Vowel', 'Word', 'Speaker'], sort=False) \
    .reset_index() \
    .style \
    .hide()

1: Count number of rows for each Vowel + Word + Speaker combination. Convert those from indices to columns, so this is a dataframe instead of a series, and thus can be printed as a table by making it a style object. .hide() drops the row numbers.

Vowel	Word	Speaker	count
i	beat	Gretchen	5
i	beat	Lauren	5
i	believe	Gretchen	5
i	believe	Lauren	5
i	people	Gretchen	5
i	people	Lauren	5
u	blue	Gretchen	5
u	blue	Lauren	5
u	through	Gretchen	5
u	through	Lauren	5
u	who	Gretchen	5
u	who	Lauren	5
æ	bang	Gretchen	4
æ	bang	Lauren	2
æ	hand	Gretchen	5
æ	hand	Lauren	5
æ	laugh	Gretchen	5
æ	laugh	Lauren	2
ɑ	ball	Gretchen	5
ɑ	ball	Lauren	5
ɑ	father	Gretchen	5
ɑ	father	Lauren	5
ɑ	honorific	Gretchen	4
ɑ	honorific	Lauren	5
ɔ	bought	Gretchen	5
ɔ	bought	Lauren	4
ɔ	core	Gretchen	5
ɔ	core	Lauren	2
ɔ	wrong	Gretchen	5
ɔ	wrong	Lauren	5
ə	among	Gretchen	5
ə	among	Lauren	5
ə	famous	Gretchen	5
ə	famous	Lauren	5
ə	support	Gretchen	5
ə	support	Lauren	5
ɛ	bet	Gretchen	5
ɛ	bet	Lauren	5
ɛ	guest	Gretchen	5
ɛ	guest	Lauren	3
ɛ	says	Gretchen	5
ɛ	says	Lauren	5
ɪ	bit	Gretchen	5
ɪ	bit	Lauren	5
ɪ	finish	Gretchen	5
ɪ	finish	Lauren	5
ɪ	pin	Gretchen	5
ɪ	pin	Lauren	5
ʊ	could	Gretchen	5
ʊ	could	Lauren	5
ʊ	foot	Gretchen	5
ʊ	foot	Lauren	5
ʊ	put	Gretchen	5
ʊ	put	Lauren	5
ʌ	another	Gretchen	5
ʌ	another	Lauren	5
ʌ	but	Gretchen	5
ʌ	but	Lauren	5
ʌ	fun	Gretchen	5
ʌ	fun	Lauren	5

Save results to data/timestamps.csv:

timestamps.to_csv('data/timestamps_annotate.csv', index=False)

Citation

BibTeX citation:

@online{gardner2024,
  author = {Gardner, Bethany},
  title = {Lingthusiasm {Vowel} {Plots}},
  date = {2024-02-09},
  url = {https://bethanyhgardner.github.io/lingthusiasm-vowel-plots},
  doi = {10.5281/zenodo.10642632},
  langid = {en}
}

For attribution, please cite this work as:

Gardner, Bethany. 2024. “Lingthusiasm Vowel Plots.” February 9, 2024. https://doi.org/10.5281/zenodo.10642632.