The first steps are to find the words for #1, and we’ll come back to #2 later.
1.1 Setup
"""Part 1 of Lingthusiasm Vowel Plots: Finding Vowels to Annotate."""import reimport pandas as pdimport requestsfrom bs4 import BeautifulSoupfrom thefuzz import processfrom pytube import Playlist, YouTube
1
Regex functions.
2
Dataframe functions.
3
Scraping data from webpages.
4
Fuzzy string matching.
5
Getting captions and audio from YouTube.
(Note: All of this could also be done in R, but I’ve done it in Python here primarily because there are Python packages that can get data from YouTube without having to set up anything on the YouTube API side. Here, all you need to do is install the package.)
1.2 Get Transcript Text & Speakers
1.2.1 List Episodes
The Lingthusiasm website has a page listing all of the available transcripts. Step 1 is to load that page and get the list of URLs to the transcripts. (They have similar but not identical structures, so it’s easiest to read the list from the website instead of trying to construct them here.)
This function uses the BeautifulSoup package to return an HTML object and a text string for a URL:
def get_html_text(url):"""Use BeautifulSoup to get the webpage text from the URL.""" resp = requests.get(url, timeout=1000) html = BeautifulSoup(resp.text, 'html.parser')return {'html': html, 'text': html.get_text()}
1
Connect to webpage.
2
Load the HTML data from the webpage.
3
Return the HTML data object and the text from the HTML data object.
This function uses BeautifulSoup to filter just the transcript URLs from the HTML data:
def get_transcript_links(url):"""Get URLs to episode transcripts from HTML page.""" html = get_html_text(url)['html'] url_objs = html.find_all('a') urls = [l.get('href') for l in url_objs] urls = pd.Series(data=urls, name='URL', dtype='string') urls = urls[urls.str.contains('transcript')] urls = urls[::-1]return urls.reset_index(drop=True)
1
Get HTML data from webpage using function defined above.
2
Filter using the a tag to get all of the link items.
3
Get the href item from each a tag item, which is the text of the URL.
4
Convert from list to pandas series.
5
Filter to only include URLs including the word transcript.
6
Sort to have earliest episodes first.
7
Return with index (row numbers) reset.
There are 84 episodes available (as of early October 2023):
Find location of episode-, since episode number is immediately following it.
2
Subset URL starting 8 characters after start of episode- string, which is immediately after it.
3
Most of the transcript URLs have more text after the number. If so, trim to just keep the number.
4
Return episode number converted from string to integer.
The text returned from the transcript pages has information at the top and bottom that we don’t need. "This is a transcript for" marks the start of the transcript, and "This work is licensed under" marks the end, so subset the transcript dataframe rows to only include that section:
def trim_transcript(df):"""Find start and end of transcript text.""" start_index = df.find('This is a transcript for') end_index = df.find('This work is licensed under')return df[start_index:end_index]
This function cleans up the text column so it plays a bit nicer with Excel CSV export:
def clean_text(l):"""Clean text column so it opens correctly as Excel CSV.""" l = l.strip() l = l.replace('\u00A0', ' ').replace('–', '--').replace('…', '...') l = re.sub('“|”|‘|’', "'", l)return' '.join(l.split())
1
Remove leading and trailing whitespaces.
2
Replace non-breaking spaces, en dashes, and ellipses.
3
Replace slanted quotes.
4
Remove double spaces between words.
After a bit of trial and error, it’s easiest to split on the speaker names, since the paragraph formatting isn’t identical across all the pages:
Regex looks like (Name1:)|(Name2:) etc. Surrounding names in parentheses makes it a capture group, so the names are included in the list of split items instead of dropped.
This function puts it all together to read the transcripts into one dataframe:
def get_transcripts(urls):"""Get transcript text from URLs and format into dataframe.""" df = []for l in urls: cur_text = get_html_text(l)['text'] cur_text = trim_transcript(cur_text) cur_lines = re.split(speaker_regex, cur_text) cur_lines = [l for l in cur_lines if l isnotNone] cur_lines = cur_lines[1:] speakers = cur_lines[::2] turns = cur_lines[1::2] cur_df = pd.DataFrame({'Episode': transcript_number_from_url(l),'Speaker': speakers,'Text': [clean_text(line) for line in turns], }) cur_df['Speaker'] = cur_df['Speaker'].str.removesuffix(':') cur_df['Turn'] = cur_df.index +1 cur_df = cur_df.set_index(['Episode', 'Turn'], drop=True) df.append(cur_df) df = pd.concat(df) df['Speaker'] = df['Speaker'].astype('category') df['Text'] = df['Text'].astype('string')return df
1
Make list to store results.
2
Read text string (vs HTML object) from URL, using function defined above.
3
Trim to only include transcript section, using function defined above.
4
Split string into list with one item for each speaker turn, using regex defined above.
5
Drop items in list that are None (which are included because of capture group syntax).
6
Drop the initial “This is a transcript for…” line.
7
Now all the odd items are speaker labels and even items are turn text. [::2] is every 2nd item starting at 0, and [1::2] is every 2nd item starting at 1.
8
Clean up strings and put everything into a dataframe.
9
Make column for turn number, which is index + 1.
10
Set Episode and Turn columns as indices.
11
Add to list of parsed episodes.
12
Combine list of dataframes into one dataframe. This works because the Episode + Turn index combination is unique.
13
Set datatypes explicitly, to avoid warnings later.
Run get_transcripts() on the lists of links to each transcript, defined above.
2
Show the first and last 10 rows of the resulting dataframe. .style prints it as a nicer-looking table.
Speaker
Text
Episode
Turn
1
1
Lauren
Welcome to Lingthusiasm! A podcast that's enthusiastic about linguistics. I'm Lauren Gawne.
2
Gretchen
And I'm Gretchen McCulloch. And today we're going to be talking about universal language and why it doesn't work. But first a little bit about what Lingthusiasm is.
3
Lauren
So I guess the important thing is to unpack 'a podcast that's enthusiastic about linguistics'.
4
Gretchen
Yeah! We chose our tagline because we're here to explore interesting things that language has to offer and especially what looking at language from a linguistics perspective can tell us about how language works.
5
Lauren
Both of us have blogs where we do a fair amount of that, but that's a fairly solitary enterprise and we wanted to take the opportunity to have more of a conversation -- something that's a little less disembodied
84
184
Lauren
Ah, well, that's okay, because now we're at the end of this episode, I'm pointing to everyone. [Music]
185
Lauren
For more Lingthusiasm and links to all the things pointed to in this episode, go to lingthusiasm.com. You can listen to us on Apple Podcasts, Google Podcasts, Spotify, SoundCloud, YouTube, or wherever else you get your podcasts. You can follow @lingthusiasm on Twitter, Facebook, Instagram, and Tumblr. You can get 'Etymology isn't Destiny' on t-shirts and tote bags and lots of other items, and aesthetic IPA posters, and other Lingthusiasm merch at lingthusiasm.com/merch. I tweet and blog as Superlinguo.
186
Gretchen
I can be found as @GretchenAMcC on Twitter, my blog is AllThingsLinguistic.com, and my book about internet language is called Because Internet. Lingthusiasm is able to keep existing thanks to the support of our patrons. If you wanna get an extra Lingthusiasm episode to listen to every month, our entire archive of bonus episodes to listen to right now, or if you just wanna help keep the show running ad-free, go to patreon.com/lingthusiasm or follow the links from our website. Patrons can also get access to our Discord chatroom to talk with other linguistics fans and be the first to find out about new merch and other announcements. Recent bonus topics include interviews with Sarah Dopierala and Martha Tsutsui-Billins about their own linguistic research and their work on Lingthusiasm and a very special Lingthusiasmr episode where we read The Harvard Sentences to you [ASMR voice] in a calm, soothing voice. [Regular voice] Can't afford to pledge? That's okay, too. We also really appreciate it if you can recommend Lingthusiasm to anyone in your life who's curious about language.
187
Lauren
Lingthusiasm is created and produced by Gretchen McCulloch and Lauren Gawne. Our Senior Producer is Claire Gawne, our Editorial Producer is Sarah Dopierala, our Production Assistant is Martha Tsutsui-Billins, and our Editorial Assistant is Jon Kruk. Our music is 'Ancient City' by The Triangles.
These words aren’t all as controlled as you’d probably want for an experiment, but they’re high frequency enough that there are multiple tokens for each one in the episodes.
Set up a nested dictionary, where the outer keys are the IPA vowels, the inner keys are the words, and the values are the regex strings for the words.
2
\u makes it a unicode string.
3
\b means search at word boundaries, and only a few words here also include suffixes (e.g., only return results for bit, but return results for finish, finishes, and finishing).
4
Also get finishes/finishing.
5
Also get guests.
6
Also get honorifics. Using honorific instead of honor because there’s an episode about honorifics.
1.3.2 Find Words In Transcript
This function goes through the transcript dataframe and finds turns by Gretchen or Lauren containing the target words:
def filter_for_words(words, df_all):"""Find turns by Gretchen/Lauren that contain the target words.""" df_gl = df_all[ (df_all['Speaker'] =="Gretchen") | (df_all['Speaker'] =="Lauren") ] df_words = pd.DataFrame()for vowel, examples in words.items():for word, reg in examples.items(): has_word = df_gl[ df_gl['Text'].str.contains(pat=reg, flags=re.IGNORECASE) ] has_word.insert(0, 'Vowel', vowel) has_word.insert(1, 'Word', word) has_word.set_index(['Word', 'Vowel'], inplace=True, append=True) has_word.index = has_word.index.reorder_levels( ['Vowel', 'Word', 'Episode', 'Turn']) df_words = pd.concat([df_words, has_word])return df_words.sort_index()
1
Filter to only include rows where Speaker is Gretchen or Lauren, not guests.
2
Create dataframe to store results.
3
Loop through outer layer of dictionary, where keys are vowels and values are lists of example words.
4
Loop through inner dictionaries, where keys are the words and values are the regexes.
5
Filter to only include rows where Text column matches word regex. re.IGNORECASE means the search is not case-sensitive.
6
Make columns for the current Vowel and Word, then add them to the index. The results dataframe is now indexed by unique combinations of Vowel, Word, Episode, and Turn.
7
Add results for current word to dataframe of all results.
8
Sorting the dataframe indices makes search faster later, and pandas will throw warnings if you search this large-ish dataframe before sorting.
Next, this function trims the conversation turn around the target word, so it can be matched to the caption timestamps later:
def trim_turn(df):"""Find location of target word in conversation turn and subset -/+ 25 characters around it.""" df['Text_Subset'] = pd.Series(dtype='string')for vowel, word, episode, turn in df.index: text_full = df.loc[vowel, word, episode, turn]['Text'] word_loc = re.search(re.compile(word, re.IGNORECASE), text_full) word_loc = word_loc.span()[0] sub_start =max(word_loc -25, 0) sub_end =min(word_loc +25, len(text_full)) text_sub = text_full[sub_start:sub_end] df.loc[(vowel, word, episode, turn), 'Text_Subset'] =str(text_sub)return df
1
Make a new column for the subset text, so it can be specified as a string. If you insert values later without initializing the column as a string, pandas will throw warnings.
2
Go through each row of the dataframe with all the transcript turns by Gretchen or Lauren that contain target words.
3
Use index values to get the value of Text.
4
Search for the current row’s Word in the current row’s Text (again case-insensitive). This returns a search object, and the first item in that tuple is the location of the first letter of Word in Text.
5
Start index of Text_Subset is 25 characters before the start of the Word, or the beginning of the string, whichever is larger.
6
End index of Text_Subset is 25 characters after the start of the Word, or the end of the string, whichever is smaller.
7
Insert Text_Subset into dataframe.
Subset the transcripts dataframe to only include turns by Gretchen/Lauren that contain a target word:
Run the two functions just defined on the dataframe of all the transcripts, to filter to only include turns by Gretchen or Lauren that contain the target words, then trim the text of each turn around the target word.
2
Show the first and last 10 rows of the resulting dataframe. .style prints it as a nicer-looking table.
Speaker
Text
Text_Subset
Vowel
Word
Episode
Turn
i
beat
1
119
Gretchen
no, but you like, you just beat the potato but you don't necessarily deep-fried it you could just pan-fry it.
, but you like, you just beat the potato but you d
12
86
Gretchen
Yeah, we make it into something like what we have. But another example is, so the sound /ɪ/ as in 'bit,' is not a sound that French has. So you'll have French speakers saying -- well, it kind of has it, but not really -- so if you have something like 'beat' versus 'bit,' for French speakers those are both just kind of 'beat.'
you have something like 'beat' versus 'bit,' for F
18
68
Gretchen
And so you have your, like, duh-DUH beat, your iamb, with weak-strong --
have your, like, duh-DUH beat, your iamb, with wea
100
Gretchen
And the other translators tend to render it in prose, or in, like, shortened lines, but without paying attention to that beat in the same sort of way.
paying attention to that beat in the same sort of
22
208
Lauren
Yep. I've already used the word 'proximal' in this episode so you're gonna beat me.
episode so you're gonna beat me.
ʌ
fun
82
127
Gretchen
Yes, we did. Because it's sometimes when you know that you're gonna need a whole bunch of examples in a text or an episode, it's fun to theme them so that you're not just reaching for the same -- like I think we used examples like 'I like cake' a lot, or like, 'I eat ice cream.' A lot of our examples are about ice cream and cake, which is fun. I mean, we do like both of these things, but sometimes, for an episode that's gonna be really example heavy, using horses and farmyard animals and stuff is a fun way to make it a little more distinct from other episodes.
text or an episode, it's fun to theme them so that
137
Gretchen
That was a deliberate choice back in the day. That was a style guide thing. Undoing that, now that it's become this unconscious thing that people are still doing, is more challenging. The fun thing, I guess, about how a lot of example sentences in old school syntax papers use 'John' and 'Mary' and 'Bill' is that there is someone who took a bunch of example sentences from papers since the refer to the same people and stitched them together into a single narrative about the adventures of John and Mary and Bill.
is more challenging. The fun thing, I guess, about
83
1
Gretchen
Welcome to Lingthusiasm, a podcast that's enthusiastic about linguistics! I'm Gretchen McCulloch. I'm here with Dr. Pedro Mateo Pedro who's an Assistant Professor at the University of Toronto, Canada, a native speaker of Q'anjob'al, and a learner of Kaqchikel. Today, we're getting enthusiastic about kids acquiring Indigenous languages. But first, some announcements. We love looking up whether two words that look kind of similar are actually historically related, but the history of a word doesn't have to define how it's used today. To celebrate how we can grow up to be more than we ever expected, we have new merch that says, 'Etymology isn't Destiny.' Our artist, Lucy Maddox, has made 'Etymology isn't Destiny' into a swoopy, cursive design with a fun little destiny star on the dot of the eye, available in black, white, and my personal favourite, rainbow gradient. This design is available on lots of different colours and styles of shirts. We've got hoodies, tank tops, t-shirts in classic fit, relaxed fit, curved fit -- plus mugs, notebooks, stickers, water bottles, zipper pouches. You know, if it's on Redbubble, we might've put 'Etymology isn't Destiny' on it. We also have tons of other lingthusiastic merch available in our merch store at lingthusiasm.com/merch. I have to say, it makes a great gift to give to a linguistics enthusiast in your life or to request as a gift if you are that linguistics enthusiast. We also wanna give a special shoutout to our aesthetic redesign of the International Phonetic Alphabet. Last year, we reorganised the classic IPA chart to have colours and have little cute circles and not just be boring grey lines of boxes and to even more elegantly represent the principle that the location of the symbols and rows and columns represents the place and degree of constriction in the mouth. I think it looks really cool. It's also a fun little puzzle to sit there and figure out which of the specific circles around different things stands for what. We've now made this aesthetic IPA chart redesign available on lots more merch options, including several different sizes of posters from small ones you can put on a corkboard to large ones you can put up in your hallway. They look really, really good, especially if you have some sort of office-y space that needs to be decorated. Plus, it's on tote bags and notebooks and t-shirts. If you want everyone you meet to know that you're a giant linguistics nerd, you can take them to conferences and use them to start nerdy conversations with people. If you like the idea of linguistics merch but none of ours so far is quite hitting your aesthetic, or if there's an item that Redbubble sells that you think one of our existing designs would look good on, we've added quite a few merch items in response to people's requests over the years, so we'd love to know where the gaps still are and keep an eye on lingthusiasm.com/merch. Our most recent bonus episode was a behind-the-scenes interview with Sarah Dopierala, who you may recognise as a name from the end credits, about what it's like doing transcripts from a linguistics perspective and her life generally as a linguistics grad student. You can go to patreon.com/lingthusiasm to get access to all of the many bonus episodes and to help Lingthusiasm keep running. [Music]
y, cursive design with a fun little destiny star o
84
167
Gretchen
I find this a fun and interesting way of thinking about meaning because the way that I think we're sometimes introduced to meaning in a formal context is through dictionary definitions that give you a description of a cat that says something like, 'A cat is a furry creature with four legs, a long tail, purrs,' whatever other attributes you wanna assign to a cat. But in practice, I didn't learn the word 'cat' by looking it up in a dictionary or having a description provided to me. I learned the word 'cat' by having a bunch of cats pointed out to me. Sometimes, a cat has three legs or is one of those weird hairless cats.
I find this a fun and interesting way o
173
Gretchen
We did a whole bonus episode about meaning and the ways that we interact with it talking about the 'Is a pizza a sandwich?' meme, which I would highly recommend because I think it's a very fun bonus episode. It turns out that this is actually how meaning works for most people in context. You see a bunch of examples of cats or birds or chairs or whatever. You come up with a sense of what's in the cluster based on generalising from those examples rather than by having an exact list of parameters because some stuff is an exception to those parameters, and it's still a point in the cluster.
ause I think it's a very fun bonus episode. It tur
Here’s how many tokens we have for each speaker and word:
Count the number of items in each combination of Vowel, Word, and Speaker. Convert those results to a dataframe and call the column of counts Tokens. Change Vowel, Word, and Speaker from indices to columns. Print the results as a table, not including row numbers.
Vowel
Word
Speaker
Tokens
i
beat
Gretchen
10
i
beat
Lauren
8
i
believe
Gretchen
25
i
believe
Lauren
23
i
people
Gretchen
981
i
people
Lauren
679
u
blue
Gretchen
29
u
blue
Lauren
11
u
through
Gretchen
123
u
through
Lauren
120
u
who
Gretchen
512
u
who
Lauren
329
æ
bang
Gretchen
4
æ
bang
Lauren
2
æ
hand
Gretchen
48
æ
hand
Lauren
31
æ
laugh
Gretchen
7
æ
laugh
Lauren
5
ɑ
ball
Gretchen
15
ɑ
ball
Lauren
11
ɑ
father
Gretchen
9
ɑ
father
Lauren
7
ɑ
honorific
Gretchen
4
ɑ
honorific
Lauren
5
ɔ
bought
Gretchen
5
ɔ
bought
Lauren
4
ɔ
core
Gretchen
7
ɔ
core
Lauren
2
ɔ
wrong
Gretchen
49
ɔ
wrong
Lauren
21
ə
among
Gretchen
18
ə
among
Lauren
6
ə
famous
Gretchen
19
ə
famous
Lauren
18
ə
support
Gretchen
42
ə
support
Lauren
24
ɛ
bet
Gretchen
13
ɛ
bet
Lauren
9
ɛ
guest
Gretchen
22
ɛ
guest
Lauren
3
ɛ
says
Gretchen
93
ɛ
says
Lauren
40
ɪ
bit
Gretchen
267
ɪ
bit
Lauren
242
ɪ
finish
Gretchen
8
ɪ
finish
Lauren
11
ɪ
pin
Gretchen
18
ɪ
pin
Lauren
5
ʊ
could
Gretchen
444
ʊ
could
Lauren
174
ʊ
foot
Gretchen
13
ʊ
foot
Lauren
6
ʊ
put
Gretchen
250
ʊ
put
Lauren
116
ʌ
another
Gretchen
239
ʌ
another
Lauren
130
ʌ
but
Gretchen
1785
ʌ
but
Lauren
1049
ʌ
fun
Gretchen
187
ʌ
fun
Lauren
57
Gretchen and Lauren each say all of the words more than once, and most of the words have 5+ tokens to pick from.
1.4 Get Timestamps
The transcripts on the Lingthusiasm website have all the speakers labelled, but no timestamps, and the captions on YouTube have timestamps, but few speaker labels. To find where in the episodes the token words are, we’ll have to combine them.
We’re going to be getting the caption data and audio from Lingthusiasm’s “all episodes” playlist, using the pytube package.
This function uses the YouTube playlist link to download captions for each video:
def get_captions(videos):"""Get and format captions from YouTube URLs.""" df = pd.DataFrame()for url in videos: video = YouTube(url) video.bypass_age_gate() title = video.title number =int(title[:2]) caps = caption_version(video.captions)try: caps = parse_captions(caps.xml_captions)exceptAttributeError: caps = pd.DataFrame() caps.insert(0, 'Episode', number) df = pd.concat([df, caps]) df = df[['Episode', 'Text', 'Start', 'End']] df[['Start', 'End']] = df[['Start', 'End']].astype('float32')return df.sort_values(['Episode', 'Start'])
1
Start dataframe for results.
2
Go through the list of video URLs and open each one as a YouTube object.
3
Need to include this to access captions.
4
The video title is an attribute of the YouTube object, and the episode number is the first word of the title.
5
Load captions, which returns a dictionary with the language name as the key and the captions and the values. Select which English version to use (function defined below).
6
Convert XML data to dataframe (function defined below). If this doesn’t work, make an empty placeholder dataframe.
7
Add Episode column (integer).
8
Add results from current video to dataframe of all results.
9
Return dataframe of all captions, after organizing columns and sorting rows by episode number then time.
Now, some helper functions to select which caption version to download and then reformat the data from XML to a dataframe.
Caption data is most commonly formatted as XML or SRT, which is a text format, but not easily convertible to a dataframe.
def caption_version(cur_captions):"""Select which version of the captions to use (formatting varies)."""if'en'in cur_captions.keys():return cur_captions['en']elif'en-GB'in cur_captions.keys():return cur_captions['en-GB']elif'a.en'in cur_captions.keys():return cur_captions['a.en']def caption_times(xml):"""Find timestamps in XML data.""" lines = pd.Series(xml.split('<p')[1:], dtype='string') df = lines.str.extract(r'(?P<Start>t="[0-9]*")') df['Start'] = df['Start'].str.removeprefix('t="').str.removesuffix('"') df['Start'] = df['Start'].astype('int')return dfdef caption_text(xml):"""Find text and filter out extra tags in XML data.""" lines = pd.Series(xml.split('<p')[1:], dtype='string') tags =r'"[0-9]*"|t=|d=|w=|a=|ac=|</p>|</s>|<s\s|>\s|>|</body|</timedtext' texts = lines.str.replace(tags, '', regex=True) texts = texts.str.replace(''', '\'') \ .str.replace('"', "'") \ .str.replace('<', '[') \ .str.replace('>', ']') \ .str.replace('&', ' and ') texts = [clean_text(line) for line in texts.to_list()]return pd.DataFrame({'Text': texts}, dtype='string')def parse_captions(xml):"""Combine timestamp rows and text rows.""" times = caption_times(xml) texts = caption_text(xml) df = times.join(texts) df = df[df['Text'].str.contains(r'[a-z]')] df = df.reset_index(drop=True) df['End'] = df['Start'].shift(-1) -1return df
1
Prefer en captions, and if those aren’t available, try en-GB then a.en. The main difference is in the formatting, and I think if they are user- or auto-generated.
2
The XML data is one giant string. Split into a list using the <p tag, which results in one string per caption item (timestamps, text, and various other tags).
3
Each caption item has a start time preceded by t=. This regex gets t="[numbers]" and puts each result into a column called Start.
4
Remove t=" from the beginning of the Start column and " from the end, leaving a number that can be converted from a string to an integer.
5
The XML data is one giant string. Split into a list using the <p tag, which results in one string per caption item (timestamps, text, and various other tags).
6
This is a list of the other variables that come with the caption text, including t= for start time and d= for duration. The formatting varies somewhat based on which version of captions you get, and the reason this selects the en captions before the en-GB and a.en captions is that this formatting is simpler.
7
Take all the tags/variables out, leaving only the actual caption text.
8
Remove special characters and extra spaces.
9
Return as dataframe, with Text specified as a string column. If you leave it as the default object type, pandas will throw warnings later.
10
Get dataframes with caption times and caption texts, using two functions just defined. Join the results into one dataframe (this works because both are indexed by row number).
11
Remove rows where caption text is blank, then reset index (row number).
12
Make a column for the End time, which is the Start time of the new row - 1.
Load and format captions for each video in the “all episodes” playlist, using functions defined above.
2
Only analyze episodes 1-84.
3
Show the first and last 10 rows of the resulting dataframe. .style prints it as a nicer-looking table.
Episode
Text
Start
End
0
1
[introductory music playing]
160
3189
1
1
L: Welcome to Lingthusiasm a podcast that's
3190
6759
2
1
enthusiastic about linguistics
6760
9218
3
1
I'm Lauren Gawne.
9219
11319
4
1
G: and I'm Gretchen McCulloch and today we're going to be talking about
11320
12399
449
84
your life who's curious about language. L: Lingthusiasm is created and produced by
2298000
2301899
450
84
Gretchen McCulloch and Lauren Gawne. Our Senior Producer is Claire Gawne,
2301900
2305559
451
84
our Editorial Producer is Sarah Dopierala, our Production Assistant is Martha Tsutsui-Billins,
2305560
2311679
452
84
and our Editorial Assistant is Jon Kruk. Our music is 'Ancient City' by The Triangles.
2311680
2316359
453
84
G: Stay lingthusiastic! [Music]
2316360
nan
Save results to data/captions.csv:
captions.to_csv('data/captions.csv', index=False)
1.5 Match Timestamps to Transcripts
Now match the text from the transcript (target word start -/+ 25 characters) to the caption timestamps, using the thefuzz package to find the closest match.
def match_transcript_times(df_trans, df_cap):"""Use fuzzy matching to match transcript turn to caption timestamp.""" df_times = df_trans.reset_index(drop=False) to_cat = ['Vowel', 'Word', 'Episode'] df_times[to_cat] = df_times[to_cat].astype('category') df_times['Text_Caption'] = pd.Series(dtype='string')for i in df_times.index: episode = df_times.loc[i, 'Episode'] captions_subset = df_cap[df_cap['Episode'] == episode] text = df_times.loc[i, 'Text_Subset'] text = match_exceptions(text, search_exceptions)if text isnotNone: match = process.extractOne(text, captions_subset['Text'])if match isnotNone: df_times.loc[i, 'Text_Caption'] = match[0] df_times.loc[i, 'Match_Rate'] = match[1] df_times.loc[i, 'Start'] = captions_subset.loc[match[2], 'Start'] df_times.loc[i, 'End'] = captions_subset.loc[match[2], 'End']return df_times
1
Make copy of the transcripts-containing-target-words dataframe, and convert Vowel, Word, and Episode from indices to columns, and specify those columns are categorical.
2
Make column for results and specify it is a string type. Again, this will work without this line, but will result in warnings from pandas.
3
Subset the captions dataframe to only include the episode of the current row, so we have less to search through.
4
Text from transcript (target word -/+ 25 characters) to search for in captions.
5
There are a few cases where the search text needs to be tweaked or subset to get the correct match, and a few cases where a token needs to be skipped because the audio quality wasn’t good enough. These cases are defined in the next code chunk.
6
Find row in caption dataframe (subset to just include the current episode) that best matches the search text (target word -/+ 25 characters).
7
If a match is identified, a tuple is returned where the first item is the matching text, the second item is the certainty (where 100 is an exact match), and the third item is the row index (in the Text column of the captions dataframe, subset to just include the current episode).
8
Use the row number from the match to get the corresponding Start and End times.
9
Results include all of the target words found in the transcript, with columns added for Text_Caption, Start, and End.
Fuzzy string matching accounts for a lot of the differences between the transcripts (which are proofread/edited) and the captions (not all proofread/edited, have some speaker names and initials mixed in). But it wasn’t always able to account for the differences in splitting sections, since the captions aren’t split by speaker turns. So, there are some where matching the right timestamp requires a transcript text that’s subset further (search_text_subsets) or tweaked (search_text_replacements). This isn’t the most elegant solution, but it works (and it was still faster than opening up 84 30-minute audio files to find and extract out 200 3-second clips).
Then, after going through all the clips and annotating the vowels (see Part 2), there are some that get skipped (search_text_skips) because of the audio quality (e.g., overlapping speech).
search_exceptions = {'replace': {"all to a sing": "all to a single place","and you have a whole room full":"speakers who have a way of communicating with each other that","'bang' or something": "bang or something","emphasising the beat gestures in a very": "B gestures","by the foot and they really want":"Yeah they rented the books by the foot and they really want","father's brother's wife": "father's brother's wife","foot/feet": "foot feet","is known as a": "repetition is known as a","an elastic thing with a ball on the end.":"paddle thing that has an elastic thing with a ball on the end","pin things": "trying to use language names as a way to pin things","red, green, yellow, blue, brown, pink, purple":"L: Black, white, red, green, yellow, blue, brown","see; red, green, yellow, blue, brown, purple, pink":"green, yellow, blue, brown, purple, pink,","tied up with": "complicated or really difficult histories","to cite": "believe I got to","your father's brother, and you": "your father's brother","your father's brothers or your":"your father's brothers or your mother's brothers" },'trim': ['a lot of very core', 'a more honorific one if you','an agreement among', 'audiobook', 'auditorily', 'ball of rocks','between those little balls', 'big yellow', 'bonus topics','bought a cot', 'bought! And it was true', 'bought some cheese','but let me finish', 'core idea', 'core metaphor','core set of imperatives', 'core thing', 'finished with their turn''finished yours', 'guest, Gabrielle', 'guest, Pedro','honorific register', 'how could this plan','imperative or this honorific', 'letter from her father','like if I bought', 'on which country you bought', 'pangrams', 'pater -- father', 'returning special guest','see among languages are based', 'substitute','that did it with the Big Bang', 'the idea is that honorifics','wanna hear more from our guests', 'with your elbow or your foot','You bet', 'you tap your foot by the treat' ],'skip': ['believe you left this', 'bit of a rise', 'By the foot!','can pin down', 'existing thanks', 'featuring', 'fun and delightful','fun to flip', 'fun with adults', 'group of things', 'his data?','in November', 'local differences', 'people notice', 'She says no','situation where', 'things wrong', 'what letters', 'who are around you' ]}def match_exceptions(text, exceptions):"""Check if text to match is an exception and gets replaced or subset."""for k, v in exceptions['replace'].items():if k in text:return vfor s in exceptions['trim']:if s in text:return sreturnNoneifany(text in s for s in exceptions['skip']) else text
1
Cases where search text needs to be replaced to match correctly.
2
Cases where search text needs to be subset to match correctly.
3
Cases where search text needs to be skipped because of audio quality.
4
If a replace key string is a substring of search text, then return key’s value.
5
If a trim string is a subset of search text, then return that string.
6
If a skip string is a subset of search text, then return None. If none of exceptions apply, then return original search text.
Get timestamps for selected turns (function defined in two code chunks above).
2
Most ID a match, filter rows that don’t.
3
Save all results.
Now select a subset to annotate. Pick the 5 highest matches (to caption timestamps from transcript) for each word + speaker, and if there are multiple good matches, pick the ones from the later episodes since those have higher sound quality.
Sort with the highest match rates and later episodes at the top. Then group by Vowel, Word, and Speaker. Select the 5 highest within each Vowel + Word + Speaker group, which gets the closest matches, and if there are multiple good matches, picks the one from the later episode (higher audio quality). Then re-sort by Vowel, Word, and Speaker and convert those from indices to columns.
2
Add new column for count within each Word + Speaker combination, which we’ll use later to name files.
3
Sort columns.
We have 5 tokens of most words (except for the words where there weren’t >= 5 tokens in the transcripts):
Count number of rows for each Vowel + Word + Speaker combination. Convert those from indices to columns, so this is a dataframe instead of a series, and thus can be printed as a table by making it a style object. .hide() drops the row numbers.