Now to actually make the vowel plots! This document goes into detail about how I decided to make them the way I did and how to implement them in ggplot, but if you just want to see the final results, jump down to here, here, and here.
However, the IPA symbols aren’t encoded correctly. They’ll render in RStudio, but not when Quarto renders the document to HTML, or always when ggplot renders the plots. This isn’t what we want:
ɑ, æ, ɔ, ə, ɛ, i, ɪ, o, u, ʊ, ʌ
So, the next step is to enter the unicode values manually (copied from this Wikipedia page):
vowels <-c("i_lower"="\u0069", # i (close front unrounded)"i_upper"="\u026A", # ɪ (near-close front unrounded)"epsilon"="\u025B", # ɛ (open-mid front unrounded)"ash"="\u00E6", # æ (near-open front unrounded)"schwa"="\u0259", # ə (mid central)"horseshoe"="\u028A", # ʊ (near-close near-back rounded)"u"="\u0075", # u (close back rounded)"o"="\u006F", # o (close-mid back rounded)"hat"="\u028C", # ʌ (open-mid back unrounded)"open_o"="\u0254", # ɔ (open-mid back rounded)"alpha"="\u0251"# ɑ (open back unrounded))
These are ordered from front to back, then close to open (Figure 4).
Then match the unicode for the IPA symbol to the words:
formants %<>%mutate(Vowel =case_when( Word %in%c("ball", "father", "honorific", "lot", "palm", "start") ~ vowels["alpha"], Word %in%c("bang", "bath", "hand", "laugh", "trap") ~ vowels["ash"], Word %in%c("bought", "cloth", "core", "north", "thought", "wrong") ~ vowels["open_o"], Word %in%c("among", "famous", "support") ~ vowels["schwa"], Word %in%c("bet", "dress", "guest", "says", "square") ~ vowels["epsilon"], Word %in%c("beat", "believe", "fleece", "people") ~ vowels["i_lower"], Word %in%c("bit", "finish", "kit", "near", "pin") ~ vowels["i_upper"], Word %in%c("force", "goat") ~ vowels["o"], Word %in%c("blue", "goose", "through", "who") ~ vowels["u"], Word %in%c("could", "cure", "put", "foot") ~ vowels["horseshoe"], Word %in%c("another", "but", "fun", "strut") ~ vowels["hat"], ) %>%factor(levels = vowels, ordered =TRUE))str(formants)
1
If the value in the Word column is ball, father, honorific, lot, or palm, then assign the alpha value from the vowels list.
2
Convert character to factor, then specify the order of the factors (same as in vowels list above) to make sure it stays consistent.
The Lingthusiasm font is Josefin Sans, which is available from Google Fonts.
I downloaded and installed it to my computer. There are a number of different ways to add new fonts without having to install them separately outside of RStudio, such as font_add_google() from the showtext package. However, that method was causing errors rendering the IPA symbols.
systemfonts() shows the list of fonts installed on my computer that R recognizes, and it finds Josefin Sans:
# A tibble: 3 × 2
name value
<chr> <chr>
1 path "C:\\Users\\betha\\AppData\\Local\\Microsoft\\Windows\\Fonts\\JosefinS…
2 name "JosefinSans-Thin"
3 family "Josefin Sans"
However, the fonts loaded by default just include Times New Roman, Arial, and Courier New:
windowsFonts()
$serif
[1] "TT Times New Roman"
$sans
[1] "TT Arial"
$mono
[1] "TT Courier New"
This tells R to load Josefin Sans into the set of available fonts, so text will render in Josefin Sans if family = sans_alt, but stick with the default sans font otherwise (and not break the IPA symbols).
Take the full data set, group it by Speaker then List then Vowel, and then calculate the means of F1 and F2 for each Speaker x List x Vowel.
2
All layers of the plot have F2 on the X axis, F1 on the Y axis, and are labelled by Vowel.
3
Write the vowel symbols (because Label = Vowel) at the location of their means.
4
Make the text box background lingthusiasm green with no outline.
5
Make the text white, size 4.5 (note that this is on a different scale than the rest of the text sizes specified in later theme()), vertically and horizontally centered.
6
Set the size of the text boxes, using snpc (squared normalized parent coordinates) to be relative to the size of the plot but always square.
7
No margins inside the text boxes and a slight curve on the corners.
8
Split the plot to have Gretchen’s data in the top panels and Lauren’s data in the bottom panels, and the data from the Lingthusiasm episodes in the left panels and the data from the Wells lexical set recordings in the right panels.
9
Change the default theme to have a white background with no grid lines.
10
Change all the lines (axis lines, axis ticks, outline around panels, outline around panel labels) to be the lingthusiasm navy.
11
Make all the text navy Josefin Sans. Set the base size as 12, but make the text of the speaker panel labels bigger.
12
Set the title, and leave the other axis/legend labels as their default values of “F1”, “F2”, and “Vowel.”
(Sidenote: saving the theme specifications so we don’t have to keep retyping theme.)
However, vowel plots typically have their axes reversed, so that the highest value of (F1, F2) is at the bottom left corner instead of the top right corner. This isn’t standard data visualization procedure, but it has a cool and useful result.
Just annotating the lines that changed from the previous chunk.
2
Add this to flip the X axis. Specify breaks because the default values aren’t even.
3
Add this to flip the Y axis. The limits (see how they’re reversed) are specified because the defaults were a bit too narrow, and like this the axis ticks/labels are spaced more evenly.
4
Add this to specify the colors and font sizes etc.
Now the layout resembles the IPA vowel chart! Front vowels are on the left, and back vowels are on the right; close vowels are on the top, and open vowels are on the bottom.
3.3 Plot Individual Data Points
Just plotting the means for each vowel loses a lot of information, so let’s take a look at the underlying data.
Now, we’ll distinguish between vowels by color. First, make a legend that will be easier to read than the default by creating a string that prints each vowel in its corresponding color (using the ggtext package to render the HTML formatting).
Vowel column is the list of vowels (unicode codes). Color column is the hex codes from the Bold palette in rcartocolor, the color set we’ve been using so far. (Using the first 11 values from the full palette, so the last color isn’t gray.)
2
Encase with HTML code, so that hex code becomes a color argument for the vowel character.
3
Merge into 1 string, with each value separated by a comma + space. 4.. Print, wrapping lines on each item.
(Note that ggplot will throw a warning like Warning in text_info(label, fontkey, fontfamily, font, fontsize, cache): unable to translate '<U+0251>png215' to native encoding, but it renders correctly, so the warnings are turned off in those code chunks.)
Passing the full data set, not the means by Speaker + Vowel + List, to ggplot.
2
Instead of geom_text(), geom_point() is a layer drawing scatterplot (size making the points slightly bigger than default).
3
Use the Bold color palette from the rcartocolor package to color-code the vowels. (There are 11 vowels, but I specify 12 colors here so the grey gets skipped.)
4
Limits need to be slightly bigger than plots with vowel means, and then breaks adjusted so that that Y axis labels don’t overlap with each other between the two panels.
5
element_markdown() from ggtext will render the HTML string. Use default sans serif font because Josefin Sans doesn’t have all of the IPA symbols.
6
Add the color-coded list of vowels as a subtitle.
7
Turn the default legend off.
3.4 Plot Word Means
The data for each vowel consists of 3 different words. How different are they? First, let’s look at the Wells Lexical Set.
These next plots use the ggrepel package to make the word labels not overlap with each other or with the scatterplot points.
The panels are stacked vertically, because the word labels take up more space. fig-asp: 1.25 in this code chunk’s header makes it render tall enough.
One thing that makes this plot a bit hard to interpret is that it’s not immediately clear which vowel in the word is the one being plotted. So, let’s make the vowel bold relative to the rest of the word.
So far we’ve been using ggtext to format text, but that doesn’t work with ggrepel. The workaround, like with the IPA vowels, is to just enter the unicode characters directly.
These are the codes for the Mathematical Sans Serif capital letters, in regular and bold faces. They’re copy-pasted in here manually even though the pattern is predictable, because procedurally generating strings with the \u prefix is a pain.
Function takes word as a string and alphabet_reg as a named list.
2
Split the word into individual letters.
3
Start string for the converted word.
4
For each letter, use the fact that alphabet_reg is named with the regular letters to get the unicode string for the current letter. Concatenate the letter pulled from alphabet_reg to the converted string.
For each item in the Word column of the formants dataframe, call the function to_unicode_caps() (defined in previous code chunk) on it. Pass alphabet_reg as the second argument to to_unicode_caps().
2
Insert the words_unicode into the formants dataframe as a column called Word_Label, after the Word column.
3
Convert the items in Word_Label from lists containing 1 string to just strings.
Mutating the Word_Label column multiple times, because several words have multiple vowels to swap. Swapping one vowel at a time is shorter than swapping one category of word at time.
2
First modification to Word_Label is all the words where “A” gets bolded.
3
If the value in the Word column is one of these items
4
Then pass the value of the Word_Label column to str_replace(). Replace the “a” from the regular-face set with the “a” from the bold-face set.
5
If the value in the Word column is not any of those words, keep the value of Word_Label the same.
6
Same logic for all the words where “E” gets bolded.
7
Same logic for all the words where “I” gets bolded.
8
Same logic for all the words where “O” gets bolded.
9
Same logic for all the words where “U” gets bolded.
10
There are a couple of exceptions: “believe”, because that’s the word where the second instance of the vowel gets bolded, not the first one. Replace the consecutive “I” and “E” from the regular-face set with the “I” and “E” from the bold-face set.
11
“Goose” is the only word where both O’s need to be bolded.
12
“Fleece” needs the first two, but not the third E bolded.
This was one of the points where I (from the northeast US) realized how I don’t have a lot of experience with Australian accents, because I wasn’t entirely sure how much of the messiness in Lauren’s back vowel data was because her back vowels are in different locations, or because I picked words where she uses a different vowel than Gretchen and I do.
3.5 Plot Vowel Boundaries
One of the the most common types of vowel plots draws ellipses around the vowel boundaries.
Specify group = Vowel because all layers are going to be grouped by vowel, but wait to specify fill and color because those will vary by geom.
2
ggforce::geom_mark_ellipse() draws an ellipse around the points in each Vowel group. stat_ellipse() is also an option, but not all the vowels in the Wells Lexical Set have enough observations for it.
3
Set the outline around the ellipse to be color-coded by Vowel, but keep the fill of the ellipses white.
4
expand = 0 draws the ellipse exactly around the edges of the data, vs. the default value of expanding the ellipse out by 5mm.
5
If you give geom_mark_ellipse() a value for label, it will draw labels for each ellipse. But I want the labels centered on the mean for each vowel, not drawn to the side, so to do that I’m using geom_textbox() again (like for the first vowel means plots).
6
geom_textbox() needs the means for each Speaker*List*Vowel, otherwise it will draw a label over every data point. We can summarize the data already passed into the plot (formats) by specifying data = . %>% summarise(). So this geom (but not the ellipse geom) uses the mean of F1 and F2 grouped by Speaker, List, and Vowel. If you only specified Vowel here, it would plot the same means on each panel when we facet_wrap() by Speaker and Vowel on the next line.
7
Set the label and fill (but not color) to vary by Vowel for just geom_textbox().
8
Set the text size, alignment, and color.
9
Remove the outlines around the boxes and make the fill color slightly transparent.
10
Set the box dimensions (see vowel means plots above).
Ellipse plots are hard to make work for this data set, since there’s just too much going on in the Lingthusiasm Episode data. It would work better with cleaner/more controlled data (like the Wells Lexical Set recordings), or if you weren’t trying to compare all of the vowels. For now, I think this just shows how complex speech comprehension is! There isn’t a straightforward way to visualize boundaries in the naturalistic data, even though we’re able to perceive boundaries without much trouble.
For more info on ellipse vowel plots and how to tweak their appearance, I like this tutorial.
Need to specify device = png (not leave default or device = "png") to get Josefin Sans font to render correctly.
3.6 Stylized Versions
The plots so far would work for a scientific presentation/paper, but the original goal was to make plots that are a little more artistic. Now that we understand what vowel data is going into the plots and how the vowel plots correspond to the IPA vowel chart, we can play around with removing some of the axis and label information for a more minimalist look. Starting at this step, without making the complete plot first, is possible, but it would be easy to end up with errors.
Basically, if you want to remove axes for scientific presentations: don’t. If you want to remove axes for artistic license: make sure you understand what you’re removing first.
Word Means for Wells Lexical Set
The first pair of stylized plots just includes the Wells Lexical Set data.
ggrepel::geom_label_repel(), which draws the label boxes offset from the points, doesn’t currently support having the label text a different color than the box outline, so the trick with this one is to call it twice—once to draw everything in navy, and a second time to draw just the text in green on top of the navy text.
Just include Gretchen’s data from the Wells Lexical Set list.
2
Calculate the mean of F1 and F2 for each word.
3
Like the rest of the plots, F1 goes on the Y axis, F2 goes on the X axis, and Word is the label.
4
Draw a point at each word mean, and make it the Lingthusiasm navy and slightly larger than the default.
5
First layer of geom_text_repel() to get the navy lines.
6
Make the text color and the outline around the box navy.
7
Josefin Sans font and larger text size.
8
Always draw a line between the label and the point, and make the line a bit thicker than default.
9
Increase the padding around the text and the amount of curve on the corners.
10
Increase the amount the labels are repelled from the points (play around with different values until it looks good), and set a seed so the output is consistent.
11
Second layer of geom_text_repel() to get the green text.
12
Make the text color (and also the outline for now) the Lingthusiasm green.
13
Josefin Sans font and larger text size. Needs to be the same as the previous layer so there’s no navy text visible below the green text.
14
Set segment.size and label.size to 0 so that this layer doesn’t draw a green box outline or line connecting to the point.
15
Same box padding and shape, repel value, and seed so that the labels are in the same place.
16
Reverse the axes, and set the limits to look right before removing the labels in the next step.
17
Remove the axis lines, ticks, numbers, and title.
18
Add a layer writing “Gretchen.”
19
Set the font to Josefin Sans, the color to Lingthusiasm green, and the size to be about double the size of the word labels.
20
The location is on the scale of the axes.
21
Use patchwork to add a layer for the Lingthusiasm logo. This layer needs to be last, and the image needs to be loaded as a raster (see Lingthusiasm Theme section at the beginning).
22
Locations for the logo. (0, 0) is the bottom left corner of the plot, and (1, 1) is the top right corner. The distance between the top/bottom and left/right are even, and the units are snpc (squared normalized parent coordinates), so the logo is square and 10% of the height of the plot.
23
Save the plot. The text/line sizes and locations of the annotation layers are set to look right at this size and aspect ratio. Instead of rendering the plot in this chunk like before, the saved image is inserted directly below, to keep the sizing exact.
Change to include Lauren’s data instead of Gretchen’s.
2
To get a plot where the edge around the data are even, the limits for Lauren’s data are slightly different than the limits for Gretchen’s.
3
The location of the name layer also needs to be slightly different for Lauren.
4
And finally so does the location of the logo (but the size/aspect ratio stay the same).
Vowel Means for Lingthusiasm Episode Words
The second pair of stylized plots uses the data from the Lingthusiasm episodes.
I want to draw a trapezoid around vowel space, with the axes the same for Gretchen and Lauren’s data so you can compare them. geom_path() connects a series of points, so we need the four corners, plus the starting corner again. Here’s a visualization of the trapezoid coordinates, relative to the vowel means:
Now that we have the layout of the trapezoid, plot Gretchen’s data. geom_textbox() and geom_label_repel() can only draw squares/rectangles, so to get circles like the Lingthusiasm logo, I use geom_mark_circle() to draw a circle, then geom_text() to write the vowel on top of it.
Just include Gretchen’s data from the Lingthusiasm episode word list, and calculate the mean F1 and F2 for each Vowel.
2
Instantiate the plot, but set the values for aes() separately in each geom below.
3
Draw the trapezoid, using the data from vowel_polygon not the vowel means passed into the plot. This draws a line connecting each of the coordinates listed in the x and y columns.
4
Make the line navy, thicker than default, and rounded at the corners.
5
Draw a circle around each vowel mean. Specify group = Vowel not label = Vowel, so that there is a circle for each vowel, but no labels.
6
Make the radius of the circle 4.8% of the plot’s height (again using squared normalized parent coordinates, just small enough to not overlap), and increase the number of points used to draw the circle from 100 so it looks smoother.
7
Fill the circle with Lingthusiasm green, make it not transparent (the default alpha is 0.3 not 1), and remove the black outline.
8
Draw the vowel label at each mean.
9
Make the text white, large enough to just fit inside the circle, and nudge it up just a bit (because most of the IPA symbols are sized as lowercase).
10
Set the axis limits to be 1% larger than the borders of the trapezoid.
11
Move the title of the Y axis from the left to the right.
12
Remove the lines, text, and ticks from the X and Y axes and the title from the X axis.
13
Make the Y axis title size 30, aligned to the bottom not the middle, navy, and Josefin Sans.
14
Set the Y axis title to be Gretchen instead of F1.
Do the same thing for Lauren’s data:
lauren_vowels_ep <- formants %>%filter(Speaker =="Lauren"& List =="Lingthusiasm Episodes") %>%summarise(.by = Vowel, F1 =mean(F1), F2 =mean(F2)) %>%ggplot(aes(x = F2, y = F1, group = Vowel)) +geom_path(data = vowel_polygon,aes(x = x, y = y, group =NA),color = lingthusiasm_navy, linewidth =2, lineend ="round" ) +geom_mark_circle(radius =unit(0.048, "snpc"), n =1000,fill = lingthusiasm_green, alpha =1, color =NA ) +geom_text(aes(label = Vowel), color ="white", size =11, nudge_y =4) +scale_x_reverse(limits =c(max(vowel_polygon$x), min(vowel_polygon$x)),expand =c(0.01, 0.01) ) +scale_y_reverse(limits =c(max(vowel_polygon$y), min(vowel_polygon$y)),expand =c(0.01, 0.01),position ="right" ) +theme_classic() +theme(axis.line =element_blank(),axis.text =element_blank(),axis.ticks =element_blank(),axis.title.x =element_blank(),axis.title.y =element_text(size =30, hjust =1,color = lingthusiasm_navy, family ="sans_alt" ) ) +labs(y ="Lauren")ggsave( lauren_vowels_ep, path ="plots", filename ="lauren_vowels_ep.png",width =7.25, height =5, unit ="in", device = png)
1
Only things that need to change from Gretchen’s version are the filter on the data.
2
And the axis label.
To combine the two, I’m going to flip Gretchen’s horizontally, so it looks like two speakers facing each other.
paired_vowels_ep <- gretchen_vowels_ep +scale_x_continuous(limits =c(min(vowel_polygon$x), max(vowel_polygon$x)),expand =c(0.01, 0.01) ) +scale_y_reverse(limits =c(max(vowel_polygon$y), min(vowel_polygon$y)),expand =c(0.01, 0.01),position ="left" ) +theme(axis.title.y =element_text(hjust =0),plot.margin =margin(t =10, l =10, r =30, b =10) ) + lauren_vowels_ep +theme(plot.margin =margin(t =10, l =30, r =10, b =10)) + lingthusiasm_tagline +plot_layout(design =c(area(t =1, l =1, b =5, r =4),area(t =1, l =5, b =5, r =8),area(t =4, l =4, b =5, r =5) ))ggsave( paired_vowels_ep, path ="plots", filename ="paired_vowels_ep.png",width =15, height =5, unit ="in", device = png)
1
Use the patchwork package to combine plots. Start with Gretchen’s plot on the left.
2
Replace the x axis scale with the non-reversed version scale_x_continuous, flip the limits accordingly, and keep the expansion the same.
3
Keep the y axis scale reversed, but put the axis title back on the left.
4
Shift the Y axis title (“Gretchen”) to be aligned to the bottom.
5
Add some space to the right side of the plot.
6
Add Lauren’s plot.
7
Add space to the left side of the plot. Based on how patchwork works, this only applies to Lauren’s plot added last, not Gretchen’s.
8
Add the Lingthusiasm tagline image.
9
Specify the locations of the three pieces on a grid to have Gretchen’s plot on the left, Lauren’s on the right, and the Lingthusiasm logo on the bottom.
10
Gretchen’s plot: full height, left side.
11
Lauren’s plot: full height, right side. 12 Lingthusiasm tagline logo: 20% of height, overlapping each of the other plots by 25%.