1  Experiment 1

Associating Pronouns with a Person


1.1 Motivation

In addition to learning that the grammatical representation of they allows it to corefer with specific singular referents, learning to use they/them pronouns may require a change in how speakers access pronouns during language production. Extending existing models of pronoun production from grammatical gender to social gender (Ackerman, 2019; McConnell-Ginet, 2014) would predict that pronouns are accessed based on morphosyntactic gender marking associated with a person’s name (e.g., Schmitt et al., 1999) or based on semantic/conceptual features of a person (e.g., Antón-Méndez, 2010). In order to use they/them for a person instead of the expected he/him or she/her, speakers may instead need to recall episodic information about the person’s stated pronouns or which pronouns other speakers use to refer to them. It is clear from other contexts, such as referring to pets with he and she, that people can learn which gender-marked pronouns to use in contexts with few gender cues based on name or appearance. This suggests that learning specific singular they should be feasible, but may involve retrieving a person’s pronouns from memory, rather than inferring them based on cues such as their name.

The first experiment investigated how people learn to associate pronouns with a person when someone’s pronouns cannot be inferred from the gender association of their first name. Participants read a series of vignettes which introduced 12 characters, each of whom was associated with a name, pronouns (he/him, she/her, or they/them), a pet, and a job. Memory for pronouns was tested in a multiple-choice recognition task, and production of pronouns was tested in a written sentence completion task.

1.2 Experiment 1A

1.2.1 Methods

The design and analysis plan were preregistered on the Open Science Framework. Materials, de-identified data, and analysis code are available at this dissertation’s Github repository.

Participants

102 undergraduate students from Vanderbilt University completed the study for partial course credit. The study was conducted online and took approximately 15 minutes. To characterize the participant sample, they were asked about their age (Mage = 19.49, SDage = 1.38); gender (79 female, 22 male, 1 nonbinary), and English experience (86 “native (learned from birth)”, 15 “fully competent in speaking, listening, reading, and writing, but not native”, 1 “limited but adequate competence in speaking, reading, and writing”). Following best practices for TGD-inclusive study design, the question about gender was a free-response box (Ansara & Hegarty, 2014; Cameron & Stinson, 2019; Zimman, 2017; NASEM, 2022).

Materials & Procedure

Participants were introduced to 12 characters, each of whom had 4 associated facts: name, pronouns, job, and pet. There were 6 typically-masculine names (Andrew, Brian, Daniel, James, Kevin, Michael) and 6 typically-feminine names (Amanda, Emily, Jessica, Laura, Melissa, Stephanie), such that 4 characters had he/him pronouns and masculine names, 4 characters had she/her pronouns and feminine names, 2 characters had they/them pronouns and masculine names, and 2 characters had they/them pronouns and feminine names. While people who use they/them pronouns have a variety of names, the choice of gendered names here means that whether a character used the gendered pronouns associated with their name or they/them could not be predicted on the basis of the name. Thus, participants had to learn each name-pronoun pairing. Characters also had 1 of 12 jobs and 1 of 3 pets. The pet fact had the same distributional characteristics as the pronouns, but was not associated with prior expectations based on the name or with strong sociopolitical beliefs, making pet learning a reference point for comparison. Participants were randomly assigned to 1 of 3 lists. Name-pronoun pairs were counterbalanced such that each name appeared with the expected he/him or she/her in 2 lists and they/them in 1 list.

Participants read descriptions of each character in the frame [Name] uses [pronouns]. [Name] works as a [job] and has a [pet]. After a brief distractor task consisting of simple math questions, there were two tasks measuring learning of pronouns: In the memory task, participants were given the character’s name and answered multiple-choice questions about their pronouns, job, and pet (e.g., What pronouns does Emily use? What is Emily’s job? What kind of pet does Emily have?). The 3 questions about each character always appeared together, the order of the 12 characters was pseudo-randomized, and the orders of the choices for each question were randomized. In the production task, participants saw sentence fragments for each character (in a randomized order) that included their name and job (e.g., After Emily got home from working as a teacher…). Participants were asked to complete the sentence in a way that made sense to them, to measure which pronouns (if any) they used to refer to the character (e.g., …they made dinner). Finally, participants were shown 3 characters, 1 for each pronoun, and asked what they would say if they were introducing the character to someone. This task measured if participants used a pronoun in reference to the character or specifically mentioned which pronouns the character used, as a pilot for more open-ended prompts.

1.2.2 Predictions

Given that people who use they/them pronouns report high rates of both unintentional errors and intentional misgendering (Cordoba, 2020; Goldberg et al., 2019; James et al., 2016; McLemore, 2018; The Trevor Project, 2020), we predict that he/him and she/her will be remembered and produced more accurately than they/them. This outcome could be observed for one or more reasons: Participants may be unfamiliar with singular they, or familiar with comprehending it but unused to producing it. Singular they is also less frequent than he and she, and as a result may be more difficult to use, even for speakers already familiar with it. Additionally, if participants avoid using they/them as an option, instead choosing the pronouns typically associated with the character’s name, accuracy for they/them would also be lower than accuracy for he/him and she/her. Alternatively, the relative novelty of they/them may improve memory, as distinctive information tends to be remembered better (von Restorff, 1933; Wallace, 1965). Under this account, accuracy remembering and producing they/them would be higher than for he/him and she/her.

We hypothesize that learning to use singular they requires a change from inferring a person’s pronoun (he or she) based on semantic/conceptual features of a person (e.g., Antón-Méndez, 2010) or based on morphosyntactic gender associated with a person’s name (e.g., Schmitt et al., 1999), and instead recalling episodic information about a person’s stated pronouns. In the context of this experiment, the gender association of the character’s name cannot predict whether that character uses he/she or they, meaning that the only way to consistently produce the correct pronouns is to remember the information from the character introductions. As a result, we predict that correctly remembering that a character uses they/them in the multiple-choice task should predict correctly producing they in the sentence completion task. Alternatively, pronoun choice in the production task may not be influenced by episodic memory for which pronouns a character uses. This would occur if, in language production, a speaker attempts to infer the character’s pronouns based on their name rather than retrieving them from memory, or if a speaker chooses to not produce singular they. In this scenario, accuracy in the memory task would not predict accuracy in the production task.

1.2.3 Results

Responses were analyzed using logistic mixed-effects regression models using lme4 in R (Bates et al., 2015; R Core Team, 2023), with the first model analyzing memory accuracy (Table 1.1), the second analyzing production accuracy (Table 1.2), and the third analyzing a relation between memory and production accuracy (Table 1.3). The fixed effect of pronoun was coded with orthogonal Helmert contrasts; the first contrast compared they/them to he/him and she/her, and the second contrast compared he/him to she/her. In the model that included memory accuracy as a fixed effect, it was mean-center effects coded. The maximal models included by-participant and by-item (defined as the 12 names) random intercepts and slopes for pronoun (Baayen et al., 2008; Barr et al., 2013). The buildmer package (Voeten, 2023) was used to select the maximal converging models; all three final models included by-participant intercepts, the memory and production models also included by-participant slopes, and the memory model also included by-item intercepts.

Memory

For accuracy in the multiple-choice memory task (Table 1.1), participants responded more accurately than inaccurately across all three pronoun conditions (β = 0.77, z = 7.36, p < .001). He/him and she/her (M = 0.77) were remembered significantly more accurately than they/them (M = 0.44) (β = 1.64, z = 9.54, p < .001). Errors were asymmetric: not remembering that a character used they/them was more common than incorrectly attributing they/them to a character (Figure 1.1).

Table 1.1: Experiment 1A: Model results for the effect of Pronoun on Memory Accuracy.
Experiment 1A: Memory
  Memory Accuracy
Predictors Log-Odds SE z p
(Intercept) 0.768 0.104 7.359 <0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33) 1.638 0.172 9.544 <0.001
Pronoun: He (-.5) vs She (+.5) 0.129 0.217 0.596 0.551
Random Effects
τ00 Participant 0.443
τ00 Name 0.008
τ11 Pronoun (They vs He + She) | Participant 0.634
τ11 Pronoun (He vs She) | Participant 0.193
ρ01 Pronoun (They vs He + She) | Participant -0.425
ρ01 Pronoun (He vs She) | Participant 0.122
N Participant 102
N Name 12
Observations 1224

Figure 1.1: Experiment 1A: [A] Pronoun accuracy in the multiple-choice memory task. By-participant means are shown as points; error bars indicate 95% CIs calculated over the by-participant means. [B] Distribution of memory responses, with the correct pronoun on the X axis and the selected pronoun by color.

Recall that just as each character was associated with one of three pronouns, each character was also associated with one of three pets. For they/them characters, pronoun accuracy was compared to pet accuracy (M = 0.39)—which exhibits similar distributional characteristics—showing no significant difference (β = 0.21, z = 1.35, p = .18). Accuracy for the characters’ 12 possible jobs (M = 0.21) was not at floor, suggesting that overall, the task was not too difficult for participants (Figure A.1). The pet and job questions are discussed in more detail in the appendix (Section A.1.1).

Production

Responses were coded by whether the sentence continuation used he/him, she/her, they/them, or no pronouns to refer to the character (Figure 1.2). Responses that did not include a pronoun referring to the character either used the character’s name as the subject of the continuing clause (e.g., After Emily got home from working as a teacher…Emily made dinner), or were ungrammatical continuations with no subject (e.g., …made dinner). Because these responses were infrequent (2% of trials) and evenly distributed between pronoun conditions, they are included in the analysis (Table 1.2) as incorrect responses. Participants responded more accurately than inaccurately across all three pronoun conditions (β = 1.33, z = 6.73, p < .001). He/him and she/her (M = 0.84) were produced significantly more accurately than they/them (M = 0.29) (β = 4.14, z = 8.80, p < .001). They/them was incorrectly produced as he/him or she/her each about one third of the time, while he/him and she/her were each incorrectly produced as they/them about an eighth of the time. Approximately 60% of participants produced singular they at least once, regardless of accuracy.

Table 1.2: Experiment 1A: Model results for the effect of Pronoun on Production Accuracy.
Experiment 1A: Production
  Production Accuracy
Predictors Log-Odds SE z p
(Intercept) 1.330 0.198 6.726 <0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33) 4.142 0.471 8.802 <0.001
Pronoun: He (-.5) vs She (+.5) -0.157 0.472 -0.333 0.739
Random Effects
τ00 Participant 1.006
τ11 Pronoun (They vs He + She) | Participant 12.336
τ11 Pronoun (He vs She) | Participant 0.317
ρ01 Pronoun (They vs He + She) | Participant 0.657
ρ01 Pronoun (He vs She) | Participant -0.945
N Participant 102
Observations 1224

Figure 1.2: Experiment 1A: [A] Pronoun accuracy in the written sentence completion task. By-participant means are shown as points; error bars indicate 95% CIs calculated over the by-participant means. [B] Distribution of pronoun production responses, with the correct pronoun on the X axis and the produced pronoun by color. [C] Number of times each participant produced singular they (correct = 4), regardless of accuracy.

Memory Predicting Production

Extending the second model to test the effects of pronoun and memory accuracy on production accuracy (Table 1.3) showed that participants were more likely to produce the correct pronoun when they had correctly remembered it (β = 1.25, z = 7.40, p < .001). Memory accuracy interacted with pronoun for the comparison between they/them and he/him + she/her (β = -0.82, z = -2.44, p < .05), such that memory improved production more for they/them characters than for he/him + she/her characters (Figure 1.3). Remembering but not producing they/them was more common than producing it but not remembering it.

Table 1.3: Experiment 1A: Model results for the effects of Pronoun and Memory Accuracy on Production Accuracy.
Experiment 1A: Memory Predicting Production
  Production Accuracy
Predictors Log-Odds SE z p
(Intercept) 0.751 0.102 7.331 <0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33) 2.607 0.176 14.781 <0.001
Pronoun: He (-.5) vs She (+.5) 0.223 0.212 1.054 0.292
Memory Accuracy: Wrong (-.5) vs Right (+.5) 1.245 0.168 7.405 <0.001
Pronoun (They vs He + She) * Memory Accuracy -0.820 0.336 -2.436 0.015
Pronoun (He vs She) * Memory Accuracy -0.064 0.429 -0.149 0.881
Random Effects
τ00 Participant 0.347
N Participant 102
Observations 1224

Figure 1.3: Experiment 1A: [A] Production accuracy split by memory accuracy. Error bars indicate 95% CIs calculated over trials. [B] Distribution of combined memory and production accuracy.

1.3 Experiment 1B

In the first experiment, people learned about a set of 12 characters whose pronouns—they/them or the expected he/him or she/her—could not be predicted from their name or other cues. These results demonstrate that, after only a brief exposure, people can learn that a character uses they/them pronouns and retrieve this information from memory to produce singular they in reference to them. Remembering that the character used they/them was a strong predictor of accurate pronoun production, with 48% of production trials correct when the memory trial had been correct, but only 15% of production trials correct when the memory trial had been incorrect. These results are congruent with a model where learning to use singular they requires a change from inferring a person’s pronouns (he or she) based on semantic/conceptual features of a person or based on morphosyntactic gender associated with a person’s name, and instead recalling episodic information about a person’s stated pronouns.

While memory accuracy did predict production accuracy, it was not a guarantee. For they/them characters, memory was significantly more accurate than production (β = -0.80, z = -4.77, p < .001). Conversely, for he/him and she/her characters, memory was significantly less accurate than production (β = 0.56, z = 4.19, p < .001) (Section A.1.1). I interpret these findings as demonstrating that memory for which pronouns a person uses and accuracy when producing a person’s pronouns are distinct but interacting processes. However, another consideration is that the order of the memory and production tasks is in play. Consider that participants only had one brief exposure to information about the characters, then solved math problems for around five minutes, then answered 36 multiple-choice questions about each character’s job, pet, and pronouns. By the time participants completed the production task, they may have begun to forget which characters used they/them pronouns, instead relying more on inferences from the characters’ names. This would also result in higher production accuracy for he/him + she/her and lower production accuracy for they/them, but not necessarily due to different task demands between remembering which pronoun the character uses and selecting the correct pronoun to produce. A replication study that counterbalanced task order was conducted to ensure that the pattern of results in Experiment 1A was not determined by task order. While participants in Experiment 1A completed the memory task before the production task, participants in Experiment 1B completed the production task before the memory task. In all other respects, Experiment 1B was identical.

1.3.1 Methods

101 Vanderbilt undergraduates completed the study for partial course credit (Mage = 19.28, SDage = 1.18). 72 were women, and 29 were men, as reported in a free response box. 82 rated their English ability as “native (learned from birth)”, 17 as “fully competent, but not native”, and 2 as “limited but adequate competence.” The only difference from Experiment 1A was switching the order of the memory and production tasks.

1.3.2 Results

Responses were analyzed using the same logistic mixed-effects model specifications as in Experiment 1A (Bates et al., 2015; R Core Team, 2023; Voeten, 2023), although the maximal random effects structures that converged differed slightly in this data set. Figure 1.4 highlights results for the they/them characters; results for all three pronouns, parallel to the figures in Experiment 1A, are included in the appendix (Figure A.3, Figure A.4). To compare these results to those of Experiment 1A, an additional trio of models with the combined data included Experiment as a mean-centered fixed effect (Section A.1.3).

Figure 1.4: [A] By-participant mean memory accuracy for they/them characters, split by task order. [B] By-participant mean production accuracy for they/them characters, split by task order. [C] Production accuracy for they/them characters, split by memory accuracy and task order. Error bars indicate 95% CIs.

Memory

Analyzing the effects of pronoun condition on accuracy in the multiple-choice memory task (Table A.3), participants responded more accurately than inaccurately across all three pronouns (β = 0.95, z = 10.03, p < .001). He/him and she/her (M = 0.79) were remembered more accurately than they/them (M = 0.48) (β = 1.55, z = 10.88, p < .001). Comparing accuracy for the 3 pets, which was designed as a control condition for pronouns, indicated that for he/him + she/her characters, pronoun accuracy was higher than pet accuracy (β = 1.74, z = 13.70, p < .001). For they/them characters, the difference between pronoun and pet accuracy was not significant (β = 0.28, z = 1.85, p = .06) (Table A.6). Memory for the characters’ 12 jobs (M = 0.29, SD = 0.45) was again above floor and was numerically higher than Experiment 1A (Figure A.2). When comparing memory accuracy between experiments (Table A.7), neither the main effect of experiment (β = 0.18, z = 1.34, p = .18) nor its interactions with pronoun (β = -0.04, z = -0.22, p = .83) were significant. Overall, completing the memory task last showed little effect on accuracy.

Production

When analyzing the effects of pronoun on accuracy in the sentence completion task (Table A.4), responses that did not include a pronoun referring to the character were again infrequent (3%) and are included in the analysis as incorrect responses. Participants used the correct pronoun to refer to the character more often than not across pronoun conditions (β = 1.11, z = 8.97, p < .001). He/him and she/her (M = 0.84) were produced more accurately than they/them (M = 0.39) (β = 2.48, z = 14.79, p < .001). Overall accuracy was not significantly different from the main experiment (β = 0.13, z = 1.29, p = .20). However, the interaction between pronoun and experiment was significant (β = -0.45, z = -2.21, p < .05), such that the difference in accuracy between they/them and he/him + she/her was reduced when the production task came first, compared to when the production task came second (Table A.8).

Memory Predicting Production

Testing the effects of pronoun condition and memory accuracy on production accuracy (Table A.5) showed that, in addition to the effects described above, participants were more likely to produce the correct pronoun if they also remembered it (β = 1.49, z = 9.27, p < .001). Memory accuracy interacted with pronoun for the comparison between they/them and he/him + she/her (β = -1.16, z = -3.58, p < .001), such that memory improved production more for they/them characters than for he/him + she/her characters. Remembering but not producing they/them was again more common than producing but not remembering it. Comparing between experiments (Table A.9), memory accuracy did not interact with task order (β = 0.27, z = 1.21, p = .22).

Figure 1.5 shows each participant’s mean difference in accuracy between the two tasks. For they/them characters, participants in both experiments were more accurate in the memory task. Conversely, for he/him + she/her characters, participants in both experiments were more accurate in the production task. Both patterns are consistent with participants forgetting that a character uses they/them or choosing not to use singular they, then defaulting to the pronouns associated with the character’s name. In a linear regression with pronoun and task order predicting the by-participant mean task differences (Table A.10), there were no significant effects of task order (β = 0.00, z = 0.07, p = .95) or its interaction with pronoun (β = 0.08, z = 1.55, p = .12).

Figure 1.5: Experiments 1A & 1B: Differences between memory accuracy and production accuracy for each participant, split by character pronoun and task order.

Reliability

To estimate the internal reliability of the memory and production tasks, I used the Bayesian mixed-effects model approach described in (Staub, 2021). The trials from Experiments 1A & 1B were split in half, so that each half included 2 he/him, 2 she/her, and 2 they/them characters for each participant. Pronoun was coded as 2 variables: the first comparing they/them (-.66) to he/him (+.33) and she/her (+.33) in even trials, with odd trials coded as 0, and the second comparing they/them to he/him + she/her in odd trials, with even trials coded as 0. The brms package (Bürkner, 2017) fit 4 Bayesian mixed-effects models: memory and production accuracy for 1A and 1B. Each model included the odd and even trial pronoun variables as fixed effects and by-participant slopes. All models were fit using the default priors and 4 chains, each chain with 4000 iterations, of which 2000 were warm-up.

The random slope estimates represent the relative accuracy of they/them compared to he/him + she/her for each participant (Figure A.5). For the memory task, the correlation between slopes was low in both 1A, r = 0.34 [-0.65, 0.92], and 1B, r = 0.45 [-0.49, 0.95], indicating poor internal reliability. For the production task, the correlation between slopes was high in 1A, r = 0.91 [0.77, 0.99], but medium in 1B, r = 0.64 [0.34, 0.88]. Internal reliabilities below 0.80 for the memory task in both experiments and for the production task in 1B indicate that analyzing individual differences would not be warranted.

1.4 Discussion

Experiment 1 investigated how people associate pronouns with a person, in a context where producing the correct pronoun required recalling stated information about the character’s pronouns, instead of choosing the pronoun based on the gender association of the name or an inference about the gender of the character. Participants learned about a set of 12 characters whose pronouns—they/them or the expected he/him or she/her—could not be predicted from their name or other cues. Memory for pronouns was measured in a multiple-choice task, and production of pronouns was measured in a written sentence completion task. After only a brief exposure to the characters, participants correctly recalled that the character used they/them pronouns in about half of trials and correctly produced singular they in reference to the character in about a third of trials. Participants originally completed the memory task before the production task, and a follow-up experiment presented the production task first, to rule out the possibility that differences between memory and production accuracy were caused by task order. In both experiments, remembering that the character used they/them pronouns strongly predicted—but did not guarantee—producing singular they. These results provide support for a model where speakers can select pronouns by retrieving information from episodic memory about a person’s stated pronouns, instead of selecting pronouns based on morphosyntactic gender information from the name or a gender inference about the person.

One limitation is that Experiment 1 was conducted with undergraduate students at a private university, which represents a particular subset of English speakers located in the US: primarily ages 18–22 and highly educated. Prior data allows us to infer that the participants are more likely to be socially liberal, have prior knowledge about LGBTQ+ topics, and to consider singular they acceptable (Minkin & Brown, 2021; Parker et al., 2019). However, more extensive demographic and language experience data were not included in the first experiment, because investigation of individual differences would first need to establish the internal reliability of the measures, in order to then be able to associate them with other person-specific variables.

There is an increasing awareness in psycholinguistics, and in cognitive psychology more broadly, of the importance of demonstrating sufficient internal reliability as a first step in conducting individual differences research (Cronbach, 1957; Hedge et al., 2017; Staub, 2021). This often proves difficult, because reliably measuring effects of experimental manipulations and reliably measuring individual differences are mathematically at odds with each other. Individual differences measures aim to characterize between-participant variability (e.g., the extent to which someone endorses the gender binary and gender essentialism), and reliability means that the measure ranks participants consistently. This requires a low degree of within-participant variability: participants’ responses would be consistent both within trials in the same experiment and if they completed the experiment again at a later date. Conversely, most cognitive psychology measures aim to characterize changes in behavior in different experimental contexts (e.g., singular they takes more time to read than plural they), which is within-participant variability. In this context, reliability means that the effect replicates in different experiments. This requires a low degree of between-participant variability, where most participants show the same pattern of responses. Hedge et al. (2017) describe this incompatibility as the “reliability paradox,” because “robust experimental paradigms…are likely to be sub-optimal for correlational studies for the same reasons that they produce robust experimental effects.”

If a measure shows low internal reliability—where each participant’s responses are not consistent within halves of the same experiment, or when completing the experiment twice—testing for individual differences would be possible, but not warranted (Hedge et al., 2017). One of the primary issues is that the reliability of the measures constrains the maximum expected true correlation that one would expect to find (Spearman, 1904), meaning that correlations with a low-reliability task will be underestimated. This is the position we find ourselves in with Experiment 1. Analyzing the data using Staub (2021)’s split-half mixed-effects model approach showed that the internal reliability of the memory task (rA = 0.34, rB = 0.45) was too low to warrant individual differences analysis, and the reliability of the production task was high enough in Experiment 1A but not 1B (rA = 0.91, rB = 0.64). While increasing the number of trials per participant can increase reliability (Liceralde & Watson, 2023), this is not feasible with a memory task, as adding more characters to learn about would make the experiment too difficult for participants. Overall, this means that while people clearly do vary in their accuracy remembering and producing singular they, it is not clear how much of this variability arises from theoretically-interesting factors (e.g., language and gender beliefs), how much arises from general psychological factors (e.g., ability in memory tests, attention to the experiment), and how much is random noise. Instead of pursuing individual differences questions with the character learning task, we turn to investigating how presenting information about singular they can support accurate memory and production.