Learning to comprehend and produce singular \emph{they} (Publication No. 30722219)

Bethany Gardner

Appendix A — Supplementary Analyses

A.1 Experiment 1

A.1.1 Experiment 1A: Additional Results

In addition to whether the characters used he/him, she/her, or they/them, participants were also asked about the characters’ jobs and pets (Figure A.1). Accuracy matching the 12 jobs to the 12 characters was lower (M = 0.21, SD = 0.41) than for the 3 pets (M = 0.41, SD = 0.49) and 3 pronouns (M = 0.66, SD = 0.47), but not at floor. Neither job nor pet accuracy varied based on the character’s pronouns. Accuracy for the characters’ pets (cat, dog, or fish) was designed as a comparison to pronoun accuracy, and the two were compared in a model including the Character’s Pronouns (contrast coded as in the main analyses) and Question Type (pronoun vs pet, mean-center effects coded) as fixed effects. The most complex model that converged included by-participant random slopes for Question Type (Table A.1). Averaging across the three character pronouns, participants were significantly more accurate for pronoun questions than pet questions (β = 1.17, z = 11.19, p < .001). The interaction between Character Pronoun (they/them vs he/him + she/her) and Question Type was significant (β = 1.45, z = 7.55, p < .001), reflecting that the character pronouns affected accuracy for the pronoun question, but not the pet question. Probing this interaction indicated that for they/them characters, there was no significant difference in accuracy between pronouns and pets (β = 0.21, z = 1.35, p = .18), but that for he/him + she/her characters, pronoun accuracy was higher than pet accuracy (β = 1.65, z = 12.97, p < .001).

Figure A.1: Experiment 1A: By-participant mean accuracy in the multiple-choice memory task for each character’s pronouns, pet, and job, with colors indicating the character’s pronouns. Error bars indicate 95% CIs calculated over the by-participant means.

Table A.1: Experiment 1A: Model results for the effects of Character Pronoun and Question Type (character’s pronoun or pet) on Memory Accuracy. Character Pronoun is contrast-coded as in the main analysis.
Experiment 1A: Memory for Pronouns vs Pets

	Memory Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.199	0.073	2.731	0.006
Question Type: Pet (-.5) vs Pronoun (+.5)	1.168	0.104	11.189	<0.001
Character Pronoun: They (-.66) vs He (+.33) + She (+.33)	0.884	0.096	9.216	<0.001
Character Pronoun: He (-.5) vs She (+.5)	0.010	0.113	0.088	0.930
*Question Type Character Pronoun (They vs He+She)**	1.449	0.192	7.548	<0.001
Question Type * Character Pronoun (He vs She)	0.128	0.226	0.566	0.571
Random Effects
τ₀₀ _Participant	0.323
τ₁₁ _{Question Type \| Participant}	0.243
ρ₀₁ _Participant	0.231
N _Participant	102
Observations	2448

The memory and production tasks were compared directly by creating a model predicting accuracy in both tasks, with Task as a mean-center effects coded fixed effect (Table A.2). The main effect of Task was not significant (β = 0.11, z = 1.04, p = .30), but the interaction between Pronoun (They vs He + She) was significant (β = 1.38, z = 6.36, p < .001). Probing this interaction indicated that memory was more accurate than production for they/them characters (β = -0.80, z = -4.77, p < .001). Conversely, memory was less accurate than production for he/him and she/her characters (β = 0.56, z = 4.19, p < .001).

Table A.2: Experiment 1A: Model results for the effects of Pronoun and Task (memory vs production) on Accuracy.
Experiment 1A: Comparing Memory and Production Accuracy

	Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.891	0.092	9.723	<0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33)	2.469	0.204	12.104	<0.001
Pronoun: He (-.5) vs She (+.5)	0.155	0.172	0.900	0.368
Task: Memory (-.5) vs Production (+.5)	0.110	0.105	1.041	0.298
*Pronoun (They vs He+She) Task**	1.380	0.217	6.357	<0.001
Pronoun (He vs She) * Task	0.152	0.268	0.568	0.570
Random Effects
τ₀₀ _Participant	0.468
τ₁₁ _{Pronoun (They vs He + She) \| Participant}	2.737
τ₁₁ _{Pronoun (He vs She) \| Participant}	0.356
ρ_{01 Pronoun (They vs He + She) \| Participant}	0.091
ρ_{01 Pronoun (He vs She) \| Participant}	0.098
N _Participant	102
Observations	2448

A.1.2 Experiment 1B: Additional Results

Table A.3: Experiment 1B: Model results for the effect of Pronoun on Memory Accuracy, when the memory task was completed after the production task.
Experiment 1B: Memory

	Memory Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.949	0.095	10.028	<0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33)	1.552	0.143	10.882	<0.001
Pronoun: He (-.5) vs She (+.5)	0.096	0.179	0.540	0.589
Random Effects
τ₀₀ _Participant	0.380
N _Participant	101
Observations	1212

Table A.4: Experiment 1B: Model results for the effect of Pronoun on Production Accuracy, when the memory task was completed after the production task.
Experiment 1B: Production

	Production Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	1.108	0.123	8.974	<0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33)	2.476	0.167	14.788	<0.001
Pronoun: He (-.5) vs She (+.5)	0.002	0.224	0.009	0.993
Random Effects
τ₀₀ _Participant	0.779
τ₀₀ _Name	0.008
N _Participant	101
N _Name	12
Observations	1212

Table A.5: Experiment 1B: Model results for the effect of Pronoun on Production Accuracy, when the memory task was completed after the production task.
Experiment 1B: Memory Predicting Production

	Production Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.767	0.080	9.538	<0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33)	1.991	0.161	12.349	<0.001
Pronoun: He (-.5) vs She (+.5)	-0.040	0.209	-0.194	0.846
Memory Accuracy: Wrong (-.5) vs Right (+.5)	1.491	0.161	9.274	<0.001
*Pronoun (They vs He + She) Memory Accuracy**	-1.156	0.322	-3.584	<0.001
Pronoun (He vs She) * Memory Accuracy	0.240	0.418	0.575	0.565
Observations	1212

Figure A.2: Experiment 1B: By-participant mean accuracy in the multiple-choice memory task for each character’s pronouns, pet, and job, with colors indicating the character’s pronouns. Error bars indicate 95% CIs calculated over the by-participant means.

Table A.6: Experiment 1B: Model results for the effects of Character Pronoun and Question Type (character’s pronoun or pet) on Memory Accuracy, when the memory task was completed after the production task.
Experiment 1B: Memory for Pronouns vs Pets

	Memory Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.324	0.072	4.505	<0.001
Question Type: Pet (-.5) vs Pronoun (+.5)	1.255	0.103	12.217	<0.001
Character Pronoun: They (-.66) vs He (+.33) + She (+.33)	0.818	0.096	8.532	<0.001
Character Pronoun: He (-.5) vs She (+.5)	0.075	0.124	0.600	0.549
*Question Type Character Pronoun (They vs He+She)**	1.474	0.192	7.688	<0.001
Question Type * Character Pronoun (He vs She)	0.022	0.231	0.095	0.924
Random Effects
τ₀₀ _Participant	0.256
τ₀₀ _Name	0.005
τ₁₁ _{Question Type \| Participant}	0.177
ρ₀₁ _Participant	0.416
N _Participant	101
N _Name	12
Observations	2424

A.1.3 Comparing Experiments 1A & 1B

Figure A.3: Experiments 1A & 1B. Memory accuracy, distribution of memory responses, production accuracy, and distribution of production responses, comparing between task orders. Points indicate by-participant means, and error bars indicate 95% CIs calculated over the by-participant means.

Figure A.4: Experiments 1A & 1B. Distribution of combined memory and production accuracy, then production accuracy split by memory accuracy, comparing between task orders. Error bars indicate 95% CIs calculated over trials.

Table A.7: Experiments 1A & 1B: Model results for the effects of Pronoun and Task Order on Memory Accuracy.
Comparing Experiments 1A (Memory First) & 1B (Production First): Memory

	Memory Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.866	0.070	12.310	<0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33)	1.582	0.101	15.715	<0.001
Pronoun: He (-.5) vs She (+.5)	0.088	0.130	0.674	0.500
Order: Memory First (-.5) vs Production First (+.5)	0.177	0.132	1.341	0.180
Pronoun (They vs He+She) * Order	-0.043	0.196	-0.217	0.828
Pronoun (He vs She) * Order	0.023	0.247	0.094	0.925
Random Effects
τ₀₀ _Participant	0.406
τ₀₀ _Name	0.005
N _Participant	203
N _Name	12
Observations	2436

Table A.8: Experiments 1A & 1B: Model results for the effects of Pronoun and Task Order on Production Accuracy.
Comparing Experiments 1A (Memory First) & 1B (Production First): Production

	Production Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.905	0.052	17.425	<0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33)	2.370	0.102	23.194	<0.001
Pronoun: He (-.5) vs She (+.5)	0.112	0.137	0.819	0.413
Order: Memory First (-.5) vs Production First (+.5)	0.134	0.104	1.291	0.197
*Pronoun (They vs He+She) Order**	-0.453	0.204	-2.215	0.027
Pronoun (He vs She) * Order	-0.187	0.274	-0.683	0.495
Observations	2436

Table A.9: Experiments 1A & 1B: Model results for the effects of Pronoun, Memory Accuracy, and Task Order on Production Accuracy.
Comparing Experiments 1A (Memory First) & 1B (Production First): Memory Predicting Production

	Production Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.729	0.056	12.928	<0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33)	2.208	0.113	19.506	<0.001
Pronoun: He (-.5) vs She (+.5)	0.078	0.146	0.536	0.592
Memory Accuracy: Wrong (-.5) vs Right (+.5)	1.354	0.113	12.005	<0.001
Order: Memory First (-.5) vs Production First (+.5)	0.075	0.113	0.667	0.505
*Pronoun (They vs He + She) Memory Accuracy**	-0.932	0.226	-4.115	<0.001
Pronoun (He vs She) * Memory Accuracy	0.121	0.293	0.413	0.680
Pronoun (They vs He+She) * Order	-0.435	0.226	-1.923	0.055
Pronoun (He vs She) * Order	-0.238	0.293	-0.813	0.416
Memory Accuracy * Order	0.274	0.226	1.214	0.225
Pronoun (They vs He+She) * Memory Accuracy * Order	-0.448	0.453	-0.989	0.323
Pronoun (He vs She) * Memory Accuracy * Order	0.239	0.585	0.408	0.683
Observations	2436

Table A.10: Experiments 1A & 1B: Model results for the effects of Pronoun and Task Order on the difference between memory accuracy and production accuracy for each participant.
Comparing Experiments 1A (Memory First) & 1B (Production First): By-Participant Differences Between Memory & Production

	Difference Score
Predictors	Estimates	SE	z	p
(Intercept)	-0.002	0.012	-0.206	0.837
Order: Memory First (-.5) vs Production First (+.5)	0.002	0.024	0.068	0.946
Pronoun: They (-.66) vs He (+.33) + She (+.33)	-0.181	0.026	-7.079	<0.001
Pronoun: He (-.5) vs She (+.5)	-0.001	0.029	-0.040	0.968
Pronoun (They vs He+She) * Order	0.079	0.051	1.548	0.122
Pronoun (He vs She) * Order	0.027	0.058	0.464	0.643
Observations	609

Figure A.5: Experiments 1A & 1B: Correlations between by-participant random slopes
for the effect of Pronoun in each half of the data, for the memory and production tasks.

A.2 Experiment 2

A.2.1 Pet & Job Questions

As in Experiments 1A & 1B, memory for the characters’ 12 jobs was analyzed in order to make sure the task did not show floor effects, and memory for the characters’ 3 pets was analyzed as a less marked comparison to pronouns (Figure A.6). Averaged across conditions, accuracy for jobs (M = 0.37) was numerically higher than in Experiments 1A (M = 0.21) and 1B (M = 0.29). Accuracy for pets was also higher in Experiment 2 (M = 0.54) than in Experiments 1A (M = 0.41) and 1B (M = 0.43). Job and pet accuracy did not vary based on the characters’ pronouns or the PSA and Biography conditions.

Pronoun and pet questions were compared in a model including the Character’s Pronouns (contrast coded as in the main analyses), the Question Type (mean-center effects coded), and the PSA and Biography conditions as fixed effects. The initial model included interactions between Character Pronoun, PSA, and Biography as in the main analyses; the interaction between Character Pronoun and Question Type; by-participant and by-item intercepts; and by-participant and by-item slopes for Character Pronoun and Question Type. In addition to the subset of interactions between fixed effects listed above, the most complex model that converged included by-participant intercepts, by-item intercepts, and by-item slopes for Question Type (Table A.11). Participants were significantly more accurate for pronoun questions than pet questions (β = 1.03, z = 15.59, p < .001), and the interaction between Character Pronoun (they/them vs he/him + she/her) and Question Type was significant (β = 1.50, z = 12.84, p < .001). Probing this interaction indicated that there was no significant difference in accuracy between pronouns and pets for they/them characters (β = 0.05, z = 0.48, p = .63), only for he/him + she/her characters (β = 1.53, z = 18.57, p < .001). This resembles the pattern of results in Experiment 1.

Figure A.6: Experiment 2: Mean accuracy in the multiple-choice memory task for pronouns, pets, and jobs, with colors indicating the character’s pronouns. By-participant means are shown as points; error bars indicate 95% CIs calculated over the by-participant means.

Table A.11: Experiment 2: Model results for the effects of Character Pronoun, PSA, Biography, and Question Type (pronoun vs pet) on Memory Accuracy.
Experiment 2: Memory for Pronouns vs Pets

	Memory Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.838	0.089	9.400	<0.001
Question Type: Pet (-.5) vs Pronoun (+.5)	1.034	0.066	15.592	<0.001
Character Pronoun: They (-.66) vs He (+.33) + She (+.33)	0.752	0.058	12.915	<0.001
Character Pronoun: He (-.5) vs She (+.5)	0.070	0.087	0.804	0.421
PSA: Unrelated (-.5) vs PSA: Gendered Language (+.5)	0.142	0.165	0.861	0.389
Biography: He/She (-.5) vs Biography: They (+.5)	-0.123	0.165	-0.747	0.455
*Question Type Character Pronoun (They vs He+She)**	1.495	0.116	12.841	<0.001
Question Type * Character Pronoun (He vs She)	0.189	0.155	1.220	0.222
*Character Pronoun (They vs He+She) PSA**	-0.247	0.116	-2.135	0.033
Character Pronoun (He vs She) * PSA	-0.026	0.138	-0.188	0.851
PSA * Biography	0.166	0.330	0.503	0.615
Character Pronoun (They vs He+She) * Biography	-0.011	0.116	-0.095	0.925
Character Pronoun (He vs She) * Biography	-0.008	0.138	-0.057	0.954
Pronoun (They vs He+She) * PSA * Biography	0.084	0.231	0.363	0.717
Pronoun (He vs She) * PSA * Biography	0.357	0.276	1.293	0.196
Random Effects
τ₀₀ _Participant	1.883
τ₀₀ _Name	0.011
τ₁₁ _{Question Type \| Name}	0.015
ρ₀₁ _Name	0.099
N _Name	12
N _Participant	320
Observations	7680

A.2.2 Additional Figures

Figure A.7: Experiment 2: Distribution of memory and production responses.

A.2.3 Additional Model Results

Table A.12: Experiment 2: Model results for the effects of Pronoun, PSA, Biography, and Memory Accuracy on Production Accuracy.
Experiment 2: Memory Predicting Production

	Production Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	0.522	0.055	9.542	<0.001
Pronoun: They (-.66) vs He (+.33) + She (+.33)	3.130	0.110	28.527	<0.001
Pronoun: He (-.5) vs She (+.5)	-0.342	0.133	-2.560	0.010
Memory Accuracy: Wrong (-.5) vs Right (+.5)	0.830	0.104	8.002	<0.001
PSA: Unrelated (-.5) vs PSA: Gendered Language (+.5)	0.075	0.104	0.719	0.472
Biography: He/She (-.5) vs Biography: They (+.5)	-0.017	0.104	-0.163	0.871
*Pronoun (They vs He + She) PSA**	-1.896	0.219	-8.677	<0.001
Pronoun (He vs She) * PSA	0.120	0.258	0.464	0.643
Pronoun (They vs He + She) * Memory Accuracy	-0.402	0.219	-1.838	0.066
Pronoun (He vs She) * Memory Accuracy	0.255	0.259	0.987	0.324
PSA * Memory Accuracy	0.240	0.207	1.156	0.248
PSA * Biography	0.377	0.207	1.820	0.069
Biography * Memory Accuracy	-0.114	0.207	-0.551	0.582
Pronoun (They vs He + She) * Biography	-0.304	0.219	-1.391	0.164
Pronoun (He vs She) * Biography	0.174	0.258	0.675	0.500
Pronoun (They vs He + She) * PSA * Memory Accuracy	-0.237	0.437	-0.542	0.588
Pronoun (He vs She) * PSA * Memory Accuracy	0.480	0.516	0.931	0.352
PSA * Biography * Memory Accuracy	0.425	0.415	1.025	0.305
*Pronoun (They vs He + She) PSA * Biography**	1.198	0.437	2.740	0.006
Pronoun (He vs She) * PSA * Biography	0.446	0.516	0.863	0.388
Pronoun (They vs He + She) * Biography * Memory Accuracy	0.830	0.437	1.899	0.058
Pronoun (He vs She) * Biography * Memory Accuracy	-0.390	0.516	-0.756	0.450
Pronoun (They vs He + She) * PSA * Biography * Memory Accuracy	-1.527	0.876	-1.744	0.081
Pronoun (He vs She) * PSA * Biography * Memory Accuracy	-1.203	1.033	-1.164	0.244
Random Effects
τ₀₀ _Name	0.004
N _Name	12
Observations	3840

Table A.13: Experiment 2: Model results for the effects of PSA and Biography on whether each participant produced singular *they* at least once. Participants were coded with a 1 if they produced singular *they* at least once in the written sentence completion task, regardless of accuracy, and with a 0 if they did not.
Experiment 2: Production of Singular They

	Produce They/Them
Predictors	Log-Odds	SE	z	p
(Intercept)	-0.362	0.117	-3.086	0.002
PSA: Unrelated (-.5) vs PSA: Gendered Language (+.5)	0.927	0.235	3.946	<0.001
Biography: He/She (-.5) vs Biography: They (+.5)	0.006	0.235	0.025	0.980
PSA * Biography	-0.816	0.470	-1.737	0.082
Observations	320

A.3 Experiment 3

A.3.1 Norming Study

Table A.14: Image Norming: Results. Counts and proportions
of they/them, he/him, she/her, and no pronoun responses for
each image in the norming study.

Norming Study: Pronouns Produced
	Image Code	they/them	he/him	she/her	none
Images Included
	gs04	19	0	6	5
	gs06	20	0	5	5
	gs12	24	0	4	2
	gs09	24	4	0	2
	gs08	22	5	0	3
	gs11	21	7	0	2
Images Not Included
	gs03	20	0	7	3
	gs05	20	0	6	4
	gs02	19	1	6	4
	gs10	23	1	3	3
	gs07	23	1	2	4
	gs01	20	6	1	3
Totals
	Proportion	0.71	0.07	0.11	0.11
	Count	255	25	40	40

A.3.2 Additional Survey Results

Table A.15: Experiment 3: Additional Demographics. Race/ethnicity has a higher total, as participants could select more than one option.

Experiment 3: Additional Participant Demographics
English Experience		162
	Native (learned from birth)	151
	Fully competent in speaking, listening, reading, and writing, but not native	8
	Prefer not to answer / Missing data	3
Education		162
	Less than high school	2
	High school graduate	23
	Some college	24
	2-year degree	16
	4-year degree	63
	Professional degree	23
	Doctorate	7
	Prefer not to answer / Missing data	4
Race/Ethnicity		176
	American Indian or Alaska Native	4
	Asian	9
	Black, African American, or African	21
	Hispanic, Latino, or Spanish	19
	Middle Eastern or North African	2
	Native Hawaiian or Pacific Islander	0
	White	112
	I use a different term	2
	Prefer not to answer / Missing data	7
Total Participants		162

The difference in naturalness ratings between singular they coreferring with generic and quantified referents [Indefinite] and with masculine, feminine, and gender-neutral names [Proper Name] was tested using a linear mixed-effects model, with Referent Type mean-center effects coded. The model also included by-item intercepts and by-participant slopes. Ratings were mean centered such that a score of 0 indicated the center of the Likert scale, so the significant intercept means that both types of sentences were rated as more natural than unnatural (β = 1.07, t = 10.23, p < .001). The significant effect of Referent Type indicated that singular they was rated as more natural with indefinites than with proper names (β = 0.98, t = 4.89, p < .001) (Table A.16).

Table A.16: Experiment 3: Naturalness Ratings. Model results for the effect of Referent Type (generic, each, every vs masculine, feminine, gender-neutral names) on sentence naturalness ratings for singular *they*.
Experiment 3: Naturalness Ratings for Singular *They*

	Naturalness Ratings
Predictors	Estimates	SE	t	p
(Intercept)	1.067	0.104	10.234	<0.001
Referent Type: Proper Name (-.5) vs Indefinite (+.5)	0.981	0.200	4.893	<0.001
Random Effects
σ²	1.294
τ₀₀ _Participant	0.938
τ₀₀ _Item	0.022
τ₁₁ _{Referent Type \| Participant}	3.229
ρ₀₁ _Participant	-0.555
ICC	0.577
N _Item	6
N _Participant	159
Observations	954
Marginal R² / Conditional R²	0.073 / 0.608

Table A.17: Experiment 3: Gender Beliefs. Question texts, distributions, means,
and SDs for items in the gender beliefs scale (Nagoshi et al., 2008).

Experiment 3: Gender Binary & Gender Essentialism Beliefs
Item	M	SD	Distribution
I avoid people on the street whose gender is unclear to me.	0.74	1.32
I am uncomfortable around people who don't conform to traditional gender roles, e.g., aggressive women or emotional men.	1.14	1.66
I would be upset if someone I'd known a long time revealed to me that they used to be another gender.	1.40	1.81
I think there is something wrong with a person who says that they are neither a man nor a woman.	1.97	2.25
I believe that a person can never change their gender.	2.08	2.29
When I meet someone, it is important for me to be able to identify them as a man or a woman.	2.31	2.11
I don't like it when someone is flirting with me, and I can't tell if they are a man or a woman.	2.35	2.16
A person's genitalia define what gender they are, e.g., a penis defines a person as being a man, a vagina defines a person as being a woman.	2.52	2.41
I believe that the male/female dichotomy is natural.	3.50	2.17
Total	18.02	14.38
0: Strongly Disagree – 6: Strongly Agree

A.3.3 Additional Pronoun Accuracy Results

Figure A.8: Experiment 3: [A] Distribution of pronouns produced for each of the 6 character images (Drucker, 2019). [B] Mean accuracy for each of the 18 name + image + pronoun combinations, summarized across Nametag & Introduction conditions. Error bars indicate 95% CIs calculated over the by-item means.

A.3.4 Participant Covariate Analyses

Figure A.9: Experiment 3: Distribution of mean-centered, rescaled participant covariates tested for model inclusion.

Table A.18: Experiment 3: Model results for the effects of Pronoun, Nametag, Introduction, Gender Beliefs, and Familiarity with Pronoun-Sharing on Accuracy.
Experiment 3: Participant Covariates

	Production Accuracy
Predictors	Log-Odds	SE	z	p
(Intercept)	11.431	1.398	8.175	<0.001
Pronoun Pair: T\|HS (-.66) vs HS\|T (+.33) + HS\|SH (+.33)	4.009	1.102	3.638	<0.001
Pronoun Pair: HS\|T (-.5) vs HS\|SH (+.5)	-0.900	0.730	-1.233	0.218
Nametag (-.5; +.5)	1.332	1.136	1.173	0.241
Intro (-.5; +.5)	0.462	1.188	0.389	0.697
Gender Beliefs (mean-centered, higher binary endorsement +)	-10.067	2.927	-3.439	0.001
Familiarity with Pronoun Sharing (mean-centered)	-3.232	1.655	-1.953	0.051
Pronoun (T\|HS vs HS\|T + HS\|SH) * Nametag	2.734	0.940	2.908	0.004
Pronoun (HS\|T vs HS\|SH) * Nametag	-0.326	1.189	-0.274	0.784
Pronoun (T\|HS vs HS\|T + HS\|SH)* Intro	-3.445	1.223	-2.816	0.005
Pronoun (HS\|T vs HS\|SH) * Intro	-1.450	1.416	-1.024	0.306
Nametag * Intro	-1.428	2.286	-0.625	0.532
Pronoun (T\|HS vs HS\|T + HS\|SH) * Gender Beliefs	6.504	1.841	3.533	<0.001
Pronoun (HS\|T vs HS\|SH) * Gender Beliefs	-4.167	2.573	-1.620	0.105
Intro * Gender Beliefs	8.753	5.573	1.571	0.116
Pronoun (T\|HS vs HS\|T + HS\|SH) * Familiarity	-2.048	1.127	-1.817	0.069
Pronoun (HS\|T vs HS\|SH) * Familiarity	1.067	1.452	0.735	0.462
Nametag * Gender Beliefs	3.090	4.881	0.633	0.527
Intro * Familiarity	-0.945	3.261	-0.290	0.772
Pronoun (T\|HS vs HS\|T + HS\|SH) * Nametag * Intro	6.163	1.651	3.733	<0.001
Pronoun (HS\|T vs HS\|SH) * Nametag * Intro	0.663	2.248	0.295	0.768
Pronoun (T\|HS vs HS\|T + HS\|SH) * Gender Beliefs * Intro	13.319	4.284	3.109	0.002
Pronoun (HS\|T vs HS\|SH) * Gender Beliefs * Intro	2.599	4.359	0.596	0.551
Pronoun (T\|HS vs HS\|T + HS\|SH) * Nametag * Gender Beliefs	-11.034	4.642	-2.377	0.017
Pronoun (HS\|T vs HS\|SH) * Nametag * Gender Beliefs	-9.794	5.381	-1.820	0.069
Pronoun (T\|HS vs HS\|T + HS\|SH) * Intro * Familiarity	6.643	2.567	2.588	0.010
Pronoun (HS\|T vs HS\|SH) * Intro * Familiarity	4.790	3.032	1.580	0.114
Random Effects
τ₀₀ _Participant	34.933
τ₀₀ _Character	3.143
N _Participant	158
N _Character	18
Observations	4640

A.4 Experiment 4

A.4.1 Additional Survey Results

Table A.19: Experiment 4: Gender Beliefs. Question texts, distributions, means, and SDs for items in the gender beliefs scale (Nagoshi et al., 2008).

Experiment 4: Gender Binary & Gender Essentialism Beliefs
Item	M	SD	Distribution
I avoid people on the street whose gender is unclear to me.	0.67	1.30
I think there is something wrong with a person who says that they are neither a man nor a woman.	0.70	1.42
I am uncomfortable around people who don't conform to traditional gender roles, e.g., aggressive women or emotional men.	0.90	1.65
I believe that a person can never change their gender.	1.03	1.77
I would be upset if someone I'd known a long time revealed to me that they used to be another gender.	1.37	2.16
A person's genitalia define what gender they are, e.g., a penis defines a person as being a man, a vagina defines a person as being a woman.	1.70	1.95
When I meet someone, it is important for me to be able to identify them as a man or a woman.	1.87	1.81
I don't like it when someone is flirting with me, and I can't tell if they are a man or a woman.	2.63	2.14
I believe that the male/female dichotomy is natural.	3.07	2.00
Total	13.93	12.56
0: Strongly Disagree – 6: Strongly Agree

As in Experiment 3, the difference in naturalness ratings between singular they coreferring with generic and quantified referents [Indefinite] and with masculine, feminine, and gender-neutral names [Proper Name] was tested using a linear mixed-effects model (Table A.20). Referent Type was mean-center effects coded, and the model also included by-item intercepts and by-participant slopes. Ratings were mean centered such that a score of 0 indicated the center of the Likert scale, so the significant intercept means that both types of sentences were rated as more natural than unnatural (β = 1.07, t = 3.89, p < .01). There was no significant difference between proper name and indefinite referents (β = 0.19, t = 0.36, p = .73).

Table A.20: Experiment 4: Model results for the effect of Referent Type (generic, each, every vs masculine, feminine, gender-neutral names) on sentence naturalness ratings for singular *they.*
Experiment 4: Naturalness Ratings for Singular *They*

	Naturalness Ratings
Predictors	Estimates	SE	t	p
(Intercept)	1.072	0.276	3.892	0.004
Referent Type: Proper Name (-.5) vs Indefinite (+.5)	0.189	0.519	0.364	0.726
Random Effects
σ²	1.420
τ₀₀ _Participant	0.807
τ₀₀ _Item	0.247
τ₁₁ _{Referent Type \| Participant}	2.200
ρ₀₁ _Participant	-0.317
ICC	0.530
N _Item	6
N _Participant	30
Observations	180
Marginal R² / Conditional R²	0.003 / 0.532

A.4.2 Match Judgments

Table A.21: Experiment 4: Test Trial Match Rates. Model results for the effect of Pronoun Pair on the likelihood of judging the story to match the scene in test trials (1 = match = correct).
Experiment 4: Match Judgment Rates in Test Trials

	Match
Predictors	Log-Odds	SE	z	p
(Intercept)	3.087	0.254	12.176	<0.001
Pronoun Pair: They\|HeShe (-.66) vs HeShe\|They (+.33) + HeShe\|SheHe (+.33)	0.330	0.245	1.345	0.179
Pronoun Pair: HeShe\|They (-.5) vs HeShe\|SheHe (+.5)	0.255	0.275	0.926	0.355
Random Effects
τ₀₀ _Participant	1.536
τ₁₁ _{Pronoun Pair (T\|HS vs HS\|T + HS\|SH) \| Participant}	0.341
τ₁₁ _{Pronoun Pair (HS\|T vs HS\|SH) \| Participant}	0.120
ρ_{01 Pronoun Pair (T\|HS vs HS\|T + HS\|SH) \| Participant}	0.277
ρ_{01 Pronoun Pair (HS\|T vs HS\|SH) \| Participant}	-0.637
N _Participant	30
Observations	2877

Table A.22: Experiment 4: Wrong Pronoun Trial Match Rates. Model results for the effect of Pronoun on judging the story to match the scene in wrong pronoun trials (1 = match = incorrect). There were no wrong pronoun trials for the HeShe|They condition, so the They|HeShe (they/them character referred to with he/him or she/her, whichever the competitor did not use) and the HeShe|SheHe (he/him and she/her characters referred to with they/them) conditions were mean-center effects coded.
Experiment 4: Match Judgment Rates in Wrong Pronoun Trials

	Match
Predictors	Log-Odds	SE	z	p
(Intercept)	-0.373	0.328	-1.137	0.255
Correct Pronoun: They (-.5) vs He + She (+.5)	-0.076	0.362	-0.211	0.833
Random Effects
τ₀₀ _Story	0.190
τ₀₀ _Participant	2.206
τ₁₁ _{Pronoun \| Story}	0.140
ρ₀₁ _{Pronoun \| Story}	-0.507
N _Participant	30
N _Story	35
Observations	238

Table A.23: Experiment 4: Test Trial Match RT. Model results for the effect of Pronoun Pair on reaction times for test trial match judgments.
Experiment 4: Match Judgment RT in Test Trials

	RT (ms)
Predictors	Estimates	SE	t	p
(Intercept)	3888.739	32.056	121.311	<0.001
Pronoun Pair: They\|HeShe (-.66) vs HeShe\|They (+.33) + HeShe\|SheHe (+.33)	-506.668	40.874	-12.396	<0.001
Pronoun Pair: HeShe\|They (-.5) vs HeShe\|SheHe (+.5)	32.961	43.864	0.751	0.452
Random Effects
σ²	0.014
τ₀₀ _Story	126
τ₀₀ _Participant	305041
τ₁₁ _{Pronoun Pair (T\|HS vs HS\|T + HS\|SH) \| Story}	397848
τ₁₁ _{Pronoun Pair (HS\|T vs HS\|SH) \| Story}	592385
τ₁₁ _{Pronoun Pair (T\|HS vs HS\|T + HS\|SH) \| Participant}	114829
τ₁₁ _{Pronoun Pair (HS\|T vs HS\|SH) \| Participant}	151199
ρ_{01 Pronoun Pair (T\|HS vs HS\|T + HS\|SH) \| Story}	-0.324
ρ_{01 Pronoun Pair (HS\|T vs HS\|SH) \| Story}	-0.961
ρ_{01 Pronoun Pair (T\|HS vs HS\|T + HS\|SH) \| Participant}	-0.833
ρ_{01 Pronoun Pair (HS\|T vs HS\|SH) \| Participant}	0.092
N _Participant	30
N _Story	60
Observations	2842

A.4.3 Additional Figures

Figure A.10: Experiment 4: Name Window. Looks to the named characters at the beginning of the story, with the first name beginning at 0ms and the second name beginning ~1000ms later.

Figure A.11: Experiment 4: Preview Window. Looks to each character during the 1000ms preview time before audio started, split by the character’s pronouns.

A.4.4 Additional Eyetracking Results

The Order of Mention effect was estimated separately for each Pronoun Pair condition by running three models, each with one Pronoun Pair condition coded as 0 and the other two Pronoun Pair conditions coded as 1. The maximal random effects structures that converged (Bates et al., 2015; R Core Team, 2023; Voeten, 2023) only included a subset of those in the main model (Table 4.2). For the by-participant effects, each kept all 3 slopes (AR(1), Order, and Trial Number). For the by-item effects, the HeShe|They reference model included both slopes (Order, Trend), the HeShe|SheHe reference model only included slopes for Trend, and the They|HeShe reference model only included intercepts.

In HeShe|SheHe trials (Table A.24), neither the main effect of Order (β = 0.05, z = 0.47, p = .64) nor its interaction with Pronoun Pair (β = 0.06, z = 0.70, p = .49) were significant. In HeShe|They trials (Table A.25), the main effect of Order was significant as anticipated, with participants more likely to be looking at target characters who had been named first in the story than target characters who had been named second (β = 0.22, z = 2.20, p < .05). The interaction between Order and Pronoun Pair was also significant (β = -0.20, z = -2.20, p < .05). Probing this interaction indicated that listeners were less likely to be looking at the target in HeShe|They trials than in HeShe|SheHe and They|HeShe trials when the target was mentioned second (β = 0.17, z = 2.51, p < .05), but not when the target was mentioned first (β = -0.04, z = -0.62, p = .54). In They|HeShe trials (Table A.26), neither the main effect of Order (β = 0.00, z = -0.03, p = .97) nor its interaction with Pronoun Pair (β = 0.14, z = 1.48, p = .14) were significant.

Table A.24: Experiment 4: Model results for the effects of Pronoun Pair and Order on the likelihood of looking at the target character (=1) or not (=0). The HeShe|SheHe condition is coded as 0, and the HeShe|They and They|HeShe conditions are coded as 1.
Experiment 4: Looks to the Target Character (HeShe\|SheHe As Reference Group)

	Looks to Target
Predictors	Log-Odds	SE	z	p
(Intercept)	-4.480	0.089	-50.111	<0.001
AR(1) (0, 1)	9.971	0.109	91.859	<0.001
Pronoun Pair: HeShe\|SheHe (0) vs HeShe\|They (1) + They\|HeShe (1)	-0.347	0.047	-7.454	<0.001
Order: Target Mentioned Second (-.5) vs First (+.5)	0.046	0.098	0.467	0.641
Trend (mean-centered)	0.049	0.038	1.309	0.191
Pronoun Pair * Order	0.065	0.093	0.698	0.485
Random Effects
τ₀₀ _Story	0.005
τ₀₀ _Participant	0.196
τ₁₁ _{Trend \| Story}	0.004
τ₁₁ _{AR(1) \| Participant}	0.268
τ₁₁ _{Order \| Participant}	0.110
τ₁₁ _{Trial Number \| Participant}	0.027
ρ₀₁ _{Trend \| Story}	0.018
ρ₀₁ _{AR(1) \| Participant}	-0.296
ρ₀₁ _{Order \| Participant}	0.084
ρ₀₁ _{Trial Number \| Participant}	0.508
N _Participant	30
N _Story	60
Observations	296022

Table A.25: Experiment 4: Model results for the effects of Pronoun Pair and Order on the likelihood of looking at the target character (=1) or not (=0). The HeShe|They condition is coded as 0, and the HeShe|SheHe and They|HeShe conditions are coded as 1.
Experiment 4: Looks to the Target Character (HeShe\|They As Reference Group)

	Looks to Target
Predictors	Log-Odds	SE	z	p
(Intercept)	-4.768	0.088	-54.437	<0.001
AR(1) (0, 1)	9.988	0.110	90.880	<0.001
Pronoun Pair: HeShe\|They (0) vs HeShe\|SheHe (1) + They\|HeShe (1)	0.062	0.046	1.329	0.184
Order: Target Mentioned Second (-.5) vs First (+.5)	0.218	0.099	2.196	0.028
Trend (mean-centered)	0.044	0.038	1.152	0.249
*Pronoun Pair Order**	-0.204	0.093	-2.205	0.027
Random Effects
τ₀₀ _Story	0.006
τ₀₀ _Participant	0.189
τ₁₁ _{Order \| Story}	0.025
τ₁₁ _{Trend \| Story}	0.005
τ₁₁ _{AR(1) \| Participant}	0.274
τ₁₁ _{Order \| Participant}	0.106
τ₁₁ _{Trial Number \| Participant}	0.032
ρ₀₁ _{Order \| Story}	0.951
ρ₀₁ _{Trend \| Story}	0.130
ρ₀₁ _{AR(1) \| Participant}	-0.289
ρ₀₁ _{Order \| Participant}	0.097
ρ₀₁ _{Trial Number \| Participant}	0.544
N _Participant	30
N _Story	60
Observations	296022

Table A.26: Experiment 4: Model results for the effects of Pronoun Pair and Order on the likelihood of looking at the target character (=1) or not (=0). The They|HeShe condition is coded as 0, and the HeShe|They and HeShe|SheHe conditions are coded as 1.
Experiment 4: Looks to the Target Character (They\|HeShe As Reference Group)

	Looks to Target
Predictors	Log-Odds	SE	z	p
(Intercept)	-4.901	0.089	-54.975	<0.001
AR(1) (0, 1)	9.983	0.110	90.621	<0.001
Pronoun Pair: They\|HeShe (0) vs HeShe\|SheHe (1) + HeShe\|They (1)	0.273	0.046	5.884	<0.001
Order: Target Mentioned Second (-.5) vs First (+.5)	-0.003	0.099	-0.034	0.973
Trend (mean-centered)	0.047	0.037	1.269	0.205
Pronoun Pair * Order	0.138	0.093	1.479	0.139
Random Effects
τ₀₀ _Story	0.005
τ₀₀ _Participant	0.192
τ₁₁ _{AR(1) \| Participant}	0.277
τ₁₁ _{Order \| Participant}	0.110
τ₁₁ _{Trial Number \| Participant}	0.030
ρ₀₁ _{AR(1) \| Participant}	-0.301
ρ₀₁ _{Order \| Participant}	0.101
ρ₀₁ _{Trial Number \| Participant}	0.493
N _Participant	30
N _Story	60
Observations	296022

Table A.27: Experiment 4: Model results for the effects of Target Pronoun and Order on the likelihood of looks to the target character (=1) or not (=0), during the window starting 200ms after pronoun onset and ending at 1210ms, the earliest shape word onset.
Experiment 4: Target Pronoun

	Looks to Target
Predictors	Log-Odds	SE	z	p
(Intercept)	-4.723	0.084	-56.372	<0.001
AR(1) (0, 1)	9.999	0.108	92.291	<0.001
Target Pronoun: They (-.66) vs He (+.33) + She (+.33)	0.275	0.047	5.864	<0.001
Target Pronoun: She (-.5) vs He (+.5)	0.039	0.053	0.728	0.467
Order: Target Mentioned Second (-.5) vs First (+.5)	0.085	0.077	1.099	0.272
Trend (mean-centered)	0.047	0.037	1.291	0.197
Target Pronoun (They vs He + She) * Order	0.139	0.094	1.478	0.140
Target Pronoun (She vs He) * Order	0.200	0.107	1.867	0.062
Random Effects
τ₀₀ _Story	0.005
τ₀₀ _Participant	0.191
τ₁₁ _{AR(1) \| Participant}	0.278
τ₁₁ _{Order \| Participant}	0.111
τ₁₁ _{Trial Number \| Participant}	0.029
τ₁₁ _{Order * Trial Number \| Participant}	0.014
ρ₀₁ _{AR(1) \| Participant}	-0.300
ρ₀₁ _{Order \| Participant}	0.103
ρ₀₁ _{Trial Number \| Participant}	0.488
ρ₀₁ _{Order * Trial Number \| Participant}	-0.339
N _Participant	30
N _Story	60
Observations	296022

Table A.28: Experiment 4: Trend Interactions. Model results for the interactions between Trend, Pronoun Pair, and Order on the likelihood of looking at the target character (=1) or not (=0).
Experiment 4: Interactions with Trend

	Looks to Target
Predictors	Log-Odds	SE	z	p
(Intercept)	-4.711	0.083	-56.631	<0.001
AR(1) (0, 1)	9.974	0.109	91.708	<0.001
Pronoun Pair: They\|HeShe (-.66) vs HeShe\|They (+.33) + HeShe\|SheHe (+.33)	0.284	0.061	4.624	<0.001
Pronoun Pair: HeShe\|They (-.5) vs HeShe\|SheHe (+.5)	0.264	0.073	3.594	<0.001
Order: Target Mentioned Second (-.5) vs First (+.5)	0.085	0.078	1.082	0.279
Trend (mean-centered)	0.058	0.037	1.587	0.113
Pronoun Pair (They\|HeShe vs HeShe\|They + HeShe\|SheHe) * Order	0.132	0.094	1.403	0.161
Pronoun Pair (HeShe\|They vs HeShe\|SheHe) * Order	-0.188	0.107	-1.755	0.079
Pronoun Pair (T\|HS vs HS\|T + HS\|SH) * Trend	-0.109	0.079	-1.392	0.164
Pronoun Pair (HS\|T vs HS\|SH) * Trend	-0.135	0.089	-1.519	0.129
Order * Trend	0.032	0.073	0.438	0.661
Pronoun Pair (T\|HS vs HS\|T + HS\|SH) * Order * Trend	-0.165	0.157	-1.054	0.292
Pronoun Pair (HS\|T vs HS\|SH) * Order * Trend	-0.115	0.176	-0.654	0.513
Random Effects
τ₀₀ _Participant	0.193
τ₁₁ _{AR(1) \| Participant}	0.268
τ₁₁ _{Order \| Participant}	0.118
τ₁₁ _{Trial Number \| Participant}	0.030
τ₁₁ _{Pronoun Pair (T\|HS vs HS\|T + HS\|SH) \| Participant}	0.038
τ₁₁ _{Pronoun Pair (HS\|T vs HS\|SH) \| Participant}	0.065
ρ_{01 AR(1) \| Participant}	-0.285
ρ_{01 Order \| Participant}	0.063
ρ_{01 Trial Number \| Participant}	0.463
ρ_{01 Pronoun Pair (T\|HS vs HS\|T + HS\|SH) \| Participant}	0.441
ρ_{01 Pronoun Pair (HS\|T vs HS\|SH) \| Participant}	0.272
N _Participant	30
Observations	296022

Table A.29: Experiment 4: AR(1) Interactions. Model results for the interactions between AR(1), Pronoun Pair, and Order on the likelihood of looking at the target character (=1) or not (=0).
Experiment 4: Interactions with AR(1)

	Looks to Target
Predictors	Log-Odds	SE	z	p
(Intercept)	-4.708	0.085	-55.706	<0.001
AR(1) (0, 1)	9.972	0.109	91.872	<0.001
Pronoun Pair: They\|HeShe (-.66) vs HeShe\|They (+.33) + HeShe\|SheHe (+.33)	0.322	0.060	5.378	<0.001
Pronoun Pair: HeShe\|They (-.5) vs HeShe\|SheHe (+.5)	0.276	0.066	4.170	<0.001
Order: Target Mentioned Second (-.5) vs First (+.5)	0.081	0.083	0.976	0.329
Trend (mean-centered)	0.052	0.037	1.409	0.159
Pronoun Pair (They\|HeShe vs HeShe\|They + HeShe\|SheHe) * Order	0.070	0.120	0.582	0.561
Pronoun Pair (HeShe\|They vs HeShe\|SheHe) * Order	-0.016	0.132	-0.119	0.906
Pronoun Pair (T\|HS vs HS\|T + HS\|SH) * AR(1)	-0.111	0.098	-1.131	0.258
Pronoun Pair (HS\|T vs HS\|SH) * AR(1)	-0.002	0.112	-0.014	0.989
AR(1) * Order	0.022	0.093	0.233	0.815
Pronoun Pair (T\|HS vs HS\|T + HS\|SH) * Order * AR(1)	0.179	0.195	0.917	0.359
Pronoun Pair (HS\|T vs HS\|SH) * Order * AR(1)	-0.447	0.223	-2.001	0.045
Random Effects
τ₀₀ _Story	0.006
τ₀₀ _Participant	0.197
τ₁₁ _{Trial Number \| Story}	0.012
τ₁₁ _{AR(1) \| Participant}	0.270
τ₁₁ _{Order \| Participant}	0.113
τ₁₁ _{Trial Number \| Participant}	0.027
ρ_{01 Trial Number \| Story}	0.048
ρ₀₁ _{AR(1) \| Participant}	-0.293
ρ₀₁ _{Order \| Participant}	0.088
ρ₀₁ _{Trial Number \| Participant}	0.485
N _Participant	30
N _Story	60
Observations	296022