Gendered Performance Differences in Introductory Physics: A Study from a Large Land-Grant University

Studies examining gender differences in introductory physics show a consensus when it comes to a gender gap on conceptual assessments; however, the story is not as clear when it comes to differences in gendered performance on exams. This study examined gendered differences in student performance in two introductory physics course sequences to determine whether they were persistent throughout each course. The population for this study included more than 10,000 students enrolled in algebra- and calculus-based introductory physics courses between spring 2007 and spring 2019. We found a small but statistically significant difference in final letter grades for only one out of four courses, algebra-based mechanics. By looking at midterm exam grades, statistically significant differences were noted for some exams in three out of four courses, with algebra-based electricity and magnetism being the exception. In all statistically significant cases, the effect size was small or weak, indicating that performance on exams and final letter grades was not strongly dependent on gender. Additionally, a questionnaire was administered in fall 2019 to more than 1,600 students in both introductory sequences to measure students' perceptions of performance, class contributions, and inclusion. We observed differences between students' perception of their performance and contribution when grouped by gender, but no difference on perception of inclusion.


I. INTRODUCTION
Over the last half century, the number of US students majoring in STEM fields has more than doubled [1]. As this enrollment has increased, so has the attention paid to who is obtaining degrees across different disciplines, particularly when it comes to underrepresented groups. While some STEM disciplines, such as biology, have relative parity between males and females attaining degrees, other disciplines have a persisting gender gap [2]. The National Center for Science and Engineering Statistics found that in 2016, women earned 20.9% of all engineering bachelor's degrees and 19.3% of all physics degrees [3].
Out of all STEM disciplines, physics is often considered to be the field that is the least welcoming for women to join [4,5]. Even for students not majoring in physics, STEM majors have to take physics as part of their academic program. For physics and engineering majors, introductory physics courses are among their early experiences in college. Such experiences can be crucial for student success within their majors [6].
The importance of such experiences is especially true for female students, as many of them leave physical science and engineering tracks during the first two years of college [7]. Female students are likely to be under the pressure of gender stereotypes and societal biases [8,9], and they often find themselves underrepresented in their physics classes. Some authors argue that stereotype threat influences female student performance in introductory physics classes [10,11] and that use of an intervention based on value affirmation can help improve the * send correspondence to: etanya@tamu.edu situation [12]. Perhaps related to stereotyping, the atmosphere in physics classrooms can influence female students physics self-efficacy, self-identity and motivation; all of which can have an impact on student success and retention [13][14][15][16][17]. A number of studies have reported on the difference in physics self-efficacy between male and female students, including courses which use research supported instructional methods [18][19][20][21][22]. As an example, Marshman et al. reported that female students had significantly lower self-efficacy than male students throughout a two-semester introductory physics course sequence [23]. They go on to note that the physics self-efficacy of female students was negatively impacted by both traditional instruction courses and flipped classroom courses.
A vast literature exists that explores the gendered differences in student performance on concept inventory tests in introductory physics courses. The majority of existing studies report a persistent gender gap with males performing significantly better than females on introductory mechanics concept inventory assessments [24][25][26][27][28][29], with some authors arguing that removing gender-biased context can reduce the gap [27,28,30]. The gender gap has been found in conceptual inventories of electromagnetism as well, although to a lesser extent and with more variation across studies [24,28,29,31,32].
The results of prior studies on the gendered differences in student performance based on course grades and examinations are less consistent: while a number of studies indicate that male students outperform female students on the exams and course grades [12,33,34], other groups found no significant gendered difference in student performance [25,26,29,31,32,35,36]. One study, comprised of 4,000 students across 7 semesters at the University of Colorado Boulder, reported a small but significant difference in course grades, correlated with differences in back-ground factors for males and females [33]. Factors beyond the course, including prior knowledge, math background, and attitudes towards science, have been seen to correlate with gendered differences in performance [33,37]. One study of an electricity and magnetism course by Andersson and Johansson argues that the gendered difference in course grades disappears when controlled for the program in which a student is enrolled [38]. Tai and Sadler show that females outperformed males in algebrabased courses while males outperformed females with the same background in calculus-based courses [39]. Several studies performed on a large number of students taking the introductory physics classes report no significant gendered difference in student performance on course exams and course grades but a gender gap in concept inventories [25,26,31]. Some studies of gendered differences in undergraduate physics have reported reduction or elimination through the use of carefully selected instructional strategies in introductory physics [21,40,41]. However, other groups have found no effect of applying selected pedagogies or controlling the prior knowledge factors on gendered performance [24,29,42,43].
This work focused on expanding the studies of academic gendered differences through large enrollment courses at Texas A&M University (TAMU). This is a large, land-grant institution, which yearly serves more than 20,000 undergraduate STEM majors across multiple colleges [44]. STEM majors at TAMU complete their introductory physics sequence through either calculusbased courses (e.g. physics, engineering, chemistry, math) or algebra-based courses (e.g. life science majors, pre-meds, and environmental science). The engineering program comprises more than half of all STEM majors at TAMU, so the calculus-based sequence enrollment is much larger than the algebra-based sequence. Students demographics, academic goals, and attitudes towards physics may differ significantly between calculusbased and algebra-based courses. Most students in the calculus-based courses were in their freshman year, i.e. right after high school. The algebra-based courses are typically taken by upper-level students in their sophomore to senior year, who do not have physics or physicsrelated disciplines as the focus of their studies and careers. Furthermore, there is a much larger proportion of women in algebra-based courses.
Our study aimed to examine the evolution of gendered differences in algebra-and calculus-based introductory physics courses at TAMU by looking at both exams and final letter grades. We collected and analyzed data on students test performances since 2007. While working with our data set, we also wanted to take a snapshot of current students' feelings to see how their perceptions aligned with historic performance. A short questionnaire was distributed to all students enrolled in calculus-based and algebra-based mechanics and electricity and magnetism in the fall of 2019.

II. METHODS
From here forward, "significant" will be used as shorthand for "statistically significant". Statistical significance was taken to be at p < 0.05. In addition, tables will use "Mech." and "E&M" for mechanics and electricity and magnetism, respectively; "Alg." or "Calc." stand for algebra-based or calculus-based.

A. Course Data
To examine the gendered student performance within introductory courses, course level data was requested from faculty who taught one or more of these courses since 2007. Participating instructors provided students' first names, numerical scores for all midterm and final exams, and the final letter grade for the course. After collecting data from faculty, a database of approximately 13,000 students was obtained. This database contains information for students enrolled in the algebra-based sequence between 2011-2019 and the calculus-based sequence between 2007-2017. This study was structured in such a way that the only data collected were the courselevel information provided by faculty. For this reason, connecting outcomes to non-academic factors was not possible with this study.
Since course-level data only included student names, gender was identified using an online tool, GenderizeIO [45]. This application program interface returns a probability of gender based on the input of a first name. Gender probability was considered identifiable for this study if it was 90% or higher. This percentage was chosen as it allowed reasonable certainty of gender without drastically reducing the size of our data set. This cut eliminated about 17% of our raw sample. The number of students identified as male or female from each of the four introductory courses examined in this study is shown in Table I.

B. Comparing Grades
Differences based on gender were examined by looking at students' final course grades and their scores on the midterm and final semester exams. Differences in populations were examined using t-tests between transformed data, as well as analysis of variance (ANOVA) applied to raw scores. Comparisons were made based on student gender, instructor, and year in the course. Some instructors gave multiple years of data, so these criteria allowed for individual lecture section distributions to be examined against each other. Though the exam distributions can be skewed for both raw scores and z-scores (see Figure 1 for raw Exam 1 scores from calculus-based mechanics), t-tests were the most appropriate statistical analysis due to the large sample size from each course [46]. Effect sizes were calculated using Cohen's d with a Hedges correction [47]. We consider d < 0.2 to be weak, 0.2 < d < 0.5 to be small, 0.5 < d < 0.8 to be medium, and d > 0.8 to be large effect sizes.
To look specifically at the relation between gender and exam performance, raw scores from individual lecture sections were mapped to new distributions using a zscore transformation. A z-score takes a raw numerical score (x i ), subtracts the average (x), and scales by the standard deviation (σ), according to the relation: A positive z-score indicates how much higher a raw score was compared to the average in units of standard deviation. A negative z-score indicates the same but for a raw score below the average [47]. This transformation of scores was performed for a more even comparison of exam distributions across multiple years and instructors. For instance, Professor A teaching a course in year X might have a higher average and smaller deviation than when the same instructor teaches the same course in year Y. As an illustration of the z-score transformation, we can look at raw exam scores from calculus-based mechanics. In 2007, the mean score was 58 points and the standard deviation was 21 points. Students scoring a 58 were mapped to a z-score of 0, while students scoring a 37 were mapped to a z-score of -1. This was done for all students, using the individual lecture section averages and standard deviations. Raw scores were used to examine individual lecture sections. When comparing across lecture sections, raw scores were transformed into z-scores so distributions may be more adequately and fairly compared to one another. Final course grades were treated on a 4 point scale (A-F) with no plus or minus grades per TAMU's grading policy.

C. Student Perceptions
In fall 2019, a short anonymous questionnaire was administered to explore how students felt about their performance and inclusion. This questionnaire was given to compare how students' perceptions of their performance aligned with historic data. Students were asked to self-identify their race and gender. Response choices for students identifying as transgender or non-binary were available on the questionnaire. Only students identifying as male or female (99%) were analyzed for this study. The questionnaire was composed of three questions where students responded using a 5-point Likert scale with responses that were negative (very and slightly), neutral, and positive (very and slightly): 1. "I felt included by my peers and instructors within this physics courses." 2. "I believe that I performed in this course." 3. "I felt that my contributions to discussions over physics material were valued during this course. This includes discussions both in class and outside of class but relating to completing assignments or preparing for exams." The questionnaire was administered in the 12th and 13th weeks of a 15 week semester, which occurred mid-November in fall 2019. We chose these weeks as it was far enough into the semester that students would have formed an opinion of their inclusion but early enough that the surveys would not take away from finals preparation. The questionnaire was given during recitation, where students were in smaller groups and did not have their instructor in the room. The brevity of the questionnaire was dictated by the recitation format and an attempt to maximize the response rate.

III. ANALYSIS AND RESULTS
Differential performance for male and female students was examined for all exams and final course grades for four introductory physics courses. Results are separated by courses in the algebra-based sequence and the calculus-based sequence. The calculus-based sequence consists of three midterm exams administered throughout the semester with a comprehensive final (identified as Exam 4). The algebra-based sequence consists of four midterm exams administered throughout the semester with a comprehensive final (identified as Exam 5).
Questionnaire responses were converted into an ordinal 5-point scale, with higher numbers equating to more positive responses. That is, better feelings of inclusion, greater performance, and stronger feelings of making valued contributions. For calculus-based mechanics, data were provided by 14 instructors, for a total of 49 lecture sections. Eleven male instructors provided data from 34 lecture sections, and three female instructors provided data from 15 lecture sections. As seen in Table II, there was no significant difference observed when a t-test was applied to final letter grades based on student gender for the pooled data from all instructors and sections.
Gendered performance on course exams were compared using t-tests on transformed data from all instructors, Table III. As a combined sample, male students score at least slightly higher compared to female students on all exams. Significant differences were observed for the first, third, and final exams of the course. The effect size of these differences is small (0.2 < d < 0.4) for the first exam, and weak (d < 0.2) for the third and final exams.
For calculus-based E&M, data were provided by 10 instructors, for a total of 27 lecture sections. Six male instructors provided data from 10 lecture sections, and four female instructors provided data from 17 lecture sections. As seen in Table II, there was no significant difference observed when a t-test was applied to final letter grades based on student gender for the pooled data from all instructors and sections.
As with the calculus-based mechanics course, exams were compared using t-tests on transformed data from all instructors, Table IV. Similar to the first semester course, male students score at least slightly higher compared to female students on all exams. In this case, none of these differences were significant.
As a validation of results found using t-tests on transformed data, three-way ANOVA was applied to raw scores for all exams from both calculus-based courses. Results were in agreement with t-tests applied to transformed data. Where significant differences were observed using t-tests, ANOVA showed gender to be a significant factor on its own or in combination with one or both of the other factors of professor and year. When examining individual lecture sections, significant differences due to gender were observed for less than 20% of lecture sections. Combined with the results above, we note a persistent gender difference in calculus-based mechanics on exams only for pooled data, producing no significant difference in final course grades. No gendered differences were noted in calculus-based E&M for either exams or final course grades. For algebra-based mechanics, data were provided by 4 instructors, covering a total of 13 lecture sections. Three male instructors provided data from 11 lecture sections, and one female instructor provided data from 2 lecture sections. Comparing the final letter grades based on student gender for pooled data from all instructors and sections showed a significant difference, Table II. This significant difference is not observed across individual instructors, nor is it consistently observed for lecture sections. Only one lecture section exhibited a significant difference for letter grades.
Gendered performance on course exams for transformed data from all instructors for algebra-based mechanics is shown in Table V. On the first exam of the semester, male students were observed to have a higher average score compared to female students, though the reverse is seen to be true for the other four exams of the course. Significant differences were observed for the third and fourth exams. The effect size for both of these exams is weak (d < 0.2).
For algebra-based E&M, data were provided by 2 instructors, covering a total of 8 lecture sections. Two male instructors provided all of the data for this course. No significant difference in final letter grades based on student gender for pooled data from all instructors and sections was observed for this course, Table II.
As with the algebra-based mechanics course, exams were compared using t-tests on transformed data from all instructors, Table VI. Male students were observed to have a higher average score on four out of five exams with female students outscoring male students on the third exam. A significant difference was observed only for the first exam. The effect size for this difference is weak (d < 0.2).
As a validation of results found using t-tests on transformed data, three-way ANOVA was applied to raw scores for all exams from both algebra-based courses. Results were in agreement with t-tests applied to transformed data. Where significant differences were observed using t-tests, ANOVA showed gender to be a significant factor on its own or in combination with one or both of the other factors of professor and year. When examining individual lecture section level data, significant differences due to gender were observed for less than 15% of lecture sections. Combined with the results above, we note a gender difference in algebra-based mechanics on the third and fourth exams only for pooled data, producing a small but significant difference in final course grades. A gender difference is observed in algebra-based E&M only for the first exam, producing no difference in final course grades.  The impact of instructor gender on differences in student performance by gender was also examined. This analysis could not be applied to the algebra-based E&M course as all data was from male instructors. Data was separated into two groups by instructor gender, and comparisons were made based on student gender. Differences in average z-scores for each course exam are shown in Table VII, while comparisons between letter grades are shown in Table VIII.
For algebra-based mechanics, data were obtained from three male instructors for 11 lecture sections (N=1075) and from one female instructor for 2 lecture sections (N=192). Significant differences in student performance based on gender were observed during the course only for the fourth exam with male instructors. This difference has a weak effect size (d < 0.2). A significant difference in final letter grades was also observed for male instructors, with a weak effect size (d < 0.2).
For calculus-based mechanics, data were obtained from eleven male instructors for 34 lecture sections (N=4,227) and from three female instructors for 15 lecture sections (N=1,222). Significant differences in student performance based on gender were observed for instructors of both genders on the first midterm exam, with a small effect size for each (0.2 < d < 0.4). Other significant differences were noted for the third and fourth exams for male instructors only. These latter two differences have weak effect sizes (d < 0.2). No significant difference was noted in final letter grades when grouping students by instructor gender.
For calculus-based E&M, data were obtained from six male instructors for 10 lecture sections (N=917) and from four female instructors for 17 lecture sections (N=1,876). A significant difference in student performance based on gender was observed for the fourth exam of the semester for male instructors. This difference has a small effect size (0.2 < d < 0.4). A significant difference was also observed in letter grades for female instructors, with a weak effect size (d < 0.2).
We also analyzed the raw score data using ANOVA. Agreement on significance using t-tests and ANOVA was found for comparisons based on instructor gender except for two instances. These instances were calculus-based mechanics on the first exam for female instructors and calculus-based E&M on the final exam for male instructors. Further examination of the impact of instructor gender was performed using Tukey HSD, which found significance in agreement with t-tests [47].

D. Students Perception Questionnaire
A short questionnaire consisting of three questions that were analyzed independently as well as demographic information was distributed to the students taking introductory physics classes in the fall of 2019, as described in Section II C. We received over 1,600 completed surveys with a response rate of 63%.
On the question about students perception of their performance, 2. "I believe that I performed in this course." they were given five answer choices: We converted these answers into a 5-point scale with "Well Below Average" corresponding to a 1 and "Well Above Average" to a 5.
In the calculus-based mechanics course, male students rated their performance higher than their female classmates to a significant degree. Specifically, they rated themselves higher by a third of a point (See Table IX).
We found that female students only rated their performance perception equally to male students in algebrabased mechanics (See Table IX). When compared to historic data, this is the one course examined where we found a statistically significant difference in final letter grades with female students outperforming male students.
For the other three courses, male students rated their performance as significantly higher. For both E&M courses, however, male students performed the same as female students at this point in the course according to historic data.
Next, we looked at how students rated their feelings of inclusion. More specifically, we asked the following,

"I felt included by my peers and instructors within this physics course."
Finally, we asked students whether they felt that their contributions were valued.
3. "I felt that my contributions to discussions over physics material were valued during this course. This includes discussions both in class and outside of class but relating to completing assignments or preparing for exams." For both of these questions, the students were given the following answer choices: A. Strongly Disagree B. Disagree C. Neither Agree nor Disagree D. Agree E. Strongly Agree Similar to the question on performance perception, we converted these answers into a 5-point scale with "Strongly Disagree" corresponding to a 1 and "Strongly Agree" to a 5.
We found that despite a difference in performance perception, there was no statistically significant difference in feelings of inclusion for any of the courses (See Table  X). That is, despite female students believing they were underperforming in three courses, they still believed they were being included equally.
When analyzing whether students felt that their contributions were valued, we found a statistically significant difference in algebra-based mechanics (See Table XI). In this course, female students rated valuations of their contributions about a fourth of a point higher on average. This corresponds to the only course where female students historically performed better on final letter grades. We found no significant differences in the other courses, indicating male and female students had similar feelings about how their contributions were valued.
While we analyzed each question separately, we calculated the Spearman rank-order correlation coefficient between each survey question. We found each question's correlations ranged from weak effect to a moderate effect (0.0 < ρ < 0.6). This includes when we sorted by gender, course, and simultaneously gender and course.

IV. DISCUSSION
We compared student outcomes for final course grades and exams based on gender for over 10,000 TAMU students over more than a decade. We examined this data to determine whether such differences were persistent throughout each course. The data were collected from instructors who taught courses from algebra-based or calculus-based introductory physics sequences. Prior studies of gendered students performance based on exam grades show mixed results, with some authors reporting that male students outperform their female counterparts [12,33,34], whereas other authors did not find a statistically significant difference between the genders [25,26,29,31,32,35,36]. While this study has the advantage of a large data set of the exam scores collected over a decade and provides a new knowledge on the gendered performance in the introductory physics classes, it has limitations: only course-level data collected from faculty was used in this study. Therefore, we did not analyze the impact of non-academic factors that have seen to po-tentially account for 20% to 70% of the gender difference [33,37].
To describe our results, we use "significant" as a shorthand for "statistically significant", defined as p < 0.05.
Of the four introductory courses comprising the algebra-based and calculus-based sequences, only the algebra-based mechanics course exhibited a significant difference in final letter grades. This difference is small, with female students outperforming male students by 0.15 GPA points and has a weak effect size. This is in agreement with a prior study by Tai and Sadler who also reported female students performing better than male students in algebra-based mechanics [39]. In algebra-based mechanics, significant performance differences based on gender were found on two out of five exams throughout the course. These were midterm exams in the second half of the course, and the differences had weak effect sizes.
In calculus-based mechanics, there was no significant difference in final letter grades. There was, however, a significant difference on three out of four course exams, with a small effect size for the first exam, and weak effect sizes for the third and fourth exams. Previous studies have found inconsistent results from calculus-based mechanics courses. Some studies conducted at other public, state universities have found no significant gap in final letter grades [25,29,36]. Kost et al found a small but statistically significant difference in overall course grades [33]. Also, Tai and Sadler cited above for the algebrabased course, reported males outperforming females in calculus-based mechanics [39].
In the second semester E&M course for both sequences, no significant difference in final letter grades was found. The only significant difference observed was for the first exam in the algebra-based course which had a weak effect size. The results are similar to Kost-Smith et al [32] who also found no statistically significant difference in course grades in a calculus-based electricity and magnetism course.
When examining the impact of instructor gender on the student gender performance gap, fewer significant differences were observed for female instructors in comparison with male instructors. Where significant differences were seen, the effect sizes were small or weak, similar to the effect sizes when data from all instructors was pooled. It should be noted that the overall number of instructors included in this study was small (∼20 unique instructors) and a larger sample of instructors would help to determine the consistency of this result.
In addition to analyzing historic exam scores, we took a snapshot of current students perception of their performance as well as their feelings of inclusion and contribution in the introductory physics courses sequences. We administered a questionnaire to all students who took introductory physics classes in the fall of 2019. We received over 1600 completed questionnaires with a response rate of 63%.
The responses indicated that female students rated their performance lower than their male classmates. The one exception is algebra-based mechanics where female students rated their performance equal to male students. Although our question about the performance perception didn't directly measure students physics self-efficacy, our results are consistent with the previous studies reporting that female students display lower self-efficacy than male students in introductory physics classes, even when controlled for their performance level [13,[18][19][20][21]23]. We found no significant gendered difference in students' perceptions of inclusion across all courses. Also, there was no significant differences in students' perceptions about the value of their contributions, except for algebra-based mechanics, where female students reported that their contributions were valued higher than male students. Gender neutral perception of inclusion and perception about the value of students contributions is positive news taking into account that the questionnaire was distributed towards the end of the semester when students had enough time to form an opinion. In algebrabased courses, the enrollment of students tends to have more female students than male students, and students are typically upper-level undergraduates. As a result, female students in these classes may experience less stereotype threat [11], which could partially explain why they report equal or better perception of inclusion and valuation of their contribution as compared to their male counterparts. In algebra-based mechanics, the instruction during the recitations was provided by upper-level undergraduate students who happened to about 70% females in fall 2019 when the questionnaire was distributed. This could contribute to female students rating the value of their contributions higher than their male classmates [15]. It is also worth noting that female students rated their performance as equal to male students and the value of their contributions higher than male students in the algebra-based mechanics class, where female students historically outperformed male students based on our analyses of more than a decade of exam data.

V. CONCLUSION
This study explored the evolution of gendered differences in student performance throughout two-semester introductory physics courses for two sequences, calculusbased and algebra-based. The performance indicators for a large pool of over 10,000 students spanning a period from spring 2007 to spring 2019 have been analyzed. Data on midterm exams, final exams, and final course grades were collected from instructors teaching these courses during that period of time. Where differences in final letter grades were found, there were no persistent differences on exams. Where persistent differences were found on exams, there were no differences for final letter grades. In algebra-based mechanics, female students outperformed male students by a small but significantly different margin. In all statistically signifi-cant cases, the effect size was small or weak, indicating that performance on exams and final letter grades was not strongly dependent on gender. One should keep in mind that the instructors who taught these classes used different instruction methods that were evolving during the period of data collected. The student demographics, background, and preparation levels were evolving during the years of study [44]. Since this is a long-term historic study of course-level data, we were not able to control for factors such as student background in math and physics.
A questionnaire was distributed to students taking both calculus-based and algebra-based sequences during the fall 2019 semester. The goal was to take a snapshot of current students' feelings to see how their perceptions aligned with historic performance. We collected students' feedback on their perception of their performance, feelings of inclusion, and the value of their contributions. The analyses of student responses revealed no difference in the feeling of inclusion in any of these courses. For one course, algebra-based mechanics, female students rated their contributions as valued more compared to male students. For the same course, female students reported their performance perception to be as high as their male counterparts. For the other three courses, male students reported higher perceptions of performance than female students.
There are several future studies that could stem from this one. In the next iteration of this work, it would be beneficial to connect course-level data with university records of students prior preparation and knowledge. Additionally, an enhanced survey on student perceptions could be linked with course performance to allow for regression analyses between inclusion and contribution with success on exams. Since calculus-based mechanics is usually taken the earliest of these four courses, it would be useful to perform a study like this one on calculus 1 and introductory chemistry. This would help us better understand if gendered performance differences among physical science and engineering majors change over time.

ACKNOWLEDGMENTS
This study was supported by the Texas A&M University College of Science. We would like to thank the Texas A&M University Department of Physics and Astronomy faculty who provided us with data for this study.