Damage caused by women’s lower self-efficacy on physics learning

Z. Yasemin Kalender , Emily Marshman, Christian D. Schunn, Timothy J. Nokes-Malach, and Chandralekha Singh Department of Physics and Astronomy, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, USA Department of Physics, Community College of Allegheny County, Pittsburgh, Pennsylvania 15212, USA Learning Research and Development Center, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, USA


I. INTRODUCTION
In the disciplines of science, technology, engineering, and mathematics (STEM), there has been some effort to enhance the participation and advancement of women, yet the historical pattern of overall unequal gender representation remains in many STEM disciplines. Over the past decades, some STEM fields, such as biology and chemistry, have shown great improvement in the number of degrees earned by women [1]. However, other STEM fields, like physics, have seen little progress in increasing representation of women and people of color in the discipline. For instance, the percentage of bachelor's and Ph.D. degrees in physics earned by women in the U.S. is approximately 20% [2]. Even more asymmetric participation occurs for postdoc and academic leadership positions in physics [3].
Education researchers have considered several reasons to explain the gender gap in physics participation [4][5][6][7][8][9][10][11][12][13]. These reasons include societal stereotypes and biases pertaining to physics being a discipline for brilliant men [14,15], and related issues such as biased learning tools [4,6], noninclusive teaching methods and physics department climate [5], and motivational factors [7]. Although there has been much interest and research in improving the pedagogy of physics teaching and reforming the content of the physics curriculum, there is relatively less focus on investigating whether these physics learning environments are equitable and how students' motivational factors in calculus-based introductory physics courses (which are foundational courses for many physical science and engineering majors) are related to male and female students' learning of physics. In particular, these motivational factors can lead to differences in performance of male and female students, and can at least partly explain why women do not pursue physics as often as men do.
One of the central motivational factors in educational studies is students' self-efficacy, which refers to individuals' own beliefs about how well they expect to do in a particular subject or task [16]. Prior work in many areas of education has found that self-efficacy predicts students' retention and academic performance even after controlling for knowledge [7,[17][18][19]. Therefore, in understanding gender disparity in physics and to create equitable physics learning environments, self-efficacy is an important factor to examine.
This study examines the role of self-efficacy in explaining the gender gap in college level calculus-based introductory physics courses. At the college level, there are typically fewer than 30% women in calculus-based introductory physics classrooms, compared to algebra-based physics courses in which women are often in the numerical majority [10]. Therefore, it is especially important to understand the role of self-efficacy in explaining genderbased performance gaps in calculus-based physics courses where women are underrepresented. In the following sections, we present an overview of the literature on self-efficacy research in physics learning and its relation to prior knowledge, academic preparation, and performance differences by gender.

A. Self-efficacy and academic performance
In learning science and educational research, selfefficacy is a commonly used construct that was first proposed by Bandura [20] and it is one of the central factors pertaining to students' beliefs about their capability to perform well in a particular domain [20]. While the impact of societal stereotypes and biases in a discipline on students' self-efficacy can be profound, this motivational factor has been found to shape and be shaped by students' interests, as well as their effort and engagement in class [16]. In particular, self-efficacy can influence students' selfregulation processes, such as goal setting, time management skills, and self-judgement [21]. Students with high self-efficacy become more task centered [22], and they are more likely to exhibit advanced level learning strategies, such as self-monitoring and self-regulation [22]. Likewise, the higher the students' self-efficacy in a particular learning activity, the more perseverance and resilience they are likely to show when faced with adversity [21].
The role of self-efficacy becomes particularly salient when students tackle difficult problems. During problem solving, students with high self-efficacy interpret the struggle as an opportunity for developing their skills while those with low self-efficacy may view the challenge as a large hurdle and further evidence of their lack of competence in the subject [20]. When encountering challenging activities, students with low self-efficacy become less interested, spend less effort and time, and eventually disengage from the class [23]. These behaviors act as a barrier to learning and development.
There is also a strong link between students' self-efficacy and academic performance where low self-efficacy can put students in a negative feedback loop with regard to its impact on performance (which can further lower selfefficacy and negatively impact performance, etc.). In particular, studies in middle and high school have shown that self-efficacy can predict student performance in science courses when controlling for prior knowledge and academic skill differences [24][25][26]. Relatedly, at the college level, nonphysical science majors' self-efficacy belief was also shown to be a predictor of conceptual understanding and course achievement in physics [27]. In this study, we examine the relationship between self-efficacy and conceptual understanding and course achievement for physical science and engineering majors.
According to Bandura's social cognitive theory, there are several factors that contribute to the development of self-efficacy: mastery experiences (achievement or failure on a previous task), vicarious learning experiences (e.g., observations of how others perform on similar tasks), social persuasion experiences (e.g., cultural norms or biased social messages about who can succeed in a particular domain), and physiological states (e.g., anxiety) [20,[28][29][30][31]. For instance, having support and encouragement from instructors can positively influence students' self-efficacy and motivate them to engage with difficult learning activities, whereas experiencing stress and doubt due to classroom norms and societal stereotypes might increase disadvantaged students' anxiety and negatively affect their self-efficacy and performance. In this study, we investigate the extent to which students' prior experiences and achievements in math and physics can predict students' self-efficacy and their future physics performance. B. Self-efficacy, gender, and performance in physics Many prior studies have shown a prevailing gender gap in students' self-efficacy levels in science and math courses, and in their overall academic achievements. In particular, female students have consistently reported lower selfefficacy than male students in many STEM courses [7,27,[32][33][34][35][36][37][38][39][40]. Cheryan et al. investigated causes of gender imbalance in some STEM fields and found that selfefficacy can be a strong predictor of unequal gender participation during class activities and enrollment in STEM fields such as physics [40]. In another study, female students were found to feel less efficacious in physics learning than male students regardless of the type of instruction (i.e., evidence-based active-engagement vs traditional) [34]. Similarly, previous research has identified a large self-efficacy gender gap for equally performing female and male students for all achievement groups (low, medium, high) [12,13]: women who obtained A's in physics on average had self-efficacy levels similar to men who obtained C's.
One of the most well-researched consequences of gender-based beliefs about ability is stereotype threat. In this phenomenon, stigmatized groups such as women in physics have a fear of confirming stereotypical expectations about their gender and they end up performing poorly in physics. In particular, this fear can create further anxiety and can impact the marginalized group's performance (e.g., anxiety can rob students of their cognitive resources while solving problems), which becomes a self-fulfilling prophecy [41]. Although not tested directly in the current study, gender stereotypes provide a well-studied explanation for why physics self-efficacy concerns could lead to differences in learning outcomes.
Previous studies have documented large gender differences in physics performance across various institutions [7][8][9][10][11][12][13][42][43][44][45][46][47]. In college level calculus-based physics courses, women often score lower than men on exams [43] and on standardized conceptual physics tests [44]. Interactive engagement teaching methods have been proposed to address the gender gaps [48]. While some prior work found reduced gender gap in active-engagement courses [47], other studies reported that the performance gap remained [27] and even became larger in calculusbased introductory physics courses despite the use of the interactive teaching methods [46].
Interestingly, the performance gap on standardized conceptual physics assessments has been found to exist on pretests (at the beginning of the course before instruction), which could explain part of the differences in post-test after instruction [44,45]. Therefore, some researchers have suggested that the gender differences in college level physics performance stem from societal stereotypes and biases accumulating over a student's lifetime and the differences between female and male students' high school experiences and preparations [8,49].
Developing robust mathematical skills can help students in college-level physics courses [49]. For example, the number of mathematics courses taken in high school is a strong independent predictor of students' college achievements in introductory science courses [49]. Likewise, research suggests that high school math grades and SAT math scores can predict college physics course success [50][51][52]. In one study, high school preparation in math was found to be the strongest predictor of students' physics grades in college [8]. Mathematics as a foundation to physics is particularly relevant because there have also been gender differences in math performance [53,54]. Despite female students' high math performance during elementary and middle school, male students score higher on high school math assessments [53,54]. This shift in math achievement during high school might be due to environmental factors such as lack of encouragement for girls in taking more advanced math classes or a belief that math is only for boys due to societal stereotypes and biases [55]. More importantly, gender performance gap in precollege math can further impact women's performance and self-efficacy beliefs in college science courses [56,57].

C. Theoretical framework and research questions
In this study, our primary goal is to explore the mediational mechanism of self-efficacy in explaining gender differences in Physics 2 learning outcomes, while also integrating academic performance (SAT math and Physics 1 course grade) and initial Physics 2 knowledge (standardized conceptual test scores as pretest scores) into the path analysis. We use structural equation modeling (SEM) as an analysis method to unpack the mediational relation between gender and learning outcomes through motivational constructs. SEM is an extension of multiple regression which allows for testing of multiple linear regression models as a single model simultaneously as part of the path analysis; SEM has a number of benefits that are discussed in the methods section. We hypothesize that gender differences in learning outcomes (post-standardized conceptual test scores or course grades) will be mediated by prior knowledge and self-efficacy. Moreover, we also explore the contributions of SAT math and prior knowledge in a standardized conceptual test as additional possible mediators of gender differences in learning outcomes (see Fig. 1). Therefore, our first research question is To what extent can gender differences in students' physics learning outcomes be explained by differences in physics selfefficacy at the beginning of the course? Here we contrast the relative roles of self-efficacy, prior knowledge, and SAT math in explaining gender gaps in learning outcomes.
Another important related issue involves the sources of gender differences in physics self-efficacy. Previous work has documented the various ways in which men and women have different exposure to physics within both in and out-of-school learning experiences [58], as well as differential preparation in mathematics. These precollege differences sometimes result in an initial physics knowledge gap and overall mathematics performance gap between men and women when they enter college. These experience differences could also underlie the self-efficacy differences. Therefore, we also posit a second research question: To what extent are gender differences in physics self-efficacy based on prior physics knowledge differences and measures of pre-college academic performance? Here we consider Physics 1 grade and SAT math (a common academic skill measure that strongly influences acceptance in selective STEM programs) as a measure of prior knowledge. While those factors are plausible drivers of self-efficacy beliefs, the connections to gender in this population are unclear. In particular, given selective participation of female and male students in physical science and engineering majors, it is not clear in advance whether there are gender differences in Physics 1 grade or SAT math among this population.
Considering all these factors and building on the success of self-efficacy studies in predicting students' achievement and retention, we focus on the impact of self-efficacy across gender on students' college level calculus-based introductory physics performance. Physics is one of the pillar courses taken during the first year of college and it is fundamental to almost all STEM degrees. Positive experiences in first-year physics courses are especially important since students typically decide to stay or exit the major at the end of the first year. Therefore, affirmative first-year experiences in physics courses can play an important role in sustaining female students' interest and self-efficacy in STEM majors [57].

III. METHODOLOGY
Data were collected from introductory calculus-based physics courses over the course of two consecutive years. Our focus is on introductory level Physics 2 courses that encompass topics in introductory electricity and magnetism, which are very challenging topics even for physical science and engineering students partly because they have had relatively little exposure to this specific content in high school. Nationally, many more high schools offer mechanics than electricity and magnetism Advanced Placement (AP) courses, with corresponding large differences in student enrollments (e.g., 2∶1 in calculus-based AP courses and 3∶1 in algebra-based courses in 2019) [59]. Within our sample, the ratio of students who took only mechanics in high school to those who took electricity and magnetism was 3∶1. We examine two different measures of learning outcomes: performance on a research-based standardized conceptual test and course grades.

A. Participants' demographic information and class context
Participants were 1467 students enrolled in calculusbased introductory physics courses, which primarily enroll students who intend to major in engineering or physical sciences. The demographic data (i.e., gender, ethnicity, age) were obtained from the university data warehouse that also kept extensive records about students' pre-college test scores (e.g., SAT) and university grades (e.g., Physics 1 and 2). When motivational and conceptual survey responses were collected, they were sent to an honest broker to be linked with students' demographic information from the university records. Completion of this process gave researchers access to students' survey results merged with their gender and ethnic-racial information as a deidentified dataset.
In terms of demographics, 32% of the students were reported by the university as female; less than 1% of the students had not given gender information and were therefore excluded from this analysis. Although we recognize that gender is a complex sociocultural and multidimensional construct, unfortunately, the data obtained by university records only included binary response options. In future studies, we hope to incorporate gender measurement with multiple options, which can allow us to measure masculinity and femininity in more nuanced ways. Students were predominantly White (78%), with the remaining students coming from a number of other ethnic FIG. 1. Conceptual framework connecting gender to learning outcomes (conceptual post-test and Physics 2 course grade) via key college academic experience (Physics 1 grade), attitude (self-efficacy in Physics 1 and 2 at the beginning of the semester in each case), and prior knowledge or skill variables (SAT Math and conceptual pretest CSEM scores). Each arrow corresponds to a linear regression relation between two variables within the path analysis using SEM. or racial backgrounds: Asian (12%), African American (4%), Multiracial (3%), Hispanic (2%) and Others (1%). Also, 90% of the students in this course were first-year students with a mean age of 19. Students in the sample were enrolled in nine sections of Physics 2 courses that were taught by five male White or Asian instructors, having varying levels of teaching experience. To improve generalizability of findings, five of the included sections were taught using traditional lecture-style format and the other four were taught using a flipped class format (i.e., video lectures watched before classed followed by in-class problem solving work). The course topics included electrostatics, magnetostatics, resistance, capacitance, inductance and simple electric circuits, Faraday's law of electromagnetic induction, Ampere-Maxwell's law, Maxwell's equations, electromagnetic waves, and wave optics. There were 24 sections of weekly recitations attached to these lecture sections and they were led by graduate teaching assistants (TA), with women students being a minority in most sections and never a strong majority. All of the TAs were male and slightly less than half were international students. Attendance in recitations was mandatory in that students were given quizzes each week contributing to their final grade.

Physics self-efficacy
We previously developed and validated a self-efficacy survey that was built from prior survey instruments [60][61][62][63]. Our instrument was iteratively refined and validated with exploratory factor analysis (EFA), and individual student interviews [10][11][12][13]. The individual student interviews used a think-aloud protocol to make sure that students interpreted the questions as intended. Conducting EFAs ensured that items measured self-efficacy coherently and separately from other motivational constructs. Furthermore, we also checked the inter-reliability between the self-efficacy items. In particular, the self-efficacy survey included 6 items and inter-reliability was measured by Cronbach's alpha, where alpha > 0.7 is considered good [64]. Self-efficacy questions assessed students' belief in their ability to understand concepts in physics and their self-perceptions of how they perform certain physics-related activities in and out of the classroom. Table I presents the self-efficacy items and response options for various questions. The main reason for varying response options is to anchor responses in more objective aspect-specific ways and encourage respondents to slow down while responding to read each item (see Table I). Students responded on a scale from 1 (low) to 4 (high), with higher scores indicating higher levels of self-efficacy. Self-efficacy scores were calculated by taking the average responses across the items. For example, a student who answered "all of the time" to the first question, "all areas" to the second question, and "no" to the other four self-efficacy questions would have an average self-efficacy score of (4 þ 4 þ 2 þ 3-because this item is reverse coded-þ2 þ 2)/(6 total questions) ¼2.83 which is between positive and neutral (2.5 score) self-efficacy.
We also performed item response theory (IRT) analyses to check the response option distances for survey constructs [65]. These analyses revealed roughly equivalent distance between response items. In particular, the parametric graded response model (GRM) with software STATA was used to test the measurement precision of our response scale [66]. GRM calculates the location parameter for each response and calculates the difference between the locations. The numerical values for the location differences for item responses should be roughly similar in order to support the use of means across ratings [65][66][67]. The distances for the response options are 2.00 and 2.31, which indicate that it is appropriate to use the averaging of the survey items the way we have done [65][66][67]. In addition, simple means were highly correlated with IRT factor scores, further justifying the use of means.

Conceptual test
The Conceptual Survey of Electricity and Magnetism (CSEM) [68] was administered to measure students' conceptual understanding of introductory electricity and magnetism, in contrast to their ability to solve quantitative problems that are typically used in regular course exams (and which can sometimes be solved algorithmically without conceptual understanding of the underlying concepts). The CSEM has been extensively validated as a measure of conceptual understanding of core physics concepts and principles within the course topic areas of electricity and magnetism [68], and it has also been successfully used for comparing different teaching methods on a standardized basis [12,44]. The CSEM test consists of 32 multiplechoice questions. The test was administered at the beginning (pre) and end (post) of the course. We calculated the proportion correct in pre-and post-tests. Typical mean scores on the CSEM for calculus-based physics students (out of 1) was approximately 0.32 on pretest and 0.47 on post-test [12,68]; in other words, the test was very difficult for these students. As is appropriate for scales based upon dichotomous items (e.g., correct or incorrect), we use Armor's θ values to report reliability [69]. The θ values were 0.76 for pretest and 0.84 for post-test, indicating good reliability [69].

Course grades
Students' course grades were also used as a measure of their learning outcome. The final course grade was largely determined by students' midterm and final exam scores. Weekly homework, students' class participation, concept quizzes, attendance, and recitation quizzes also contributed to the course grade. The final course grades (both Physics 1 and 2) were obtained from the data obtained from the university records. The conversion between the letter grade and corresponding grade point is given in Table II. While a student's course grade is a measure influenced by attendance, TA and peer support of homework completion, and uneven test quality, this measure is better aligned to full content covered in each course (compared to a standardized test such as CSEM) and also represents an important learning outcome for students (including whether they must repeat the course).

Pre-college test scores
The university provided a wide range of scores that are used to determine admission to the university, including high school GPA, standardized assessment scores for mathematical and verbal ability (SAT), and standardized assessment scores for advanced coursework. In our model, we use the Scholastic Assessment Test (SAT) math scores as a predictor variable, which ranged from 400 to 800 and is designed to predict first-year university performance. Prior research suggests that students may overgeneralize the implications of their performance on the SAT, believing that lower math SAT scores imply lower ability for physical sciences [52].

C. Procedures
Motivational and conceptual tests were administered during recitation. Both surveys were administered by the responsible recitation TAs at the beginning of physics courses. The motivational survey was given before students took the conceptual test. The self-efficacy survey was completed by most students in a couple of minutes (embedded in a larger motivational survey taking between 10 and 15 min), and the students worked through the conceptual physics assessments in the remaining class time (approximately 35-40 min).
Instructors were encouraged to give a small amount of course credit to students for completing the surveys. The instructor or teaching assistant responsible for giving the motivational and conceptual physics surveys was given the following script to announce before administering the surveys to the students to encourage students to take the assessments seriously: "We are surveying you on your understanding and beliefs about physics in order to improve the class. Your responses will not be evaluated for grades except to make sure the responses were done seriously, rather than randomly."

D. Analysis
An initial examination compared female and male students' scores in predictors and outcomes for statistical significance using t tests and for effect sizes using Cohen's d [70]. Further, we calculated the correlations between the key constructs for two reasons: highly correlated constructs (>0.90) would signal that they measure nondistinguishable dimensions, whereas low correlations (<0.20) would indicate that the interrelation between the constructs was so low as to not require a direct link in the model (so could be excluded as a variable if not connected to any other variable).
To test the hypothesized path between the variables, we used structural equation modeling (SEM) as a statistical tool by using R (lavaan package) with a maximum likelihood estimation method [71]. SEM is an extension of multiple regression and has multiple advantages compared to other methods. First, by conducting several multiple-regressions simultaneously between variables in one estimation model instead of running them in sequential steps separately, we can calculate the overall goodness of fit and contrast different structural accounts. SEM also enables calculation of interrelated dependence between variables within a single analysis, which has greater statistical power and better controls for indirect correlations through third variables compared to multiple regression models. Third, variables in the model can be independent variables (input) and dependent variables (output) at the same time, allow for calculation of indirect direct effects through multiple pathways. Finally, SEM has an option to handle missing data by using a full information maximum likelihood "ML" estimation feature, which usually improves both power and generalizability because students missing only some data are not dropped. SEM involves several commonly used fit parameters to test the goodness of the fit: comparative fit index (CFI), which compares the fit of the proposed model to the null model; Tucker-Lewis index (TLI), which is similar to CFI but takes into account a more complex model-TLI is more strict than CFI; root mean square error of approximation (RMSEA), which refers to residuals and measures how closely the model fit to the data; and standardized root mean square residual (SRMR), which is the standardized difference between the observed correlation and the predicted correlation. There are commonly used thresholds for deciding whether the fit is acceptable or not: CFI and TLI > 0.90; SRMR and RMSEA < 0.08.
Before using mediation as a statistical method, we did moderation analysis to check whether any of the relations between variables show differences across gender or course type (flipped vs traditional). We used the R software package "lavaan" to conduct multigroup SEM. We initially tested for measurement invariance. In other words, we looked at whether the intercepts or residual variances of the observed variables (e.g., self-efficacy, SAT math, etc.) are equal by gender. The analysis involves introducing certain constraints in steps and testing the model differences from the previous step. In each step, we compare the model to both the previous step and the freely estimated model, that is, the model where all parameters are freely estimated for each gender or course type group.
Since we did not find significant moderation by gender or course type (see Appendix), we tested the proposed theoretical model as a mediation analysis, examining the resulting structural paths between constructs. In creating a final acceptable model, we began with the saturated model as shown in Figure 1 (i.e., included all possible regression pathways), and then dropped the connections of variables that were non-significant predictors to obtain a model that produced an acceptable fit to the data and contained only statistically significant regression paths.
Finally, within the path models, the indirect effects of gender to the outcome variables were found by multiplying the coefficients of the particular predictor that connected gender and learning outcome. If the predictor had more than one path between gender and learning outcome, we summed each path's contribution.

A. Correlations
Zero-order pairwise Pearson correlations are given in Table III. Pearson's r values signify the strength of relationship between the variables, uncontrolled for other correlated variables. Investigating the correlations among the predictors (self-efficacy 1 and 2, SAT math, Physics 1 grade, the CSEM pre), we find that there were mediumlevel correlations around 0.40, showing that the predictors are not so correlated as to be impossible to separate in the regression analyses, but also sufficiently intercorrelated that simple Pearson correlations with outcomes can be artificially higher than the true direct relationships. The strongest correlation was between the students' self-efficacy in Physics 1 and self-efficacy in Physics 2 with r ¼ 0.59 (see Table III). The next highest correlation was between students' CSEM pre and Physics 1 Grade (r ¼ 0.48) followed by the correlation between Physics 1 grade and Physics 2 self-efficacy (r ¼ 0.46). But the r ¼ 0.46 correlation represents roughly 20% shared variance, so selfefficacy is not identical to performance measured by this test or necessarily free from biases based upon stereotypes and social interactions. Furthermore, Physics 1 grade was moderately correlated with SAT math test scores, suggesting that prior experience with math is quite important for college level introductory physics courses [43].
The last two rows of Table III present the correlation values between the learning outcomes (the CSEM post and Physics 2 course grade) and the predictors discussed above. The CSEM post-test was most closely correlated with students' Physics 1 grade and CSEM pre-test results. Both self-efficacy in Physics 1 and 2 courses followed the physics learning results in terms of the correlation value with CSEM Post. For the Physics 2 course grades, students' grades in Physics 1 has the highest correlation value (r ¼ 0.65) followed by CSEM post (r ¼ 0.38), CSEM pre (r ¼ 0.34) and self-efficacy in Physics 2 (r ¼ 0.30).
The correlation between the two outcomes variables (CSEM Post and Physics 2 grades) was sizeable but far from identical, supporting the need to separately analyze the relationship of the predictors to the two outcomes. Further, the pattern of simple correlations of Physics 2 course grades and CSEM post with the predictor variables was also different, further suggesting that separate analyses are warranted.

B. Gender differences in predictors and outcomes
Statistically significant gender differences in favor of male students were found on most of the variables (see Table IV). A large gender gap occurred in students' initial self-efficacy reports for both Physics 1 and Physics 2 [70]. While men reported approximately a mean score of 3 in self-efficacy beliefs in both physics courses which corresponded to a positive self-efficacy, women more typically reported a neutral level of self-efficacy (approximate 2.6) in physics at the beginning of the Physics 1 and 2 courses, despite all students being physical science or engineering majors. Further, the gender differences in the standardized performance measures were smaller, with medium differences in CSEM (pre and post), and small differences in SAT math. Thus, while there are preexisting differences based on high school experiences, the largest gender difference appeared to be one of perceived, rather than actual, physics skills or knowledge. The gender gap was smaller in students' course grades than in CSEM performance and it was not statistically significant (see Table IV). Table V shows that female and male students were otherwise very similar in terms of percent underrepresented minorities, intended major, age, and overall GPA at the time of the course, ruling out other demographic differences which might explain performance differences.

C. SEM path model
We used mediation analysis to understand the extent to which gender effects in students' learning outcomes in physics were mediated by differences in students' initial self-efficacy, prior knowledge in physics (as measured by CSEM pre-test and Physics 1 grade), and the pre-college academic measure in math.

Using CSEM as a learning outcome
After iterations to remove nonsignificant links, the final mediation model produced good fit parameters: CFI ¼ 0.99 (>0.95), TLI ¼ 0.99 (>0.95), RMSEA ¼ 0.02 (<0.08), and SRMR ¼ 0.013 (<0.08). In this model, self-efficacy in Physics 1 and 2, CSEM pretest and Physics 1 grade had direct effect on students' CSEM post scores (see Fig. 2), where there were no direct connections with gender or SAT math. Students' initial Physics 1 grade had the strongest effect (β ¼ 0.40 ÃÃÃ ) on the conceptual test results. CSEM pre was the second strongest variable that had a direct effect on the CSEM post-test (β ¼ 0.25 ÃÃÃ ). Finally, self-efficacy scores both in Physics 1 and Physics 2 were the last variables that predict CSEM Post scores. In particular, self-efficacy 1 and 2 remained a significant predictor of learning outcome even after controlling for precollege academic skills and prior knowledge differences in Physics 1 courses. More interestingly, we found that students' initial gender differences in Physics 1 courses impact their Physics 1 grade, which later impacts students' Physics 2 self-efficacy with a much larger regression coefficient (β ¼ 0.43 ÃÃÃ ). The only direct connections to gender involved a relationship with self-efficacy in Physics 1 (β ¼ 0.24 ÃÃÃ ), self-efficacy 2 (β ¼ 0.12 ÃÃÃ ), and much small relationship with SAT math (β ¼ 0.06 Ã ). This finding suggests a substantial and powerful impact of students' self-efficacy on learning outcomes even after adjusting for prior knowledge differences: the gender gap in self-efficacy mediated the gendered differences in pre-and post-physics test performance.

Using Physics 2 course grade as a learning outcome
For the course grade, a similar model proved to fit the data well, and in fact provided an even stronger fit: CFI ¼ 0.99, TLI ¼ 0.99, RMSEA ¼ 0.01, and SRMR ¼ 0.01. However, there were some structural differences (see Fig. 3). Unlike what we have observed in the first model with CSEM post, Physics 1 grade was only one direct predictor of Physics 2 course grade with a strong connection (β ¼ 0.75 ÃÃÃ ). The initial gender differences in self-efficacy and SAT Math predicted students' Physics 1 grade which in turn has the strongest direct effect on how students perform on Physics 2 (as measured by grades). The conceptual test (CSEM) mean score in the post-test was 47%. Therefore, Physics 1 grade suppresses the relation between CSEM pre and Physics 2 grade even though there was initially a correlation between two with r ¼ 0.34.

D. Total indirect effect of SAT math and Physics 1 self-efficacy
Since gender is not directly connected to either CSEM Post or Physics 2 Grade in the final path models, it is possible to examine the relative contribution of gender to outcomes via SAT math, pre Physics 1 SE, and pre Physics 2 SE since they are measures before students started to interact with college level physics 2 topics. Therefore, we calculated the total mediated effects between gender and learning outcome (Physics 2 grade and CSEM post) via these three variables. Indirect effect of gender to learning outcomes measures the mediated effect by adding all the paths that flow through certain predictor variables after calculating the sum over all the paths which are expressed as the products of β values. We have only counted the paths that had an indirect effect of larger than 0.01 although the pattern is identical when all paths are included. For instance, one of the pre Physics 1 SE mediation paths was gender → pre Physics 1 SE → CSEM post. Therefore, we multiplied all the coefficients along this path (0.24 × 0.08 ¼ 0.019). Another path between gender and CSEM post flowed through variables Physics 1 grade and CSEM pre, so the calculation involved the path: gender → pre Physics 1 SE → Physics 1 grade → CSEM pre → CSEM post. We again multiplied all the standardized coefficients for this mediation route as 0.24 × 0.09 × 0.48× 0.25 ¼ 0.002. Since this value is very small (smaller than 0.01), this path was not added to the final indirect path calculation. After we calculated all the paths greater than 0.01 that include pre Physics 1 SE in a similar way, we summed them to find the total indirect effect of pre Physics 1 SE. We repeated a similar process for the calculation of SAT math and pre Physics 2 SE as well.
The total mediated effect calculations were conducted separately for both outcome variables (CSEM post and course grade) and shown in Table VI. For the model that has CSEM post as a learning outcome, pre Physics 1 SE had a total indirect effect three times the size of SAT math's total indirect effect. The total indirect effect of pre Physics 2 SE followed the pre Physics 1 SE by half. Therefore, gender was mainly mediated through self-efficacy differences (both in physics 1 and physics 2), which further impacts students' CSEM post. For the second model, where we used Physics 2 grade as a learning outcome, gender's indirect effect was mainly mediated through pre Physics 1 SE followed by SAT math at half the size. Pre Physics 2 SE had no indirect effect on Physics 2 grade.

A. Summary
Our most important finding is that the direct connection between gender and conceptual test results becomes nonsignificant in the SEM model for the conceptual test outcome. Further, the analysis of indirect effects revealed that the gendered patterns in conceptual test performance and course grades were mainly associated with students' self-efficacy, with a smaller role for SAT math. Further, mathematics skills and prior physics preparation appears to be correlated with the large differences in self-efficacy. In particular, prior mathematics and physics learning appears to play a small direct role in shaping later physics learning outcomes but plays an indirect role in shaping physics learning outcomes via undermining or supporting student self-efficacy, which then itself influences learning.

B. Implications
Research suggests that self-efficacy is related to students' learning or performance even after controlling for their prior academic performance differences [12,16,35]. There are several mechanisms that explain the strong impact of self-efficacy on students' motivation, academic achievement, goal orientation, and academic outcome expectations [16][17][18][19]72]. Students with high self-efficacy can engage in more challenging tasks without anxiety, which keeps the cognitive load under control, and they are  more likely to persist when they face failure in such activities.
In addition to being driven by prior preparation differences, which often result from inequities including societal stereotypes and biases, students' self-efficacy is related to their interactions with peers and classroom experiences [57]. Therefore, in a male-dominated classroom environment such as in calculus-based introductory physics, a woman might experience a lower level of sense of belonging and higher level of anxiety with low selfefficacy [73]. In addition, nonsupportive instructional pedagogies, lack of recognition from instructors, and teaching assistants and classroom interactions with peers can further decrease women's self-efficacy in physics. With that in mind, the instructor's focus on equity and inclusion, and approaches to recognizing students in poorly genderbalanced classrooms become even more vital in supporting women's self-efficacy and promoting learning for all students in the classroom [74]. Since working in an equitable and supportive learning environment will be less likely to trigger stereotype threat, instructors' implementation of explicitly inclusive active-engagement strategies might help women feel more confident and competent in physics. These equitable and inclusive strategies that provide a supportive environment in which women feel recognized and valued might also bolster women's interest in taking more physics-related courses [8,74,75].
Conceptual tests that consist of primarily novel (students are unlikely to have encountered such questions before) and difficult questions is one factor that can activate or elevate stereotype threat and can lower women's self-efficacy and performance. They stand in contrast to traditional exams that comprise more familiar, quantitative questions, that give partial credits for students' solutions rather than only for correctness. As we have shown in this study, the gender gap in conceptual test is mediated by students' self-efficacy.
The gender gap in physics course performance can also increase in active engagement classrooms in which equity and inclusion are not treated as central constructs [46]. Therefore, the reforms towards active-engagement courses in physics not focusing on equity and inclusion will not generally be sufficient on their own to address performance gender gaps. In particular, attention to factors such as equity and inclusivity, the extent to which women feel valued and recognized and details of the support for classroom discussions will be critical in order to benefit all students equally. For example, during classroom activities, instructors must make sure that all students' opinions are valued and respected by all of the group members and all students feel free to communicate without feeling anxious or judged. In group activities during labs or lectures, female students typically have tasks that require a low level of cognitive engagement with the subject matter, such as notetaking or simply reporting the work [73]. Male students might dominate the conversations in these group discussions, which may cause female students' self-efficacy to drop even more. Therefore, in such active-engagement activities, instructors need to assign each student a role and later rotate the student's role and ensure that all students have a sense of belonging and contribute to the task equally.
One primary cause of the gender-biased beliefs in physics is the field-specific intelligence attributions. As Leslie et al. found, women recede away from the domains that are thought to require innate ability and brilliance for success in the field [14]. Physics is one of the exemplar fields that illustrates the negative correlation between the number of women and high expectations for brilliance [14]. These biases provoke fixed intelligence mindset attributions regarding how success is achieved via innate ability. Individuals discerning intelligence as a fixed characteristic then perceive struggles as a threat to their ability and failure as an indication of a lack of ability [76]. By contrast, fostering a growth mindset view encourages students to view struggle as a stepping stone that enhances learning, enabling students to become more enthusiastic about spending effort to develop their skills in physics. There are several classroom interventions designed to create better student engagement with growth mindset [77][78][79][80][81]. Some of these interventions have focused particularly on minority groups as they aim to normalize students' struggles in academic life and increase their sense of belonging and self-efficacy [77].
Failure to support women's self-efficacy especially during their first-year college experiences will not only have measurable short-term impact, but is likely to lead to long-term effects, such as gendered patterns of retention in STEM domains. For instance, in the absence of equitable and inclusive learning environments, initial low selfefficacy of women can increase their anxiety in the exams [73] and cause them to perform worse than they actually otherwise would [82]. We also found that conceptual test exams such as CSEM or SAT math had gender gaps whereas we did not find similar gap in students' course grade. While conceptual physics tests are composed of multiple-choice questions and are given in a short period of time (e.g., CSEM tests are given in recitation section and it typically lasts 40-45 min), students' course grades are composed of multiple assessments such as homework, exam grades, participation, etc. Furthermore, physics exams mainly have quantitative problems in both multiple-choice and open-ended formats that give students a chance to obtain partial credit for their solution steps and are made by instructors so that students are more familiar with the types of questions posed. Therefore, we believe that assessments that involve especially difficult and unfamiliar tasks in high stakes' situations can be one factor that elevate stereotype threat especially when these types of assessments are given in a short time. Future work should assess the effect of self-efficacy on exam performance alone as one important component of grades since other measures of grades may mask gender differences on midterm and final exams.
Gender gap in physics self-efficacy favoring male students can have a negative impact on female students' choices of academic career. Some engineering fields are mostly male dominated and contain more physics-focused topics throughout their curricula, while other engineering degrees appear to be more gender balanced and have less focus on physics materials [73]. Because of the first-year college experiences in physics 2 courses, women might switch out from physics-intensive engineering majors, such as mechanical or electrical engineering despite having initial interest in these majors. Equally importantly, due to pervasive societal stereotypes and biases, women's career choices might rely on fixed mindset and the ability-related negative beliefs such as beliefs about women having low ability in physics. Therefore, supporting women in male dominated fields also necessitates promoting and supporting positive recognition and endorsement for their competence from mentors, academic advisors, and course instructors as well as their family members [74]. In particular, academic encouragement, support, and recognition have the potential to enhance their self-efficacy and interest, and help them develop positive identities in the physics-related fields [83][84][85][86][87].

C. FUTURE DIRECTIONS
Future work should involve designing, implementing, and evaluating equitable and inclusive instructional strategies to address the issue of the large gender gap in physics self-efficacy. As discussed earlier, there are some interventions that have been found to improve women's self-efficacy [79][80][81]88,89] in nonphysics contexts, which may also promote equity and inclusion in the physics classrooms.

APPENDIX: MODERATION ANALYSIS
Using moderation analysis, first, we tested for "strong" or "scalar" measurement invariance by fixing intercepts to equality across gender and course type groups. Both of the models were not statistically significantly different when we compared to freely estimated model. For the gender group, we found chi-square difference ðΔχ 2 Þ ¼ 4.00, degree of freedom difference ðΔdfÞ ¼ 4, and nonsignificant p value ¼ 0.20 where strong invariance holds. For course type, the strong invariance also holds with, ðΔχ 2 Þ ¼ 4.28, ðΔdfÞ ¼ 3 and pvalue ¼ 0.23. Next step was to test for "strict" measurement invariance where we fixed intercepts and residuals to equality. "Strict invariance" also holds when we compared to the scalar model (gender Δχ 2 ¼ 7.5, Δdf ¼ 4, p ¼ 0.11; course type Δχ 2 ¼ 1.17, Δdf ¼ 1, p ¼ 0.27) and free model (gender Δχ 2 ¼ 13.4, Δdf ¼ 8, p ¼ 0.09; course type Δχ 2 ¼ 5.4, Δdf ¼ 4, p ¼ 0.24). Therefore, since strong and strict measurement invariance holds for this model, we continued on to perform other group comparisons.
Next, we ran a multigroup SEM where all regression estimates were fixed to equality for female and male students, and we compared this model to the previous three model (free, scalar and strict). There was no statistically significant difference between two models, so we report the model where regression pathways are equal for men and women. The model fit parameters were good for both gender comparison (RMSEA ¼ 0.027, SRMR ¼ 0.04, CFI ¼ 0.989, TLI ¼ 0.985) and course type comparison (RMSEA ¼ 0.028, SRMR ¼ 0.034, CFI ¼ 0.994, TLI ¼ 0.990). The multigroup SEM results suggest that regression pathways showed very small differences across gender (e.g., from self-efficacy 1 to self-efficacy 2, female students had unstandardized regression coefficient of 0.58 whereas male students had 0.53) or across course type (e.g., from self-efficacy 1 to self-efficacy 2, flipped classroom had unstandardized regression coefficient of 0.59 whereas traditional courses had 0.62). We did not find model differences when we compared this regression model across groups to the freely estimated model (for gender Δχ 2 ¼ 27.29, Δdf ¼ 17, p ¼ 0.05; for course type Δχ 2 ¼ 17.08, Δdf ¼ 12, p ¼ 0.16); to the scalar model (for gender Δχ 2 ¼ 21.39, Δdf ¼ 13, p ¼ 0.06; for course type Δχ 2 ¼ 12.02, Δdf ¼ 9, p ¼ 0.17), and to the strict model (for gender Δχ 2 ¼ 13.8, Δdf ¼ 9, p ¼ 0.12; for course type Δχ 2 ¼ 11.02, Δdf ¼ 8, p ¼ 0.15).
We followed a similar approach when we tested the moderation effect for the second model where we used physics 2 grade as a learning outcome. For gender, we found Δχ 2 ¼ 9.22, Δdf ¼ 4, and nonsignificant p value ¼ 0.056. For course type, the strong invariance also holds with Δχ 2 ¼ 10.42, Δdf ¼ 5, and p value ¼ 0.06. Next step was to test for "strict" measurement invariance where we fixed intercepts and residuals to equality. "Strict invariance" did not hold when we compared to scalar model and free model for both course type and gender moderation. However, strict invariance rarely holds. Therefore, we continued on to perform regression model comparisons. There was no statistically significant difference between two models, so we report the model where regression pathways are equal for men and women. The model fit parameters were good for both gender comparison (RMSEA ¼ 0.015, SRMR ¼ 0.042, CFI ¼ 0.997, TLI ¼ 0.996) and course type comparison (RMSEA ¼ 0.018, SRMR ¼ 0.038, CFI ¼ 0.997, TLI ¼ 0.996). The multigroup SEM results suggest that regression pathways showed very small differences across gender (e.g., from self-efficacy 1 to self-efficacy 2, female students had unstandardized regression coefficient of 0.54 whereas male students had 0.59) or across course type (e.g., SAT math to Physics 1 grade, flipped classroom had unstandardized regression coefficient of 0.59 whereas traditional courses had 0.62). We did not find model differences when we compared this regression model across groups to freely estimated models (for gender Δχ 2 ¼ 18.62, Δdf ¼ 12, p ¼ 0.13; for course type Δχ 2 ¼ 24.46, Δdf ¼ 13, p ¼ 0.02); to the scalar model (for gender Δχ 2 ¼ 9.39, Δdf ¼ 9, p ¼ 0.40; for course type Δχ 2 ¼ 14.04, Δdf ¼ 8, p ¼ 0.08), and to the strict model (for gender Δχ 2 ¼ 3.58, Δdf ¼ 5, p ¼ 0.61; for course type Δχ 2 ¼ 5.76, Δdf ¼ 7, p ¼ 0.56).