Exploring the Gender Gap in the Conceptual Survey of Electricity Exploring the Gender Gap in the Conceptual Survey of Electricity and Magnetism and Magnetism

The “ gender gap ” on various physics conceptual evaluations has been extensively studied. Men ’ s average pretest scores on the Force Concept Inventory and Force and Motion Conceptual Evaluation are 13% higher than women ’ s, and post-test scores are on average 12% higher than women ’ s. This study analyzed the gender differences within the Conceptual Survey of Electricity and Magnetism (CSEM) in which the gender gap has been less well studied and is less consistent. In the current study, data collected from 1407 students (77% men, 23% women) in a calculus-based physics course over ten semesters showed that male students outperformed female students on the CSEM pretest (5%) and post-test (6%). Separate analyses were conducted for qualitative and quantitative problems on lab quizzes and course exams and showed that male students outperformed female students by 3% on qualitative quiz and exam problems. Male and female students performed equally on the quantitative course exam problems. The gender gaps within CSEM post-test scores, qualitative lab quiz scores, and qualitative exam scores were insignificant for students with a CSEM pretest score of 25% or less but grew as pretest scores increased. Structural equation modeling demonstrated that a latent variable, called Conceptual Physics Performance/Non-Quantitative (CPP/NonQnt), orthogonal to quantitative test performance was useful in explaining the differences observed in qualitative performance; this variable was most strongly related to CSEM post-test scores. The CPP/NonQnt of male students was 0.44 standard deviations higher than female students. The CSEM pretest measured CPP/NonQnt much less accurately for women ( R 2 ¼ 4% ) than for men ( R 2 ¼ 17% ). The failure to detect a gender gap for students scoring 25% or less on the pretest suggests that the CSEM instrument itself is not gender biased. The failure to find a performance difference in quantitative test performance while detecting a gap in qualitative performance suggests the qualitative differences do not result from psychological factors such as science anxiety or stereotype threat. DOI: 10.1103/PhysRevPhysEducRes.13.020114


I. INTRODUCTION
The difference in the performance of male and female students on many of the conceptual evaluations commonly used in physics education research (PER) is well documented and pervasive. Madsen, McKagan, and Sayre provided an overview and analysis of the "gender gap" [1]. Most research has focused on instruments measuring conceptual knowledge of Newtonian mechanics including the Force Concept Inventory (FCI) [2] and the Force and Motion Conceptual Evaluation (FMCE) [3]. For example, in a large study (N ¼ 5500) Docktor and Heller reported that male students outperformed female students by 15% on the FCI pretest and 13% on the post-test [4] even though there was no difference in course grade.
Electricity and magnetism evaluations such as the Conceptual Survey of Electricity and Magnetism (CSEM) [5] and the Brief Electricity and Magnetism Assessment (BEMA) [6] are less well studied. In aggregate, these instruments have demonstrated a gender gap of 3.7% on the pretest and 8.5% on the post-test [1]. The gender gap on these instruments is less consistent with Pollock [7] reporting a negative gender gap. Results from the current study were similar to the majority of electricity and magnetism studies, and showed that women scored 6% lower on average than men on the CSEM posttest, 5% on the pretest.
This research adds to the extensive literature on gender gaps in performance on PER conceptual instruments by providing a study featuring a large sample performed at an institution with a less well academically prepared population than many other studies. It also adds to the literature on gender gaps in electricity and magnetism that have not received the same level of attention as gender gaps in mechanics. This study furthers the understanding of the gender gap by comparing gender gaps observed in the CSEM to student performance on both quantitative and qualitative problems assigned in the course studied. These additional problems were assigned in both higher stakes insemester examinations and lower stakes quizzes allowing the analysis of the effect of testing conditions on the gender gap.
Section II provides a review of the literature on several possible sources of CSEM gender differences. In Sec. III, the data collected, coding of problems, and classroom context is discussed. In Sec. IV, the overall and disaggregated by pretest score results are presented; structural equation modeling results using the latent variable of Conceptual Physics Performance/Non-Quantitative (CPP/NonQnt) are also presented. In Secs. V-VIII, the results are discussed in light of prior findings, implications for instruction, and future work.

II. BACKGROUND
Factors that might be related to the gender gap include prior preparation in physics, performance on standardized tests, cognitive differences in learning, math and or/science anxiety, and stereotype threat. Additionally, high school course-taking patterns and other sources of conceptual prior knowledge such as informal learning experiences are subject to broad patterns of gender socialization. These patterns have been found to be significant in other male-dominated fields such as computer science [8], and if an informal background is similarly helpful in this area, we might expect to see differences especially on qualitative problems. This work will treat gender as a binary variable despite the call by Traxler et al. for a more nuanced treatment of gender and sex [9]. More complete demographic data was not available to explore additional dimensions of the role of gender in the course studied here.

A. Prior knowledge
Differences in prior preparation in physics between male and female students are well documented. Using data drawn from a nationally representative sample [10], a 2015 National Center for Education Statistics report showed that women enroll in high school physics classes at a lower rate than men, with male students receiving high school physics credit at a 5.6% higher rate than female students [11]. Women take chemistry and advanced biology at significantly higher rates than men. The ACT, the company that administers one of the two major U.S. college entrance examinations, reports (N ¼ 1 009 232) that in 2016 21% of women and 30% of men met the ACT College Readiness in Science, Technology, Engineering, and Mathematics (STEM) benchmark [12]. Taking physics in high school has been shown to increase physics grades in college [13,14] and, therefore, might improve scores on conceptual evaluations.
Antimirova, Noack, and Milner-Bolotin reported that taking high school physics predicted more variation in FCI pretest score than in FCI post-test score but did not find gender predictive of FCI posttest score [15]. Kost, Pollock, and Finkelstein looked at prior physics knowledge by binning students by their FMCE pretest scores and compared FMCE post-test scores between men and women. They found no difference between men and women in any of the pretest bins [16]. In this work, a "bin" will be defined as range of pretest scores; scores are "binned" if divided into groups by ranges of scores. In contrast, Kohl and Kuo binned on CSEM pretest scores and found gender differences in normalized gain in most of the pretest score bins [17]. Kost-Smith, Pollock, and Finkelstein found that male students outperformed female students by 1.5% on the BEMA pretest, a gap that grew to 6% on the post-test [18].
Kost-Smith, Pollock, and Finkelstein also explored using the FMCE post-test score from the previous class as a measure of prior knowledge. They separated students into five FMCE post-test bins. A higher proportion of women than men were found in the lower FMCE post-test score bins while more men than women were found in the higher bins [18]. Bates et al. also found that the lowest performing quartile of students on the FCI pretest consisted of approximately half of the female student population. Most of these female students remained in the lowest performing quartile on the FCI post-test [19].

B. Gender in standardized testing and grades
Gender gaps between male and female student performance on standardized examinations such as the Scholastic Aptitude Test (SAT) or Graduate Record Examination (GRE) have also been documented. The Educational Testing Service's (ETS) Gender Study (1997) provided a nuanced analysis showing gender differences varied by subject and that differences were not uniform within the same subject (male students were better at some mathematics skills, female students at other skills) and that a large gender gap between male and female students that had existed in math and science in 1960 had largely closed by 1990 [20]. The female advantage in language skills had not closed. More recently, the College Board reported that in 2006 male and female students scored approximately equally on the SAT verbal/critical reasoning subtest; however, male students scored 536 on the mathematics subtest while female students scored 502 [21]. This difference represented approximately one-third of a standard deviation. The difference had been approximately constant for the previous decade. The ETS concluded that "Gender differences are not easily explained by single variables such as course-taking patterns or types of tests. They not only occur before course-taking patterns begin to differ and across a wide variety of tests and other measures, but they are also reflected in different interests and out-of-school activities, suggesting a complex story of how gender differences emerge" [20].
The differences observed in standardized test performance are counter to a generally consistent higher performance on course grades by women [20]. Voyer and Voyer provide an overview of this body of research in a metaanalysis of studies involving over one million students at all academic levels K-20 [22]. The female academic advantage was strongest in language classes, Cohen's d ¼ 0.37, and weakest in science, d ¼ 0.15, and mathematics, d ¼ 0.07; however, for classes where female students outnumbered male students, the advantage in math was reduced to d ¼ 0.03 and in science d ¼ 0.01. The female advantage in mathematics and science grades also became smaller with time from middle school through college. Cohen's effect size conventions are d ¼ 0.2 represents a small effect, d ¼ 0.5 a medium effect, and 0.8 a large effect [23]. Cohen suggests that results of statistical analyses must be interpreted in terms of practical as well as statistical significance [24]. More recent analysis has suggested that Cohen's original effect size criteria should be adjusted for educational research with medium effects as d ¼ 0.4 and large effects as d ¼ 0.6 [25].
The gender gap on standardized examinations may be related to the gender gap on conceptual evaluations. Kost, Pollock, and Finkelstein used regression analysis to show that combining the FMCE pretest score along with math placement exam score, Colorado Learning Attitudes about Science Survey [26] pretest, and the semester the physics course was taken, explained 70% of the gender gap in the FMCE post-test [16]. A similar model explained 62% of the gender gap in BEMA post-test scores [18]. Men also outperform women on the FCI post-test when using the SAT math score as a covariate [27]. The gender gap has been shown to be the greatest for students with high reasoning skills (Lawson scores) [28].

C. Cognitive factors
Differences in physics prior knowledge may imply that male students are more likely to be relearning the material than female students. This could have differing effects on the pretest and the post-test. The relation of relearning a complex task to learning it for the first time has been extensively studied [29], was central to the development of early theories of memory [30], and, more recently, has been shown to have a physiological origin [31]. In foundational experiments, Ebbinghaus demonstrated that the more thoroughly a task is initially learned, the more quickly it can be relearned [30]. Patterns of learning and forgetting have also been measured within a physics class, finding substantial fluctuations in student knowledge levels on the same topic within a semester [32].
A large body of literature exists exploring the differences in numerous cognitive abilities between men and women [33]. The evidence for superior male spatial reasoning abilities [34,35] and superior female verbal abilities [36,37] is fairly robust, but these constructs are multidimensional and advantages are not uniform across all subfacets. Conceptual physics problems often involve a mixture of verbal, graphical, and logical reasoning. Cognitive researchers have not yet investigated whether there is a gender-based cognition advantage for either sex in the processes needed to solve conceptual physics problems. Some evidence for a cognitive effect on physics performance has been demonstrated; a program of spatial training was shown to result in improved test performance in introductory mechanics [38]. As such, if cognitive differences are the origin of the gender gap, targeted training may alleviate the differences. Spatial training has proven effective in improving spatial reasoning and shows promise for improving retention of women to STEM [39]. For a review of current research on cognitive sex differences see Miller and Halpern [40].

D. Science and math anxiety
Mathematics anxiety can cause students of both genders to perform more weakly on quantitative assessments. Differences in math anxiety by gender have been investigated [41,42]. The difference in mathematics anxiety between boys and girls had approximately the same effect size, d ¼ 0.28, as the difference in mathematics self-efficacy, d ¼ 0.33. These differences were substantially larger than the differences in mathematics performance, d ¼ 0.11 [42]. Mathematics anxiety has been shown to be negatively correlated with performance, r ¼ −0.27 [41], a relation that is independent of gender. The effect size conventions for correlations suggest r ¼ 0.1 as a small effect, r ¼ 0.3 a medium effect, and r ¼ 0.5 a large effect [23].
The phenomenon of science anxiety and its relationship to gender has also been explored [43][44][45][46]. Mathematics and science majors have the lowest levels of science anxiety when compared to nonscience majors [47]; however, within these mathematics and science majors, female students were more anxious than male students.
Within the physics classroom, students with more communication apprehension achieved lower gains on the FCI [48]. Physics students that see their instructors as allowing more autonomy had lower anxiety about taking a physics course and demonstrated higher performance [49].

E. Testing conditions
Testing conditions may also influence the gender gap. Conceptual evaluations are often given under low stakes testing conditions where students receive credit for good faith efforts. It is possible that male and female students react differently to testing conditions and that their performance would be changed if the evaluation was given as part of a higher stakes in-semester examination. Significant differences in exam performance for "low stakes" and "high stakes" applications of the same instrument have been demonstrated with small effect sizes [50]. Higher exam stakes have been shown to be positively correlated with student motivation and performance [51]. The relation of interest and effort on low-stakes science and mathematics test performance has also been demonstrated [51] with interest positively correlated with performance. Unfortunately, these studies have controlled for gender rather than investigated differences by gender. Other testing conditions such as the time limit placed on the examination have not been shown to have a significant effect on performance [20].

F. Stereotype threat
Women are substantially underrepresented in physics [52] and in the engineering disciplines that provide the majority of the enrollment in many calculus-based physics classes [53]. The National Science Foundation reported that in 2014, while women received 57% of all bachelor degrees in the U.S., they received only 19% of those awarded in physics and 20% of those awarded in engineering [54]. As a substantially underrepresented population, the performance of women in physics classes may be influenced by stereotype threat. The effect of stereotype threat on academic performance has been investigated as an explanation of differences in performance of men and women in STEM disciplines [55][56][57]. Shapiro and Williams define stereotype threat as "a concern or anxiety that one's performance or actions can be seen through the lens of a negative stereotype-a concern that disrupts and undermines performance in negatively stereotyped domains" [58]. Studies have shown that stereotype threat does indeed have a negative effect on both women's performance and women's interest in STEM fields [58]. Picho, Rodriguez, and Finnie's meta-analysis examined over 15 years of research, specifically about female performance in mathematics under stereotype threat [59]. The research showed an overall negative effect on the quantitative performance of female students, d ¼ −0.24; however, this effect was greater for middle school and high school students compared to college students. Gunderson et al. investigated how parents' and teachers' gender-related math attitudes can have a negative effect on women when choosing a STEM or math-related career [60]. Within physics, Koul, Lerdpornkulrat, and Poondej demonstrated a three-way interaction between gender typicality, gender contentedness, and gender stereotypes on physics self-concept [61]. Women who had strong math gender stereotypes and a combination of high gender typicality and gender contentedness had a negative physics self-concept.

G. Instrumental and other effects
Multiple authors have suggested that some items within the FCI [2] exhibit a gender bias [62][63][64][65]; however, these results have been inconsistent. The CSEM is substantially less well studied than the FCI and similar studies have not yet been carried out. The item contexts in the CSEM are often fairly abstract (point charges, field maps) unlike the more concrete contexts of the FCI (rockets, planes) and may be less susceptible to gender bias. Differences in the gender gap by item have also been identified in in-semester physics assessments [66] and in problems used in physics competitions [67].
Finally, other factors that may contribute to the gender gap on physics conceptual inventories include method of instruction and the use of a standardized instrument. Multiple studies have shown interactive engagement instructional methods are beneficial in reducing the gender gap on conceptual evaluations [68][69][70] and improving success in physics classes [71]; however, the reduction of the gender gap has not been replicated in all settings [72]. The use of a standardized instrument may cause mismatches in coverage between the instrument and the class tested, presenting students with problems on which they have received little instruction. This could produce gender differences either through differences in prior knowledge or through differences in the psychological response to being asked to solve problems one should not be expected to answer correctly. The psychological response could interact with stereotype threat.
The results of this study do not advance any claim that gendered patterns reported here are fixed, inherent, or apply equally to any individual student (regardless of their gender identity). In most calculus-based physics courses, 80% or more of the students identify as male. In settings with skewed demographic samples, it is important to ask whether reported learning gains are equally distributed among students, or whether they primarily accrue to students from traditionally overrepresented groups in physics.

H. Research questions
This research seeks to answer the following research questions: RQ1: Does student performance on the CSEM show evidence of a gender gap in the course studied? RQ2: How does the difference in male and female performance on the CSEM compare with those observed in other problems assigned in the course? Are differences consistent between qualitative and quantitative problems? Are differences consistent between low and high stakes testing conditions? RQ3: Are these differences dependent on the student's CSEM pretest score? RQ4: If a single latent variable is constructed to measure the difference in qualitative and quantitative performance, how does this variable differ by testing conditions? How does this variable differ for male and female students?

A. Context for research
The research was conducted in the second-semester, calculus-based physics course at the University of Arkansas, a large midwestern land-grant university serving approximately 25 000 students in the United States. The institution had a Carnegie classification of highest research activity through the period studied. The institution, however, had lower national stature and featured engineering and science graduate programs that ranked lower than those found in many PER studies [73]. At the time of the submission of this work, the undergraduate engineering program was ranked 105th [74]; this ranking was fairly consistent for all semesters studied. Engineering students form the majority of the students (80%) in the class studied. Much of the PER research cited in the introduction was performed at more highly ranked institutions. For example, the University of Colorado-Boulder's undergraduate engineering program was ranked 32nd and Colorado School of Mines 44th at the time the data were accessed [74]. As such, the students studied should be somewhat less academically prepared than those in many previous studies of gender differences in physics. The course studied covered electricity, magnetism, and optics. Most students taking the course were enrolled in engineering or physical science degree programs and elected the course because it was required for their major.
While there was some spring-to-fall fluctuation of overall class size, the gender composition of participating students was fairly consistent for the 10 semesters studied. The class size and the percentage of male and female students is shown for each semester in Table I. Women were substantially underrepresented in the course for all semesters studied.
Students were required to attend two 50-min lectures each week and two 2-h laboratory sessions. Lectures were presented traditionally with attendance managed with an in-class quiz. Homework was due before each lecture session. Homework assignments were divided into an open-response assignment collected on paper before each lecture and a multiple-choice assignment entered electronically before each lecture. Four in-semester examinations and a final examination were used to assess student learning. Laboratory sessions featured a mixture of TA-led demonstrations, small group problem solving, inquiry-based explorations, and traditional laboratories. Students were given a quiz during each laboratory session, a lab quiz, to assess their understanding of the previous homework assignment. The CSEM was used to measure student conceptual understanding gains and was given as a lab quiz pre-and postinstruction; both were graded for credit just as any other lab quiz. All course assignments featured a mixture of conceptual and quantitative problems. The course was presented with few modifications during the period studied. The course was considered effective by the physics department, producing strong learning gains on the CSEM, high course evaluation scores for the lead lecturer and teaching assistants, and encouraged many students to elect physics as a major leading to a strong growth in the number of physics majors graduated [75].
The course studied was designed to be both an excellent learning experience for students and a stable research environment for PER. The same lead instructor presented all lectures, designed all assignments, and oversaw TA training during the time studied. As such, much of the variation present in many courses was minimized.

B. Conceptual survey of electricity and magnetism
This work will compare the pretest and post-test responses on the CSEM to answers to other multiple-choice physics problems in the class studied. The CSEM is a 32-problem multiple-choice instrument containing qualitative problems in electricity and magnetism. The test requires approximately 45 min to administer. Each problem has 5 possible responses. The problems cover a range of electromagnetic topics including the electrostatic behavior of point charges, electric potential, the magnetostatic behavior of electric currents, and induction. The test does not cover Gauss' or Ampere's law, electric circuits, or electromagnetic waves.

C. Identifying nonquantitative problems
Each problem presented in the course was classified as either quantitative or nonquantitative using a rubric developed for a National Science Foundation project (DUE-0535928). This rubric was developed to allow reliable classification while also identifying all problems presented in popular PER conceptual evaluations as nonquantitative. The identification of nonquantitative problems was complicated by the existence of conceptual inventory problems requiring some mathematics (for example, if the distance between two point charges is doubled, how does the electric force change?) or problems that were only superficially quantitative (for example, an object with radius 4 cm and volume charge density 3 μC=m 3 is stationary at the origin, what is the magnetic field at a point 10 cm along the positive y axis?). The last example contains numbers but requires no calculation and could be converted into a problem that would be identified as quantitative by modifying it to require numeric calculation (for example, an object with radius 4 cm and volume charge density 3 μC=m 3 is stationary at the origin, what is the electric field at a point 10 cm along the positive y axis?). The rubric was constructed and tested on problems found in popular textbooks. Three raters applied the rubric to problems found in seven textbooks achieving 96% agreement. One rater then used the rubric to classify all problems presented in the course studied.

D. Evaluation environment
The class required students to complete a variety of assignments: homework, quizzes completed in lecture (lecture quizzes), quizzes completed in the laboratory (lab quizzes), and in-semester examinations. Lecture quizzes and homework were often completed cooperatively and, therefore, could not be used as individual measures of understanding. Lab quizzes and in-semester examinations were administered so that each student worked individually. In-semester examinations were composed of both open-response and multiple-choice problems; only the multiple-choice test problems were analyzed in this study. The multiple-choice test problems were fairly evenly divided between qualitative (nonquantitative by the above rubric) and quantitative problems. The average of the qualitative multiple-choice test problems is denoted as test qualitative or "TestQual." The average of the quantitative multiple-choice test problems is denoted as test quantitative or "TestQuant." Lab quizzes were composed primarily of conceptual problems designed to evaluate the students' understanding of the previous homework assignment (not the lab they had just completed). They were taken on computers in the lab room during the lab session. The average of the qualitative lab quiz problems is denoted by lab qualitative or "LabQual." There were insufficient numbers of quantitative lab quiz problems for analysis. The CSEM pretest and post-test were administered and graded as lab quizzes, and therefore, the lab qualitative average measured a second set of qualitative problems given under the same testing conditions as the CSEM.
This study will, then, evaluate the average score for male and female students on five collections of problems: the CSEM pretest, CSEM post-test, qualitative lab quiz problems, qualitative test problems, and quantitative test problems. These problems were administered to students in two testing environments: the lab quiz environment and the insemester examination environment.
All problems were given postinstruction and were specifically designed for the course (except CSEM problems). As such, all test and lab quiz problems were problems the instructor believed had been covered during the course. The tests formed approximately 70% of the course grade, were administered in large lecture theaters, and were therefore a moderately high pressure experience. Lab quizzes formed only 5% of the course grade, were administered in lab, and were believed to be a much lower pressure experience.
In the class studied, four in-semester examinations were administrated; only the first three are included in this study. The last three weeks of the class and the fourth in-semester examination were devoted to ray optics which is not covered by the CSEM. All ray optics problems were removed from the analysis so that the coverage of the analyzed lab quiz and test problems was the same general coverage as the CSEM. No CSEM problem was used in either the non-CSEM lab quizzes, the in-semester tests, or any other material or assignment in the class.

E. Sample
The data were collected from the Fall 2007 semester to the Spring 2012 semester. During this time, 1851 students completed the class for a grade (77% male and 23% female). Students who did not complete all problems on the CSEM pretest or post-test were eliminated, leaving N ¼ 1407 students that formed the sample for the analysis which follows. Multiple-choice responses to all CSEM pretest, post-test, qualitative lab quiz, and test problems were collected from these students which resulted in a data set containing 199 483 responses: CSEM pretest 45 024, CSEM post-test 45 024, qualitative lab quiz 70 749, qualitative test 18 993, and quantitative test 19 693.

F. Bonferroni correction
This work will report multiple statistical tests and as such inflation of the type I error rate should be considered. The large sample size also makes interpretation of significance tests problematic and effect sizes will be reported when possible. A Bonferroni correction adjusts the significance levels for the number of statistical tests by dividing the p value by the number of statistical tests performed. This work will employ 15 statistical tests. A Bonferroni correction would adjust significance levels with p < 0.05 becoming p < 0.0033, p < 0.01 becoming p < 0.000 67, and p < 0.001 becoming p < 0.000 067. Few results will be changed by this correction. Most tests produced significance levels of p < 0.001; these results were also significant at the p < 0.000 067 level. Uncorrected p values will be reported. Tests that would be modified by the correction will be noted as they are presented in the text. The structural equation modeling analysis and the many statistical tests implied by the analysis were treated as independent and not included in this correction. Table II summarizes the overall averages separated by gender for each problem collection. On average, male students outperformed female students on each set of qualitative problems including the CSEM pretest (5%), the CSEM post-test (6%), the laboratory quizzes (3%), and the in-semester tests (3%). Male and female students performed equally on in-semester quantitative test problems.

IV. RESULTS
The gender differences were examined using t tests. Significant differences between male and female students were found on the CSEM pretest [tð729Þ ¼ 8. 59 Effect sizes ranged from a small effect size for qualitative test average and lab quiz average to small to medium effect sizes for the CSEM pretest and post-test score. There was no significant difference between male and female students on the quantitative test problems. The difference between male and female students on qualitative test problems would not be significant and the difference in the qualitative lab quiz problems would be significant at the p < 0.05 level if corrected for the number of statistical tests performed using a Bonferroni correction.
The data set was reduced from the 1851 students who completed the course for a grade to the 1407 student sample for this study by the restriction to students who completed all problems on both the pretest and posttest. If this restriction is relaxed, the pretest and post-test averages change little. For the 1788 students who answered any problem on the pretest, the mean pretest percentage was 27.7% [men 28.6%; women 24.4%], which was very similar to the scores of the 1613 students who answered all pretest problems 27.8% [men 28.9%; women 24.4%]. These values are also very similar to the results for students who answered all pretest and post-test questions in Table II.  Table II. Blank questions were treated as incorrect in this analysis.
A. The effect of the pretest score Prior conceptual knowledge was measured by giving the CSEM as a pretest. A density distribution of male and female pretest scores is presented in Fig. 1. Table II and Fig. 1 show that male students have a higher pretest average, but also that the male pretest distribution is skewed with a substantial number of men receiving high pretest scores. The post-test density distribution is plotted in Fig. 2.
To explore the effects of these differences in pretest scores on students' performance postinstruction, the sample was divided into subgroups. The CSEM is a 32-problem, 5-response evaluation and, therefore, a student should answer 6.4 problems correctly if he or she guesses randomly. To produce groups that contained enough female students for analysis, students were grouped into pretest score ranges (bins) 0-6, 7-8, 9-10, and 11-12. Too few  female students scored 13 or above on the pretest for analysis. Figure 3 presents the average score within each pretest range for male and female students for each problem collection. For pretest scores between 0 and 8 (bin 0-6 and 7-8), a t test found no significant difference between male and female students in the number of correct responses for any problem collection; therefore, no gender gap exists for pretest scores of 25% or less. Although a small gap of approximately 2% was observed in the CSEM post-test scores for students scoring 25% or less on the pretest, this difference was not significant. The gender gap in the CSEM post-test grew rapidly with the pretest score. A similar, but weaker, relationship between pretest score and gender gap was found in both the qualitative test and lab quiz problem scores. No significant gender gap was found for quantitative test problems; female students outperformed male students particularly at the lowest levels of preparation. The equal quantitative test averages resulted from a greater number of male students with higher levels of preparation who were not plotted in Fig. 3.

B. Latent variable analysis
The qualitative outcomes measured by CSEM post-test score, lab quiz average, and qualitative test average showed similar behavior when plotted against CSEM pretest score, Fig. 3. All have small differences at the lowest pretest score, but a growing difference between male and female outcomes becomes apparent as the pretest score increases. This pattern of increasing gender difference in performance was not observed in the quantitative test results. The similarity of the qualitative results suggested that the difference in qualitative and quantitative performance may be explained by a common latent variable. This variable should be related to the prior conceptual knowledge required for higher pretest scores and any cognitive ability that aids in the solution of qualitative problems but does not contribute to the solution of quantitative problems. As Meltzer noted [76], pretest scores combine prior knowledge with academic ability. We called the latent variable Conceptual Physics Performance/Non-Quantitative or CPP/NonQnt. CPP/NonQnt was functionalized as the part of conceptual performance not explained by overall physics quantitative performance measured by quantitative test average. CPP/ NonQnt measures the part of the effect of prior knowledge and conceptual ability that does not result in improved quantitative performance.
Structural equation modeling (SEM) was used to extract CPP/NonQnt and to assess whether it is a productive variable for understanding the differences in conceptual performance observed. First, to control for general physics ability, the quantitative test average was used as the independent variable in regressions against the qualitative dependent variables: CSEM pretest score, CSEM post-test score, qualitative lab quiz average, and qualitative test average. A latent variable, CPP/NonQnt, was then introduced and used to predict the qualitative variables. CPP/ NonQnt was required to be orthogonal to the quantitative test average. The "laavan" package in the R statistical software system was then used to fit the model and the result is shown in Fig. 4 [77]. Further, the null model for the chi-squared test of perfect model fit is not well aligned with the research question which explores the efficacy of a single latent CPP/NonQnt variable; this assumption is expected to be only approximately true as CPP/NonQnt must certainly be a multidimensional construct. The weaknesses of the chi-squared test at large N as well as its sensitivity to the features of the underlying distribution and the size of the model correlations [77] have led to a number of additional statistics with superior performance and extensive research into combinations of statistics [78]. This continues to be an active area of research and general rules for SEM fit are still under development [79]. These fit statistics, called approximate fit indices, suggested a good model fit [78]. A wide variety of indices exist; among the most used are the root mean square error of approximation (RMSEA), the standardized root mean square residual (SRMR), and the comparative fit index (CFI). Hu and Bentler [78] found that a combination of two fit statistics dramatically improve the probability of retaining a correct model or rejecting an incorrect model. They suggest RMSEA < 0.05, SRMR < 0.09, and CFI > 0.96 for an acceptable model fit. For the model shown in Fig. 4, RMSEA ¼ 0.039, SRMR ¼ 0.012, and CFI ¼ 0.997; all well within the range of good model fit. The 90% confidence interval of the RMSEA was 0.005 to 0.075. A RMSEA less than 0.05 is considered good fit and greater than 0.10 poor fit; the confidence interval excludes the region of poor fit [78]. All regression coefficients, factor loadings, and variances were significant (ps < 0.001). As such, the model fit statistics suggest the latent variable, CPP/NonQnt, produced a model that improves upon a model without the latent variable.
The distribution of male and female CPP/NonQnt is shown in Fig. 5; a density plot of each distribution is also included. The CPP/NonQnt calculated by SEM was normalized by subtracting the mean and dividing by the standard deviation. The difference in CPP/NonQnt between male and female students shown in Fig. 5 Because CPP/NonQnt is normalized, differences may be interpreted as Cohen's d effect size and, therefore, the difference between the male and female CPP/NonQnt, 0.44, represents a small to medium effect size.
The binning used in Sec. IVA was repeated in Fig. 6, which demonstrated a growing difference in CPP/NonQnt with the CSEM pretest score, as well as an approximately linear relation between male pretest scores and CPP/NonQnt.  The relation of pretest score to CPP/NonQnt for women was approximately flat for pretest scores of 7 or more. Correlation analysis was used to explore this qualitative difference. The Pearson correlation coefficient, r, between pretest score and CPP/NonQnt was smaller for women, r ¼ 0.20 [tð321Þ ¼ 3.57, p < 0.001], than for men, r ¼ 0.41 [tð1082Þ ¼ 14.86, p < 0.001]. As such, pretest score explained 17% of the variance in CPP/NonQnt for men, but only 4% for women. The correlation between CPP/ NonQnt and pretest score for female students would not be significant if corrected for the number of statistical tests performed using a Bonferroni correction.
The differences in CPP/NonQnt were compared for the students with the lowest pretest scores. Combining students with pretest scores of 0 to 8, male and female students had significantly different CPP/NonQnt [tð400Þ ¼ 2.4, p ¼ 0.018]; however, this would not be significant if the p threshold was corrected for the number of statistical tests performed with a Bonferroni correction.
While the plots in Figs. 3 and 6 are similar, their interpretation is quite different. Figure 6, and the correlation analysis, suggests that the CSEM pretest scores should be interpreted differently for male and female students with the same pretest score indicating higher CPP/NonQnt for male students. Figure 7 presents a plot of the CSEM post-test percentage for men and women for each CPP/NonQnt quartile; the quartile was calculated aggregating male and female scores. Male and female students' post-test scores were indistinguishable in each quartile. As such, the growing gender gap observed for all sets of conceptual problems is identified as a result of the differences in the degree to which the CSEM pretest accurately measures CPP/NonQnt for men and women.
If the overall distribution of CPP/NonQnt aggregating male and female students is divided into quartiles, 15% of female students and 28% of male students fall in the highest quartile as shown in Table III. A t test comparing women in the 1st quartile and women in the 2nd and 3rd quartile did not demonstrate a significant difference; therefore, lower and moderately prepared female students are statistically indistinguishable by pretest scores. These students represent 85% of all female participants.

C. Distribution analysis
The observation that pretest scores were more correlated with CPP/NonQnt for male students than female studentsthat pretest scores measure CPP/NonQnt differently for men and women-warrants further investigation. Figure 1 shows the density distribution of CSEM pretest scores for both male and female students. The pretest scores were very low, and as such, it should be expected that some of the students, who have little knowledge of the material, were guessing. To attempt to understand the differing correlations for men and women, a sequence of models combining binomial distributions representing guessing behavior and normal distributions representing prior knowledge were fit to the distribution of male and female students' pretest scores as shown in Figs. 8(a) and 8(b), respectively.
The dashed lines in Fig. 8 show the result of fitting only a binomial distribution, Bðx; p ¼ 0.2Þ, representing pure guessing with probability of success p ¼ 0.2 and pretest score x. The pure guessing model was a relatively good fit for female pretest scores. While the fit was not perfect for men, the mean and standard deviation were not that   Fig. 8 plot the result of fitting the model shown in Eq. (1) that mixes a binomial distribution with a normal distribution where p b is the fraction of students who are guessing, p n are the fraction of students demonstrating some prior knowledge, and Nðx; μ n ; σ n Þ is a normal distribution with mean, μ n , and standard deviation, σ n : Fitting Eq. (1) with p b þ p n ¼ 1 yielded p b ¼ 0.40, p n ¼ 0.60, μ n ¼ 8.83, and σ n ¼ 2.99 for the male students. For the female students the fit resulted in p b ¼ 0.23, p n ¼ 0.77, μ n ¼ 6.67, and σ n ¼ 2.36. The curve representing Eq. (1) substantially improves the fit to the male distribution of pretest scores, Fig. 8(a); however, this model did little to improve model fit over the binomial distribution for female students. The mean extracted for the normal distribution for women, 6.67, was very close to the mean of the binomial guessing distribution, 6.40. The difference between the binomial and binomial-to-normal distribution fit for male students suggests that the CSEM can discriminate between male students who exhibit some prior knowledge and those who are guessing. However, for female students the CSEM pretest could not discriminate between those with some prior knowledge and those who were guessing. This analysis explains the qualitative differences in the male and female plots in Fig. 6 and the differences in the correlation of CPP/NonQnt and pretest score. The somewhat lower preparation of women shifts their distribution of pretest scores slightly so that it was less distinguishable from guessing than the male pretest score distribution. As such, the pretest scores of female students provide less information about the incoming knowledge state of the student because of the similarity of pretest results of students with moderate prior knowledge to those with no prior knowledge. This result is almost certainly dependent on the student population; a student body with higher average levels of prior preparation might produce different results.

V. DISCUSSION
This study sought to answer four research questions; these will be addressed in the order proposed.
RQ1: Does student performance on the CSEM show evidence of a gender gap in the course studied? A gender gap of 5% was found in the CSEM pretest and 6% on the post-test. Both these gaps represented small to medium effect sizes. These gaps were consistent with the gaps observed in a large study (N ¼ 2000) [18] of the BEMA, but inconsistent with the negative gender gap observed by Pollock (N ¼ 168) [7]. The growth of the gender gap from pretest to post-test was consistent with Kohl and Kuo, but of a smaller magnitude [17]. The failure of this study to reproduce the negative gender gap in Pollock could be the result of the less well academically prepared population in this study or differences in instruction.
RQ2: How does the difference in male and female performance on the CSEM compare with those observed in other problems assigned in the course? Are differences consistent between qualitative and quantitative problems? Are differences consistent between low and high stakes testing conditions? Table II shows the gender differences found in CSEM pretest and post-test scores were also present in the other qualitative problems presented in the class; however, the differences were smaller for the other problems (3% for both lab quiz and qualitative test problems). Both these differences represented a small effect size. Male students outperformed female students on qualitative problems in both the low stakes lab quiz environment and the higher stakes in-semester test environment at about equal rates, suggesting that neither the testing rules (low or high stakes) nor the stress of the testing situation were the cause of the gender gap. There was no significant gender gap in the students' quantitative test performance, which provides evidence that the gender gaps observed in the qualitative performance were not the result of general differences in physics ability between male and female students. The CSEM was given in the lab quiz environment, and as such, the larger CSEM post-test gap cannot be attributed to the testing environment.
RQ3: Are these differences dependent on the student's CSEM pretest score? Figure 3 shows that the gender gap was very small at lowest levels of pretest score. No statistically significant difference in CSEM post-test, qualitative lab quiz average, or qualitative test average was found for students with CSEM pretest scores of 25% or less. The gender gap grew with pretest score for all qualitative problem collections. The growth of the gender gap was most pronounced in the CSEM post-test. This result was completely different than that observed by Pollock,and Finkelstein [18], where the gender gap disappeared if students were binned by FMCE post-test scores. It was also inconsistent with the CSEM normalized gain results of Kohl and Kuo who found a fairly consistent gender gap, except in the lowest pretest bin [17]. The growth of achievement gaps with increasing student ability has been well documented [33]; however, the failure to observe any gap in quantitative test scores suggests the growing gender gap observed for qualitative problems had an origin other than in cognitive differences. The students in Kost-Smith, Pollock, and Finkelstein should be substantially more academically prepared than those in this study; in fact, Kost-Smith, Pollock, and Finkelstein [18] report a very small pretest gap. Their failure to observe the growth of the gender gap with pretest score could possibly be explained by a somewhat better prepared female student population which pushed the pretest scores into a range where they were equally predictive of CPP/NonQnt for men and women. The distribution analysis indicates that a small shift in pretest score (Fig. 8)  Average male CPP/NonQnt was 0.44 standard deviations higher than female CPP/NonQnt. If the distribution of CPP/NonQnt was divided into quartiles, 13% more male students were in the highest quartile and 17% more female students were in the lowest quartile. This overrepresentation of women in the lowest CPP/NonQnt quartiles is consistent with other research binning students by pretest scores [18,19].
The CSEM pretest score was more weakly correlated with CPP/NonQnt for female students, r ¼ 0.20, than for male students, r ¼ 0.41. Analysis of the pretest probability distribution suggested that this resulted from the somewhat lower level of female prior knowledge shifting the pretest distribution of moderately prepared women closer to the pure guessing distribution. If CPP/NonQnt rather than pretest score is employed to bin students, no post-test gender gap exists (Fig. 7).
The growing gender gap with pretest score for all qualitative problem collections is well explained by the differential predictive power of CSEM pretest scores for men and women. This also explains the variability in the pretest binning results as the CSEM is applied to academic populations with different levels of preparation. The different correlation of the CSEM pretest scores with CPP/ NonQnt for men and women, however, cannot explain the gender differences in the averages of the CSEM pretest, post-test, qualitative lab quizzes, and qualitative tests.
In Sec. I, many potential causes for the gender gap observed in the average scores on conceptual instruments in physics were reviewed. This study was not experimental and cannot conclusively eliminate many of these causes, but a pattern of averages of the different problem collections makes many of these explanations difficult to support. Psychological explanations involving differing responses to testing by gender through math anxiety [41,42], science anxiety [45], or stereotype threat [58] cannot explain why these reactions would occur for qualitative test problems but not quantitative problems on the same test. The failure to find evidence for stereotype threat for this student population further explains the inability to reliably reproduce the effects of interventions to eliminate stereotype threat [80][81][82] and the failure to detect a relationship between the fraction of women in a class and gender gaps [1]. It seems likely that if efforts to reduce stereotype threat were implemented in the class studied, the gender gap would not be affected.
The observed differences are also difficult to explain by the intrinsic gender fairness of the CSEM instrument. The gender fairness of some FCI items has been questioned [64,65], but no research exists for the CSEM. At pretest scores of 25% or less, no significant gender gap was found. Students who scored less than 25% on the pretest performed more weakly on other class assessments, but the effect was fairly small. It is possible that an intrinsic CSEM gender bias that impacts only the highest performing pretest students exists. This possibility is made less likely by the observation of approximately similar gender gaps in qualitative lab quiz and test scores which did not use CSEM problems.
It is also difficult to resolve the results of this study with an explanation involving cognitive differences between men and women in the ability to solve qualitative physics problems. Cognitive differences vary strongly with the kind of cognitive task [20]. It is possible that men are intrinsically, either through biology or socialization, superior at the combination of verbal, logical, and graphical skills required to solve qualitative physics problems. This explanation seems unlikely; quantitative physics problems like those given in the class studied also require verbal, logical, and graphical reasoning skills, but no gender gap was observed in quantitative problem solving. The quantitative test problems represented a spectrum from problems solvable by substituting numbers into the correct formula to challenging applications of Gauss' law where abstract symbolic and graphical reasoning were required. Further, while male superiority in spatial reasoning [34,35] could impact some qualitative items, one would expect that female superiority at verbal reasoning [36] would be the most important cognitive aspect which differed between qualitative and quantitative problems. As such, one would expect female students to have a cognitive advantage over male students on conceptual problems. No evidence of cognitive abilities differentiated by gender and unique to conceptual physics problems currently exists; however, research into this aspect of cognition is sparse.
There is at least one explanation for which the observed pattern of averages would be expected. The CSEM pretest is a test of prior knowledge of electricity and magnetism; the problems cannot be answered intuitively without knowing the physical laws. Naturally, a student's academic ability also plays a role, but even a very highly performing student would do poorly on the CSEM if they had no knowledge of the physics tested. The gender gap could be explained by the differences in physics class taking patterns of male and female high school students [10] and differences in informal learning experiences. Both the large CSEM posttest gap and the weaker relation between CPP/NonQnt and qualitative quiz and test averages than with CSEM post-test score could be explained by women overcoming the differences in background while in the class, but men having an advantage on a standardized instrument where coverage was not fully aligned with the class. The large CSEM posttest factor loading in the SEM model could also result if the opportunity to relearn the material instead of learning it for the first time was important in post-test results [30]. Further research should be able to test this conjecture. This interpretation is not fully supported by the work of Kost-Smith, Pollock, and Finkelstein [18] who did not find the years of high school physics taken as a productive variable in predicting post-test scores; however, their analysis used pretest score as an independent variable and, as the authors suggest, high school physics may already have been accounted for in this variable.
Either formal or informal prior physics learning experiences could affect physics performance in many ways. These experiences may produce higher pretest scores, but they may also allow students to master conceptual material more easily by relearning instead of learning for the first time [30]. They may produce higher post-test scores on standardized instruments by filling in holes in coverage. They may also produce more complex interactions such as allowing students retaining misconceptions to confront them again from a different perspective.
This study contributed additional support to previous work showing that mastering quantitative and qualitative problem solving require different learning processes. Students in this sample performed differently on quantitative and qualitative problems given in the same testing environment. The prevalence of poor conceptual performance in noninteractive classes [83] as well as specific experiments investigating the effect of quantitative problem solving on conceptual learning suggest conceptual and quantitative learning are somewhat different processes [84].

VI. IMPLICATIONS FOR INSTRUCTION
The observation that CSEM pretest scores predict CPP/ NonQnt and outcomes on qualitative assignments differently for male and female students suggests that pretest scores should be used with caution for instructional decisions such as establishing lab groups or assigning remedial material. The observation that pretest scores are more highly correlated with CPP/NonQnt for men than for women also suggests that the CSEM pretest may be less valid for women than for men [85,86]; that is, a pretest score provides less information about female students than male students. This conclusion is supported by the analysis of the pretest distributions in Sec. IV C.
The persistence of gender gaps for all qualitative problem collections within the course presents a substantial challenge for instruction. Higher levels of CPP/NonQnt benefit students at all points in the course; however, the differences in CPP/NonQnt observed in men and women imply this benefit is not equally distributed for students of different genders. Whether differences in CPP/NonQnt arise from documented differences in high school course taking or less well understood differences in informal education or cognitive processing, women on average have a disadvantage in a physics class when presented with a qualitative problem. CPP/NonQnt loads as strongly on qualitative test average as it loads on pretest score; therefore, differences in CPP/NonQnt have lasting negative effects for women even postinstruction. It is possible that some optional or adaptive remedial strategies could allow women to close the conceptual gap with men. For example, additional qualitative homework problems could be recommended as exam study aids to the entire class. More practice in this area would benefit most students, but could disproportionately help those with lower CPP/NonQnt, which would include many women in this sample but also students who had less high school preparation or less access to informal learning experiences.
The reality is that students in introductory physics courses have extremely variable levels of preparation. The differences identified in CPP/NonQnt between men and women present additional instructional challenges because of a potential interaction between self-efficacy [87] and CPP/NonQnt where male students seem to learn the material more easily because of prior preparation in physics. This could cause women, already with lower selfefficacy toward science [88], to fail to develop self-beliefs consistent with their accomplishments and ability; these women may choose to leave science or engineering careers. This effect has been found in computer science, a field with comparably poor performance in attracting and retaining women [8]. Self-efficacy has been demonstrated to be important in retention [89] and is one of the strongest psychological correlates with academic performance [90]; therefore, it is important as instructional strategies mix students with differing prior knowledge that appropriate support is provided for students who come to the class with less prior knowledge.

VII. LIMITATIONS AND FUTURE DIRECTIONS
This study was performed at a single institution and, therefore, its results may be specific to the student population or instructional strategy at that institution. The analysis was correlational rather than experimental; additional work is required to understand the relation of CPP/NonQnt to high school preparation, informal learning experiences, and college class taking. Furthermore, additional research is needed to explore whether differences exist in conceptual physics ability differentiated from general physics ability. This study provided evidence that the CSEM as an instrument is not gender biased, but additional item level analysis is needed to determine if the 2% difference in the posttest at lowest levels of preparation results from specific items in the CSEM. While this gap was not significant, the shape of the posttest curves in Fig. 3 and the 2% difference between the posttest and qualitative lab quiz and test averages suggest it warrants further investigation.
The observation that differences in conceptual performance are not related to differences in performance on quantitative problems requires further research. It is unclear if the results of this study would be altered if the pretest and post-test were quantitative and qualitative test performance was used as the control.
The lead instructor of the course was male for all semesters studied. Some research suggests a significant, but weak relationship between the instructor's race or gender and the persistence of students in STEM for students of the same race or gender [91]. Instructor gender effects were also observed in one of the course sections in Kost-Smith Pollock, and Finkelstein's study in which female students outscored male students on participation and homework, but male students scored higher on exams for most semesters studied [18]. In the only lecture section taught by a female instructor, gender differences in exam scores were insignificant. Additional research is needed to determine if the results of the current study would be modified if the lead instructor were female.

VIII. CONCLUSIONS
In this study, gender differences in the CSEM were examined and a 5% gender gap on the pretest was found; the gender gap was 6% on the post-test. This gender gap was also analyzed in other assignments throughout the course: qualitative lab quiz problems, qualitative test problems, and quantitative test problems. The gender gap that was present in the CSEM was also present for the other qualitative problem collections studied. Male students outperformed female students by 3% on both qualitative lab quiz problems and qualitative test problems suggesting that testing environment was not an important source of the gender gap. Male and female students performed equally on quantitative test problems and, therefore, the gender gaps were not a result of general differences in physics ability. The equal performance of men and women on the quantitative test questions also suggests the differences observed in the qualitative questions do not result from psychological factors such as math or science anxiety or stereotype threat. The gender gap for all qualitative problem collections was insignificant for students with a pretest score of 25% or less. The failure to identify a gender gap in either the CSEM pretest or post-test for the least prepared students suggests that there is not an intrinsic gender bias in the CSEM instrument. The gender gap grew with CSEM pretest scores. Structural equation modeling showed that a latent variable called Conceptual Physics Performance/Non-Quantitative, which captured the part of qualitative physics performance not explained by quantitative test average, was productive in explaining the variance in the four qualitative problem sets studied: CSEM pretest, CSEM post-test, lab quiz, and in-semester examination. Male pretest scores were more highly correlated with CPP/NonQnt than female pretest scores and, as such, the pretest is more predictive of CPP/NonQnt for men than for women.