Partitioning the gender gap in physics conceptual inventories: Force Concept Inventory, Force and Motion Conceptual Evaluation, and Conceptual Survey of Electricity and Magnetism

Very little of the gender gap on multiple choice conceptual inventories can be explained by differences in academic performance between men and women.


I. INTRODUCTION
The "gender gap," gender differences between the scores of men and women on commonly used physics conceptual inventories, such as the Force Concept Inventory (FCI) [1], the Force and Motion Conceptual Inventory (FMCE) [2], and the Conceptual Survey of Electricity and Magnetism (CSEM) [3], has been thoroughly investigated. On average, men outperform women by 12% on the mechanics conceptual inventories and by 8.5% on electricity and magnetism conceptual inventories [4].
Many factors have been explored to explain the differences observed in the performance of men and women on conceptual physics evaluations. These factors may be broadly classified as factors related to general academic achievement, prior physics or mathematics preparation, and factors not related to achievement or preparation. Factors related to academic achievement include academic performance measured by course grades and tests of specific cognitive reasoning skills. A substantial body of research has demonstrated differences in academic course grades [5,6] with a consistent advantage to women. Extensive research has examined the differences between men and women on specific cognitive tasks [7][8][9][10] with women scoring generally higher on verbal reasoning tasks and men generally higher on spatial reasoning tasks. These differences can be very finegrained with differences measured on related tasks in the same discipline [6]. Within physics, multiple academic, cognitive, and preparation measures have been used to explain gender differences including the Lawson test of scientific reasoning and the years of high school calculus as well as conceptual physics pretest score (see Madsen, McKagan, and Sayre, Table I for a summary [4]).
Factors not related to academic achievement and preparation have also been extensively examined; these include psychosocial factors and instructional factors. Psychosocial factors that have been shown to be related to gender differences in academic performance include science anxiety [11][12][13], mathematics anxiety [14,15], and stereotype threat [16]. Psychosocial factors have also been investigated as an explanation of performance differences in physics classes [17,18]. Classroom instructional mode and environment have also been explored as possible explanations of gender differences. The results of these studies have been inconsistent with some studies showing active-learning instruction produces decreased gender differences [19][20][21] while other studies show no effect of reformed instruction on gender differences [22][23][24].
For a more detailed discussion about the many sources that may influence the overall gender gaps on physics conceptual inventories, see Henderson et al. [25].
Performance differences between men and women on the individual items on physics conceptual inventories have been less thoroughly investigated. Much of the research in this area has focused on the FCI; Classical Test Theory [26][27][28] and Item Response Theory [27,[29][30][31] have been used to examine the validity of the instrument. In addition, Differential Item Functioning (DIF) analysis has demonstrated that some of the items in the FCI are unfair to either men or women [30,32,33]. As previously described in Traxler et al. [33], "An item is defined as being "fair" if men and women of equal ability have the same chance of answering the item correctly." While less research has been performed on the FMCE and the CSEM, individual items on the FMCE [34][35][36][37] and the CSEM [38,39] have also been examined; however, most of this work was not differentiated by gender. Only one study has reported item-level gender fairness for the FMCE or the CSEM [40]. For a more complete discussion of item-level research on the FCI, FMCE, and CSEM, see Traxler et al. [33] and Henderson et al. [40].
In general, many factors have been shown to influence the overall gender gap, but physics education researchers have yet to come to an agreement as to the origin of these gender differences. This work presents an analysis which evaluates the relative importance of academic performance, instrumental fairness, and prior preparation on gender differences in the FCI, the FMCE, and the CSEM. It uses samples described in three previous studies [25,33,40] and attempts to shed additional light on the gender differences identified in these studies by modifying the conceptual instruments as proposed in those studies and by using relations between the student populations and instructional environments in the individual samples.

II. RESEARCH QUESTIONS
This study sought to the answer the following research questions: RQ1: How much of the gender gap in physics conceptual post-test scores can be attributed to differences in general academic performance measured by ACT/ SAT scores or physics test averages? RQ2: How much of the gender gap can be attributed to instrumental fairness? RQ3: How much of the gender gap can be attributed to differences in prior conceptual preparation in physics measured by pretest scores? By answering these questions, this study forms a partition of the gender gap that may allow more targeted development of instructional interventions that will allow all students to succeed equally in physics classes. We end with some suggestions for instructors and researchers and a reminder that concept inventory gaps are only one element of the gender dynamics of a classroom.
We acknowledge that the model of a binary classification simplifies the complexity of gender identity [41]; however, this model is used throughout much of the physics education research (PER) literature that examines the gender differences found in the physics conceptual inventories. In addition, we were limited to the gender descriptions collected at the institutions studied. Future studies should explore the following results for other marginalized groups.

III. BACKGROUND
This section summarizes the results of three previous studies that examined the samples presented in this paper;

A. Study 1
In study 1, Henderson et al. examined the gender gap on the CSEM and found that men outperformed women by 5% on the pretest and 6% on the post-test [25]. This study also examined other qualitative and quantitative multiple-choice questions assigned in the course. A gender gap of 3% was also measured for qualitative lab quiz questions and qualitative test questions; however, men and women performed equally on the quantitative test questions. This result suggested that the gender gap in this sample could not be explained by psychological mechanisms such as science anxiety or stereotype threat. Why would a student experience stereotype threat on the qualitative questions on a test but not the quantitative questions on the same test?
Through a structural equation modeling analysis, a latent variable called conceptual physics performance/nonquantitative (CPP/NonQnt) was extracted. CPP/NonQnt represented the amount of conceptual performance that could not be explained by quantitative performance. The correlation between CPP/NonQnt and CSEM pretest score was larger for men (r ¼ 0.41) than for women (r ¼ 0.20) suggesting that the CSEM pretest was less predictive of CPP/NonQnt for women than for men. Study 1 presented a partial explanation of this effect by exploring the distribution of pretest scores. Women have 5% lower pretest scores on average than men; this produced a small shift in the distribution of pretest scores moving women slightly closer the binomial distribution of pure guessing scores. As such, it was much more difficult to distinguish moderately prepared women from unprepared women than it was to distinguish moderately prepared men from unprepared men.
The sample that was investigated in study 1 will be labeled "CSEM-1" in the current study. The number represents the institution from which the sample was collected.

B. Study 2
In study 2, Traxler et al. explored the validity and intrinsic bias of the FCI using Classical Test Theory and Item Response Theory [33]. The analysis identified many of the items on the FCI as problematic due to item difficulty and discrimination values outside the accepted range for well-functioning items. Study 2 also investigated item fairness employing both a graphical analysis and using DIF analysis. In the graphical analysis, five items stood out as significantly unfair to women: items 14, 21, 22, 23, and 27. DIF analysis showed that eight items were substantially unfair controlling for the student's overall post-test score, two of which were unfair to men.
To construct a fair, valid FCI, Study 2 iteratively removed unfair items until no items in the instrument showed bias. This process produced a 19-item instrument which, in turn, reduced the original gender gap by 50%. Study 2 analyzed three samples from three different institutions. The largest sample from study 2 was also analyzed in the current study and is labeled "FCI-1."

C. Study 3
In study 3, Henderson et al. [40] replicated the fairness analysis of study 2 for the FMCE and the CSEM using two large FMCE samples and two large CSEM samples. Overall, there were fewer items in the FMCE and the CSEM that demonstrated substantial unfairness to either men or women. For the first FMCE sample in study 3, one item was substantially unfair to women, item 27_29. Study 3 used the modified scoring suggested by Thornton et al. [42] where some items were eliminated and some groups (clusters) were scored as a block. The notation 27_29 represents items 27, 28, and 29. In the second FMCE sample in study 3, two items were substantially unfair to women, item 27_29 and item 40.
For the CSEM, only one item, item 20 was substantially unfair and only for one of the two samples analyzed. This item was unfair to men. Study 3 utilized four different samples all of which were further investigated in the current study. Sample 1 and sample 3A from study 3 are labeled "FMCE-2" and "FMCE-3," respectively, in the current study. Sample 2 and sample 3B are labeled "CSEM-1" and "CSEM-3," respectively.
In the current work, modified conceptual inventories were constructed which eliminated invalid or unfair items for all conceptual inventory pretests and post-tests for the samples used in these three studies. Hierarchical linear regression (HLR) was used to analyze the gender gaps controlling for academic performance, measured by test average or ACT/SAT math percentile, and prior physics preparation, measured by pretest scores. This allowed a "partitioning" of the gender gap to determine which factors were most important to the observed gender differences and whether the relative importance of the factors was consistent across instruments and institutions.

IV. METHODS
This work reports results for the FCI, the FMCE, and the CSEM; each of the analyzed samples were described previously in more detail in studies 1 to 3. Readers seeking more information about institution characteristics, sample characteristics, or instructional environment should consult these works.

A. Samples
This study utilized five samples collected at three different institutions. The institutions are denoted as University 1, University 2, and University 3. The samples are denoted as FCI-1, FMCE-2, FMCE-3, CSEM-1, and CSEM-3 where the number represents the institution at which the sample was collected.
University 1: University 1 is a large southern land-grant university serving approximately 25 000 students. University level demographics for the undergraduate student population of University 1 were 76% White students, 4% African-American students, 9% Hispanic students, 4% students reporting two or more races, and other groups each with 3% or less [43].
University 2: University 2 is a large western land-grant university serving approximately 34 000 students. The demographic composition of the undergraduate population of University 2 consisted of 68% White students, 12% Hispanic students, 7% international students, 6% Asian students, 5% students reporting two or more races, and other groups each with 2% or less [43].
University 3: University 3 is a large eastern land-grant university serving approximately 30 000 students. The undergraduate demographic composition of University 3 consisted of 79% White students, 7% international students, 4% African-American students, 4% Hispanic students, 4% students reporting two or more races, and other groups each with 1% or less [43].
Sample FCI-1: Sample FCI-1 was collected at University 1. Data were collected in the introductory, calculus-based mechanics course, where the FCI was given as a pretest and post-test. Sample FCI-1 contains 3663 matched pretest and post-test pairs (77% men, 23% women). Sample FCI-1 is a subset of the sample investigated in study 2 where it was referenced as sample 1. The sample is smaller than that of the previous study because test average data were not available for all students. Students enrolled in this course participated in two 50min lectures and two 2-h laboratory sessions each week. Throughout the period studied, the design of this course was stable; the course was overseen by the same instructor with attendance managed with a quiz. The laboratory sessions included multiple research-based techniques including small-group problem solving, hands-on inquiry-based explorations, and TA-led demonstrations.
Sample FMCE-2: Sample FMCE-2 was collected at University 2. FMCE pretest and post-test data were collected in the introductory, calculus-based mechanics course. Sample FMCE-2 contains 2551 matched pretest and post-test pairs (72% men, 28% women). Sample FMCE-2 is a subset of the sample analyzed previously in study 3 where it was referenced as Sample 1. The sample contains fewer records than that of the previous study because ACT/SAT scores were not available for all students. The course was presented with three 50-min lectures and one 50-min tutorial section each week. Four university faculty members taught the lecture sections using peer instruction with clickers. Within the tutorial sections, students worked the University of Washington Tutorials in Introductory Physics [44]. There was no laboratory associated with this course.
Sample FMCE-3: Sample FMCE-3 was collected at University 3. FMCE pretest and post-test data were collected in the introductory, calculus-based mechanics course. Sample FMCE-3 contains 3719 matched pretest and post-test pairs (79% men, 21% women). Sample FMCE-3 is identical to sample 3A in study 3. The instructional environment for sample FMCE-3 varied over the period studied. During all semesters studied, a learning assistant (LA) program [45] was implemented in the laboratory where research-based materials were presented in the laboratory. During the first half of the study, the course studied presented four 50-min lectures and one 2-h laboratory session each week with LAs provided to all labs. Many lecture instructors taught the class during this period. In the second half of the study, the class was revised to three 50-min lectures and one 3-h laboratory session each week with LAs provided to a subset of the laboratory sessions because of funding issues. The new structure was led by two co-instructors that implemented the same policies and employed peer instruction using clickers.
Sample CSEM-1: Sample CSEM-1 was collected at University 1. CSEM pretest and post-test data were collected in the introductory, calculus-based electricity and magnetism course. Sample CSEM-1 contains 1767 matched pretest/post-test pairs (77% men, 23% women). Sample CSEM-1 is a subset of the samples investigated in study 1 and study 3; in study 3 it was referenced as sample 2. The sample is smaller than that used in the previous studies because test average data were not available for all students. The instructional environment for sample CSEM-1 was similar to that of sample FCI-1. The course was led by one instructor and the instructional environment remained stable over the time period studied.
Sample CSEM-3: Sample CSEM-3 was collected at University 3. CSEM pretest and post-test data were collected in the introductory, calculus-based electricity and magnetism course. Sample CSEM-3 contains 2439 matched pretest and post-test pairs (81% men, 19% women). Sample CSEM-3 is identical to sample 3B in study 3. The instructional environment for sample CSEM-3 was similar to that of sample FMCE-3.
Many students matriculated from the mechanics course to the electricity and magnetism course at all the institutions studied. As such, the student populations of samples FCI-1 and CSEM-1 were similar as were the student populations of samples FCME-3 and CSEM-3.

B. Corrected conceptual inventories
For this analysis, the conceptual inventory pretest and post-test scores for each of the samples were modified. The modifications removed problematic items from the pretest and both problematic items and unfair items from the posttest as identified in studies 2 and 3. The scores after these modifications are called "corrected" scores and the instruments, corrected instruments. To construct valid pretest scores for each instrument, items that were identified as problematic on the respective pretests for either men or women were eliminated. These items had difficulty or discrimination outside of the range suggested by Classical Test Theory. To correct the post-test scores, small to moderate and large DIF items identified by DIF analysis were removed, thus removing item-level unfairness from the instrument. Problematic post-test items as identified by Classical Test Theory were also removed. Study 2 did not find the same pattern of substantially unfair items in the FCI-1 pretest that were found in the post-test, as such, pretest scores were not corrected for fairness. Table I summarizes the included items on each of the valid pretests and the fair or valid post-tests.

C. Measures
Gender was coded dichotomously as the variable Gen with women coded as zero and men coded as one. General academic performance was represented by the variable APerf%. For FCI-1 and CSEM-1, APerf% was measured with the in-semester physics test average. The tests were approximately 70% quantitative and 30% qualitative and represented about 70% of the student's grade. For FMCE-2, FMCE-3, and CSEM-3, the ACT or SAT mathematics percentile score was used as the measure of academic performance. These percentile scores are represented by the variable ACTM% because the majority of the students took the ACT. When both scores were available, they were averaged. We acknowledge that physics test average and ACT/SAT mathematics score measure different facets of general academic achievement and that it would have been optimal if ACT/SAT mathematics scores had been available for all students. For a subset of Samples FMCE-3 and CSEM-3, both ACT/SAT scores and physics test averages were available allowing a comparison of the use of the two variables to measure general academic performance. While not identical, the partitions of the gender gap produced for students where both variables were available were very similar suggesting both variables measure academic performance similarly. This analysis is summarized Sec. V C and presented in detail in the Supplemental Material [46]. Pretest and post-test scores were converted to percentages and are represented by the variables Pre% and Post%.
All statistical analysis was performed in the "R" statistical software system [47].

V. RESULTS
Descriptive statistics for all samples are presented in Table II. Mean percentage score and standard deviation are reported for both the original, uncorrected instrument, and for the valid or fair corrected instrument.

A. Binning analysis
Many previous works investigating gender differences in conceptual inventory scores have employed binning, dividing students into subgroups with small ranges of pretest scores and calculating subgroup (bin) averages [20,21]. In all samples, there were pronounced differences in the distribution of men and women in the pretest bins. The percentage of women in a pretest bin decreased as the average score of the bin increased. A table of the distribution of men and women in each bin for the uncorrected instruments in presented in Table III; a similar table for the   TABLE II corrected instruments is presented in the Supplemental Material [46]. Figures 1 and 2 plot the binned pretest scores against the post-test scores for all instruments; both the corrected and uncorrected plots are shown. Overall, except for some minor changes, the post-correction plots are very similar to the pre-correction plots. A linear regression line has been added for men and women to each plot; the regression was performed only including bins containing at least 30 students. Except for the FMCE-3 sample (which is problematic because of the very low number of retained items), the regression lines are striking in that they are nearly parallel. This suggests gender differences as a function of corrected pretest score could be investigated by simple linear models. The corrected CSEM-1 regression lines are more parallel than the uncorrected lines. The corrected FCI-1 regression line has a larger slope than the uncorrected line.

B. Partitioning the gender gap
This overall gender gap, δG, the difference in mean posttest score of men and women, observed in the uncorrected post-test scores could be produced by many factors. Hierarchical linear regression (HLR) analysis was used to determine the relative importance of each factor. The results of these regressions allow the partitioning of the overall gender gap δG in the uncorrected instrument into • δG pop , the gap resulting from differences in general academic performance between men and women measured by either ACT/SAT mathematics percentile score or test average, the population gap; • δG fair , the amount of the gap explained by correcting the instrument for fairness, the fairness gap; • δG prep , the part of the gap resulting from differences in physics conceptual preparation using the corrected pretest to measure preparation, the preparation gap; and • δG equal , the gap of men and women with equal academic performance and equal physics conceptual preparation on the valid or fair corrected instrument.
This combined model can be written as The terms in Eq. (1) were calculated through two HLRs, one using the uncorrected post-test score as the dependent variable, the other using the corrected post-test score. The δG parameters are related to the regression coefficient of a dichotomous variable (Gen) coded as 0 for women and 1 for men. The dependent variable in the uncorrected regressions is the uncorrected post-test percentage, Post %, and the uncorrected pretest percentage is used as an independent variable, Pre%. The dependent variable in the corrected regressions is the corrected post-test percentage, Post C %, with the corrected pretest percentage, Pre C % as an independent variable. The superscript "C" was used to indicate corrected pretest percentage, post-test percentage, and regression coefficients calculated with these quantities. Three regressions were carried out for each dependent variable; the uncorrected regression equations are given by Post% ¼ β 21 þ β 22 Gen þ β 23 APerf% ð2bÞ The variable APerf% measures general academic performance (ACT/SAT math score or physics test average), Pre % is the pretest percentage score, and Post% the post-test percentage score. The regression coefficients are β ij , where i represents the model and j the term in the model. A similar set of regressions was carried out for the corrected pretest Pre C % and post-test Post C % with regression coefficients denoted by β C ij . The variables used to partition the gender gap are summarized in Table IV. Table V   A;prep , is the Gen regression coefficient in model FCI-1-3C using Gen, APerf%, and corrected pretest percentage Pre C % as independent variables.
The coefficients were then used to decompose the overall gender gap δG. The part of the overall gap that can be attributed to differences in the academic performance of the men and women in the samples (the population difference) is δG pop ¼ δG − δG A . The amount of the gender gap attributable to the overall fairness and validity of the instrument is δG fair ¼ δG A − δG C A , comparing the corrected and uncorrected instrument for students of the same academic performance. The amount of the gender gap attributable to physics preparation differences of students with the same academic performance on the corrected instruments (the preparation gap) is δG prep ¼ δG C A − δG C A;prep . The remaining gap (the fair, equally prepared and performing gap) δG C A;prep ¼ δG equal is the gender difference attributable to equally prepared students with the same academic performance on the corrected instrument.  The part of the gender gap resulting from differences in physics preparation.

A;prep
The gender gap of equally prepared and performing students on the corrected instrument. the standard deviation) using the standardized β coefficients and unstandardized values using the B coefficients are presented. The set of regressions used to calculate δG for the FCI in sample FCI-1 is shown in Table V; regressions for the other samples are presented in the Supplemental Material [46]. The table presents the regression coefficient B and its standard error SE, the standardized regression coefficient β, the variance explained by the model R 2 , and the additional variance explained by a nested model ΔR 2 . Figure 3 presents a visual representation of the partitioning of the gender gap shown in Table VI. To create this representation, first the sum of the absolute value of each δG forming the partition was calculated to form the total absolute gender gap jδG T j. The percentage of each partition was then calculated; for example, the percentage of the population gap was calculated as 100%jδG pop j=jδG T j. This somewhat circuitous calculation was needed to account for the negative gender gaps.

C. Comparison of academic performance measures
This study used both test average and ACT/SAT math percentile score as measures of academic performance. For a subset of the FMCE-3 and CSEM-3 samples, both variables were available allowing a comparison of the differences between these measures. The subsets contained 963 men and 271 women for FMCE-3 and 654 men and 171 women for CSEM-3. While both measures did not produce identical results, the resulting partition of the gender gap was very similar. As such, comparisons of the partition using different academic performance measures shown in Table VI should be valid. The detailed comparison is presented in the Supplemental Material [46].

D. δG equal
The gender difference for equally prepared students with equal general academic performance, δG equal , could depend on many factors; psychosocial factors and features of the instructional environment have been advanced to explain gender differences not related to academic performance or preparation. Psychosocial explanations of academic gender differences include stereotype threat, science anxiety, and math anxiety. Instructional factors include whether the courses used research-based practices. Both factors are reviewed in the introduction and more thoroughly in study 1. The causes of δG equal almost certainly vary by student population and university environment; however, additional data and analysis provided in study 1 for the CSEM-1 sample make it difficult to support psychosocial factors as the cause of δG equal for this sample. Study 1 also reported results for both quantitative and qualitative multiple-choice items that were not part of the CSEM including   quizzes given in the laboratory (lab quizzes) and qualitative and quantitative multiple-choice test questions. While a 3% gender difference with an advantage toward men was found in qualitative lab quizzes and qualitative test questions, no gender differences were found in quantitative test questions. The CSEM was given and graded as a lab quiz and there was a 6% gender difference on the post-test in this sample. The course instructor reported that both the qualitative and quantitative test items required a mix of verbal, logical, and graphical reasoning for their solution.
The failure to observe a gender difference in the quantitative test items while observing gender differences in the qualitative test questions strongly suggests that psychosocial factors do not explain the gender differences. It is very hard to see how stereotype threat, for example, would function for qualitative items but not quantitative items on the same test. This suggests, for Sample CSEM-1, that δG equal should be zero. If δG equal ¼ 0, then we must revisit our assumption that academic performance, test fairness, and prior preparation have been correctly controlled. The DIF analysis used to produce the fair instruments is the standard method of ensuring fairness. It also seems very likely that the physics test average is an accurate measure of academic performance for this sample. As such, the assumption that the CSEM pretest score accurately measured prior preparation in physics must be reexamined. There is some support for challenging this assumption; study 1 showed that female pretest scores were much more weakly correlated with a latent variable measuring the student's qualitative performance not explained by his or her quantitative performance than male pretest scores.
Many theoretical objections can be raised for the assumption that CSEM (or FCI and FMCE) pretest scores are an accurate measure of prior preparation. The CSEM measures a limited subset of concepts in electricity and magnetism; this limited coverage may generate inaccurate results. The CSEM has very limited coverage of Newtonian mechanics and energy; these concepts are often used in conceptual electricity and magnetism problems. As such, the CSEM may not measure the student's mechanics preparation accurately. Further, and possibly most importantly, the CSEM pretest estimates the state of student knowledge early in the class and therefore only measures prior preparation that is directly retained. (Sec. VII has additional comments on concept inventories as measures of student knowledge or learning.) A pretest cannot measure the well-documented advantage to the student of relearning material rather than learning it for the first time [48][49][50].
To explore whether δG equal was the result of prior preparation not captured by pretest scores additional measures of prior preparation were needed. For samples FCI-1, CSEM-1, FMCE-3, and CSEM-3, a subset of students completed both the mechanics and electricity and magnetism classes. For these students, either FCI or FMCE post-test results were also available, as well as CSEM scores. For sample CSEM-1, there were 1073 students for which a FCI post-test score was available (826 men, 247 women). Reproducing the partitioning of the gender gap shown in Table VI for this restricted sample yielded δG ¼ 4.90, δG pop ¼ −0.85, δG fair ¼ −1.60, δG prep ¼ 0.41, and δG equal ¼ 6.94. If the FCI post-test score is used to measure prior preparation along with the CSEM pretest score [adding it as an independent variable to Eq. (2c)], the last two terms change to δG prep ¼ 1.84 and δG equal ¼ 5.51. In this, adding FCI post-test scores as an additional measure of preparation reduced the equal gap by 1.43 or 21%. Figure 4 updates the CSEM results in Fig. 3 using the mechanics post-test results as an additional measure of prior preparation. For the CSEM-3 sample, there were 1788 students for which a FMCE post-test score was also available (1413 men, 375 women). Reproducing the partitioning of the gender gap for this restricted sample yielded δG ¼ 6.01, δG pop ¼ 0.08, δG fair ¼ −0.44, δG prep ¼ 2.31, and δG equal ¼ 4.06. If the FMCE post-test score is used to measure prior preparation along with the CSEM pretest score, the last two terms change to δG prep ¼ 4.42 and δG equal ¼ 1.95. Adding FMCE post-test scores as an additional measure of preparation reduced the equal gap by 2.11, or 52%.

VI. DISCUSSION
The primary results of this paper are captured in Fig. 3 and Table VI. For all samples, very little of the measured gender differences could be attributed to academic performance differences between men and women. Correcting the instruments for fairness explained differing amounts of the gender differences; fairness accounted for 30% of the gender difference in the FCI-1 sample, smaller but significant amounts of the gender differences in the FMCE (17% in FMCE-2 and 9% in FMCE-3), and little of the gender differences in the CSEM (2%-6%). This was fairly consistent with the size of the fairness effects calculated by the DIF analysis in studies 2 and 3. Prior preparation measured by pretest score explained consistently 30%-40% of the gender differences in all samples except sample CSEM-1. This left from 26% to 90% of the gender difference (δG equal ) unexplained. These percentages were calculated using the same method as was used to construct Fig. 3, by summing the absolute values of δG i .
The 30% fairness gap measured in sample FCI-1 is smaller than the 50% reduction in the gender gap for the fair instrument presented in study 2. This difference results from the correction of the gender gap for academic performance which was performed in the current study but not study 2. The women in sample FCI-1 were higher performing than the men; correcting for this increased the gap.
There was a substantial difference in δG prep (the gender difference resulting from prior preparation) between samples CSEM-1 and CSEM-3. This may partially be the result of the different prerequisite requirements of the classes. For CSEM-1, Calculus 2 is a co-requisite, while for CSEM-3, Calculus 2 is a prerequisite. As such, students in CSEM-3 are generally later in their academic career than students in CSEM-1. This suggests that some of the prior preparation difference may result from the student's experiences in college classes other than physics.
Analysis of a matched sample where both mechanics and electricity and magnetism pretest and post-test scores were available provided evidence that a substantial part of δG equal could be explained by additional prior preparation measures. Study 1 further suggests that, to the extent that δG equal results from psychosocial factors, that it should be zero for the CSEM-1 sample. This suggests that pretest scores, or at least CSEM pretest scores, do not provide an accurate measure of prior preparation. It is possible, as new measures are developed, that it will be shown that a substantial part of δG equal is also the result of prior preparation.
FCI-1 and CSEM-1 have similar student populations and instructional environments. The large difference in δG equal between the samples provides further evidence that δG equal in the CSEM-1 sample cannot be explained by psychosocial effects. The instructional similarities between the two samples also suggest that δG equal is not the result of instructional differences.

VII. IMPLICATIONS
This paper is one of a series that examined item-level gender fairness in several introductory physics conceptual inventories [25,33,40]. One of the original motivating questions for this research was how much of a widely discussed gender gap on these tests results from psychometric problems in the tests themselves? This question had been probed for the FCI ( [33], Sec. I), but otherwise was largely open. In the course of investigating this larger issue, we have tried several methods to tease apart the sources of this gap: reduced or valid subsets of instruments, linear modeling incorporating pretest scores and other preparation measures, and the framework in this paper of population, fairness, preparation, and equal partitions. We found that sometimes a small number of anomalous and problematic items could be identified [33], but often prior preparation was a larger contributor, and in some samples no combination of these factors can explain even half the gap. Physics faculty use concept inventories for a number of reasons. Instructors may seek to gauge the quality of their teaching by comparing pre-and post-test scores, or may want to know how well students have learned the major concepts of the course. Researchers may want a standardized measure of the effectiveness of new curricula or interventions. Departments may suggest or require that instructors collect the data for official accountability or accreditation plans, or may simply value the practice of measuring teaching effectiveness. In all cases, it is important to remember that instruments must normally be calibrated before their readings make sense. This is as true for conceptual inventories as it is for probes on an oscilloscope.
It is also important to recall that multiple-choice tests, even those grounded in research on student thinking, are only one possible way to assess learning. They use questions to probe constructs such as "conceptual understanding of Newton's laws," which are defined with varying levels of theoretical clarity [51] and which may or may not emerge as expected by the test designers [52]. By the nature of their format, the information they provide is mediated by students' skill with taking standardized multiple-choice tests. Other researchers might prioritize different physics concepts or different ways of operationalizing them in test constructs [53], and instructors may value skills along other dimensions, such as constructing graphs or problem solving. These limits notwithstanding, conceptual inventories are used in many classrooms as the basis for claims about student learning. It is thus important to understand the complex range of factors that contribute to a single numerical score.
For instructors, the inventories we have examined-the FCI, FMCE, and CSEM-may completely encompass the content they want to assess. If they do not, other options exist that may be more appropriate (Madsen et al. [54] give a recent list). If instructors are giving points based on the number of correct items, they should check first for invalid items for their population. Though we found several consistently problematic items on the FCI, we also saw variation in the corrected instruments across our samples ( Table I). The best way for instructors to understand the fairness of an instrument for their local population of students is to check the data. Calculating item difficulty using Classical Test Theory is straightforward, does not have the large sample size requirements of IRT models, and can still flag problematic items on pretest or post-test. The Classical Test Theory difficulty is simply the mean item score, allowing the graphical fairness analysis presented in studies 2 and 3 to be performed by any instructor with relatively small sample sizes.
For researchers, calibration is also essential if conceptual inventory scores are being used to make or bolster claims about effective materials. The process of calculating item difficulty, reducing instruments to valid items, and comparing new and old percentage scores (Table II) is one way to check whether problematic items are influencing claims about student learning. Researchers may also be one of the drivers of departmental data collection, and may have some influence over how that data are used. If a department collects gender data and is interested in examining gender gaps, PER faculty should advocate for best practices in interpreting that data.
Finally, physicists do not teach with the aspiration that their students should be able to score higher on multiplechoice tests. Conceptual inventories are valuable as one measure of learning, but if faculty are checking their data for gender gaps, it is equally worth interrogating other aspects of the classroom. For example, instructors might also benefit from taking an Implicit Attitudes Test to check for their own gender biases. Peer teaching observations can include noting whether the instructor differentially calls on students by gender, and pass or drop rates can likewise be checked for gender gaps. Concept inventories, by their structure, position learning as an individual outcome that is entirely located in the students. This is a useful approximation insofar as it reflects what a single person will carry forward from the classroom. However, that focus on the individual and not on the learning environment also filters out other possible sources of a gender gap. Faculty who find the question valuable enough to ask in one context ("How do my students differ in their scores by gender?") should consider how it fits into a coherent plan to evaluate the gender dynamics in their classroom, and what remediation strategies exist for other aspects of classroom culture.
This research demonstrated the need to use the fair conceptual instruments proposed by Traxler et al. [33] and Henderson et al. [40], particularly the fair FCI. This research also showed that a substantial part of the gender differences in each sample could be explained by prior preparation. If δG equal is eventually shown to also result from prior preparation, the majority of gender differences in all samples were the result of differences in prior preparation in physics. This suggests that physics classes must deploy instructional strategies to address these differences. These may include adaptive conceptual training that allows all students to work toward a mastery goal, rather than delivering the same assignments to all students, thus giving more conceptual practice to all students who need it.
The partition of the gender gap presented above also has serious implications for future research and the interpretation of past research. Prior preparation differences explained a substantial part of the gender differences in most samples; academic performance differences explained smaller, but still significant amounts of the differences in some samples. The amount these two factors contributed to the gender gap varied greatly between samples. As such, researchers investigating differences in student performance for any reason must collect an appropriate set of control variables including standardized test scores and measures of student prior preparation. The results of this work also seem to indicate that conceptual pretest scores may not provide an accurate characterization of physics prior preparation; more accurate measures should be developed.

VIII. FUTURE
The gender gap has been an active area of research for over a decade; it seems unacceptable to have so much of the gender differences still unexplained. We intend to refine our measurement of prior preparation with the inclusion of a broad set of high school course-taking measurements to determine how much of δG equal can be explained by high school physics and mathematics preparation.

IX. CONCLUSION
This work partitioned the gender difference in post-test performance on the FCI, FMCE, and CSEM into four components: academic performance, instrumental fairness, physics-specific preparation, and a fourth segment representing other effects. The percentage of the gender gap accounted for by each segment varied strongly between the five samples. Fairness accounted for 30% of the gender differences in the FCI, 17% and 9% of differences in the FMCE, and 2%-6% of differences in the CSEM. For four of the five samples, differences in prior preparation measured by pretest scores accounted for approximately 40% of the gender gap. The amount of the gender gap which was accounted for by other effects varied widely between samples. Further correcting for prior preparation using the post-test score in the previous class reduced the size of the gender differences resulting from other effects, sometimes dramatically. This suggests that a CSEM pretest score does not completely capture the effect of prior preparation on conceptual performance.