A comparison of Hake's g and Cohen's d for analyzing gains on concept inventories

Measuring student learning is a complicated but necessary task for understanding the effectiveness of instruction and issues of equity in college STEM courses. Our investigation focused on the implications on claims about student learning that result from choosing between one of two commonly used methods for analyzing shifts in concept inventories. The methods are: Hake's gain (g), which is the most common method used in physics education research and other discipline based education research fields, and Cohen's d, which is broadly used in education research and many other fields. Data for the analyses came from the Learning Assistant Supported Student Outcomes (LASSO) database and included test scores from 4,551 students on physics, chemistry, biology, and math concept inventories from 89 courses at 17 institutions from across the United States. We compared the two methods across all of the concept inventories. The results showed that the two methods led to different inferences about student learning and equity due to g being biased in favor of high pretest populations. Recommendations for the analysis and reporting of findings on student learning data are included.


I. INTRODUCTION
The methods for measuring change or growth and interpretations of results have been hotly discussed in the research literature for over 50 years [1]. Indeed, the idea of simply measuring a single state (let alone change) in an individual's understanding of a concept, conceptualized as a latent construct, is wrought with issues both philosophical and statistical [2]. Despite these unresolved issues, education researchers use measurement of growth for quantifying the effectiveness of interventions, treatments, and innovations in teaching and learning. Gain scores and change metrics, often referenced against normative or control data, serve as a strong basis for judging the efficacy of innovations. As researchers commonly measure change and report gains and effects, it is incumbent on researchers to do so in the most accurate and informative manner possible.
In this work, we collected data at scale and compared it to existing normative data to examine several statistical issues related to characterizing change or gain in student understanding. The focus of our analyses in this investigation are on student scores on science concept inventories (CIs). CIs are research-based instruments that target common student ideas or prior conceptions. These instruments are most often constrained response (multiple choice) and include these common ideas as attractive distractors from the correct response. There exist a multitude of CIs in use across Biology, Chemistry, and Physics (our target disciplines) and in other fields (e.g. Engineering and Math). While CIs are common, the strength of their validity arguments varies widely and some lack normative data. All the CIs used in our work have at least some published research to support their validity and they align with our proposed uses for the scores.
A principal tool in quantitative research is comparison, which leads to the frequent need to examine different instruments and contexts. Complications arise in these cross contextual comparisons because the instruments used may have different scales and the scores may greatly vary between populations. For example, some CIs are designed to measure learning across one semester while others are designed to measure learning across several years.
Instructors could use both instruments in the same course but they would, by design, give very different results. To compare the changes on the two instruments, researchers need to standardize the change in the scores, which researchers commonly do by dividing the change by a standardizing coefficient. Unlike in Physics Education Research (PER) and other Discipline Based Education Research (DBER) fields, the social science and education research fields typically use a standardizing coefficient that is a measure of the variance of the scores.
The most common gain measurement used in PER and other DBER fields when analyzing CI data is the average normalized gain, g, shown in Equation 1 [3]. In this equation, the standardizing coefficient is the maximum change that could occur. Hake adopted this standardizing coefficient because it accounted for the smaller shift in means that could occur in courses with higher pretest means. He argued that g was a suitable measure because it was not correlated with pretest mean, whereas posttest mean and absolute gain were correlated with pretest mean and were not suitable measures. He also argued that this normalization allowed for "a consistent analysis over diverse student populations with widely varying initial knowledge states" [3, p. 66] because the courses with lecture-based instruction all had low g, courses with active-engagement instruction primarily had medium g, and no courses had high g. Hake then used this reasoning to define high g (g >0.7), medium g (0.7 >g >0.3), and low g courses (g <0. 3).
Researchers have extensively investigated the utility and limitations of d, while the research investigating g is limited. In contrast to Hake's (1998) earlier finding, Coletta and Phillips [5] found that g was correlated with pretest means. Willoughby and Metz [6] found that inferences based on g suggested gender inequities existed in college STEM courses even though several other measures indicated that there were no gender inequities in those courses. Furthermore, researchers use several different methods for calculating g, which can lead to discrepant findings [7,8]. Researchers have also identified issues for d. For example, d exaggerates the size of effects when measuring changes in small samples [9]. Cohen's d is based on the t-statistic and the assumptions of normality and homoscedasticity in the test scores used to generate it [9]. CI data frequently fails to meet the assumptions of normality and homoscedasticity because of floor and ceiling effects and outliers. We expect that any problems that this creates for d are also applicable to g. However, we are not aware of any research on these assumptions pertaining to g.

II. PURPOSE
Both d and g have limitations. Our purpose in this investigation was to empirically compare concept inventory gains using both g and d to investigate the extent to which they lead to different inferences about student learning. In particular, our concern was that g favors high pretest populations, which leads to skewed measures of student learning and equity. This particularly concerned us because researchers use g as the de facto measure of student learning in PER and other DBER researchers have used it despite there being few investigations of the validity of g and known problems with its efficacy. We compared g to d since d is gaining use in DBER and is the comparable de facto measure in the much larger fields of sociology, psychology, and education research where researchers have extensively studied its validity, utility, and limitations.

III. BACKGROUND ON MEASURING CHANGE
In this section, we provide a foundation for our motivations and work. First, we discuss the development and use of CIs in undergraduate science education research. We then discuss statistical issues related to measuring change before reviewing the uses of the average normalized gain in analyzing scores from CIs. Finally, we discuss Cohen's d and its use in the context of best practices for presenting data and findings.
A. Rise in the use of CIs to measure student knowledge CIs provide "data for evaluating and comparing the effectiveness of instruction at all levels." Hestenes et al. [10]. They typically consist of banks of multiple-choice items written to assess student understanding of canonical concepts in the sciences, mathematics, and engineering. Researchers generally develop CIs through an iterative process. They identify core concepts with expert feedback and use student interviews to identify common preconceptions and provide wording for distractors. CIs exist for core concepts in most STEM fields, see [11] for a thorough review and discussion. Though it is unclear exactly how many CI's exist, one of the most widely used CIs, the Force Concept Inventory [10], has been cited more than 2,900 times at the time when this manuscript is being prepared according to Google Scholar.
Researchers often use CIs as the outcome measures for evaluative studies to find out if an instructional intervention has an effect on learning relative to a control condition. To facilitate this use, researchers administer CIs pre-and post-instruction, and they compare gains observed in treatment groups to gains observed in a control condition. CIs tend to measure conceptual understanding at a big picture level. This means that if students conceive of science learning as a matter of primarily memorizing definitions and formulas (consistent with a more traditional conception of teaching and learning), they are unlikely to do well on most CIs. Several studies have used CIs to compare the impact of research-based pedagogies to more traditional pedagogies [3,12,13] and to investigate equity in STEM courses by comparing knowledge and learning of majority and underrepresented minority students [14][15][16][17]. These types of investigations motivate instructors to adopt active learning in courses throughout the STEM disciplines [18].
Because researchers often compare scores for different CIs administered to different populations, they often use a change metric that is standardized and free from the original scale of the measurement. This change metric is often g for DBER studies, but some DBER studies have used d. One particular case that focused our current investigation on comparing g and d, was the use of g and test means to conclude, "In the most interactively taught courses, the pre-instruction gender gap was gone by the end of the semester", [17, p. 1]. A finding that Rodriguez et al. [16] later called into question when their, "analysis of effect sizes showed gender still impacted FCI scores and that the effect was never completely eliminated" [16, p. 6].

B. Some issues in measuring change
Discussions in the measurement literature on quantifying change can be sobering. A classical and often cited work in this area is that of Cronbach and Furby [2], which raised issues of both the reliability and validity of gain scores. Based on Classical Test Theory, the authors argue that the prime issue of reliability has to do with the systematic relationship between error components of true scores derived from independent, but "linked" observations. Consider a common situation in CI use in which the same test is given as both a pre-and posttest. One could argue that the observations (pre and post) are independent measurements since they are taken at different time points, but they are actually linked since the measurements are from the same group of students. Because those students had responded to the same instrument at the pretest administration, their posttest scores are likely correlated with their pretest scores due to a shared error component between the two scores. One can correct for this (often overstated) correlation due to the shared error components, but the correction is not always straightforward. Bereiter [1] calls this the "over correction under correction dilemma." Cronbach and Furby discuss this dilemma at length and offer various methods to dissattenuate the correlation. However, the authors seem to see these correction methods as a workaround for the real issue of linked observations. In their summary discussion, Cronbach and Furby actually state that "investigators who ask questions regarding gain scores would ordinarily be better advised to frame their questions in other ways" (p. 80). Despite their persistent statistical issues, gain measurements are widely used in education research due to their great utility. Acknowledging these issues while lever-aging the utility requires researchers to be diligent and transparent in their methods and presentation.
Another issue with gain scores has to do with the actual scale of the scores, which Bereiter refers to as the "physicalism-subjectivism dilemma." The issue here is related to the assumption of an interval scale on the construct of interest when using raw scores, or when using gain scores that are normalized on that same scale (as in using g). In other words, the gain metric (g) is scaled in terms of the measure itself (e.g., Newtonian Thinking as measured by a CI) and is assumed to be intervally scaled. A potential solution here is to change the scaling to something that "seems to conform to some underlying psychological units", [1, p. 5]. In this case, the scaling factor (or "standardization coefficient") is not based on the scale of the measure (e.g., raw scores on a CI, or Newtonian Thinking) but rather in a standard unit such as the variance of the score distribution. In this way, the gain metric is transformed out of the scale of the measure (e.g., Newtonian Thinking) and into a construct independent, standardized scale (e.g., based on variance). Transforming the scale can make cross-scale comparisons possible, and also may highlight potential inequities brought on by remaining in the scale of the measure itself. This latter approach is how the dilemma is addressed when using the effect size metric (discussed further below). For a more detailed discussion of these issues related to measuring change in Classical Test Theory see [19].

C. The Average Normalized Gain
Hake [3] developed the average normalized gain (g) as a way to normalize average gain scores in terms of how much gain could have been realized. Hake interpreted g from prepost testing "as a rough measure of the effectiveness of a course in promoting conceptual understanding" (p. 5). His work was seminal in PER and led to the broad use of g throughout DBER. The breadth of its uptake led to at least three different methods for calculating g to be in common use. The original method proposed by Hake calculates g from the group means and is shown in Equation 1. A second method that is more commonly used [13] is to calculate the normalized gain for each individual student (g I ) to characterize that student's growth, and to then average the normalized gains for all the individuals to calculate g for the group. Bao [7] provides an in depth discussion of the affordances of these two methods, but Bao and Hake both state that in almost all cases the two values are within 5% of one another.
Marx and Cummings [8] proposed a third method, normalized change (c), in response to several shortcomings of (g I ). These shortcomings included: a bias towards low pretest scores, a non-symmetric range of scores (−∞ to 1), and a value of -∞ for any posttest score when the student achieves a perfect pretest score. These limitations inhibit the ability to calculate the average normalized gain for a class by averaging the individual student normalized gains. Instead, they offer a set of rules for determining c based on whether the student gains from pre-to posttest, worsens, or remains at the same score. Their metric results in values of c ranging from -1 to +1. However, c is still sensitive to the distribution of pre and posttest scores in a way that might be "related to certain features of the population" [7], as it is still normalized on the same scale as the measure itself, an issue raised by Bereiter [1] and discussed above.
One particular concern with gain metrics, and with g specifically, has to do with the possibility that these metrics can be biased for or against different groups of students. As that "males had higher learning gains than female students only when the normalized gain measure was utilized. No differences were found with any other measures, including other gain calculations, overall course grades, or individual exams." One might expect this to be the case when pretest score is part of the standardization coefficient, since pretest is likely correlated with previous education, and therefore opportunity and even socioeconomic status. Indeed, Coletta and Phillips [5, p. 1] "found a significant, positive correlation between class average normalized FCI gains and class average preinstruction scores." This finding is aligned with Marx and Cummings [8] conclusion that g is biased by pretest scores, however, they found it was biased the opposite direction.

D. The Effect Size Metric
One of the most widely used standardized effect size metrics is Cohen's d. Cohen's d normalizes (i.e., scales) the difference in scores in terms of the standard deviation of the observed measurements. In essence, it is the difference between Z (standard) scores. This results in a "pure" number free from the original scale of measurement [4]. As a result, d meets the need for ". . . a measure of effect size that places different dependent variable measures on the same scale so that results from studies that use different measures can be compared or combined", Grissom and Kim [9].
As a consequence of using the standard deviation, d assumes that the populations being compared are normally distributed and have equal variances. Accordingly, the standard deviation used to calculate d is that of either sample from the population since they are assumed to be equal. However, in practical applications the pooled standard deviation, Using either the equal pre-and posttest standard deviations or the pooled standard deviations assumes that the samples (pre-and posttest) are independent and therefore does not take into account or correct for the correlation between measurements made at pre and post (the "dilemma" discussed above from Bereiter). The calculation for d accounting for the dependence between pre-and posttest [20] is shown in Dunlap et al. [21] present an example of calculating a t-statistic between two means when assuming the samples are independent, and again when assuming dependence for the same sample means. When running dependent analyses the "...correlation between the measures reduces the standard error between the means, making the differences across the conditions more identifiable" [21, p. 171]. Thus, taking into account the dependence between the data results in a larger t-statistic because the difference in the means is divided by a smaller standard error. Cohen's d can be directly calculated from the dependent or independent t-statistic. This is why the dependent form of d is always larger than the independent form.
In practice, many researchers use the independent form of d given in Equation this practice, arguing that that correlation does not change the size of the difference between the means but only makes the difference more noticeable by reducing the standard error.
Morris and DeShon [22] agree that using the independent calculation for d with dependent data is an acceptable practice so long as all researchers are aware of the issue and any effect sizes being compared are calculated in the same way.

IV. RESEARCH QUESTIONS
Given our purpose of comparing g and d, our specific research questions were: 1. To what extent did the relationships between g and d and their relationships to the descriptive statistics used to calculate them indicate that they were biased toward different populations of students?
2. To what extent did disagreements between g and d about the learning for groups of students with different pretest scores confirm any biases identified while investigating the first research question?
Based on previous research we expected differences in the degree to which d and g indicated that a phenomenon (e.g., learning gains or equity) was present. We expected the gain characterized by each metric to vary by student population due to differences in pretest scores

A. Data collection and processing
Our general approach to data collection and processing was to collect the pre and post data with an online platform. We then applied filters to the data to remove pretests or posttests that were spurious. Instead of only analyzing the data from students who provided both a pretest and a posttest, we used Multiple Imputation (MI) to include all the data in the analyses. Online data collection enabled collecting a large dataset, filtering removed spurious and outlier data that was unreliable, and using MI maximized the size of the sample analyzed and the statistical power of the analyses.
We used data from the Learning About STEM Student Outcomes (LASSO) platform that was collected as part of a project to assess the impact of Learning Assistants (LAs) on student learning [23,24]. LAs are talented undergraduates hired by university and two-year college faculty to help transform courses [25]. LASSO is a free platform hosted on the LA Alliance We processed the data from the LASSO database to remove spurious data points and ensure that courses had sufficient data for reliable measurements. We filtered our data with a set of filters similar to those used by Adams et al. [26] to ensure that the data they used to validate the Colorado Learning Attitudes about Science Survey (CLASS) were reliable.
Their filters included number of items completed, duration of online surveys, and a filter question that directed participants to mark a specific answer. In our experience, Adams and colleagues discussion of filtering the data is unique for physics education researchers.
Just as Von Korff et al. [13] found that few researchers explicitly state which g they used, we found that few researchers explicitly address how they filtered their data. For example, authors in several studies [27][28][29] that used the CLASS made no mention if they did or did not use the filter question to filter their data, nor do they discuss any other filters they may have applied. The lack of discussion of filtering in these three studies is not a unique choice by these authors. Rather, their choice represents the common practices in the physics education literature.
We included courses that had partial data for at least 10 participants to meet the need for a reliable measure of means without excluding small courses from our analyses. We removed spurious and unreliable data at the student and course level if any of the following conditions were met.
• A student took less than 5 minutes to complete that test. We reasoned that this was a minimum amount of time required to read and respond to the test questions.
• A student answered less than 80% of the questions on that test. We reasoned that these exams did not reflect student's actual knowledge.
• A student's absolute gain (posttest mean minus pretest mean) was 2 standard deviations below the mean absolute gain for that test. In these cases, we removed the posttest scores because we reasoned that it was improbable for students to complete a course and unlearn the material to that extent.
• A course had greater than 60% missing data. Low response rates may have indicated abnormal barriers to participating in the data collection that could have influenced the representativeness of the data from those courses.
Filter 1, taking less than five minutes, removed 364 students from the data set. Filter 2, completing less than 80% of the questions, removed 10 students from the data set. Filter 3, a negative absolute gain 2 standard deviations below the mean, removed 0 students but did remove 43 posttests. Removing the courses with more than 60% missing data removed 27 courses and 1,116 students from the analysis.
To address missing data, we performed multiple imputations (MI) with the Amelia II package in R [30]. The most common method for addressing missing data in PER is to use listwise deletion to only analyze the complete cases, discarding data from any student who did not provide both the pretest and posttest. Though, we know of at least one study in PER that used MI [31]. We used MI because it has the same basic assumptions of listwise deletion but it reduces the rate of type I error by using all the available information to better account for missing data [32]. This leads to much better analytics than traditional methods such as listwise deletion [33] that, while they "...have provided relatively simple solutions, they likely have also contributed to biased statistical estimates and misleading or false findings of statistical significance", [34, p. 400]. Extensive research indicates that in almost all cases MI produces superior results to listwise deletion [35,36].
MI addresses missing data by (1) imputing the missing data m times to create m complete data sets, (2) analyzing each data set independently, and (3) combining the m results using standardized methods (Dong and Peng 2013). The purpose of MI is not to produce specific values for missing data but rather to use all the available data to produce valid statistical inferences [35].
Our MI model included variables for CI used, pretest and posttest scores and durations, first time taking the course, and belonging to an underrepresented group for both race/ethnicity and for gender. The data collection platform (LASSO) provided complete data sets for the CI variables and the student demographics. As detailed in Table I, either the pretest score and duration or the posttest score and duration was missing for 42% of the students. To check if this rate of missing data was exceptional, we identified 23 studies published in the American Journal of Physics or Physical Review that used pre-post tests.
Of these 23 studies, 4 reported sufficient information to calculate participation rates [27][28][29]37]. The rate of missing data in these 4 studies varied from 20% to 51% with an average of 37%. The 42% rate of missing data in this study was within the normal range for PER studies using pre-post tests.
Based on the 42% rate of missing data we conducted 42 imputations because this is a conservative number that will provide better results than a smaller number of imputations [38]. We analyzed all 42 imputed data sets and combined the results by averaging the test statistics (e.g., means, correlations, and regression coefficients) and using Rubin's Rules to combine the standard errors for these test statistics [39]. Rubin's Rules combines the standard errors for the analyses of the MI datasets using both the within-imputation variance and the between-imputation variance with a weighting factor for the number of imputations used. For readers interested in further seeking more information on MI, Schafer [39] and Manly and Wells [35] are useful overviews of MI. All assumptions were satisfactorily met for all analyses.

B. Investigating the effect size measures
To identify and investigate differences between the effect size measures, we used correlations and multiple linear regressions (MLR) to investigate the relationships between the effect size measures and the test means and standard deviations. Correlations informed the variables we included in the MLR models.
We calculated Cohen's d for each course using the independent samples equation, Equation 2. We used this measure because it is the most commonly used in the physics education research literature and because we expected it to have little to no impact on the analyses [22], which we discussed in Section III. D.
To test biases in the effect sizes and their effects on CI data, we used the male and female effect size measures in the aggregated data set. We separated these two groups because male students tend to have higher pretest and posttest means on science concept inventories than female students [14,40]. Thus, gender provided a straightforward method of forming populations with different test means and standard deviations. Gender also allowed us to frame our analysis in terms of equity of effects. We defined equity as being the case where a course does not increase pre-existing group mean differences. This definition means that for a course to be equitable the effect on the lower pretest group is equal to or larger than the effect on the higher pretest group.
For this analysis we calculated the effect sizes for males and females separately. For each effect size measure we then calculated the difference between males and females effect sizes, for example ∆ g = g male − g f emale . If males in a course had a larger effect size than females in the course then that course was inequitable, ∆ g > 0. This created four categories into which any two effect size measures would locate each course. Two categories for agreement where both effect sizes said it was either equitable or inequitable and two categories for disagreement where one said equity and the other said inequity. If one of the effect size measures was biased and indicated larger effects on higher pretest mean populations than we expected that one type of disagreement would occur more frequently than the other type.
To easily identify differences in the number of courses in the disagreement categories and the size of those disagreements, we plotted the data on a scatter plot. We tested the statistical difference in the distributions using a chi square test of independence with categories for each effect size measure and whether they indicated equity or inequity.

C. Simplifying the analysis
The multiple methods for calculating normalized gain for a course complicated our purpose of comparing normalized gain and Cohen's d. Therefore, we compared normalized gain calculated using each of the three common methods, which are described in Section III. C, for each course. We calculated g using the average pretest and posttest scores for the course.
We calculated the course g I and c by averaging the individual student scores for each course.
Correlations between all three measures were all large and statistically significant, as shown in Table 1. The scatter plots for these three measures are shown in Figure 1. These results indicated that all three measures were very similar. Therefore, we only used the normalized gain calculated using course averages, g, in our subsequent analyses.
The filters we applied to the data likely minimized the differences between g I and the other two forms of normalized gain. As Marx and Cummings [8] point out, g I is asymmetric.
Students with high pretest scores can have very large negative values for g I , as low as approximately -32, but can only have positive values up to 1. We focused on filtering out spurious and unreliable data that would have likely produced many large negative g I values for individual students and resulted in larger differences between g I and the other two normalized gain measures. Nonetheless, the notable differences between the three normalized gain metrics all occurred for g I being much lower than the other two metrics.  Because these two measures serve the same purpose, the 44% that they do not have in common was a large amount. Further investigating the correlations between the effect sizes and their related descriptors, shown in Table 2, revealed large differences between d and g. The correlations between d and both pretest mean and pretest standard deviation were small to very small and were not statistically reliable. In contrast, g was moderately to strongly correlated with both pretest mean and pretest standard deviation. These correlations between g and pretest statistics (0.43 and 0.44 respectively) indicated that approximately one fifth of the variance in normalized gains was accounted for by the score distributions that students had prior to instruction. In contrast, d was only weakly associated with both pretest mean and pretest standard deviation. These relationships were strong evidence that g was positively biased in favor of populations with higher pretest scores.
To inform the size of this bias, we ran several models using MLR with g as the dependent variable and independent variables for d, pretest mean, and pretest standard deviation. We used g as the dependent variable because this was consistent with the correlations between g and pretest mean indicating that g was positively biased by pretest means; whereas correlations indicated that d was not biased. The linear equation for the final model is given in Equation 5. Our focus in these MLRs was on the additional variance explained by each variable in the models, which we measured using the adjusted r 2 . We did not focus on The four models for the MLR are shown in Table III  variance from Model 1 and the correlations indicated that pretest mean and pretest standard deviation were much more strongly related to g than to d, these results indicated that g was biased in favor of groups with higher pretest means.
B. Testing the bias in g using populations with different pretest scores Results from the MLR Model 2 indicated that a class's pretest mean explained 27% of the variance in a class's g value that was not explained by d. If g is biased in favor of high pretest groups, as the MLR and correlatioms indicated, then we expected the disagreements between g and d to skew such that they indicated a bias for g in favor of the high pretest population. To visualize potential bias in g we plotted the difference in d on the x axis and the difference in g on the y axis in Figure 3. The course marker color shows whether male or female students' pretest means were higher. Almost all of the markers (41 of 43 courses) indicated that male students started with higher pretest means and that the data was consistent with our focus on equity being a larger effect on female students. In total, g showed a larger effect on males in 33 out of 43 courses whereas d indicated a larger effect on males in 22 out of 43 courses. Figure 3 illustrates this bias in g in the difference between Quadrants II and IV. A chi squared test of independences indicated that these differences were statistically reliable, χ 2 (1) = 6.10, p=0.013. This difference confirmed that g was

VII. DISCUSSION
To simplify our comparison of the statistical merits of using g and d to measure student learning, we first determined what differences there were between the three methods of calculating g. Our analysis showed that the three methods for calculating normalized gain scores were highly correlated (r ≥ 0.93). The high level of correlation between the normalized gain values indicated that it made little difference which method we used. This result was encouraging given that many researcher report g scores without discussing which method of calculation they used [13]. The scatter plots (Figure 1) for the three measures of g indicated that the large disagreements between the measures that occurred were cases in which g I was much lower than both g and c. This discrepancy is consistent with the negative bias in possible g I scores that led Marx and Cummings [8] to develop c. The filters we used to remove unreliable data likely increased the agreement we found between g I and the other normalized gain measures. However, there were several courses where g I was noticeably lower than g and c. These disagreements indicate two potential problems in the existing literature. Some studies using g I may have underestimated the learning in the courses they investigated due to the oversized impact of a few large negative g I values. Alternatively, some studies may have filtered out data with large negative g I values but not explicitly stated this filtering occurred. Both situations are consistent with Von Korff and colleague's [13] statement that few researchers explicitly state which measure of normalized gain they used. Either situation or a combination of the two make it difficult for researchers to rely on and to replicate the work of those prior studies.
Our comparisons of g and d revealed several meaningful differences that indicated that g was biased in favor of high pretest populations. The correlation between g and d was strong (r=0.75, p<0.001) but was markedly smaller than the correlations between the three different methods of calculating g (r ≥ 0.93). This correlation of 0.75 meant that g and d shared only 56% of their variance. MLRs indicated that pretest mean and standard deviation explained most of the difference between g and d ; d, pretest mean, and pretest standard deviation accounted for 92% of the variance in g. Given that g was correlated with these pretest statistics much more strongly than d, we concluded that g is biased in favor of populations with high pretest means. We recommend that researchers avoid using all forms of normalized gain and instead report Cohen's d and the descriptive statistics used to calculate it, including the correlation between pretest and posttest scores.
This bias of g in favor of populations with high pretest means is problematic. The dependence of g on pretest means privileges populations of students who come into a class with more disciplinary knowledge or who perform better on multiple choice exams. This bias disproportionately affects students from traditionally underrepresented backgrounds such as women in physics. When comparing the learning of males and females in our dataset, g identified males as learning more in 33 of 43 courses (77%) while d only identified males as learning more in 23 of 43 courses (53%), nearly cutting the rate by 1/3 ( Figure 1). This difference in measurement indicated that g should not be used for investigations of equity as it overestimated student inequities. Researchers are better served by using statistical methods that analyze individual students posttest scores while controlling for their pretest scores and other variables of interest. All researchers should ensure that they report sufficient descriptive statistics for their work to be included in meta-analyses.

VIII. CONCLUSION AND INFERENCES
The bias in g can harm efforts to improve teaching in college STEM courses by misrepresenting the efficacy of teaching practices across populations of students and across institutions. Students from traditionally underrepresented backgrounds are disproportionately likely to have lower pretest scores, putting them at a disadvantage when instructors make instructional or curricular decisions about an intervention's efficacy based on g. For example, g likely disadvantages instructors who use it to measure learning in courses (e.g., non-major courses) or are at institutions (e.g., 2-year colleges) that serve students who have lower pretest means. This is particularly important for faculty at teaching intensive institutions where evidence of student learning can be an important criterion for tenure and promotion.
Comparing the impact of interventions across settings and outcomes in terms of gain scores requires some form of normalization. Normalized learning gain (g) and Cohen's d both employ standardization coefficients to account for the inherent differences in the data. Hake developed g to account for classes with higher pretest means potentially having lower gains due to ceiling effects. By focusing on ceiling effects, g implicitly assumes that any population with a higher pretest score will have more difficulty in making gains than lower pretest populations. This assumption contradicts one of the most well established relationships in education research that prior achievement is a strong predictor of future achievement. Thus, g's adjustment for potential ceiling effects appears to overcorrect for the problem and results in g being biased in favor of populations with higher pretest means.
Using standard deviation as the standardization coefficient in Cohen's d helps to address ceiling effects in that measure. When ceiling effects occur the data compresses near the maximum score. This compression causes the standard deviation to decrease which increases the size of the d for the same raw gain. Cohen's d also corrects for floor effects by this same mechanism. Instruments that have floor and ceiling effects are not ideal for research because they break the assumption of equal variances on the pre-and posttests and because they are poor measures for high or low achieving students. Instruments designed based on Classical Test Theory, such as the CIs used in this study, mainly consist of items to discriminate between average students and have few items to discriminate between highperforming students or low-performing students. Cohen's d may mitigate the limitations of these instruments for measuring the learning of high or low pretest populations of students by accounting for the distribution of tests scores. When the standard deviation is smaller, as with floor or ceiling effects, the probability of change is lower (i.e., learning is harder) so Cohen's d is larger in these cases for the same size change in the means.
In addition to reporting Cohen's d, researchers should include descriptive statistics to allow scholars to use their work in subsequent studies and meta-analyses. These descriptive statistics should include means, standard deviations, and sample sizes for each measure used, and correlations between the measures. We include correlations on this list because of the dependent nature of CI pre-post testing is not taken into account by the change metrics we have presented in this paper. As discussed in the background section, this correlation (i.e., linking) results in a shared error component that can exaggerate the size of the difference.
While it is not a common practice in education research, there are effect sizes and statistical methods that can account for the dependence of pre-post tests in published data when the correlations are reported.
The bias of g is also an issue for researchers who wish to measure the impact of interventions on student learning. The efficacy of interventions ranging from curricular designs to classroom technologies have been evaluated and scaled-up based on measures of student learning. For these investigations, it is important to have a measure of student learning that is not excessively dependent on the knowledge that students bring to a class. By using the pooled standard deviation, rather than the maximum possible gain as defined by the pretest, as a standardization coefficient, d avoids the bias toward higher pretest means while accounting for instrument specific difficulty of improving a raw score. We recommend researchers use d rather than g for measuring student learning. Besides being the more reliable statistical method for calculating student learning, the use of d by the DBER community would align with the practices of the larger education research community, facilitating more cross-disciplinary conversations and collaborations.