Missing data and bias in physics education research: A case for using multiple imputation

Physics education researchers (PER) commonly use complete-case analysis to address missing data. For complete-case analysis, researchers discard all data from any student who is missing any data. Despite its frequent use, no PER article we reviewed that used complete-case analysis provided evidence that the data met the assumption of missing completely at random (MCAR) necessary to ensure accurate results. Not meeting this assumption raises the possibility that prior studies have reported biased results with inflated gains that may obscure differences across courses. To test this possibility, we compared the accuracy of complete-case analysis and multiple imputation (MI) using simulated data. We simulated the data based on prior studies such that students who earned higher grades participated at higher rates, which made the data missing at random (MAR). PER studies seldom use MI, but MI uses all available data, has less stringent assumptions, and is more accurate and more statistically powerful than complete-case analysis. Results indicated that complete-case analysis introduced more bias than MI and this bias was large enough to obscure differences between student populations or between courses. We recommend that the PER community adopt the use of MI for handling missing data to improve the accuracy in research studies.


I. INTRODUCTION
Physics education research (PER) commonly handles missing data by using complete-case analysis (a.k.a. listwise deletion, casewise deletion, and matched data) [1,2]. Complete-case analysis removes any individuals who are missing any data from the analysis. This method is advantageous because it is easy to carry out, but has the drawbacks of lowering statistical power and potentially biasing the results [3][4][5][6].
Complete-case analysis produces reliable results so long as the missing data is missing completely at random (MCAR) [3]. For MCAR the missingness is completely independent of any observed or missing data [7]. So long as the data meets the MCAR assumption, complete-case analysis will not result in biased estimates; it will, however, lose statistical power due to discarding partial student data. If the data is not MCAR, then complete-case analysis may lead to biased findings. We are not aware of any studies in PER that have explicitly tested the MCAR assumption. Van Ness et al. [8] and [9] provide examples of these tests in epidemiology and health research. The few studies that have explicitly compared participants and non-participants using course grades [2,[10][11][12] all indicate that students with higher course grades are more likely to provide complete data. These results indicate that data in PER studies that use concept inventories or attitudes surveys (i.e., the Force Concept Inventory [13] or Colorado Learning Attitudes about Science Survey [14]) do not meet the MCAR assumption because students with higher course grades tend to do better on these instruments [2]. Therefore, as illustrated in Fig. 1, the distribution of the collected data and the missing data likely differ. This difference may create biased results. For example, on concept inventories the mean scores will be higher if the data mostly comes from students that earned As and Bs than if it comes from all of the stu-

dents.
As participation rates drop, the skew in representation toward students who receive higher grades typically increases [2]. This increased skew in participation tends to raise the size of the difference between the collected and missing data, leading to a greater likelihood of bias in any subsequent analyses. We are not aware of any studies in PER that have investigated this potential bias, how large this bias may be, nor what impact it could have on understanding student learning in college physics courses.
Multiple imputation (MI) [15] provides a consistently superior alternative to complete-case analysis. Research shows that MI has greater statistical power and less biased results than complete-case analysis [3,5,16,17]. This superior performance results from MI not relying on the assumption that the data is MCAR and from MI using all of the available data to build accurate and reliable models. A search of the Sage journals for the term 'multiple imputation' during the preparation of this manuscript indicated that education researchers use of MI as the search identified 2,876 research articles on education that referenced MI. A similar search of the Physical Review database for the term 'multiple imputation' identified only four studies in PER that referenced the term. Of these four studies, only two used MI [1,18], and we only know of one other article in the PER literature that used MI [2].

II. RESEARCH QUESTION
In this article, we compare and contrast the bias introduced by using either complete-case analysis or MI to analyze concept inventory data with participation skewed toward higher performing students. We designed the study to cover a broad range of variables we identified as pertinent to concept inventory data. The results inform how likely complete-case analysis biases results in the PER literature and the possible size of those biases. By comparing complete-case analysis and MI we hope to raise awareness in the PER and discipline based education research communities about methods for handling missing data in quantitative studies.
To compare the accuracy for complete-case analysis and MI we examined the following research question: • When controlling for the relationships between grade, concept inventory scores, grade distributions in a course, and participation rates, to what extent do complete-case analysis and MI produce biased results for posttest scores?
If the results indicate that complete-case analysis provides inaccurate results compared to MI, these results could motivate researchers to use MI in their studies. The results could also provide reviewers and editors with a resource to push against the use of complete-case analysis and to push for improved reporting and transparency about data collection and analysis in future studies.

A. Missing data in PER studies
To inform the common research practices around reporting and handling missing data, we reviewed the published literature in the American Journal of Physics and in Physical Review -Physics Education Research. We identified 28 studies that reported pretest and posttest scores for concept inventories in introductory physics courses. We did not includes studies that used either pretest or posttest scores but did not report descriptive statistics for student performance. Of these 28 studies, 7 provided adequate descriptive statistics to calculate the participation rates and one [19] stated the range of participation rates across the courses sampled in the study, as shown in Table I. The participation rates ranged from a low of 30% to a high of 80%.
Twenty-three of the studies we reviewed used complete-case analysis. For studies that did not report how they handled missing data, we inferred from the matched number of pretests and posttests that the researchers used complete-case analysis. Five studies calculated descriptive statistics using all available data. These 28 studies do not include the three studies in PER that used MI, which we discussed earlier. We excluded these three articles from the 28 studies that we reviewed because two of them did not report pre and posttest scores on concept inventories Nissen et al. [1], Dou et al. [18] and we discuss the third article [2] below.
Only three of the eight studies that reported participation rates, shown in Table I, provided average grade data for the participants and non-participants. All three studies disaggregated the data by gender. The participants in these three studies had much higher grades than the students who did not participate in the study, with a B-on average for participants and a C on average for nonparticipants. These differences in grades indicate that the missing data in these studies does not meet the assumption of MCAR required for complete-case analysis. The underrepresentation of low-performing students raises the possibility that the results reported in these studies were positively biased.
Nissen et al. [2] investigated the differences in performance and participation on paper and computer-based assessments. They modeled the participation rates of 1,310 students in 25 sections of 3 different introductory physics courses. Results indicated that students with lower grades participated at much lower rates on both computer-based and paper-based assessments as shown in Figure 2. Their model accounted for four different practices that instructors used to motivate students to participate in the computer-based assessments. The differences in participation across student grades existed no matter what practices instructors used to motivate their students to participate.
Higher participation rates for higher achieving students occurred in all of the studies that we reviewed that reported information on participation and indicate that concept inventory data is not MCAR. This consistent failure to meet the assumptions necessary for completecase analysis to produce accurate results combined with the almost exclusive use of complete-case analysis raises the possibility that results in PER studies that use prepost concept inventories are positively biased to varying extents. FIG. 2. Participation rates for computer-based tests (CBT) and paper-and pencil tests (PPT) from Nissen et al. [2]. Participation on the PPT pretest is not shown because it closely clustered around 100% for all grades. Recommended practices measured four actions instructors could take to motivate students to participate in the CBTs.

B. Types of missing data
The statistical methods underlying complete-case analysis assumes the data is MCAR. MI makes no explicit assumption about the missingness of the data, however many software packages implementation of MI assumes missing at random (MAR) data. Rubin [7] coined three terms to classify the relationships between the mechanisms of the missingness and the missing and observed values themselves.
• Missing completely at random (MCAR): all of the cases have the same probability of being missing. There is no relationship between the probability of a case being missing and any values in the dataset.
This assumption can be partially tested [23].
• Missing at random (MAR): The missingness is independent of the value of the missing data but is conditionally dependent on other observed variables that can explain all of the missingness. For example, a researcher has blood pressure, age, and cardiovascular disease data. They are concerned that the blood pressure data is not missing at random because older people with cardiovascular disease are more likely to report their blood pressure than young healthy people. So long as the age and cardiovascular disease data can explain the missingness in the data then the data is MAR.
• Missing not at random (MNAR): The missingness depends on both the observed and unobserved data. For example, wealthy and poor people who chose not to report their income for fear of being stig-matized due to their income. Since the reported variable is related to the likelihood of reporting and no other variable can explain the missingness, the data is MNAR.
In real world data, the boundary between MAR and MNAR cannot be firmly established because doing so requires observing the unobserved data. Instead, researchers must make reasonable arguments to evaluate the mechanism of missingness. Simulation studies like the one we present in this manuscript allow researchers to build models with data that is known to be missing based on one of the three missingness classifications.
Bhaskaran and Smeeth [24] provide a brief article explaining MAR. They argue [24, p. 1337], "... the terminology describing missingness mechanisms is undeniably confusing. In particular, 'missing at random' is often conflated with 'missing completely at random', leading researchers to mistakenly conclude that any systematic patterns or mechanisms underlying the missing data contraindicate the use of multiple imputation." We adapted the following scenario from Bhaskaran and Smeeth's article to present MAR in a common context for PER. Their article provides a more thorough discussion of MAR.
We present the following scenario as an example of MAR. A research team collected concept inventory data, but they are concerned that the data is MNAR because the students who participated had much higher grades than the students who did not participate. Fig. 1 illustrates this scenario. The researcher can use the grade data to argue that the data is MAR because the missingness in the concept inventory data can largely be explained by the students grades, as illustrated by Fig. 3. In the case of MAR data, splitting the data in Fig. 1 by grade results in Fig. 3 and shows similar distributions between collected and missing data for each grade. The distribution of missing data for the A students looks similar to the complete data for the A students and so on for each group of students. The researcher can argue that within each group of students (A, B, C, D, and F) the primary factors related to their participation were not related to their performance (i.e., traffic, illness, a death in the family, etc.) and the groups with lower participation had more of these unrelated events overall. The difference in the aggregated data, Fig. 1 resulted from the difference in the proportion of students that participated for each grade, which is illustrated by the height of the histograms in Fig. 3.

C. The persistence of complete-case analysis
Despite the known and proven bias caused by ignoring missing data when it is not MCAR, many research fields continue to use complete-case analysis. Cheema [5] points out that complete-case analysis and other error prone methods for handling missing data are common in education research. King et al. [25] found that 94% of political scientists used complete-case analysis, resulting in losing one third of their data on average. In biomedical research, few studies accurately report the amount of missing data or how they handled it, and those that do most commonly report using complete-case analysis [26][27][28][29]. The work in biomedical research indicates that researchers can consistently critique the use of completecase analysis with little improvement in a field's practices.

D. Imputation of missing data
Imputation is a principled technique for handling missing data [4] that PER seldom uses. Imputation fills in the missing data with plausible values, such that a researcher can analyze the now complete data set without concern for missing data. Imputation methods fall into two broad categories: deterministic and probabilistic. We focus on probabilistic imputation methods in this article, but provide a brief review of deterministic methods for contrast.
Deterministic options for imputation include mean imputation and last observation carried forward. Mean imputation replaces the missing values with the mean value for that variable. Researchers use last observation carried forward with longitudinal data to replace the missing data with the last observed value for all subsequent measurements. Both are problematic because they (1) do not preserve the relationships between variables and (2) as with any single imputation approach, do not account for the error incurred by the imputation process itself. These deterministic methods treat the missing values as if they were known, which can lead to inappropriately small variances and an erroneously increased chance of statistically significant findings [30].
In this article, we demonstrate the use of multipleimputation (MI) [4] because it is a probabilistic approach that is broadly applicable to addressing missing data across a wide range of applications [3] and because research finds that MI is more statistically powerful and more accurate than other methods for handling missing data [5,17]. The idea behind MI is graphically presented in Fig. 4. The first step applies an imputation procedure containing a random component (such as predictive mean matching, which is described below) to a dataset with missing data M times to generate different imputed values for each piece of missing data and generate M complete data sets.
Step two calculates the desired estimate from the analysis, such as a mean or regression coefficient, on each data set separately using standard analytical methods. The final step pools the estimates using simple combining rules, also known as Rubin's Rules [31], which are described later in equations 1-5. These pooled results then properly reflect the variation in the original estimates and the variation introduced by the imputation process itself.
The plausibility of the imputed values generated in the first step relies entirely on the model used for the imputa-  tion. Simplistic imputation models that do not use information contained in related variables will impute values that are not an accurate reflection of what the missing data could have been. For example, imputation models need to account for whether the data is longitudinal or if there is reason to suspect the data is MNAR, and the models need to include known correlations and relationships between variables or measures. In short, MI is only as good as the imputation model being used to create the imputed values.
Many software programs have built in or add-on methods to perform MI, both the imputation and pooling steps. In this paper we used the MICE [32] package in RStudio V. 1.1.456 [33]. The MICE package uses predictive mean matching as the default model to impute missing data for continuous variables. Predictive mean matching uses the following process [34].
1. Using the portion of the data with no missing values, build a linear model (b) by calculating the least squares estimates of the regression coefficientsβ, the model residualsˆ , and variance of the residualŝ σ.
2. Create a new linear model (b (m) ) by randomly drawing values for the regression coefficient from a probability distribution centered onβ with variance derived fromσ andˆ .
3. Use b to generate predictionsŷ i for all cases with fully observed data, and b (m) to generate predicted valuesŷ * i for the cases with missing data. 4. For each case with missing data, identify a set of k predictions (ŷ i ) that are close to the predicted value for that recordŷ * i . This creates a donor pool of values, where k should vary between 3 and 10 depending on the size of complete data set. The MICE package uses k=5.
5. Randomly choose one value (ŷ i ) from the donor pool to impute the missing value.
6. Repeat steps 2-5 for each of the M imputations.
Following analysis of each complete dataset researchers, with the aid of statistical software, pool the individual results from across the M imputations using Rubin's Rules to generate valid estimates and intervals of the quantities of interest. To explain Rubin's Rules, let δ be the parameter whose estimate we desire to obtain from an analysis (i.e., a mean, correlation, or regression slope). Given M imputed data sets, M estimates of δ : (δ 1 ,δ 2 , . . . ,δ M ) are generated and used to calculate the following quantities.
• The overall estimate of the parameter is the average of the individual point estimates.
• The within-imputation variance is the average of the individual variances.
• The between-imputation variance is the variance of the estimates • The total variance is a weighted average of the within and between imputation variances.
• And, 95% intervals are calculated using the total variance.Q The resulting variance of the combined estimate then accounts for both the within and between data set variances. The predictive mean matching process incorporates randomness in steps 2 and 5. The amount of variance introduced in these steps depends on the variability and size of the data being modeled. If the linear regression in step 1 provides an excellent fit with small standard errors for the coefficients, then step 2 will add little variability. Similarly, step 5 adds little variability if the data set is large because a large number of similar values will be available to choose from. By pooling the within and between imputation variances, Rubin's Rules provides standard errors for the estimates based on all of the available information that account for the uncertainty introduced by the missing data.

E. Comparisons of methods for handling missing data in education research
Pampaka et al. [16] compared complete-case analysis to MI for handling missing data using a dataset that originally had large portions of missing data that they were able to fill in with subsequent data collection. This design allowed them to compare the results for MI and complete-case analysis of the missing data to the true values for the dataset with no missing data. The total dataset included 1,374 students, but complete-case analysis reduced the data to 495 students. Students who received an A were three times more likely to provide data than students who received a C, indicating that the data was not MCAR. Both the complete case and MI models provided similar relationships between the variables to those in the true models. However, MI produced smaller standard errors than complete-case analysis. They concluded that MI provided a much closer approximation of the true values than complete-case analysis.
Cheema [5] used a simulation study and two real datasets to provide guidance for researchers in designing studies to account for sample size, proportion of missing data, method of analysis, and method for handling missing data. The analysis compared four methods for handling missing data: multiple imputation, complete-case analysis, mean imputation and maximum likelihood estimation. Cheema compared the four analytical methods across three sample sizes and two levels of missingness. The two levels of missing data were 1% to 10% and 11% to 20%; very few studies in the PER literature report such low levels of missing data. This design created a decision tree with 24 possibilities. Multiple imputation was the most effective method in 15 cases and maximum likelihood estimation in 7 cases. Similar to Pampaka et al. [16], Cheema found that imputation methods increased the statistical power of the studies with samples less than 200 by large enough amounts to warrant the use of imputation methods.
These two studies illustrate how MI tends to have greater statistical power than complete-case analysis. The trend toward greater statistical power for MI follows from MI using all of the available data and not discarding any data. These studies did not identify bias in the results from either complete-case analysis or MI.

IV. METHODS
We compared the accuracy of estimates from MI and complete-case analysis using simulated course data for grades and pretest and posttest concept inventory scores. Our analysis focused on course level mean posttest scores as the estimate of interest (µ post ). While we focused on posttest means, we also analyzed mean pretest scores (µ pre ) because many effect sizes and analytical methods use both pretest and posttest scores. Data simulation included a random component that allowed us to generate complete data, create missing values, and calculate µ many times to generate a distribution of µ's. Running the analyses many times informed how consistently the measures and methods for handling missing data performed.
Using data from STEM courses (sources detailed below), we identified typical grade distributions and performance models for the relationships between course grades and concept inventory scores. We simulated student level data for grades and concept inventory scores using these models, which served as the true values (µ) in this paper. After introducing missing data using participation models based on prior research [2], we calculated estimates (μ) using complete-case analysis and MI. This design allowed us to assess the effect of the simulation model parameters and the method of handling missing data on the accuracy of the estimates.

A. Simulating the complete data to generate true results
We simulated the course data by simulating data for each of the five course grade subsets (A, B, C, D and F) and then combining the five subsets into a single dataset. To generate the concept inventory scores, we used a truncated normal distribution, which limited the scores to between 0% and 100%. The normal distribution required inputs for mean (µ), standard deviation (σ), and sample size (N ). The mean for each grade came from five performance models based on three physics courses investigated by Nissen et al. [2]. The standard deviation came from a model of the relationship between the mean and standard deviation for 197 concept inventories. The sample size for each grade subset came from the total course size and three grade distributions we developed based on the grade distributions from 192 STEM courses. We used the five performance models and three grade distributions to cover a range of relationships that could occur in PER studies.

Determining means using the relationships between concept inventory scores and course grades
To generate realistic concept inventory scores, we examined the relationship between course grade and concept inventory scores using data from Nissen et al. [2]. We disaggregated the students in each course by their course grade and calculated the mean concept inventory score for each group of students in each course. We transformed the grades to the numeric values, A=4, B=3, C=2, D=1, and F=0, that the institution used to calculate student grade point average (GPA). Fig. 5 presents the means for each course grade and linear regression fit lines for the pretests and posttests for the three courses. Table II includes the intercept, slope, and r 2 for each linear regression. Based on the scatter plots in Fig. 5 and the r 2 value exceeding 0.5 for 5 of the 6 models, we concluded that a linear model adequately described the relationship between mean concept inventory scores and course grades.
The mean concept inventory scores represented the average value for each grade about which the models simulated the individual scores. To cover a broad range of performance levels, we built models for five different per-  formance levels that were informed by the linear models from the three courses studied by Nissen et al. [2]. The models differ from the results in Table II because our goal was to cover a broad range of possible relationships rather than to replicate the relationships that we found. Table III contains the model parameters for the pretest model and the five posttest models. Only one model generated pretest scores for all courses and is shown in Equation 6. We started with an average model and modified it to create two high-performance models and two low-performance models by varying either the slope or the intercept in the model. The intercept established the mean concept inventory score for the subgroup that earned an F. The slope established the size of the difference between each grade. These five models covered a range of relationships to inform how varying the slope and intercept related to the bias introduced by using MI or complete-case analysis and to provide more robust and generalizable results. We used 197 means and standard deviations from either pretests or posttests to build a quadratic model for the relationship between mean and standard deviation. This data came from both the literature and concept inventories collected with the LASSO platform [35]. A quadratic model fit the data because the standard deviation should approach 0 at both of the boundaries of the test scores (0% and 100%). Equation 7 describes the fit line. We determined that the quadratic fit line was adequate because the adjusted r 2 for the fit line was 0.34, all coefficients were statistically significant with p < 0.001, and visualizations indicated that the quadratic fit line was appropriate. σ = 16.6 + 14.6 * µ − 32.2 * µ 2 . (7)

Determining sample size based on grade distributions in STEM courses
We simulated courses with approximately 1,000 students total. While this size is larger than typical courses, it allowed us to use fewer replications at the course level simulations to quantify any bias introduced by MI or complete-case analysis. To determine the number of students that earned each grade in our simulated courses, we analyzed grade distributions from 192 STEM courses at California State University -Chico to build three different grade distributions: low, average, and high. We combined the drop, withdraw, and fail students into a single F group. To build the low grade distribution, we averaged the grade distributions from 13 courses with less than 10% As and greater than 30% Fs. We built the average grade distribution by averaging all 192 grade distributions. To build the high-grade distribution, we averaged the grade distributions from 6 courses with greater than 20% As and greater than 20% Bs. Fig. 6 shows the three grade distributions. We reasoned that these three distributions covered the range of grade distributions found in most STEM courses.

Simulated course data
The 5 performance models and three grade distributions created a total of 15 different simulated courses. For each of these 15 courses, we simulated 20 datasets (replications) with approximately 1,000 students each. This process resulted in 300 different datasets. Fig. 7 provides an example of data generated for one course using the high slope model with an intercept of 43 and a slope of 10 for the posttest scores and an average grade distribution. For the high slope model, each grade higher meant that the average posttest concept inventory score increased by 10 percentage points. Students with F grades had a 43% posttest score on the concept inventory on average and this raised to 53% for Ds, 63% for Cs, 73% for Bs, and 83% for As. The diamonds in Fig. 7 represent the mean test scores for the subgroups and illustrate the linear relationship between grade and both pretest and posttest means. The density plots for the pretests (top of Fig. 7) and posttest (right of Fig. 7) illustrate the variance of the generated scores about the means. The density plots for posttest scores covered a larger range of means and illustrate how the quadratic equation for standard deviation concentrated the scores into a narrower range as the mean score neared 100%. Table IV provides the true average values for the complete data for pretest and posttest means and the absolute gain across the simulated courses.

B. Models for missing data
We used the participation models for posttests from Nissen et al. [2] to create five levels of missing at random data based on course grades in the simulated posttest data for each of the 15 simulated courses described in Table IV. Fig. 2 shows the five models for missing data with the value for 'recommended practice' distinguishing between the five models. Within each grade, we randomly deleted posttest scores based on the participation  model. As an example, for participation model 2 (i.e., recommended practice = 2) we deleted 96% of posttest scores for Fs, 83% for Ds, 51% for Cs, 18% for a Bs, and 4% for As. The randomization for deleting the data was done independently for each of 5 participation models and each of the 20 simulated classes within the 15 simulated courses described in Table IV. Removing data for posttest scores represents a typical dropout situation and had limited impact on the complete-case analysis because complete cases removes both pretest and posttest scores when either is missing. These methods for generating missing data provided participation rates, the percentage of students who took both the pretest and posttest, that covered the range of 30% to 80% reported in the literature and presented in Table I.

C. Measuring accuracy using bias
To inform the extent to which complete-case analysis and MI provided biased estimates for posttest scores, we measured the accuracy of the results using bias. We calculated bias as the average difference between the mean from the missing data,μ (for both complete-case analysis and MI), and the true posttest mean (µ) as shown in Eq. 8. In Eq. 8, n represents the number of replicated courses, which we set at 20 for each of the 15 simulated courses. A bias greater than zero indicated that the estimates were larger than the true values.

V. RESULTS
We first present the bias on the pretest model across the three grade distributions. Second, we present the bias in the posttest scores for the 15 simulated courses. Last, we present a comparison of two simulated courses to illustrate the potential impact of the bias introduced by complete-case analysis and MI on research results. We used the same model of the relationship between grade and test scores to simulate the pretest data for all five of the performance models because we expected the bias for the estimates of pretest scores to be smaller than that for the posttest scores. Fig. 8 presents the pretest bias introduced by complete-case analysis. The participation models only inserted missing data in the posttests. The complete-case analysis created missing pretest data by discarding the pretest scores from students that do not participate in the posttest. MI discards no data so it introduced no bias into the analysis for the pretest scores since there were no missing pretest scores. Complete-case analysis introduced small amounts of bias (< 2.2 percentage points) into the course means for all simulated datasets. The bias introduced by complete-case analysis for the pretest tended to increase as the participation rate decreased and tended to be higher for lower grade distributions.
The posttest bias, shown in Fig. 9, resulting after conducting complete-case analysis and MI tended to be positive and to overestimate the true values. Conducting complete-case analysis resulted in more bias than conducting MI. Conducting complete-case analysis always produced positive biases with a minimum value of 0.7 percentage points and a maximum value of 12.8 percentage points. This bias of 12.8 percentage points meant that complete-case analysis estimated the posttest mean to be 70.2% on average for the high slope low grade distributions simulated course while the true average value was 57.4%. In contrast to complete-case analysis, conducting MI produced negative biases for 19 of the 75 measurements with a minimum value of -0.3 percentage points and a maximum value of 1.9 percentage points. These results indicate that both methods tend to overestimate the true posttest scores, but that the overestimation was much larger for complete-case analysis. This overall trend of larger bias resulting from complete-case analysis than from MI followed for each combination of performance, grade distribution, and participation rate. Even at the lowest level of participation, the MI analysis tended to produce less bias than the highest level of participation for the complete-case analysis, as is illustrated by the boundary between the two graphs in Fig. 9.
The bias introduced by conducting both MI and complete-case analysis tended to decrease as the participation rate increased. This trend occurred for completecase analysis of all 15 of the simulated courses but was less consistent for MI analysis of the simulated courses. These results illustrate the value of maximizing participation rates for achieving accurate estimates of concept inventory means.
Differences in bias for complete-case analysis across the five performance models indicated that varying slope had a stronger impact on bias than varying intercept. As shown in Fig. 9, the largest bias occurred for the high slope simulated courses (long-dashed line with empty squares) and the lowest bias occurred for the low slope simulated courses (dotted line with filled squares). The Performance in both courses was average. The traditional course had a low grade distribution and low participation rates. The transformed course had a high grade distribution and a high participation rate. We did not include error bars to focus on the effects of bias and because they are very small due to the large sample sizes for the simulated data.
maximum bias for the high-slope simulated courses was 12.8 percentage points whereas the maximum bias for the high-intercept simulated courses (dashed lines with empty triangles in Fig. 9) was 7.4 percentage points. This difference in bias was not caused by a difference in posttest scores as the bias was larger in the high-slope simulated courses but the mean posttest score was lower (57.4% for the 12.8 versus 66.6% for the 7.4). Similarly, comparing the low slope and low intercept high grade distribution simulated courses shows that the bias for the low slope course was lower (0.7 versus 1.2 percentage points maximum bias for each) whereas the posttest mean was higher for the low-slope simulated courses (50.7% for 0.7 versus 41.9% for 1.2). These relations indicated that the absolute value of the posttest mean was not the primary factor in the amount of bias introduced by complete-case analysis. Rather, the relationships within the datasets and the total amount of missing data best explained the bias.
The bias for MI, in contrast to that for complete-case analysis, did not reveal consistent differences between the performance models or grade distributions and bias. The much lower overall bias for MI may obscure differences in bias across the simulated courses. However, Fig. 9 shows that the clear differences in bias for complete-case analysis across the simulated courses did not exist for MI.
To compare how the bias introduced by complete-case analysis and MI could skew comparisons, in Fig. 10 we compared two simulated courses that represent a plausible comparison between courses with similar performance but different grade distributions and participation rates. Using the average performance model for both courses simplified comparing the results because the performance for students who earned the same grade were the same across the two courses. We varied the participation and grade distributions between the two courses to align with comparisons between traditional and transformed courses that occur in the PER literature (e.g., Brewe et al. [21]. The two comparison courses are listed below. The true values indicated that students in the transformed course learned more conceptual knowledge on average than the students in the traditional course. This difference follows from the higher grade distribution in the transformed course and the same performance model in both courses. The larger gains in the transformed course remained when we analyzed the data with MI. However, complete-case analysis nearly eliminated the difference in gains on the concept inventory. This decrease in the difference between the courses occurred because little data was collected in the traditional courses from students with low grades and thus the analysis positively biased the gain. In contrast to the true results and the results after analysis with MI, the results from the complete-case analysis do not support the claim that students learned substantially more in the transformed course than in the traditional course.

VI. DISCUSSION
Complete-case analysis can introduce large amounts of bias into the estimates for concept inventory scores when researchers apply it to data that is not MCAR. The bias introduced by complete-case analysis in the simulated data ranged from 0.7% to 12.8% for the posttest means and fell below 2% for the pretest means. The 28 articles we reviewed, which included 158 courses, reported gains from 5% to 56% with an average of 23%. Twenty three of these studies used complete-case analysis, none reported using a principled method for handling missing data (e.g., MI), and none indicated that the missing data in the study was MCAR. Subsequently, our results indicate that part of the gains reported in those studies likely resulted from the improper use of complete-case analysis. In some of those studies, complete-case analysis may have exaggerated the gains by increasing them from anywhere between one third to doubling them. The introduced bias may have also skewed any comparisons made in those studies, particularly comparisons across courses with different participation rates.
We cannot say exactly how much of these reported gains resulted from bias introduced by complete-case analysis. Our results indicate that the amount of bias complete-case analysis introduces depends on both the participation rate and the relationships within the data. To determine the bias in prior studies that used completecase analysis without meeting the assumptions for its reliable use, researchers will need to analyze the data directly. However, physics education researchers seldom publish the data or analytical code used in their studies. The PER community can improve transparency and accountability by supporting researchers in publishing or publicly sharing the datasets from their research. Going forward, sharing data would allow the research community to double check the impact that the methods for handling missing data have on the conclusions that researchers draw from their data.
The bias introduced by complete-case analysis could obscure differences across courses and undermine both research and evaluation work. For example, we compared a simulated traditional course with a simulated transformed course. The simulated transformed course had lower DWF rates, higher grades, and greater conceptual learning. Bias introduced by using complete-case analysis obscured the differences in conceptual learning between the two simulated courses. In a comparison of real courses, a critic of the transformed course with lower DWF rates could use the similar results from the complete-case analysis of the concept inventory scores to claim the transformed course had lower grading standards. Otherwise, the transformed course would have outperformed the other course on the concept inventory. Using MI to account for the missingness in the data introduced less bias into the results and preserved the true result that, overall, students learned more in the transformed course. Researchers and educators need accurate results to inform the design and implementation of research-based teaching materials. If researchers continue to use complete-case analysis without accounting for the impact of missing data, they risk wasting time and resources either discarding useful interventions or pursu-ing false leads.

VII. CONCLUSION
Researchers, reviewers, and editors can take several steps to improve the handling of missing data in quantitative studies. During the data collection process, researchers should take reasonable actions to minimize the amount of missing data. However, education researchers often cannot avoid some missing data in their studies. Researchers should use multiple imputation or another principled method for handling missing data. Researchers using complete-case analysis should present evidence that their data is MCAR. However, principled methods for handling missing data, such as MI, are not a panacea. Rather principled methods are only one component of the diligence necessary to address missing data. Before analyzing the data and deciding on an appropriate method for handling the missing data, researchers should examine the amount of missing data; patterns in the missing and complete data; and the mechanisms behind those patterns. When implementing MI to address missing data, researchers should check that their data meets the assumptions of the MI algorithm. Many MI software packages include tools to check these assumptions. Studies should state the participation rates in their data collection, describe the methods they used to address missing data, discuss patterns in the missing data, and discuss how the missing data may influence analytical results. These steps will improve the quality,reliability, and replicability of quantitative studies on student outcomes in physics.