Framework for evaluating statistical models in physics education research

Across the field of education research there has been an increased focus on the development, critique, and evaluation of statistical methods and data usage due to recently created, very large datasets and machine learning techniques. In physics education research (PER), this increased focus has recently been shown through the 2019 Physical Review PER Focused Collection examining quantitative methods in PER. Quantitative PER has provided strong arguments for reforming courses by including interactive engagement, demonstrated that students often move away from scientistlike views due to science education, and has injected robust assessment into the physics classroom via concept inventories. The work presented here examines the impact that machine learning may have on physics education research, presents a framework for the entire process including data management, model evaluation, and results communication, and demonstrates the utility of this framework through the analysis of two types of survey data.


I. INTRODUCTION
Influenced by the development of machine learning, the statistics community has had discussions on the importance of a model's predictive capability, in addition, or in contrast, to its explanatory power [1][2][3].This argument has been extended to the often used statistical inference models in social science [4].This conversation has moved into the physics education research (PER) community through a recent critical examination of quantitative methods used in PER [5], which included calls to modernize statistical modeling [6].The framework presented in this paper builds upon that work to produce a method for evaluating statistical studies in PER that is in line with calls for updating current practices in social science fields, such as computational social science and psychology [7].
Traditionally PER has used linear models such as ordinary least squares regression, mixed effects regressions, and logistic regression (e.g., Refs.[8][9][10]).Linear models assume that direct combinations of features represent a direct observation [11].These models focus on determining unbiased estimates of model coefficients in comparison to predictive performance.In the words of Shmueli et al. [3], traditional PER models focus on explanatory power rather than predictive power.Models with poor predictive power, as measured by goodness of fit tests, are sometimes still assumed to be transferable and the results generalizable if model parameters are considered statistically significant (e.g., demonstrating via p values and effect sizes)."Hidden" effects are used to explain why there is low explained variance in many models (e.g., Ref. [12]).Using models in this way is not uncommon in other sciences.However, there are calls within physics and other fields to update model assessment and evaluation to include deeper measures of prediction (e.g., Refs.[7,13]).
Much of the recent work in the statistics and machine learning communities has focused on prediction rather than explanation.In the former, community data collected are put into a model that is evaluated on its predicted output.The internal structure of the model is not assumed to be representative of the system being studied [1]; that is, there is not necessarily a correspondence between model structure and system structure and dynamics.The statistics community still considers the model central and highly values its interpretability.Model performance is evaluated using an independent test set of data [2].These data are independent; they are completely separated from the data used for model building [14].Thus, the models are trained using a "training dataset" and then evaluated using a "testing dataset."Models with poor predictive power are rejected in favor of more predictive models [14].By evaluating the predictiveness, instead of the statistical significance, a relationship between two variables can emerge beyond a "statistical association" [7].
The framework that is presented in this paper is designed to provide a method for establishing generalizable results from educational data mining studies [15].Its goal is to aid in exploring research questions such as (i) who will change majors in physics to other majors [16], (ii) which students in physics courses need extra help to succeed [17], and (ii) what patterns exist in student preparation that may impact learning at the university level [18].Producing generalizable results through use of statistical model evaluation will provide an additional tool to the PER tool box.This tool will allow researchers to compare results across similar university environments.
It is important to note that this framework is not designed to test causal models [15], nor is it designed for evaluating latent variable modeling, such as item response theory.While this framework is a good starting point for evaluating these kinds of models too, causal models require more strict verification that is grounded in the theory being tested and cannot rely on solely statistical model evaluations.While this is true in educational data mining studies as well, in an educational data mining study, theory only suggests what variables could be used in a model [15,19].
In this paper, we review how statistical modeling has been done in PER and then describe the current discussions in the statistics and machine learning communities on model representation and evaluation.We then provide evidence that the current quantitative PER work could be advanced by using robust model assessment in line with other quantitative fields.Then, we offer a framework for the PER community that considers data management, model evaluation, and results communication holistically.This framework is grounded in work presented in the recently published PRPER focused collection [5], as well as in modern views from the statistics and machine learning communities.The framework we present is model agnostic, that is, it will work for traditional linear models, as well as more complex "machine learning" models.Finally, we present two applications ("case studies") of the framework: (i) using the Learning About STEM Student Outcomes (LASSO) dataset [20], and (ii) using the data from the Colorado Learning Attitudes about Science Survey for Experimental Physics (E-CLASS).These two case studies demonstrate the need for assessing the predictive ability of models beyond what is typically done in quantitative PER.

II. BACKGROUND
Physics education research (PER) is an interdisciplinary subfield of physics that investigates many aspects of physics education [21].PER uses a wide variety of studies and types of research, including theory building, empirical investigations, qualitative research, and statistical modeling.Quantitative PER has historically focused on investigating how instruction impacts student outcomes such as conceptual learning [22] and attitudinal shifts [23].Papers in PER that use statistical modeling typically explore data using logistic or linear regressions.Typically, they do not use Cox style regressions (i.e., time-to-event models) or machine learning models (with few exceptions, e.g., Refs.[16,17,24]), although use of these methods is increasing.Early work in PER also generated its own statistics based on empirical observations such as the Hake normalized gain [22].

A. To explain or predict?
A common goal across all science is the understanding and explanation of how a system works.For example, physics has historically had a close connection between the explanation of how a system works and a prediction of the time evolution of the system.Newton's second law provides an explanation of what causes an object to move, as well as excellent predictions as to where an object will end up after some time t.In education research, we rarely have access to such causal relationships that produce excellent explanations of a system, as well as excellent predictions.However, it is sometimes assumed that a model that produces excellent explanations is also predictive [3].This can conflate the goals of explanation and prediction, which are different aims.This is seen across both the culture of how models are viewed and the practices and choices of the data used for modeling.An explanatory model might rely on a specifically designed survey to, for example, assess a particular attitude towards experimental physics [23] with data from particular types of courses.Whereas a predictive early warning model of student failure in a physics course [17] would not be useful without current data from ongoing courses that are actively streamed into the model.
Breiman et al. [1] identified two separate cultures that use statistical models: one culture is grounded in the theory that the stochastic models that generate the data must be approximated, the other culture is grounded in the theory that the model that generates the data should remain unknown.Breiman et al. [1] argues that the data generating process is normally complex and statistical models, such as ordinary least squares regression or multi-level regressions, cannot approximate it.In typical machine-learning reasoning, he supports algorithmic methods, in the form of a black-box, which might produce better and different understanding of the data.In this context, a black-box model is one where the way the model arrives at the solution is unknown.Taking the concept to the extreme, he goes on to state: "the goal [of a statistical model] is not interpretability, but accurate information" arguing that by sacrificing accuracy for interpretability, a researcher sacrifices understanding between the dependent and independent variables."While the advantages of an interpretable statistical model over a black-box solution are evident, in terms of transparency, portability, and generalizability, Breiman's statements should be taken into consideration when highlighting the importance of a model's predictive ability.We argue that a black box that does not give any further insight into the problem is not terribly useful.
Focusing on prediction can lead to the discovery of novel causal relationships, the generation of new hypothesis, and the refining of existing explanatory models [3].Much of recent literature across social science and statistics has called for a change in the types of models used to predict social systems, the theory and frameworks that motivates the choice of models, and the evaluation methods used to demonstrate prediction.Hofman et al. [4] recommend that "current practices for evaluating predictions must be better standardized," arguing that current methods in social science focus too much on explanation and too little on prediction.Hofman et al. [4] recognize that there is a fear that complex models may lose interpretability, but points to innovations in recent literature that overcome a loss of interpretability in complex models (e.g., Ref. [25]).The work in PER has not yet fully engaged with this discussion, but work is beginning to discuss the importance of different considerations in conducting quantitative PER [15].

B. Quantitative analysis in PER
Quantitative methods in PER have evolved over time.There is a robust and growing community of researchers in physics education that are thinking about the entire process of quantitative research from data collection and using more sophisticated models to evaluating those models and presenting results.In this section, we discuss a few salient components of the current state of quantitative PER methods and where improvements might be made.

Data collection
The PER community has investigated how data preparation and organization impacts observational and experimental results.Both Ding et al. [26] and Springuel et al. [27] recognize that education data typically are not gathered as interval or ratio data (as is typical in a traditional physics study).Thus, the encoding of data can directly impact the results reported.Springuel et al. [27] identifies several types of data that are common in PER and describes the limitations of using each type of data in research.These types include asymmetric and symmetrical nominal data, ordinal data, interval data, and ratio data.These types are presented on a scale as to how informative they are and what they are able to do.For example, ordinal data contain a finite number of ordered responses (e.g., a Likert scale).In contrast, ratio data create a scale of data from 0 to 1 that represents the presence or absence of some observation.
Missing data can also be an issue in quantitative PER studies [28].Most statistical models are not capable of handling missing data directly.Data must be either removed or imputed.Rubin [29] describes three types of missing data: (i) missing completely at random, (ii) missing at random, and (iii) missing not at random.Data missing completely at random is when data are missing and there are no mechanisms that can be demonstrated that explain how the data came to be missing.The other kinds of missing data assume that there is some mechanism that may cause the data to be missing (e.g., low performing students are less likely to complete both a pre-and postconcept inventory).When these data are missing in low amounts, it is reasonable to do row wise deletion of data that is missing [27].Row wise deletion means that data are removed from the dataset for an entire row if any of the columns in a row have missing data.
Nissen et al. [28] recommended the use of multiple imputation [30] to fill in the missing data in cases with larger amounts than discussed by Springuel et al. [27].Josse et al. [31] demonstrate that in addition to using various imputation methods (including multiple imputation), adding a covariate column that explicitly includes whether data are imputed or not can increase model performance without biasing data, especially when missing data are related to the target variable.
Finally, communication of data context has also been observed to be an issue in PER.Knaub et al. [32] found that approximately half of papers published in Physical Review: PER do not describe basic descriptive information about the students being studied such as their demographics, average grades, or institutional descriptors [33].

Statistical modeling
Recent efforts have been made to modernize statistical modeling in PER both in methods and in theory.For example, Aiken et al. [16] recommend using out-of-sample data to evaluate classification models.Out-of-sample data are data that are randomly selected from the dataset under study that is not used to train the model.Once the model is trained, it is then used to assess model performance (e.g., using mean squared error).Van Dusen and Nissen [6] recognize that much of the data in social science and in PER can have a hierarchical structure that is not picked up by typical linear regressions with only fixed effects.They recommend using multilevel models to calculate the contributions of the hierarchical structure that include random effects.Theobald et al. [34] make similar recommendations extending the types of regressions that should be considered in PER (e.g., proportional odds, multinomial, etc.) and provide a toolkit to determine the most effective generalized linear model that a researcher should use to analyze their data.This is in line with much of the work of Bryk and Raudenbush in the 1980s culminating in their textbook [11].Bryk and Raudenbush [35] identified that "the real need [in education research] is for statistical models that provide explicit representation of the multiple organizational levels typically encountered in educational research."Bryk and Raudenbush [35] identified that learning happens across organizational levels such as an individual student, the classroom, the institution, and even the different times a measurement is taken.

C. The role of theory and domain expertise in statistical model evaluation
The goal of this paper is to present a framework for evaluating statistical models in PER to allow for better generalizability of results.However, we are not suggesting that the only reliable model evaluation comes from whether the statistical model fits the data input into the model or not.It is also important that the data input decisions, and the results that are interpreted, be done in the context of educational theory.Integrating theory and domain knowledge is currently an open topic in the fields of statistics and machine learning.Within PER, domain knowledge and theory can include, as an example, theories of equity and inclusion [36], theories of learning, such as the role of interaction [22], impact of social interaction and social network [37], neurological and cognitive theories [38], and theories of community integration, such as the communities of practice theory [39].Statistical modeling allows for a process of rigorous testing of these theories.In reverse, theory allows for strong explanations about when a model fails (e.g., a lack of social network data can impact the overall predictability when examining university student drop out [40]).

III. A FRAMEWORK FOR EVALUATING STATISTICAL MODELS IN PER
The framework presented in this section is aligned with current calls in social science to modernize statistical model evaluation, so that predictive models presented are more robust.Figure 1 is a graphical representation of this framework.Each step in Fig. 1 is discussed below.The framework is shown as a flow diagram to represent each integral step in building a statistical model.This flow diagram represents explicitly each step in the statistical modeling process.It is not meant to represent an ordered process of steps that are performed in succession until completion.In some cases, it is useful to iterate this process to produce the best models.However, care should be taken to not iterate until results are found (i.e., "p hacking" [41]).However, this does not mean that successive tests of different theories and statistical models cannot inform the statistical modeling process and ultimately the results of a research study.Additionally, this process can be transparent through sharing of analysis in an Appendix B, code repositories, and other Supplemental Material [42].
Prior to the collection of any data, researchers must identify their theoretical constraints and research questions.
These dictate what data are to be collected at all.For example, if researchers are interested in student learning they might consider a cognitive model of learning and thus they would collect data that would indicate that (e.g., aptitude scores, scores on other cognitive measures, prior course work, amount of sleep, etc.).That is, measurements that all effect cognition.Instead if researchers are interested in identity formation and if someone adopts an identity or not, they could collect other data such as time spent on activities related to that identity, a student's rating of their own competency, etc.Both are welcome in the analytical framework presented in this paper and underlie all of the data and analysis.

A. Raw data
Raw data (see Fig. 1) collected in educational studies should be described in adequate detail.This includes discussing the sample data that are analyzed in the paper and the larger population group that the sample is drawn from.It also means discussing how the data were collected (e.g., in-class, online, for credit, optional, etc.).We recommend following the questions presented in Knaub et al. [32] such as "what information regarding the sample is useful for the audience," as well as emphasizing explicit sample descriptions and limitations [33].

B. Data encoding, transformations, missing data
Data collected for any research study is dictated by the research study design and ultimately the research questions being asked.How these data are encoded can impact the results of a model.For example, a student variable could be "passed course" or it could be the student's grade.Encoding and transformation procedures applied to data should be described in adequate detail (see Fig. 1).We do not recommend to state explicitly what each data type is for various kinds of datasets.Instead, we recommend authors motivate the methods used to transform or encode each variable in a model from the type of underlying data.For example, if data are to be transformed using summary data, such as with a z score [43], then the summary statistics (e.g., mean and standard deviation for a z score) should be calculated only for data labeled as training data.These summary statistics can then be used when calculating the z score for the test statistic.Otherwise, data leakage could occur, inflating the model's prediction ability [7].Data leakage is what happens when data or information from the test set are included in the training dataset.When this occurs, the statistical model can overfit on patterns in the test dataset and present artificially higher predictive results.Additionally, in this step, missing data should be addressed.If applicable, we recommend using an imputation method of the authors' choice (e.g., multiple imputation [44] or K-nearest neighbors [45]) and including a missingness indicator for each column with missing data that has been imputed [31].In some cases, there may be too much missing data and therefore imputing the data may bias the results [46].There is no universal solution to missing data.For this reason as well, it is extremely important to report the treatment of missing data in the analysis.

C. Train and test split
This process separates data randomly into training and testing data (see Fig. 1).The training data are used to "teach" the statistical model the patterns in the data.The testing data are used to measure how well the model performs, and should be completely independent from the training data to avoid overestimation of the performance.It is important in this step to acknowledge that some data might be stratified in ways that are inappropriate to completely separate data at random (e.g., randomly picking data across time windows in a study that is attempting to predict time to some event).For example, Aiken et al. [16] present a statistical model that predicts if a student enrolled in the physics department will graduate with a physics degree or switch to an engineering degree.In that study, students' high school GPAs had risen over time.One way to adjust model training could be to randomly split data per year instead of randomly across all data regardless of the enrollment year.
When the number of observations is too low to justify a split into a training set and a testing set, more elaborate methodologies can be implemented, for example, cross validation [47].Cross validation splits the data into several ('K') subsets of approximately the same size: each subset is in turn used as a test set, in which the performances of multiple models trained on the remaining K-1 subsets.The K performance measures obtained are then averaged to obtain a final evaluation of the model.This process allows for the stability of the model to be tested and permit one to use each data point both in the training and test processes.
In addition to data splitting methods for model evaluation, many modern models have hyperparameters that need to be "tuned" to produce a best fit model.Hyperparameters are parameters that describe the structure of the model as opposed to the structure of the data.This could include the depth of a decision tree or the value of a regularization term.When the model fitting procedure requires fixing the value of hyperparameters, such as the number of iterations and the step size for boosting (see Sec. IVA 3 for more details), these must be tuned by using a third set, called a validation set.In this case, data are split into three groups: training, validation, and testing.The models are trained using the training set for each value of the hyperparameters, the best values are selected using the validation set, and finally the performance of the model is evaluated on the test set.Even if there is no need to tune an hyperparameter, this threefold split can be used to improve the quality of the analysis (model selection and model fitting are performed on independent observations), and therefore a stronger claim of generalizability.Using a validation dataset typically requires having a very large dataset.The split training and validation is mimicked by applying cross validation on the larger set.Cross validation is the same train and test separation process, but abstracted to K separate datasets known as "folds."Then each fold is compared to the other folds to produce average estimates.When cross validation is used instead of the training and test set, a further cross-validation procedure is implemented within the K-1 folds used in turn as the training set.

D. Statistical models
Here, we emphasize that models is plural.The best statistical model might be tree based (e.g., random forests [48] or gradient boosted trees [49]), or it might be an ensemble of models [2], or it might even be a model more familiar to PER researchers, such as a linear regression.The goal is to test models of varying complexity and pick a model that has the highest predictive ability with the lowest complexity both in terms of number of features used and how the model itself works.This is already being done in PER in some studies, typically by comparing models with few features to models with many features and reporting insample goodness-of-fit tests.For example, Salehi et al. [18] compared different models that predict final exam scores to assess the impact of preparation; Aycock et al. [50] compared models that predict a student's sense of belonging in physics due to different experiences.Additionally, more sophisticated models can increase our understanding even when we have lower amounts of data on student subpopulations.For example, using a dataset that had a small racial and ethnic minority population (15%) Aiken et al. [40] presented gradient boosted logistic regression models that performed much better than traditionally solved maximum likelihood models when predicting undergraduate student time to graduation.Statistical model choice can impact not only the overall predictability of a model, but also our understanding of a system even when our data on all populations are not equal.It is important to note that the choice of the models to investigate must be done before the data analysis step, but rather in the design of the experiment.Trying all possible methods until one "fits" the data are a way of overfitting (looking for significance) and should be avoided.

E. Evaluate model performance
A model's predictive ability should be evaluated by its ability to predict new data that the model has not seen before [2,7,14].Using out-of-sample data provides protection against overfitting and places the responsibility of model reliability in its ability to represent the natural world, as opposed to statistical significance established from insample data.Standard quantities may be used to evaluate model quality, such as mean squared error or mean absolute error in the case of continuous response variables; area under the receiver operator characteristic curve and the F1 score for binary responses (e.g., for logistic regression); or the Integrated Brier score and C-index for time-to-event data (survival analysis).
A good model fits the training data well, does not diverge much in fit when applied to test data, and has a low complexity.High complexity models typically have a large number of features and are algorithmically complex [2].They may also suffer from overfitting, with the characteristic situation of small training error and large test error.

F. Explanation
Once a final model has been determined (see Fig. 1), the model can then be used to provide explanations for the prediction.This can include traditional statistics such as odds ratios and effect sizes that are based on in-sample calculated model coefficients.Explanation can also use test data to calculate partial dependence [49] or determine explained variance using recursive feature elimination (e.g., Refs.[16,24]).

G. Using the framework
Ultimately, this framework is only a guide that can be applied and adjusted for each project.For example, a longitudinal study with multiple measurements might require substantial changes in how data are managed and how a model is evaluated.However, several constants should be true for any study that includes predictive statistical modeling.The encoding of data, transformation of data, and handling of missing data should be explicitly articulated.Statistical model evaluation should use out-ofsample data to evaluate model performance.Multiple types of statistical models should be compared and evaluated before presenting results from the model.In the following sections, we demonstrate how to use this framework with two case studies.

IV. ILLUSTRATING THE FRAMEWORK
We have applied this framework to the study of two separate, but common studies in PER.The first study uses the LASSO dataset [20].The LASSO dataset is a multiinstitution, multicourse dataset that includes pre-and postconcept inventory responses along with other variables.Student postscores are predicted using regression models.The second study uses the Colorado Learning about Science Survey for Experimental Physics [51] to examine differences between students in introductory physics laboratory courses and lab courses that are "beyond the first year" (BFY) of college.
In both cases, the data that are presented have a complex structure, which justifies trying complex models.By complex, we mean that the data may contain nonlinearities, hierarchical effects, and other peculiarities that will lead to poor functional approximation using simple methods such as ordinary least squares.For the LASSO study, first, we try the traditional ordinary least squares solution to linear regression.Second, we attempt to fit a hierarchical linear model using the restricted maximum likelihood method [11].Finally, we use a gradient boosted trees model [52].
For the E-CLASS study, the problem is framed as a binary outcome variable.Instead of predicting the postscore like before, we predict whether the students are in an introductory course or in a BFY course.In this case, we attempt to solve the logistic regression problem in three ways: (i) with a maximum likelihood estimate method and no hierarchical effects, (ii) a maximum likelihood method with hierarchical effects, and (iii) a gradient boosted method without hierarchical effects.We present these two separate studies as demonstrations of the framework for statistical models that attempt to predict continuous outcome variables and binary outcome variables.
In the following section, we briefly discuss how each algorithm approaches the solution.The methods for each study, such as how the LASSO data or E-CLASS data are transformed, are then discussed in their respective sections.

A. Statistical models
In this study, we use three different methods to derive a regression model.The model is used to estimate or predict the value of a response y, denoted as ŷ, based on some covariates or predictors x.In the LASSO study, ŷ is the post-test score.In the E-CLASS study ŷ is a binary variable denoting whether the student is in a BFY course or in an introductory physics lab course.In the LASSO study, the first method we use is a traditional ordinary least squares linear regression.The second is a two-level hierarchical model [11].The third is a gradient boosted trees model using the Xgboost implementation [52].In the second study (E-CLASS), we instead use logistic regression models since the outcome variable is whether the student is in an introductory course or a BFY course.Below is a short description of how each method approaches the solution.Then, we describe how the models are evaluated.

Ordinary least squares regression
Ordinary least squares (OLS) regression is a popular method used across all fields of science having its roots in the 1800s [53].Ordinary least squares estimates a vector of coefficients for the model by minimizing the residual sum of squares (RSS) [2], where y i and x i are the value of y and X for the observation i ði ¼ 1; …; NÞ, respectively, and fðxÞ ¼ β0 þ P p j¼1 X j βj .This vector, β ¼ ð β1 ; …; βp Þ, can be obtained directly from the unique solution (not derived here) β ¼ ðX T XÞ −1 X T y, while β0 ¼ ð1=NÞ P N i¼1 y i .In this case, ϵ is assumed to be the irreducible error, i.e., the variability that cannot be explained by the model.It is also assumed that this error is normally distributed with a mean 0 and standard deviation The coefficients of an ordinary least squares model represent how a model variable can impact the outcome variable in two ways: negative or positive (the coefficient sign), the predicted shift in the outcome variable based on a one unit increase in the covariate (the coefficient magnitude).

Hierarchical linear model
Hierarchical linear regression models (also called mixed effects models) are another popular method that has seen much use in social science [11] and in PER (e.g., Refs.[54,55]).Mixed effects models allow the inclusion of random effects into the classical linear model, usually something related to clusters of observations.For example, the effect of attending a Catholic school versus a public school may have on math learning [11].These regressions typically have nested equations that represent the different organizational effects described by Ref. [35].The simplest version of a random effects model is where X is the input data, β is the fixed effects, Z is input data with random effects, u is a random effect (E½u ¼ 0, Var½u ¼ σ), and ϵ is the random error.These models cannot be solved analytically like ordinary least squares and iterative methods must be used instead.For a PER treatment of these types of models see Van Dusen and Nissen [6] and Burkholder et al. [56].

Gradient boosting
The idea of boosting is to estimate the function fðXÞ that relates the covariates to the response in an iterative way.At each iteration, called the boosting step, the model is improved by fitting a base estimator hðXÞ to the negative gradient of a loss function (e.g., the squared loss).The loss function is a function that penalizes the discrepancies between the outcome of a model and the truth.In the simplest case, this means to iteratively fit a model to the residuals of the model computed in the previous step.At each step, the improvement is kept small (a hyperparameter, called boosting step size, guarantees this) to allow a better exploration of the covariates and, more importantly, to avoid overfitting.Examples of boosting algorithms are the model gradient boosting [57] implemented in the R package mboost and the eXtreme gradient boosting [52], implemented in xgboost.The latter, in particular, uses, as default, statistical trees [2] as a base estimator.
A full description of gradient boosting can be found in Ref. [2].The specific optimizations that Xgboost enhances gradient boosted trees with are described in Ref. [52].

B. Model evaluation
It is important to evaluate model performance in order to gauge the confidence we should place in models.Models that have poor fit, such as when R 2 is close to zero, may produce statistically significant results that are not repeatable or are simply due to increases in the size of modern datasets [58].Model performance for the LASSO study is evaluated visually by comparing predicted distributions (Fig. 3) and through the following metrics (Table III): the mean square error (MSE) and the mean absolute error.Model performance for the E-CLASS study is evaluated by comparing receiver operator characteristic (ROC) curves (Fig. 5).All metrics are computed on a test set.
Here, we do not explore information criteria such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC).These criteria try to evaluate the prediction error by adding a term to the training error that is supposed to estimate the optimism, i.e., how much the training error underestimates the test error [2].While accurate for simple models, these criteria tend to be hard to compute for more advanced methods, such as boosting, due to the difficulties in estimating the effective degrees of freedom [59].R 2 is also not applicable here, as it often has different meaning for different models.Think, for example, to a random effect model where the random effect controls some of the variance of such an effect.Since R 2 represents the explained variance of the model, and the random effect model controls some of the variance through the random effect, then the R 2 will not capture the same explained variance behavior.Despite being a reliable measure in the linear regression model, R 2 should not be used for comparing models different in their nature.
Additionally, we examine the out-of-sample residuals for each model (Fig. 4).Residuals are the difference between the true value and the predicted value from a model.While most summary statistics provide average views of how good the model is in predicting the out-of-sample data (e.g., mean squared error), residuals plots allow for the examination of local behavior across the range of the outcome variable (i.e., the post-test score).
An assumption of the ordinary least squares regression model is that residuals should be normally distributed and not correlated.When the residuals are not normally distributed and/or they are correlated, this indicates the model missed some behavior in the data.That is, there is an effect that is occurring in the data causing, for example, heteroskedasticity that is not captured by the model's functional approximation of the data.While this assumption is relaxed for other types of models, the presence of correlation within residuals can be useful in analyzing model outputs.Thus, visual inspection of the residuals is often important to determine if models miss underlying structure in the data or overfit on underlying structure.In practice, it is common that residuals have some correlation in real world data [2].
The gradient boosted models presented in this paper use decision tree estimators instead of linear models.Decision tree models do not return coefficients that are easily interpretable like in the case of linear models, instead they return the gain in information due to the use of that variable in the tree.For a gradient boosted model that is a collection of many trees, the average contribution of all the instances of the variable is used to describe its "importance" in the model prediction.For an xgboost model, this value is unitless [52].Gradient boosted models are evaluated in the same way as linear models using out-of-sample fit statistics.

V. CASE STUDY 1: PREDICTING POST-TEST SCORES IN CONCEPT INVENTORY DATA
Using the LASSO dataset [20], our first investigation looks at predictive regression models of postscores on the Force Concept Inventory and the Force and Motion Conceptual Evaluation.The LASSO dataset is a collection of pre-and postconcept inventory responses collected from multiple institutions typically in introductory physics courses.The data in this paper cover three years of data collection from 2016 to 2018.The LASSO dataset is connected with the national learning assistants' program that aims to include undergraduate students in the instruction of courses [60].Because this program is teaching focused, LASSO data are likely to be biased by the institutions that select to participate.Institutions that are more likely to invest time and money in teaching and learning are more likely to participate in the LASSO program.Institutions that invest in teaching and learning are likely to see larger learning gains than those that do not [22].However, the LASSO dataset is a large dataset of validated concept inventory responses, thus, it is a good test bed for the methodology presented in this paper.

Data management
The 2016-2018 LASSO dataset has 13 916 responses for 28 institutions (see Table I).We restrict these data to responses for only the Force Concept Inventory [61] and the Force and Motion Conceptual Evaluation [62] (see Figure 2).This focuses our work on introductory mechanics concepts.We then remove all courses with less than 20 students enrolled and all courses with less than 20 responses.This increases the registered number of students per course for the training and testing datasets.We also remove data for any course that has less than 50% students responding to the pretest or the post-test.Additionally, for students with multiple responses, we use only the student's first response.This reduces the dataset to have 6688 students total across 20 institutions and 90 courses.
In this paper, we have restricted the total variables to 30, including the outcome variable, the post-test score.The variables include two data types: (i) data regarding the student, and (ii) data regarding the course the student attended (see Table II).Student data include the pretest score, background information for students, such as gender and race or ethnicity, the number of pre and post-test questions answered, what year the student is in while taking the course (1st, 2nd, or 3rd or greater), and the time the student spends taking the pre-and post-test.The course variables include the type of concept inventory (FMCE or FCI), the number of students enrolled in the course, the number of learning assistants assigned to the course, the student-to-learning assistant ratio, and whether the course is offered at a community college or not.
The LASSO dataset has missing data [28].Pretest scores, post-test scores, the number of questions answered for preand post-tests, and the time it takes to complete a pre-and post-test all may have missing data.To address this issue, we used the multivariate imputation by chained equations (MICE) method [44] as recommended by Ref. [63] for use on the LASSO data.MICE estimates the missing data per variable that has missing data via a regression algorithm using all other columns.This is repeated for each column with missing data until all columns are filled.This is then TABLE I. Self-reported gender, race, and ethnicity of students in the LASSO dataset.Students may select more than one race option on the survey.In this table the race categories are presented as a sum of the reported within category.Thus the total sum of reported races is greater than the total number of students (N ¼ 6688) due to students identifying as more than one race.iteratively fit across a user chosen maximum number of iterations.In this paper, MICE is used to impute the missing data for pretest scores, post-test scores, the number of questions answered for pre-and post-tests, and the time it takes to complete a pre-and post-test.For each variable that includes imputed data, an additional column is created to identify whether the data are imputed or not.This additional column can, in the case when data are not missing completely at random, provide models with information that can aide prediction [31].

Model evaluation
In the LASSO study, we use an OLS model, a mixed effects model, and a boosted model.The OLS model uses all of the variables found in Table II.The mixed effects model additionally considers a random effect on the intercept and on the slope.The boosted model uses the xgboost package in python [52] and uses decision trees as estimators.It uses 20 000 decision tree estimators.20 000 was chosen as it provides a large number of estimators to learn the data without having any one estimator overwhelming the fit.We choose these three models as they each attempt to solve a specific problem and the OLS and mixed effects models are becoming commonly used in PER [6].OLS models are used to estimate mean effects on outcome variables (in this case, the concept inventory post score).Mixed effects models are used when there is an assumed groupwise variance such that the slopes and/or intercepts per group may be different [6,35].Gradient boosting models have been developed more recently [64] and can often better fit data in comparison to maximum likelihood models, such as the mixed effects models presented in this paper [65].
The models are then evaluated using the visual predicted post-score distribution (Fig. 3), the residuals visual (Fig. 4), the mean squared error, and the mean absolute error (Table III).A visual examination of the model fit through predicted post-score distributions and residuals provides a visual representation of the fit statistics (such as mean square error).This visual examination can prevent us from being "fooled" by good fit statistics [66].The fit statistics provide a summary value that can be compared directly across models.In this case, the lower the mean squared error and the mean absolute error, the better fit the model.

B. Concept inventory data results communication
Using the LASSO dataset, we have trained three separate models: (i) an ordinary least squares model, (ii) a mixed effects model, and (iii) a gradient boosted model.Overall, the gradient boosted model has a lower mean absolute error and lower mean squared error than the linear models (Table III).The mixed effects model has slightly better fit statistics in comparison to the ordinary least squares model.
Visually, the out-of-sample postscore distribution predicted by the gradient boosted model and the mixed effects model is close in shape to the true values (Fig. 3).The ordinary least squares and hierarchical models overpredict scores in the 30%-60% range and underpredict higher scores.
The residuals for all models show some correlation with the true values of the post-test score.The ordinary least

VI. CASE STUDY 2: COMPARING STUDENTS IN INTRODUCTORY VS BEYOND FIRST YEAR COURSES
Not all models in PER attempt to predict a continuous dependent variable like we have shown in the previous analysis.In some cases, the dependent variable is binary or even multiclass.For this work, it is common to use logistic regression instead of the traditional regressions.In this study, we use a subset of the data E-CLASS to present an example of a binary classification task in PER and how the framework can be used for such a task.This subset focuses on the future plans questions asked of students in the E-CLASS post survey.

A. E-CLASS methods 1. Data management
The E-CLASS is a survey of student epistemology and expectations surrounding experimental physics.We take a situated theory of epistemological development [67] approach when considering how the survey is used and what research questions can be asked.This theory looks at how students' epistemologies develop in particular socialcultural contexts based on reflections of their experiences.The survey is validated for all levels of university and college students and can be applied across many different lab experiences.To facilitate this process, the E-CLASS survey is administered through a central administration system [23].As of January 2020, it had been used in 599 FIG. 4. Residuals plots for the LASSO study.For the OLS model and the xgboost model, the residuals are normally distributed with the majority being within AE10 points of the true value (see Appendix Fig. 6).The mixed effects model does not produce normally distributed residuals.TABLE III.Out-of-sample model fit statistics for ordinary least squares, mixed effects, and gradient boosted models for the LASSO data study.In all cases, the gradient boosted models have "better" fit statistics in comparison to the linear models.This better fit is unlikely due to the boosted model over fitting the data.This better fit is evidenced by the linear models producing significantly different post score distributions using the out-ofsample data (see Fig. 3).Students were able to reply to the gender question as a text response instead of picking an option.In this case, these responses were reduced to a third "other" category.In some cases students had replied to the survey more than once, e.g., they took the survey in more than one course.In this case we only used their first response.In some cases there was no description of the type of institution the student attended in the course information survey that instructors filled out.Responses were removed in this instance.Additionally, responses were removed for students who attended 2 year colleges since there are no BFY courses in these institutions.This reduced the overall dataset of matched pre-and postresponses of N ¼ 19445 to N ¼ 10148 (see Table IV) students.In this study, we focus on the differences between students in introductory and BFY courses.

Ordinary
In this study, we do not use any results from the attitudinal survey itself.Instead, we use demographic information and some of the responses to career plan questions that the students report at the end of the survey.This includes the student's gender, eight questions (see Appendix) concerning the student's preferred career trajectory, the student's race or ethnicity, and the type of institution the student attends (4 year college, master's granting, or Ph.D. granting).We then use these variables to predict whether the students are in introductory or BFY courses.

Model evaluation
We apply traditional maximum-likelihood logistic regression, a mixed-effects logistic regression, and a gradient-boosted logistic regression to compare students in introductory courses to those in BFY courses.We compare the predicted output using the receiver operating curve (ROC) [68].This curve represents the ratio of truepositive to false-positive predictions of the model given the entire scale of the decision boundary (0 to 1).For example, a threshold of 0.2 would indicate that any student with a predicted probability of < 0.2 would be classified as a student in an introductory course.Students above this line would be classified as BFY students.For the hierarchical model, we add a random term to handle the effect of the institution type.
In this study, we used the advanced form of the "traintest-split" method presented in Sec.III C known as K-fold cross validation [2].Using 5 folds, we calculate each model and then report the average area under the ROC (AUC) curve (Fig. 5).By using cross validation, we can develop a better understanding of our confidence in the comparisons of the reported AUCs.This is necessary because (a) they end up being low according to acceptable AUC ranges created by Hosmer Jr et al. [68], and (b) they are close in value to each other.

B. E-CLASS results communication
Using the E-CLASS dataset, we trained three models to predict whether a student would be in an introductory course or a BFY course when responding to the survey.Using only demographics, responses to the seven-question career-plan question block, and information about the type FIG. 5. Receiver operator characteristic curves for the E-CLASS models predicting whether a student is in an introductory course or a BFY course.The more area under a curve, the better its performance.Overall, the boosted model (AUC ¼ 0.798) performs approximately 2% better than the maximum likelihood model (AUC ¼ 0.773) and considerably better than the mixed effects model (AUC ¼ 0.581).However, the overall scores are still below an excellent discrimination (AUC > 0.8) as identified by Hosmer Jr et al. [68].In this figure, the curves represent the average ROC curve for all 5 cross validation folds. of institution, we calculated the ROC curves shown in Fig. 5 from the model classifications (true positive rate and false positive rate).Overall the gradient boosted logistic regression model performs the best (AUC ¼ 0.798), while the mixed effects logistic regression model performed the worst (AUC ¼ 0.581).The fixed effects logistic regression model performs similar (AUC ¼ 0.773), but slightly worse, as compared to the gradient boosted model.All three models are below an "excellent discrimination" (AUC > 0.8) as identified by Hosmer Jr et al. [68].It is not clear if this can be expected.One must consider the typical demographics of physics lab courses in the US.Introductory lab courses often have students in wideranging majors, from physics, to engineering, to those in the life sciences, while BFY courses are typically dominated by physics majors.Students in physical sciences and engineering in the US tend to be overwhelmingly white and male, while those in the life science majors have more women.Additionally, there are issues of systemic sexism and racism in colleges and universities that can decrease the percentage of women and nonwhite students continuing into upper-division courses [69].It is not clear a priori if these general differences in course population are seen in the data we have or can be identified by the models.

VII. DISCUSSION
The framework that is presented in this paper is a culmination of a number of contributions towards statistical modeling in PER.It highlights the need to understand the types of data we collect, how we can represent them, and how we can transform the data [27].The framework is attentive to the missing data, which is common in social science [28].The framework allows for the introduction and evaluation of many different kinds of statistical models that can elucidate understanding of educational data [6,16,70].Finally, the framework aligns statistical model evaluation in PER with calls in social science fields to place a much greater emphasis on evaluating model prediction [7].

A. Data management
Describing datasets used in PER includes how the data are managed.This includes explicit description of how the data are reduced (e.g., dropping data that could be used to identify students) or describing how data are transformed (e.g., how the durations in the LASSO study were transformed via a logarithm).Describing this process allows for interpretation and reproducibility of transforming the raw data into data for a statistical model.
It is common in the social sciences that data are missing and that we need to impute that data in some way [28,46,71,72].The reason data are missing can be due to both completely random reasons, such as a question simply not being answered on a survey, to purposeful reasons, such as a subpopulation under study is less likely to participate in a repeated measurement like a pre-and post-test.
In the LASSO data study, we imputed missing data.Examining the feature importances, the gradient boosted model used many of the "is imputed" variables more effectively for prediction than it did the actual data (see Appendix).Approximately half (47%) of the rows in the LASSO dataset had at least one column that was imputed.It could be that having so many data being imputed biases the model much like not imputing data at all can bias results [28].That is, when so much data are being imputed, the model learns the process the imputation engine follows and that can artificially inflate the model fittedness.Thus, the statistical model then represents what the imputation model did, instead of the nature of the data collected.It may be that there are too many rows with missing data.This is similar to recommendations from biomedical clinical research, which state not to impute data when data are missing at least 40% of the time [46].It could also be that while the overall LASSO dataset is large, groupwise per course and institution, it has low N values.That is, in both the student groups (e.g., gender, race) and the institutional groups (e.g., single courses) there is a low number of students that represent each subpopulation.Thus, the actual amount of data available to impute student data from the same learning community is low.In this case, it may be wise to examine other methods such as weighting variables instead of imputing variables that have shown to be effective in low-N scenarios [72].

B. Model evaluation
We have presented two research studies in which we demonstrated the utility of examining model fittedness using the statistical model framework presented in this paper.This framework takes an agnostic approach to understanding model success.It is agnostic to whether models are considered direct representations of the data or are simply algorithms that fit data.There is much contention in statistics as to which viewpoint is "better."Bryk and Raudenbush [35] argued for directly representing the organizational levels of an educational environment through random effects in statistical models.This argument forms the basis for the use of most hierarchical mixed effects models.These models have been demonstrated to reject previous results such as the interaction of aptitude and teaching style has on math performance [73].Ultimately, we must pick the best models that can explain relationships we are interested in, being careful of not searching for the model that explain it by chance.Picking these models must rely on a comparative assessment of many models performance [7].In many cases, there may be a "multiplicity of good models" [1].
In the LASSO study, we assume there is an institutional random effect due to them being reported in the literature (e.g., Ref. [20]).Therefore, the ordinary least squares method performs poorly and there is an assumed random effect due to the institution.That is, there is not only a per institution effect that adds some amount to a student's total post-test score, but also this effect has a wide variance that would make the effect variable per student all other variables being equal.
For the E-CLASS study, we used the receiver operator characteristic curve (Fig. 5).The ROC curve represents the ratio of true-positive results to false-positive results.Thus, it helps us estimate model performance in the context of what the model labels as the true value (in our case, being a BFY student).What we see in this study is that if we had relied on an assumed structure of the data as dictated by the mixed effects model, we would arrive at much different results than either the OLS model or the boosted model.

Visualizing model results
We have also emphasized in this paper the visual examination of model fits to data, in addition to the fit statistics.It is not a new claim that we must not only examine model fit statistics, but also visually examine the model's predictions themselves.The quartet of Anscombe [66] highlighted the need to examine the data visually.Using a quadrant of four different datasets but with identical fit statistics, Anscombe demonstrated that visual examination of a model's fit is sometimes the only way to determine our trust in the model itself.In this paper, we have demonstrated the same need when statistical modeling is applied within physics education research.In the LASSO case, the gradient-boosted model arrived at a better understanding than the linear models of the dataset.The residuals of the model were examined to better understand the model fit.Using the residuals we can determine how closely the predicted scores are to the actual scores.

Boosting
We have presented three models for each study: (i) a traditional linear model, (ii) a mixed effects model, and (3) a boosted model.In each case, the boosted model is better at fitting the data without overfitting.The boosted method likely solves the regression better because the model spends more time learning the peculiarities of each dataset [65].Gradient boosted models are typically fit using a large number of iterations of adding new learners to the overall base model.The number of iterations is actually an important tuning parameter for boosting, and many properties depend on its correct choice [59].
The models presented here use 20 000 iterations, which produces a post-test distribution closer to the observed (test) data and an overall better fit.No one learner in the boosted model uses all of the variables nor does it use all of the data.By combining the collection of many weak learners, the boosted model is able to capture complex structures of the data, i.e., nonlinear effects, and prevents overfitting (the typical sign of overfitting a model is when the model predicts training data very well and performs poorly on test data [2]).In particular, the last point is related to the discussion of Sec.VII B on the role of a model: if one wants the results to be generalized outside the specific data set at hand, i.e., have a low prediction error, a boosted model may lead to a much better model than those obtained with traditional approaches.
Ultimately, it is up to the researcher to decide what models they use and why.Researchers who are most concerned with understanding local effects and not producing generalized models are likely to find ordinary least squares an adequate solution for their purposes.However, by restricting the use of which models are available, it might lead to models that require lower confidence in the results or even erroneous characterization of the most important features.Models may also not be able to be generalized beyond a local context.Newer methods, such as boosting, also have the ability to increase understanding of subpopulations that traditionally are poorly fit by statistical models due to lack of data [40].Ultimately, there is not one "true" model, but many different models that can inform us on the dynamics of students in educational systems in physics.

C. Results communication
It is common in PER to report statistical model results as the coefficients and random effects (if any) of the model.We argue that we can present results in greater detail by also using visualizations of the predicted output.
Models with misestimated variable rankings may lead to departmental policy decisions that are not reflective of the confidence researchers have in the model.Given that 24% of the reviewed papers (Sec.II) presenting statistical models had no examination of model fittedness, this could have a profound effect on results that have been presented in physics education research.
It is often the goal in PER that we are interested in as much the magnitude of an effect as is the direction of that effect.For example, it is helpful to know not only how much a student's pretest score affects their post-test score, but also whether that effect is negative or positive.In this paper, we have not presented these types of results.This is done in order to focus on the examination of prediction to establish model confidence.Modern methods such as boosting can produce these estimates as well depending on the type of learner the boosted model uses (such as a linear learner).These boosted models can also use decision trees (this is what is used in the E-CLASS study).Feature importances of tree-based models do not tell us positive or negative effects only the magnitude of the effect.Instead, tree-based models can utilize methods such as partial dependence [64] or shapely additive values [74] that can provide a more complete picture of each variables effect on the outcome.Both of these model explanation methods explore the entire range of a particular variable and its effect on an outcome variable such as a post-test score.These measurements are able to explain nonlinear effects (impossible for model coefficients) and are model agnostic.Explaining these methods is beyond the scope of this paper.

D. The role of theory and domain knowledge in statistical modeling
This paper has focused primarily on developing a framework of statistical modeling in PER.The studies presented as examples here are classic studies in PER: survey and concept inventory research.While there is theory motivating these types of studies, the theories were not present in this paper.This is not a tacit recommendation to avoid theory.Theory dictates the research design and the boundary conditions upon which we accept or reject model results.This holds true as well in the reverse.In some cases, theory may fail and the statistics we have used provide an argument for that failure.In either case, it is important to clearly articulate the theory that motivates a study, as much as it is important to articulate a statistical model evaluation process.

E. Recommendations for researchers
Researchers who are building statistical models to answer physics education research questions can build more effective research by following a few recommendations: 1. Describe the dataset, how it was collected, the types of data that are utilized in the study, the context of the data, whether it is from a specific institution or many, and any processes used to impute the data or otherwise alter it [27,32] produce the best fitted models [1].These models should include not only different groups of variables, but also different algorithms such as gradient boosting [64] or random forests [48].While inline with current calls in other fields to standardize model evaluation (e.g., Ref. [7]), these recommendations are not exhaustive.In many cases, there is likely a better fit statistic, visualization, or methodological approach than presented in this paper for the specific research question a physics education researcher is exploring.The recommendations presented here are suggested to be starting points for statistical modeling in PER.

VIII. CONCLUSION
We presented a framework for statistical modeling using predictive models that are presented in physics education research.The framework focuses on explicit articulation of the data management procedures, the model evaluation methods, and the way results are defined and communicated.It presents results that suggest explicit articulation of data handling through transformations, imputations, etc., is important to understand the statistical model outcomes.We highlight the need to assess statistical models using out-ofsample data.Out-of-sample prediction must be used as an assessment of the effectiveness of predictive models.Assessing prediction provides confidence in statistical model results that are otherwise lacking in view of modern statistics [7].The framework is grounded in recent efforts to critically evaluate statistical methods in PER highlighting this need to assess prediction [5].We present two complementary studies that justify this need for demonstrating that without assessing prediction, researchers can arrive at erroneous results.In our two example case studies, the boosted models are consistently better than the linear models.While this is consistent with theoretical assessment of boosting from statistics literature (e.g., Ref. [65]), the goal of this paper is not to present boosting to the PER community as a panacea.Instead it is to demonstrate the effectiveness of the presented statistical modeling framework.

ACKNOWLEDGMENTS
This project was supported by the Michigan State University College of Natural Sciences including the STEM Gateway Fellowship and the Lappan-Phillips Foundation, the Association of American Universities, and the Norwegian Agency for International Cooperation and Quality Enhancement in Higher Education (DIKU), which supports the Center for Computing in Science Education.This project has also received support from the INTPART project of the Research Council of Norway (Grant No. 288125) and the National Science Foundation (Grant No. PHY-1734006).We would also like to thank Coline Bouchayer who created Fig. 1.

APPENDIX A: E-CLASS QUESTIONS
The future plans questions from the E-CLASS postsurvey used in this study include the following:

E-CLASS study
The E-CLASS study presented three separate models: (i) a logistic regression (Table VIII), (ii) a mixed effects logistic regression (Table IX), and (iii) a gradient boosted logistic regression (Table X).The feature coefficients and importances are presented below.

FIG. 1 .
FIG.1.A framework for building and evaluating statistical models in physics education research.

FIG. 3 .
FIG. 3. A comparison of the out-of-sample predicted distributions of post-test scores from the LASSO dataset.Curves represent the kernel density estimates of the predicted post-test scores.The gradient boosted model is closer to the true distribution than the ordinary least squares model and the mixed effects model.

TABLE II .
Data types at different levels (student or course level).An asterisk (*) indicates that there is also a column in the dataset indicating if the data was imputed or not.No columns for the E-CLASS study were imputed.

TABLE IV .
Self-reported gender, race, and ethnicity of students in the E-CLASS dataset separated by level of course (first year and beyond first year).

TABLE V .
OLS coefficients for the OLS model presented in the LASSO study.

TABLE VII .
[52]ure importances for gradient boosted model.The feature importance in this table represents the gain[52].

TABLE VI .
Mixed effects estimates for the LASSO study.

TABLE VIII .
The logistic regression results for the E-CLASS study.

TABLE IX .
Mixed effects results for E-CLASS study.

TABLE X .
[52]ure importances for the gradient boosted model for the E-CLASS study.The feature importance in this table represents the gain[52].