Comment on"Rubric-based holistic review represents a change from traditional graduate admissions approaches in physics"

A recent paper used complicated machine learning methods to suggest that a new rubric-based graduate admissions approach was significantly different from the previous approach in an unspecified way. Simple inspection of the distributions of metrics shown in the paper shows that the rubric approach succeeded in the often-stated specific goal of increasing admissions of applicants with low undergraduate grade point averages and GRE scores. Nevertheless, the argument that this change is a promising way to improve graduate outcomes is based on a misreading of the prior literature. A method used in some of the analysis for dropping data points before running the machine algorithm is likely to bias those results.

6/16/23 2 Young et al. (1) have recently evaluated whether adopting a rubric-based approach for graduate physics admissions at Michigan State University has changed who gets admitted.The primary motivation for the change, clearly laid out in many of their references and prior publications, e.g.(2)(3)(4)(5), is to obtain domestic students more representative of the demographics of the United States by removing barriers to admission of applicants with low Graduate Record Exam (GRE) scores or with low undergraduate grade point averages (UGPA).Obtaining higher PhD completion rates through better selection of students is described as a central goal of such changes.(1) Although effects of the rubric approach on demographics were not directly evaluated due to lack of demographic data, more abstract statistical methods were used to evaluate whether the admissions procedures had changed much.(1) In this Comment I make three main points.
1.Although the paper only tentatively concludes: "Overall, the results of this initial investigation are suggestive that our admission process did change…"(1), simple statistical tests show conclusively that the rubric system specifically succeeded in substantially increasing the fraction of low-scoring applicants who were admitted.
2. Although the paper argues, based on prior literature, that this change will improve graduate outcomes, a more careful reading of that literature, including papers they cite, suggests the opposite.The simplest method used is a Kolgomorov-Smirnov test (6), which compares two cumulative distribution functions (CDFs) to test the null hypothesis that they represent random samples from the same population.This test was unable to reject the null at alpha=0.05 for the CDFs of the UGPA or either of the Physics and Quantitative GREs.(1) Although the CDFs are not explicitly given, visual inspection of the smoothed probability density functions of the distributions of accepted applicants shown in Figures 5-7 indicates a much larger left tail (low scores) for each metric after the procedural change, precisely the effect that the changes were intended to produce a priori.(1) The data points are fairly visible, allowing quantitative reconstruction of the left tails.To systematically compare low-scorer admissions in the two groups, we may check what fraction of the old-system accepted students scored below the tenth percentile (17 th lowest) of the rubric-admitted students.This cutoff includes enough students to matter but does not get into the meat of the distribution where individual points become too hard to discern in the figures.As nearly as I can judge from the figures, the total number of these low old-system points for the three metrics shown was 29, i.e.

A data-editing
3.9% of the total rather than 10%.The one-sided Fisher exact p-value for the increase (i.e. the probability that an increase at least that big would happen by pure chance) is less than 0.001.For the Quantitative GRE, Physics GRE, and UGPA taken separately the one-sided Fisher exact pvalues were less than 0.001, 0.10, and 0.04, respectively.These counting results confirm the visual impression from the smoothed PDFs that more lowscorers were admitted in the new system, as hoped, although for the Physics GREs taken separately the change would not be large enough to meet the conventional alpha=0.05significance standard.It seems that most of the low-scorers admitted via rubrics would not have been admitted in the old system, so that the new system almost certainly succeeding at opening up the admissions process substantially in the intended direction.

Modeling results and methods
The bulk of Young et al.'s (1) technical discussion looks at differences between random forest models of the different admissions procedures to see if changes show up in the models.Some model differences were observed.A model of the old system produced an area-under-curve 6/16/23 4 (AUC) of 0.76 on test samples after training on the other data, about halfway between entirely unpredictive (AUC=0.5)and completely predictive (AUC=1.0).The model of the new system gave a testing AUC=0.63 on the full set of applicants, only about a quarter of the way toward full prediction.Thus omitted predictors of admission seem to be important for both models, especially the new one.
For a large subset of the newer applicants a variety of rubrics were available, each treated as a categorical variable.These variables were: explain that since Michigan law forbids the direct use of such factors "To comply with this law, our university's admissions system collects limited demographic data and our department chose not to record the information that was available when evaluating applicants."(1) In order to deal with the models' low AUCs, the paper explores a modified analysis in which some points are removed before the random forest model is used.The paper states "…there are cases where applicants have similar physics GRE scores and GPA, yet one applicant is accepted while the other is not.Given that cases such as these might add challenges to modeling the data, removing such applicants might allow us to better characterize the general trends in the data."(1) This data-editing is problematic because removing points where a UGPA-GRE model is not predictive makes models that include those factors look better at predicting than they actually are.Such editing can obscure the need for other modeling factors that account for outcome differences that are not explained by UGPA or GREs.It undermines the key reason given for using the random forest method: "We choose to apply a classification machine learning model under this computational framing, specifically random …due to the lack of assumptions on the data and random forest's feature importances."(1) Similar problems could arise whenever a such a data-dropping method is used before modeling a complicated real-world process, as opposed to before modeling a simple few-parameter process obscured by exogenous noise.
Prior literature on graduate performance predictors Three publications are cited to support the claim that traditional predictors have little value.(3,4,7) One is a thesis(7) that found no significant predictors of outcomes.Since that study was based on a single program it was subject to range restriction and collider stratification bias.With N=54, it had little chance of finding any effects.One other study, not cited, also found no conventionally statistically significant evidence of GRE predictive power in a design that should minimize collider stratification bias, but it concerned less quantitative fields and had N=32.(8) 6/16/23 6 Those results contrast with an uncited larger study (N=138) lacking collider stratification bias that showed correlation coefficients between 0.55 and 0.70 between GRE scores and measures of psychology graduate students' performance in learning and using quantitative social science methods.(9) Two papers cited (3,4) are specifically on whether GREs help predict completion of physics programs.Although those papers did claim at most points (the Supplement to ( 4) is an exception) that the GREs were not significantly predictive, subsequent papers have shown that their analyses relied on myriad statistical errors, including a biased imputation method for missing data, collider stratification bias, improper treatment of range restriction, improper treatment of collinearity, improper use of null hypothesis significance testing on small subsets, and other errors.(10,11) Each error tended to obscure the predictive power of GREs.(10,11) Reanalysis using conventional methods showed that, holding UGPA constant, the Quantitative and Physics GREs together provided a factor of 3 or more in predicting the odds of graduation between the 90 th percentile and the 10 th percentile in the large domestic cohort studied.(11 shown in its Table 4, three of the top four predictors of who would do poorly (bottom 20 th percentile) were GREs.(12) Three of the top five predictors of who would do well (top 20 th percentile) were GREs.(12) Thus while this reference does offer support for the utility of random forest predictive models in some circumstances, its substantive results are evidence against the underlying premise that dropping or de-emphasizing GREs will improve graduate performance in fields involving quantitative analysis of data.(12) Regardless of details of the models, the analysis of Young et al. (1) does show that at MSU the Physics GRE was given much more weight in admissions decisions than the Quantitative GRE.
If that turns out to be typical for other departments, it could help explain why the Quantitative 6/16/23 7 GRE was more predictive than the Physics GRE in multiple regression fits (3,4), since the more heavily weighted admissions variable picks up more collider stratification bias.
technique that is advocated and used in some of the analyses is likely to give biased results and thus should generally be avoided.Evaluating effects on admissions of the procedure change Young et al. primarily use two methods to check if the new procedure is admitting different students than the old procedure.Neither generic method uses the prior knowledge that the point of the changes is to open up admissions for students who would previously have been rejected due to low scores.Thus these methods are not especially sensitive to the feature of interest.6/16/23 3 surprisingly, a model based on these same detailed metrics as evaluated and used by the admissions committee only increased the testing AUC of the rubric model to 0.67.Other factors beyond this extensive list seem to have been considered.6/16/23 5 Prior work from the same MSU group gives indications of some important missing factors in the models.(5)Logistic regression models of admission controlling for UGPA and Physics GRE showed strong positive direct predictive coefficients for females and under-represented minorities.(5) These demographic factors were not included in the current models.Young et al.

)
Young et al. (1) justify the use of the complicated random forest technique to look for changes in the admissions criteria by citing a paper(12) comparing various techniques for predicting which students did well or poorly in a graduate data science program.Young et al. (1) do not, however, mention the substantive results of that paper.According to the random forest results