Skills-focused lab instruction improves critical thinking skills and experimentation views for all students

Instructional labs are fundamental to an undergraduate physics curriculum, but their possible learning goals are vast with limited evidence to support any particular goal. In this study, we evaluate the efficacy of labs with different goals and structures on students ’ critical thinking skills and views about experimentation, using an extensive database of survey responses from over 20 000 students at over 100 institutions. Here, we show that labs focused on developing experimentation skills improve students ’ critical thinking skills and experimentation views compared to labs focused on reinforcing lecture concepts. We further demonstrate the positive impacts of skills-based labs over concepts-based labs on these outcomes across students ’ gender and race or ethnicity. Our analysis also shows that activities to support students ’ decision making and communication explain over one-half and one-third of the effect of skills-based labs on students ’ critical thinking skills and experimentation views, respectively, while modeling activities have only a small effect on performance. ’ each


I. INTRODUCTION
Instructional labs make up an important part of the undergraduate physics curriculum, with the opportunity to engage students in the practices of experimental physics and develop their technical, communication, and critical thinking skills [1]. There are myriad ways that labs can be structured, consistent with a wide range of learning goals [2][3][4][5][6]. However, literature on lab instruction demonstrates a lack of consensus on the desired goals of lab instruction [2,7], with many labs focusing on demonstrating or reinforcing canonical theories or phenomena, rather than seeking to develop students' skills [8]. The lack of consensus on goals for labs has been attributed to a lack of evidence about the efficacy of labs for either goal [2,9,10]. More and more research has begun evaluating the efficacy of different lab programs on students' critical thinking skills and views, but a comprehensive study across multiple curricula is needed.
Research has suggested that labs with an explicit goal of reinforcing concepts also taught in lecture have little to no impact on students' conceptual understanding [11][12][13][14][15][16].
In this paper, we probe an open area of research that looks to evaluate the efficacy of labs with different goals. We examine how labs from over 100 institutions impact students' critical thinking skills and experimentation views, and, more importantly, evaluate whether the impact is consistent for students of different genders and races or ethnicities. Additionally, we evaluate what types of instructional activities may explain the impacts on students' outcomes. Ultimately, the analysis demonstrates the universally positive impact of skills-based labs compared with concepts-based labs on these assessments for all demographics of students, due in part by their increased use of activities that target student decision making and communication.
A. How does lab type impact subpopulations of students?
Several previous studies have looked at the effects of labs with different goals on overall student performance. For example, research has generally found benefits of labs aiming to develop students' experimentation skills, looking at whole cohorts of students. Many studies in physics education research, however, have found (or suggested) that instructional outcomes may differ for different subpopulations of students [22][23][24]. In labs, specifically, researchers have found that men held more positive attitudes and beliefs about experimental physics than women, on average, and that the size of this difference was largely maintained following instruction [25]. Similar differences have been found on measures of student performance on lab-specific outcomes, such as one study that found men outperformed women on a data handling diagnostic, on average, and that the difference widened following instruction [26]. Analysis of students' grades, however, found no consistent difference between men's and women's grades in lab courses, despite consistent differences in lecture courses [27].
Some of these differences may be a result of how different students experience or are able to participate in the instruction. For example, previous work found that men and women participate in lab activities and roles differently [26,[28][29][30][31][32][33]. This division of tasks, however, may vary depending on the goal or type of lab instruction [30].
In a more open-ended lab that aimed to develop students' experimentation skills, researchers found that there were more roles available to students and that the division of roles along gender lines was more distinctive than in a traditional lab aiming to reinforce concepts [30].
Alternatively, the differences (or lack thereof) may be inherent to the pedagogical structures of laboratory courses in general. For example, the assessment and grading structures in lab courses differ significantly from those of lecture courses [27], which likely impacts the ways students approach the courses [34]. The prolific use of group work [8], often such that students are assessed at the group level, may also be a factor. Altogether, this literature raises the question: How do labs with different purposes impact the attitudes and experimentation skills of students from different subpopulations?
B. What pedagogical features of different lab types lead to outcomes?
Given the apparent benefits of labs aiming to develop students' experimental physics skills, the next question is: why? Given that it is the implementation, rather than the goal, of the lab that impacts students' learning, through what mechanisms do labs with different goals impact students' performance?
To answer this question, we explored the types of skills the instructors indicated focusing on through their lab activities. For example, lab courses may focus on data analysis and uncertainty, experimental design, modeling, or communication skills, among others [3]. Lab courses in this dataset almost ubiquitously included activities associated with data analysis and uncertainty [8], restricting us from drawing any conclusions about the impacts of these skills on student outcomes. Three other types of skills, however, were more variable in the dataset.
First, many courses focused on developing students' skills around making decisions about an experiment, such as to choose a research question or design aspects of the experimental procedures. Pedagogically, supporting students' decision making requires opportunities and support for student agency [35]. Labs that support student agency and choice have been shown to improve students' engagement [36], attitudes and beliefs [14,37,38], and engagement in experimental physics practices [15,19,35].
Second, a fundamental aspect of experimental physics is learning to generate and test knowledge about physics through modeling activities. A focus on modeling has been seen in high school physics [39], introductory physics instruction [40], and upper-division physics labs [41]. A modeling focus has been shown to engage students in behaviors that align with more expertlike thinking and reasoning in the lab [42,43] and support students' attitudes towards physics [44].
Finally, communication skills are a significant focus of many lab courses [45], as students learn to use lab notebooks [46], write lab reports or mock journal articles [47,48], or present their work [48]. Communication skills may also relate to teamwork skills [48], which may in turn develop classroom community and foster students' sense of belonging [49].

C. Research questions
To measure students' experimentation views and critical thinking skills, we analyzed data from two previously validated assessments: the Colorado Learning Attitudes about Science Survey for Experimental Physics (E-CLASS) [50] and the Physics Lab Inventory of Critical thinking (PLIC) [21]. Together, these two instruments have been used in hundreds of courses in over 100 different institutions, totaling in responses from over 20 000 students around the world. Our research questions in this study are threefold: RQ1 How do labs with different purposes affect the E-CLASS and PLIC scores for different subpopulations of students? RQ2 What pedagogical features are characteristic of labs with different intended purposes? RQ3 How do pedagogical features of labs affect students' E-CLASS and PLIC scores? We distinguish three different types of labs based on their overall purpose: (a) labs that aim to reinforce material from lecture (concepts-based labs), (b) labs that aim to develop students' experimentation skills (skills-based labs), and (c) labs that aim to do both (mixed labs).
To answer RQ1, we evaluate students' outcomes through a lens of equity of individuality [24,[51][52][53]: "Equity of individuality is achieved when an intervention improves the outcomes of students from marginalized groups" [ [24], p. 40]. With this definition of equity, we do not explicitly examine whether achievement gaps exist between groups of students, nor whether these potential gaps are closed, maintained, or widened following instruction. We instead examine whether labs with different purposes provide positive outcomes for all groups of students.
For RQ2 and RQ3, we consider pedagogical features that relate to decision-making, modeling, and communication activities that students engage in during labs. For RQ2, we examine how instructors' intended goals for their labs align with the prevalence of these activities in their labs, an indication of their pedagogical choices. For RQ3, we evaluate the effects of these pedagogical features on students' E-CLASS and PLIC scores to ascertain the role of these features in developing students' experimentation views and critical thinking skills in physics labs. By addressing these two research questions, we begin to explore how instructors' intended goals for their labs manifest in their pedagogical choices and the impacts of these pedagogical choices.

II. METHODS
In this section, we provide an overview of the data and analysis methods used in this study. Additional details can be found in Appendix A.

A. Data sources
We measured students' critical thinking skills and views about experimental physics through two instruments.
The E-CLASS aims to measure students' personal views about experimental physics in their lab class by evaluating the degree to which students agree or disagree with statements about experimental physics [54]. The instrument consists of 30 five-point Likert items and students are scored based on how well their responses align with responses from expert physicists on a collapsed three-point scale: students receive one point on an item if their answer aligns with the majority of experts (e.g., the student selects "agree" or "strongly agree" when most experts selected "strongly agree" or "agree") and −1 points if their response is opposite to that selected by the majority of experts (e.g., the student selects "agree" or "strongly agree" when most experts selected "strongly disagree" or "disagree"). Neutral responses receive zero points. This scoring scheme provides a range of possible scores on the E-CLASS from −30 to 30 [25,54].
The PLIC aims to measure students' critical thinking skills in the context of experimental physics, defined here as the decision making involved in interpreting data, drawing accurate conclusions from data, comparing and evaluating models and data, evaluating methods, and deciding how to proceed in an investigation [21,55]. The PLIC consists of 10 multiple-response items and, as with the E-CLASS, students are scored according to how well their responses align with those from expert physicists.
Scores on each item can range from zero to one with partial credit awarded for selecting response choices that were picked by at least 10% of experts. Possible scores on the PLIC, then, range from zero to ten.
When administering the E-CLASS or PLIC, instructors provided details about their class through a course information survey (CIS) [17,56]. The CIS asks, for example, about the course level, the number of hours students spent in lab each week, the number of instructional staff, and the main purpose of the lab (either to reinforce physics concepts, develop lab skills, or both about equally). We acknowledge that one can not isolate skills or concepts in labs entirely; models and practices are inextricably linked in experimental physics. These broad categorizations of the lab purpose, however, informs whether, primarily, the practice is in service of theory or the theory is in service of practice [11]. Because different instructors may view these characterizations differently, we also explore more tangible pedagogical variables that may more specifically characterize these categories of lab purposes. The survey also asks how often students engage in various activities in the lab, such as designing procedures, building apparatus, or working in groups. We used this last set of items to measure the amount of decision making, modeling activities, and communication activities in the labs, discussed in further detail in Appendix B.
We used data collected from only first-year courses, as labs at the beyond-first-year level tended to be more homogeneous and aligned with developing students' labrelated skills; only four beyond-first-year courses in our E-CLASS and PLIC datasets were labeled as conceptsbased by instructors. We used responses to the E-CLASS from 16 409 students enrolled in 230 classes and 56 institutions, and responses to the PLIC from 4988 students enrolled in 77 classes and 28 institutions. We define a class here as a combination of course (e.g., Physics 101) and semester (e.g., Fall 2019), so a single course may administer the E-CLASS or PLIC in multiple semesters and count as multiple classes. In Table I, we provide the breakdown of institutions and classes in our datasets across institution type and the main purpose of the lab associated with the class. Both the E-CLASS and PLIC are administered to students prior to lab instruction (pretest) and following the conclusion of lab instruction for the semester (posttest). We collected students' self-reported gender, race or ethnicity, and academic major information at the end of the surveys. Table II gives a breakdown of the students in our datasets by students' self-identified gender and race or ethnicity and by the type of lab in which the student was enrolled. For both instruments, students had the option of not disclosing any demographic information, in which case we categorized their gender or race or ethnicity as unknown. We kept these students in our dataset to maintain statistical power and recognize all students who completed the assessments whether or not they were comfortable disclosing demographic information. We also chose not to collapse demographic characteristics (such as by grouping students into majority or underrepresented minority categories) to more accurately evaluate the possible effects of lab instruction on different subgroups of students [57].

B. RQ1: Effects of different lab types on scores for different subpopulations of students
To address RQ1, we performed mixed-model regression analysis to evaluate the impact of lab type on student outcomes for different subpopulations of students. We used two-level linear mixed models with institutions as random effects to account for students being nested within institutions in our datasets. The institutional random effects help to account for potential systematic differences between institutions, such as prior preparation or instruction in nonlab components of the courses (for a review of linear mixed models for physics education research datasets, see Ref. [58]). Unconditional models-models with no fixed effects, but with institution as a random effect-indicated that 5.1% of the variation in E-CLASS post-test scores and 7.1% of the variation in PLIC post-test scores could be explained by institution-level differences alone.
Separately for the E-CLASS and PLIC, we fit models with post-test score as the dependent variable and main effects for lab type, pretest score, gender, race or ethnicity, and major. We additionally included interaction terms between lab type and gender, and lab type and race or ethnicity. This model allowed us to investigate whether the effect of lab type on students' post-test scores differed across student demographic groups. We used a lens of equity of individuality [24,51], whereby we evaluated the post-test scores in each intervention (i.e., lab type) separately for each demographic group. Using pretest score as a main effect variable serves to account for differences in students' incoming preparation.

C. RQ2: Pedagogical features that characterize different types of labs
We used a confirmatory factor analysis (CFA) to construct a measurement model for the amount of decision making, modeling, and communication activities in labs using a combined dataset of instructors' responses to both the E-CLASS and PLIC CIS. To avoid overweighting courses where instructors administered both the E-CLASS and PLIC in the same course or either assessment in multiple classes across semesters, we included only one entry for each unique course in our dataset. We made an exception if an instructor provided different responses to the CIS for different classes, indicating that pedagogical features of their lab had changed. The dataset used in this analysis included 157 unique courses (or classes where the instructor provided different responses to the CIS in different semesters). The full measurement model used and its development are presented in Appendix B. II. Demographic breakdown of students in the E-CLASS and PLIC datasets across lab type. Racial or ethnic groups were not considered mutually exclusive and so numbers may not sum to the total number of students in the dataset. All  16 409  3209  3823  9377  4988  1838  2229  921  Gender  Man  9236  1604  2234  5398  2762  1065  1168  529  Nonbinary  172  29  48  95  51  11  29  11  Woman  6626  1489  1482  3655  2140  753  1013  374  Unknown  375  87  59  229  35  9  19  7  Race or ethnicity  American Indian  152  17  22  113  63  22  29  To address RQ2, we used Thurstone's regression method [59] and our measurement model to compute factor scores for the amount of decision making, modeling, and communication activities in each of the 157 unique courses in our dataset. We standardized factor scores to have mean zero and standard deviation 1.

D. RQ3: Effects of pedagogical variables on students' scores
To address RQ3, we further computed factor scores for all classes in both the E-CLASS and PLIC datasets using the same procedure as above. Eight classes (corresponding to two unique courses who used the PLIC multiple times) with 158 students total were missing information on the PLIC CIS such that we could not compute their factor scores, and so were removed from the dataset. We used the complete factor scores data in two-level linear mixed models similar to those described in Sec. II B (separately for the E-CLASS and for the PLIC) with post-test score as the dependent variable and controlling for the main effects of pretest scores, gender, race or ethnicity, and major. Unlike the models described in Sec. II B, we did not include lab type as a main effect and we did not include interaction effects between any variables.
Because of a large correlation between the factor scores for decision making and communication activities (r ¼ 0.93), we fit two separate models for each of the E-CLASS and PLIC datasets. In one model, we included decision making and modeling factor scores, while in the other model we included modeling and communication factor scores. These models allowed us to determine the effect of increased decision making and communication on students' scores controlling for the amount of modeling in labs, and vice versa. We could not, however, disentangle the effects of decision making and communication on students' scores, given the significant correlation.

A. RQ1: Effects of different lab types on scores for different subpopulations of students
Full results from the fitted linear mixed models for the E-CLASS and PLIC are presented in Appendix C. In this section, we summarize the results using plots of expected post-test scores from marginal effects. Marginal effects represent the expected outcomes from our fitted models and indicate how the outcome measure (i.e., E-CLASS and PLIC post-test scores) changes with particular independent variables (i.e., lab type, gender, and race or ethnicity). In these marginal effects plots, students' pretest scores are held fixed at the mean value for each independent variable being investigated. All other variables, other than the ones plotted, are held fixed at their proportions. For example, 13.8% of students in our E-CLASS dataset intended to major in math or computer science. When calculating marginal effects from this model, we set the Math and CS variable, which is generally a binary variable (1 if the student is a math or CS major and 0 otherwise), equal to 0.138. These marginal effects should, then, be interpreted as expected post-test scores averaged across all other variables. Figure 1 shows the expected post-test scores for the E-CLASS and the PLIC. The first panel in each row shows the effect of lab type on aggregate. Overall, on both assessments, students in skills-based labs score an average of 0.2 standard deviations higher at post-test than students in concepts-based labs, controlling for students' pretest scores, and self-reported major, gender, and race or ethnicity. Mixed labs sit between the two extremes.
The second and third panels of Fig. 1 show that, when there is sufficient precision to distinguish the groups, the pattern is consistent across student gender and race or ethnicity. That is, students from all subpopulations score higher on the instruments when participating in skills-based labs compared to concepts-based labs, again with mixed labs in the middle. The size of the effect differs for different subpopulations, however, and the sample sizes in some subpopulations are too small to distinguish the scores between lab type.

B. RQ2: Pedagogical features that characterize different types of labs
We found differences in the average factor scores across all three pedagogical variables between skills-based and concepts-based labs. For agency factor scores, this difference was about 1.44 AE 0.17 standard deviations; for modeling factor scores, this difference was about 0.51 AE 0.23 standard deviations; and for communication factor scores, this difference was about 1.58 AE 0.16 standard deviations.
Smoothed density plots of the fraction of labs of each type with varying amounts of agency, modeling activities, and communication activities are shown in Fig. 2. We find that skills-based labs typically engage students in more decision making and communication activities than concepts-based labs, while mixed labs typically fall somewhere between these extremes. The amount of modeling activities in all three types of labs follow similar distributions, with mixed and skills-based labs supporting slightly more modeling than concepts-based labs, on average.
We also found that these pedagogical variables were not independent. The presence of modeling activities in labs was moderately correlated with an increase in decision making (r ¼ 0.41) and communication activities (r ¼ 0.23), while increased student decision making was highly correlated with increased communication activities (r ¼ 0.93). These results imply that pedagogical choices made to support student decision-making in physics labs are closely associated with choices to support opportunities for student communication. Error bars represent one standard error (68% confidence interval). We have not shown marginal effects on the PLIC for students who identified as nonbinary, an unknown gender, American Indian or Alaska Native, or Native Hawaiian or Pacific Islander due to error bars that exceed the range of the plots. These students were included in our models, however, and full results can be found in Appendix C.
C. RQ3: Effects of pedagogical variables on students' scores Figure 3 shows the expected post-test scores as a function of the amount of decision-making, modeling, and communication activities in the labs for both the E-CLASS and the PLIC (Appendix C for the full results of the fitted linear mixed models). Post-test scores are higher in labs with more activities related to decision making and communication, with a small effect from modeling activities.
We also estimated the fraction of the observed effect of skills-based labs (compared to concepts-based labs) that can be explained by these pedagogical variables (either decision making, modeling, or communication activities). Full results and calculations can be found in Appendix C. We find that decision-making and communication opportunities in labs accounted for 34%-41% of the observed effect of skills-based labs on students' scores on the E-CLASS and 58%-76% of the effect on PLIC scores (corresponding to standardized effect sizes of 0.06-0.09). In contrast, the presence of modeling activities in labs accounted for less than 7% of the difference in scores between skills-based and concepts-based labs on both the E-CLASS and PLIC (standardized effect sizes of 0.03 AE 0.01 and 0.02 AE 0.03 for the E-CLASS and PLIC, respectively).

IV. DISCUSSION
Our analysis validates previous work [17] demonstrating the overall effectiveness of skills-based labs in developing students' views about experimental physics, now with a much broader dataset. We also demonstrate similar effects on students' critical thinking skills in the context of experimental physics. By simultaneously comparing student scores by lab type and demographic variables, we illustrate that this effect is consistently positive regardless of students' gender and race or ethnicity.
Although all students benefited from skills-based labs, there was still a differential impact on subpopulations of students. The difference in scores on the E-CLASS between students in skills-based versus concepts-based labs was much larger for women than men, consistent with previous work [17]. One might infer that conceptsbased labs are particularly detrimental to women's views or that skills-based labs are particularly beneficial to women's views. Future work, such as through interviews and video analysis of students in labs, should further evaluate how women experience skills-based labs compared with concepts-based labs, particularly given the effects on their participation seen previously [26,[28][29][30][31][32][33], and how this relates to their views of experimental physics more broadly.
This differential impact on subgroups of students in different types of labs, however, did not exist with the PLIC. The difference between students' PLIC scores in skills-based and concepts-based labs were the same regardless of students' gender. This result suggests that skills-based and concepts-based labs affect men's and women's critical thinking skills similarly, despite differences in how it affects their views. The impact of mixed labs, however, on students' PLIC scores differed for men and women in ways that cannot be explained by the dataset. Future work should evaluate possible explanations for this differential impact.
The size of the gaps between lab types for students of different reported races or ethnicities were more variable on both assessments, as was the precision with which we could measure them. Future work should continue to evaluate the impacts of different types of labs on students with different racial or ethnic identities.
We also find evidence that skills-based labs typically incorporated more activities to support decision making and communication skills than concepts-based or mixed labs, which correlated with students' PLIC and E-CLASS performance. Modeling activities, on the other hand, had a smaller difference in prevalence between lab types and a smaller effect on students' post-test scores. The results indicate that the measures of decision making and communication activities explain most, but not all, of the variability between scores based on lab type.
The additional variability between lab type and student scores may come from limitations in our ability to measure the pedagogical activities in the courses. First, our analyses initially relied on instructors' classifications of their courses as aiming to reinforce conceptual understanding, develop lab skills, or both about equally. Individual instructors may have used different criteria for characterizing their course along these lines, as evidenced by the variability with which the courses incorporate the three pedagogical features studied. In addition, the analysis captured only the instructors' perceptions of their instruction, not necessarily what actually took place, and captured only a finite (and not exhaustive) set of activities related to these pedagogical variables. For example, the analysis does not include the role of grading, instructor feedback during class time, or the role of interactions between students, all of which may impact the enactment of these pedagogies. Analysis of instructors' course materials (e.g., syllabi, explicit learning goals, lab instructions) and the enactment of those materials in classrooms could more accurately (albeit less efficiently) determine the actual instruction carried out in the labs.
Alternatively, the additional variability may come from additional types of activities included in the instruction beyond the three variables explored here. For example, the role of grading in labs may be a critical variable. Prior work found no differences in men's and women's physics lab grades, despite consistent differences in their lecture course grades [27]. The authors attributed this result to differences in the testing and grading structures in labs compared with lectures. That is, lecture course grades are typically weighted heavily towards high-stakes individual testing and exams, while lab grades typically rely on lower stakes assessments and group activities. Our study, however, provides additional nuance to this explanation: labs focused on concepts or skills both typically involve group work and we are unaware of systematic differences in testing strategies based on lab type. Future work should seek to identify additional types of activities that differ significantly between the three types of labs and with lectures to explain the remaining effects.

V. LIMITATIONS AND CONCLUSIONS
Our study is limited in several ways. First, our analysis focused exclusively on two physics lab assessments, which probe student views about experimental physics and critical thinking skills. We cannot say that the results can be generalized beyond these constructs or these assessments. The interpretations about students' views about experimental physics and critical thinking skills are only valid insofar as the assessments validly measure these constructs. Future work should further probe these ideas using additional data sources.
While the datasets are much larger and diverse than typical PER studies [22], simultaneously including data from an array of institution types and student characteristics, the analysis is still limited by sample size, in terms of both the number of unique classes and the number of students who identified with select demographic categories. The data are also not uniformly weighted between these variables, meaning some variables were more precisely measured than others and some results may be biased towards particular institutions or course types. The limited number of students in several demographic groups limited the reliability of estimates of the effect of lab type for those groups of students. Results presented in Appendix D 1 indicated that we could not have achieved much better precision for these estimates by using simpler models. Future work, therefore, should further test these results with more data with larger and more diverse samples.
Overall, we found that labs focused on skills improve (or produce equivalent) PLIC and E-CLASS scores for all students compared with labs aiming to reinforce concepts or do both, in part due to the increased focus on student decision-making and communication. The results have important implications for improving student learning and experiences in labs, as well as representation in physics. Given that students with more expertlike views tend to persist in physics [60], it is plausible that a focus on experimentation skills over reinforcing concepts (or, alternatively, providing increased focus on student decision making and communication in labs) could retain more women and students from backgrounds historically excluded and marginalized in physics. Future work should evaluate this possibility explicitly, acknowledging that the different types of labs may affect other aspects of student learning and experiences in different ways.

ACKNOWLEDGMENTS
We acknowledge support from NSF PHY-1734006 and NSF DUE-1611482.

APPENDIX A: DATA COLLECTION AND PROCESSING
Data were collected with the E-CLASS between August 2016 and December 2019 and data were collected with the PLIC between August 2017 and December 2020. Both the E-CLASS and PLIC were administered online as part of an automated administration system [56]. Individual instructors determined how to administer the instruments in their classes, but the automated system sent regular reminders to instructors updating them on how many of their students had completed the assessment. In Sec. A 1, we summarize how data were filtered to arrive at the dataset used in the main text [61]. In Sec. A 2, we provide additional details on the student-level variables used in our analyses: gender, race or ethnicity, and major.

Data filtering
In total, 36 538 students in first-year classes submitted valid E-CLASS responses and 10 387 students in first-year classes submitted valid PLIC responses. We considered a response to the E-CLASS to be valid if the student 1. clicked submit at the end of the survey, 2. consented to participate in the study, 3. responded to at least one question, and 4. responded correctly to the filtering question used to eliminate responses from students who were not reading the questions. We considered a response to the PLIC to be valid if the student: 1. clicked submit at the end of the survey, 2. consented to participate in the study, 3. indicated that they were at least 18 years of age, and 4. spent at least 30 sec on at least one of the four pages of the assessment. For both assessments, if a student submitted more than one valid pre or post-test response, we kept only the first submitted valid response. Students also sometimes took the assessments multiple times as part of different classes. We treated these instances, denoted as student records in Table III, as independent events and used student records as the unit of analysis. We refer to student records as students in the main text.
We removed entire classes from our datasets when the instructor did not indicate the main purpose of their lab. We additionally removed classes from our datasets when the assessment was not administered both as a pretest and as a post-test. Both class-level filters were necessary to answer our research questions. We applied only one filter at the student level, removing students who did not have matched pre and post-test responses in the datasets. We matched students within classes by student ID for both the E-CLASS and PLIC. For the PLIC, we additionally matched students within classes by combination of first and last name. Students provided this information at the end of both assessments.
The student-level filtering aims primarily to improve the validity of the analysis by removing, for example, incomplete responses, responses from students who did not complete the course, or responses from individuals randomly clicking through the instrument. The filtering, however, may introduce systematic biases in the dataset, such as skewing towards students with higher grades and scores on the assessments [62]. The data collected from the PLIC between March 2020 and December 2020 are also likely biased based on who was able to participate due to the challenges of the COVID-19 pandemic. Imputation methods are recommended for mitigating such biases [63], but our data sources do not include sufficient information to accurately identify and impute the missing data.
Table III summarizes how the class and student-level filters affected our datasets. Our matched datasets for the E-CLASS and PLIC contained 16 409 and 4988 students, respectively. We did not have accurate information about how many students were enrolled in each class, but we calculated an upper bound on the response rates by assuming that all students enrolled in classes in our datasets completed at least one of the pre or post-test. With this assumption, 49.2% is an upper bound on the response rate for the E-CLASS and 52.4% is an upper bound on the response rate for the PLIC. These response rates are in line with typical response rates reported in the literature [63].

Student-level variables
Students optionally provided demographic information at the end of the E-CLASS and PLIC. For the E-CLASS, students could select from three options when identifying their gender: woman, man, or other (with text box). For the PLIC, students could select from five options when identifying their gender: woman, man, nonbinary (with text box), prefer to self-describe (with text box), or "prefer not to disclose." We distinguished students' self-identified genders using the terms man, woman, and nonbinary (which included students that selected "other" on the E-CLASS or "nonbinary" or "prefer to self-describe" on the PLIC) to more closely align with nonbinary and fluid definitions of gender identity [64]. Students could select multiple racial or ethnic identities from a list of seven that we provided: American Indian or Alaska Native, Asian, Black or African American, Hispanic or Latino, Native Hawaiian or other Pacific Islander, White, or other race or ethnicity. Again, students could select prefer not to disclose or skip the question entirely. We did not treat these race or ethnicities as TABLE III. Number of institutions, classes, students, and student records included in the PLIC and E-CLASS datasets following each round of the data filtering process. We define a class as a combination of course (e.g., Physics 101) and semester (e.g., Fall 2019), so a single course may administer the E-CLASS or PLIC in multiple semesters and count as multiple classes. Students that took an assessment in different classes are counted multiple times in the student records tally. In the main text, we refer to student records as students and use these data in our analyses. mutually exclusive; rather, we included each of these race or ethnicities as separate independent variables in our analyses, so students could belong to multiple groups. The choices for students' race or ethnicity follows the Department of Education IPEDS definitions of race [65]. The E-CLASS and PLIC provided different options for students when selecting their intended major, but both assessments allowed students to select prefer not to disclose or to skip the question. We collapsed students' intended major on the E-CLASS into seven categories: engineering, life science, math and computer science, physics (including astronomy, astrophysics, and engineering physics), other science (including chemistry, geology, and geophysics), nonscience, and open or undeclared. The original version of the PLIC provided students with five options when selecting their intended major and we kept those original groups with one exception: we combined physics and engineering physics into one group, physics, consistent with our E-CLASS groups. We thus used four groups for students' intended major on the PLIC: engineering, physics, other science, and other. We, again, labeled a student's major as unknown when this information was not provided by the student.

APPENDIX B: MEASUREMENT MODEL FOR PEDAGOGICAL FEATURES OF LABS
In a previous analysis using a similar dataset, Holmes and Lewandowski identified (using both exploratory and confirmatory factor analysis) a group of items on the CIS that measured the amount of decision-making and modeling in labs [35]. We began with these items and factor structure when constructing a measurement model for the amount of decision-making and modeling in labs during our analysis. We extended the model examined in Ref. [35] by including an additional factor for the amount of communication activities in labs using additional items from the CIS. All of the CIS items used in this analysis are listed in Table IV. We first performed a confirmatory factor analysis using the three-factor model presented in Table IV. We found that this measurement model did not adequately describe the data (confirmatory fit index ½CFI ¼ 0.793; root mean square error of approximation ½RMSEA ¼ 0.134; standardized root mean square residual ½SRMR ¼ 0.100). Examining the standardized factor loadings, we found that item C4 did not load strongly onto the hypothesized communication factor (standardized factor loading <0.35). We also found that item D1 had large residual correlations with three other items, including two that were part of the hypothesized communication factor (which was not part of the original model developed in Ref. [8]). In our revised model, we removed items D1 and C4. We also added covariance terms between modeling items with parallel language (i.e., M1 and M2, M1 and M3, M2 and M4, and M3 and M4) and with large modification indices in the original model (the average modification index for these four terms was 23.5). We found that this revised model fit our data adequately (CFI ¼ 0.909; RMSEA ¼ 0.100; SRMR ¼ 0.074). The factor loadings from this model are shown in Table IV. We present complete results of fitted models from Sec. III A in Table V. Standardized effect sizes were obtained by standardizing continuous variables (i.e., pretest TABLE IV. CIS items designed to measure the amount of decision making, modeling, and communication activities in labs. Items on the CIS asked instructors how often students engaged in the listed activities: never, rarely, sometimes, often, always. Two items (D1 and C4) were dropped from our revised measurement model for reasons discussed in the text. The factor loadings presented were calculated after standardizing the latent variables.

Factor
Code Item Loading   and post-test scores) and refitting the models. In the main text, we reported expected outcomes from our fitted models (i.e., marginal effects), which indicate how post-test scores varied with particular independent variables (i.e., lab type, gender, and race or ethnicity).

RQ3: Effects of pedagogical variables on students' scores
We present complete results of fitted models from Sec. III C in Tables VI and VII. Standardized effect sizes were obtained by standardizing continuous variables (i.e., pretest and post-test scores) and refitting the models. In the main text, we reported expected outcomes from our fitted models (i.e., marginal effects), which indicate how post-test scores varied with particular independent variables (i.e., pedagogical features).
In Sec. III C, we also reported estimates for the fraction of the effect of lab type that could be explained by pedagogical features of the labs. We calculated these estimates by combining the results of Secs. III A and III B with the results presented in Tables VI and VII above. In Sec. III B, we calculated the differences in average factor scores between skillsbased and concepts-based labs (E½FS diff in Table VIII). Multiplying these difference by the standardized effects of the pedagogical variables in Tables VI and VII (β std ), we obtained an estimate for the expected difference in post-test scores (in units of standard deviations) between skills-based labs and concepts-based labs, considering only the pedagogical variables. Dividing this value by the marginal effect of skills-based labs (compared to concepts-based labs) on students' standardized post-test scores from Sec. III A (0.21 AE 0.05 for the E-CLASS and 0.18 AE 0.10 for the PLIC), we obtained an estimate of the fraction of the marginal effect of lab type that can be explained by each of the pedagogical variables. These results are presented in Table VIII.

APPENDIX D: MODEL DIAGNOSTICS
In this appendix, we examine qualities of our linear mixed models from Sec. III A. In Sec. D 1 we examine how our model choices impacted precision in our estimated effects, while in Sec. D 2 we check visually how well our models satisfied assumptions of linear mixed models and discuss implications of violations of these assumptions.

Variance inflation factors
We prioritized accuracy over precision in this study by simultaneously controlling for several variables to better estimate the effect of lab type on students' scores. In this section, we present variance inflation factors that quantify the degree to which we decreased precision in our estimates by taking this approach. Variance inflation can have significant impacts on p values and elevate false negative rates, which is why we expressly avoided placing significant weight on p values in our discussion.
Variance inflation factors (VIFs) can be interpreted as the ratio of the standard error of a coefficient in a model to the standard error of that coefficient if only that variable, and no others, were included in the model. A VIF of two would indicate that the standard error of a coefficient in the model was double to its standard error in a model with only that variable. We used a generalized VIF (GVIF) here that corrects for the degrees of freedom of a variable [66] and is more useful for linear mixed models. We checked the GVIFs for all variables included in each of our linear mixed models. The results are shown in Table IX.
The GVIFs for most of the main effects variables were above 2, suggesting limited precision on the estimates of those effects. Our analysis was particularly concerned with the interaction terms between lab type, gender, and race or ethnicity, and the large VIFs on the main effects do not suggest any issues with the interpretation of the interaction terms. Models with interaction terms are generally TABLE VIII. Estimate of the fraction of the observed effect of skills-based labs (compared to concepts-based labs) that can be explained by the pedagogical variables. E½FS diff is the difference in average factor scores for skills-based and concepts-based labs. β std is the effect of the pedagogical variable on students' post-test scores. E½FS diff × β std gives the expected difference in students' post-test scores between skills-based and concepts-based labs with average factor scores, and dividing this value by the marginal effect of skills-based labs (compared to conceptsbased labs), ME Skills , gives an estimate of the proportion of the effect of skills-based labs that can be attributed to each pedagogical variable. Note that the decision-making and communication variables were modeled separately and so the effects of these variables are not additive; there is considerable overlap in the variance explained by these variables. susceptible to inflated variances, but we found only minimal variance inflation in the interaction terms here. Only the interaction terms for lab type with Asian, White, and Unknown race had GVIFs larger than two. The small GVIFs for the other interaction terms indicated that we could not have achieved substantially better precision for the interaction terms even if we had used a simpler model. The larger GVIFs on the interaction terms were not problematic for interpreting our data because, even with inflated variances, we were able to measure these terms more precisely than the others.

Visual check of model assumptions
Linear mixed models have the same modeling assumptions as multiple linear regression, in addition to assuming that there exists a nested structure to the data. The two most important assumptions that we evaluate here are the homoskedasticity of residuals with fitted values (i.e., the spread of residuals is approximately the same for all predicted values of the dependent variable) and the normality of residuals.
We evaluated the above assumptions visually using Fig. 4. Plots of residuals against fitted values did not display any obvious trends that would lead us to conclude that the assumption of homoskedasticity was violated egregiously. All quantile-quantile plots of standardized residuals displayed some departure from normality at the left tails of the distribution. This departure from normality at the tails of the distribution is not uncommon and does not generally affect the interpretation of p values or the estimated effects [67]. FIG. 4. Model diagnostic plots for the linear mixed models. There is no noticeable trend in the residuals for either model. There are generally departures from normality in the residuals at the tails of the distribution, however, which is not uncommon and does not generally affect the interpretation of p values [67].