Evolution of response time and accuracy on online mastery practice assignments for introductory physics students

We have investigated the temporal patterns of algebra ( N ¼ 606 ) and calculus ( N ¼ 507 ) introductory physics students practicing multiple basic physics topics several times throughout the semester using an online mastery homework application called science, technology, engineering, and mathematics (STEM) fluency aimed at improving basic physics skills. For all skill practice categories, we observed an increase in measures of student accuracy, such as a decrease in the number of questions attempted to reach mastery, and a decrease in response time per question, resulting in an overall decrease in the total time spent on the assignments. The findings in this study show that there are several factors that impact a student ’ s performance and evolution on the mastery assignments throughout the semester. For example, using linear mixed modeling, we report that students with lower math preparation for the physics class start with lower accuracy and slower response times on the mastery assignments than students with higher math preparation. However, by the end of the semester, the less prepared students reach similar performance levels to their more prepared classmates on the mastery assignments. This suggests that STEM fluency is a useful tool for instructors to implement to refresh student ’ s basic math skills. Additionally, gender and procrastination habits impact the effectiveness and progression of the student ’ s response time and accuracy on the STEM fluency assignments throughout the semester. We find that women initially answer more questions in the same amount of time as men before reaching mastery. As the semester progresses and students practice the categories more, this performance gap diminishes between males and females. In addition, we find that students who procrastinate (those who wait until the final few hours to complete the assignments) are spending more time on the assignments despite answering a similar number of questions as compared to students who do not procrastinate. We also find that student mindset (growth vs fixed mindset) was not related to a student ’ s progress on the online mastery assignments. Finally, we find that STEM fluency practice improves performance beyond the effects of other components of instruction, such as lectures, group-work recitations, and homework assignments. DOI: 10.1103/PhysRevPhysEducRes.19.020111


I. INTRODUCTION
Nearly 55 years ago, Bloom published an article claiming 90% of students can master material an instructor teaches them if the material can be broken down into smaller units and fit to meet the student's needs [1].He claimed that by tailoring the speed that material is delivered to students, instructors allow students who are comfortable with the material to move on to other topics, while students who are struggling with a topic can spend more time learning the material until the topic is mastered.In the decades following, a number of studies have shown the effectiveness of mastery learning across a number of fields [2], and in the field of university-level physics education, more recent studies have also documented the benefits of mastery learning [3][4][5][6][7][8].
However, the general notion that mastery learning is an effective instructional strategy oversimplifies the vast complexity of the domain of science, technology, engineering, and mathematics (STEM) learning, the different methods used to demonstrate and investigate mastery, and the target student populations [9].In other words, there are numerous potentially interacting factors that likely modulate the effectiveness of mastery learning.The natural heterogeneity of topics, methods, and populations in educational settings compels us here to focus on cases that are applicable to important and common STEM educational contexts.Specifically, in this paper, we investigate the extent to which a number of educationally relevant factors, described below, affect mastery learning for students using an online mastery learning application, called STEM fluency, designed for building fluency in basic skills and knowledge necessary for success in algebra-based and calculus-based university-level introductory physics for students in large research universities in the United States.
Rather than focusing on larger grain outcomes like exam scores, final grades, or retention, in this study, we will peer more into the "black box" of the process of learning and investigate the progression of mastery of several topics assigned multiple times throughout the semester.Therefore, we will study two longitudinally measured outcomes aimed at measuring fluency.The first outcome measures the student's response accuracy, which is a commonly studied indicator of mastery of the material.We expect to see improvements in the correctness of students' responses over the course of the semester if students are benefitting from using the STEM fluency application.The second outcome is a measure of speed, such as the time taken for completion of the assignment, and is less commonly studied.We expect to also see improvements in the time that it takes students to answer questions if students are benefitting from the STEM fluency assignments.Task completion time is well known to be a relevant measure of cognitive performance [10,11], especially for tests with time constraints [12].Completion time has also been used to characterize or predict performance, such as rapid guessing on low-stakes tests [13], copying on homework assignments [14], or course performance for online assignments [15].Further, response time is a common measure of cognitive load [16,17], which is important for our context since one of the goals of the online learning application, STEM fluency, is to reduce the cognitive load of basic skills, allowing students to solve more complex problems [8].
Task completion time has also been measured as an important factor in modeling learning.This idea has emerged, for example, from work by Newell and Rosenbloom (1980), who demonstrated a very general power-law decrease in task completion times as a function of trial numbers for a wide variety of tasks.The use of response times has especially been used to measure performance or model student mastery of a given "knowledge component" in intelligent tutoring systems [18,19] and in other learning contexts [20], including problem-solving [21].One must always keep in mind, though, the possible confound of increasing speed due to retesting effects [22].

A. Factors affecting mastery learning
Mikula and Heckler have demonstrated via pretesting and post-testing that STEM fluency was effective in improving student accuracy and speed for a variety of vector math skills, and they also investigated which kinds of feedback were most beneficial in this online mastery learning context [8].In this study, we will expand to a wider variety of content topics and investigate the extent to which several other important factors affect the progression of mastery learning and performance.
First, we will study a design-level factor.To begin, it is important to note that since we collect data on student performance during practice, each mastery practice assignment for a given topic in this study is also an "in situ" assessment of accuracy and speed on that topic.Therefore, in order to first establish whether STEM fluency practice adds educational value, we will compare performance on mastery assignments between "fully trained" students and "partially trained" students for several skill categories.For a given category, fully trained students complete four mastery assignments, with the first assignment starting less than a week after the first lecture on the relevant topic (but before the first homework on the topics is due) and the remaining three assignments starting 1-10 weeks after the first.Partially trained students complete only the third and fourth mastery assignments for that category (as shown in Fig. 1).Therefore investigating whether there are differences in accuracy and speed between conditions in the third mastery assignment will allow us to compare the effect of completing two STEM fluency assignments to no STEM fluency training, keeping in mind that both conditions may have gains due to other course components such as lectures and homework assignments.Further, this design will allow us to compare students practicing twice vs 4 times to determine the extent to which more practice on a given topic is providing significantly more benefit.These results will help us to better calibrate how many times students on average should practice for a given topic.
We will also study the dependency of the evolution of performance on several educationally important studentlevel factors.The first factor is prior preparation.In this study, we will use ACT (or SAT equivalent) math score as a proxy for prior preparation, at least for math preparation.
While we acknowledge some larger potential issues with FIG. 1. Study 2 design showing the differences in training between full and partial trained students.Practice 1 begins less than one week after the first lecture on the topic begins.The subsequent practices occur sequentially, each 1-10 weeks after the first.
demographic biases [23], the ACT math score is well documented to be predictive of introductory physics grades [23,24].Since we are studying the STEM fluency application that is designed to improve the mastery and fluency of basic skills, we are especially interested in the extent to which there is an interaction between preparation and training.Specifically, do students with lower ACT scores have higher gains in performance than students with higher ACT scores?If so, this may support the intention of such practice to especially help underprepared students.
The second student-level factor studied is gender.Studies have documented gender differences in homework scores with women tending to score higher than men [23][24][25].More specifically, in a study by one of the authors that is set in the same institution, courses, and assignments as in this study, Simmons and Heckler have documented that women achieve higher scores (i.e., completion rates) on the STEM fluency mastery assignments [23].Given these gender differences and the critical gender disparities in physics enrollments, it is important to investigate whether there are gender differences in the evolution of performance in these mastery assignments.For example, controlling for ACT score, do men and women start at the same level of accuracy and speed?Is there a difference in gains in accuracy and speed?
The third student-level factor is related to procrastination.Felker and Chen found that rewarding students with extra credit for submitting assignments early encourages lowperforming students who typically submit assignments late to complete assignments earlier than they otherwise would have and spend more time studying for the class [26].This could be an important factor when designing mastery assignments or when developing interventions related to students' procrastination.Additionally, in a recent study, we used submission time as a proxy for measuring student behavior [27].For example, students who submitted assignments closer to the deadline earned lower grades in the course and completed fewer assignments than students who submitted assignments early.Further, there were differences in procrastination by gender.For this study, we are interested in determining if students who procrastinate have smaller gains in their performance on mastery assignments, as opposed to students who do not procrastinate as much.We might expect if students are procrastinating that they are not completing all the training sessions and that they are spending less time on assignments than students who are submitting assignments early thus impacting the evolution of the progress a student makes on the mastery assignments.
The fourth student-level factor involves the construct of mindset, which can be considered as a theory of "challenge-seeking and resilience" [28].In this study, we are specifically interested in the claim that students with a growth (as opposed to fixed) mindset persist to overcome challenges and are more resilient to failure [29], though it is important to note that some researchers have found evidence that does not support this claim [30].Applying this idea to the context of our study, we are interested in determining whether there is a relation between mindset (growth vs fixed) and performance on the assignments, especially for students who initially struggle with the assignments.For example, considering students who do poorly on the initial mastery assignments, do students with a growth mindset perform better on subsequent assignments compared to students with a fixed mindset?We will also consider how overall performance on all mastery assignments is related to the growth mindset, though correlations between mindset and measures of academic achievement, such as exam scores, have been found to be positive but very weak and with wide variation [31,32].
The final student-level factor involves the course grade.While the course grade may be an outcome rather than a predictive variable, for purposes of gaining further insight into how different students evolve in mastery training, we will also present descriptive statistics in the form of graphs for students with different final course grades.For example, do students with lower final grades also have initially lower performance on mastery assignments on basic skills, and do they evolve differently than students with high grades?
In summary, this study investigates the evolution of accuracy and speed in online mastery learning of several basic introductory physics skills on the timescale of weeks.We also investigate several educationally important factors that may affect mastery learning, including design-level factor and several student-level factors.More specifically, our research questions are as follows: RQ1 This investigation is comprised of two studies that were conducted at a large public research university located in the United States Midwest during the first semester of a two-semester calculus-based and algebra-based introductory physics course.The studies included participants from two semesters: Study 1, conducted in the Autumn of 2019, investigated RQ1 and one factor in RQ2, while study 2, conducted in the Spring of 2021, investigated RQ1-RQ4 and employed quantitative statistical modeling.Participants were included in this study if they consented to include their data in this study, which was requested on the first assignment.As shown in Table I, about 52% of all students enrolled were included in study 1 and 58% of all students enrolled were included in study 2. This is significantly below full participation because approximately 70% of students completed the first assignment and about 80% of those students consented to participate.While this participation rate does introduce some potential selection effects in our data, the sample was somewhat representative of the population, as seen in Table I.Specifically, the mean grades were 0.2-0.4grade points (0.2-0.4 standard deviations) higher, the mean ACT scores were 0.2-0.4 points (0.05-0.1 standard deviations) higher, and the female participation rate was 2%-7% points higher for the study sample compared to all students enrolled in the course.
The course structure included a lecture section with traditional lectures (lecturing most of the time, with occasional lecture demonstrations, and occasional questionand-answer with students), a recitation section comprised of group work and/or quizzes, and a traditional lab section.The Autumn 2019 classes were in person, and, due to the pandemic, the Spring 2021 classes were virtual (including online Zoom lecturers and recitation group work via Zoom rooms), but all other aspects of the course were identical.The graded course components included a set of nonexam components, such as weekly homeworks, participation, and lab grades (30% of total grade) and exam/quiz components (70%).The weekly STEM fluency assignments were included as a nonexam component of the student's grade and accounted for 3% of the student's grade.
During each semester, online STEM fluency units were assigned weekly.The first and last units were pretest and post-tests on topics that are not included in this study, and the remaining units were mastery assignments.Each mastery assignment consisted of 3-5 categories to complete.To complete or "master" a category, students were required to correctly answer three or four questions in a row in that category.In study 1, there was a mix of assignments, some requiring three questions in a row and others requiring four questions.We realized after study 1 that this additional variation in the number of questions in a row required a more complicated data analysis; therefore, in study 2, four questions were required for all assignments.
The students were given feedback on the correctness of their responses immediately after they submitted their answers to each question.If they answered incorrectly on a question, they were given the option to try to answer 2 more times, after which they could choose to view the correct answer.However, if they incorrectly answered a question on the first try for a given category, the counter indicating the number of correct questions in a row for that category would be reset to zero and students would have to answer four more questions in a row to master that category.A student received full credit for each category they mastered and zero credit for categories not mastered, and the grade depended on the proportion of completed categories.An investigation of login and logout time stamps revealed that the vast majority of students completed each assignment in one sitting and typically took 10 to 30 min to complete.The weekly assignment window opened on Tuesdays at noon and closed on Sundays at 11:59 pm.
In this version of STEM fluency, the questions were all in multiple-choice format.The questions and responses were carefully and iteratively designed based on an evidence-based process described by Mikula and Heckler [8] involving feedback from student performance and prior research on student difficulties.While common distractors were often included as answer options, the questions were not designed to be especially difficult or "tricky."Rather, they were designed to be a straightforward and effective practice of specific skills with careful variation in a range of relevant practice dimensions such as representation format, physical context, magnitude, sign and direction of parameters, and which variables are known and unknown.On average, students took between 30 s and 2 min to answer each question.
Each week students practiced four or five different categories, with a total of about 12 categories per semester.We investigated only a portion of the categories covered throughout the semester, namely for each study, we selected a priori those that were well developed, assigned multiple times in the semester, and spanned a range of topics.We investigated five categories each in studies 1 and 2, with three of the categories overlapping between the studies.The lack of complete overlap occurred because of uncontrollable differences in assignment schedules between semesters, and we were also interested in increasing the number of categories studied.Brief descriptions of each practice category are provided in Table II and example items of each category are presented in Appendix B. The practice categories were constructed over a period of several semesters using an evidence-based process described by Mikula and Heckler [8].
The design of study 1 included all students assigned the practice categories indicated in Table II.All students in both courses were assigned the categories either 2, 3, or 4 times spaced throughout the semester.The timing of the assignments is detailed in the figures on the results (Sec.III A).
Study 2 included two conditions, as shown in Table III.All students received full training in some practice categories and partial training in other categories, depending on the condition.As described in Sec.I A, partial training consisted of two practice trials starting at least two weeks after the relevant unit and full training consisted of four practice trials beginning just after instruction starts.The last two practice trials for the full training coincided with the two partial training trials as shown in Fig. 1.We assigned both full and partial training to students to help counterbalance total training time (across categories) and to allow for a within-student analysis of the effect of training.Students were selected in one of the two conditions based on their instructor's lecture section.If an instructor taught two lecture sections, the students in the first lecture section received condition 1 categories while the students in the other lecture section received condition 2 categories.This was done to help control for instructor-level factors that may affect student improvement on the assignments.

B. Performance data
There is some freedom and ambiguity in choosing which performance parameters to use when considering useful outcome measures related to accuracy and speed.Choices include the raw performance data collected during the mastery assignments, consisting of the number of questions attempted (Q att ) to achieve mastery, the number of questions answered correctly (Q cor ), and the total completion time (T) [or the logarithm of the completion time log 10 ðTÞ] for each student and each category in each assignment.Other possible choices derived from these measurements include the proportion correct (Q cor =Q att ), the mean response time per question (T=Q att ).The choice of variable to investigate certainly depends on the research questions of interest, and it can also depend on the design of the practice assignments.To provide a sense of how these  For an outcome measure related to accuracy, we chose different measures for study 1 and study 2. As mentioned earlier, study 1 had varying numbers of correct questions in a row required for mastery of different practice categories in different assignments.Therefore, for study 1, we used the proportion correct (Q cor =Q att ) as the measure of accuracy.For study 2, the number of questions correct required in a row was the same for all practice categories and all assignments, therefore we chose to use the total number of questions attempted Q att to achieve mastery, which we also view as an informative measure when the study design allows for it.It also complements the results of study 1.There were several additional reasons for this choice.First, considering between Q att and Q cor , we found that for any given category, Q att was essentially empirically interchangeable with Q cor , because their correlations with each other were typically around r ¼ 0.97 (see Tables VIII-X in Appendix A).Second, we chose Q att because it is readily interpretable and is relevant to the assumed general student goal of minimizing the number of attempted questions needed to complete the assignment.Further, we chose Q att instead of the proportion correct Q cor =Q att , because the latter can be ambiguous in terms of the number of questions answered in the mastery practice context.To understand this, consider that the goal of mastery practice is to achieve a set number, say 4 questions correct in a row.This could be achieved, for example, by answering 4 out of 6 questions correctly or 8 out of 12 questions correctly, given that only the last four questions were answered correctly in a row in both cases.In the second case, Q att is twice as large, but the proportion correct is the same.
For an outcome measure related to speed, again we chose different and complementary measures for study 1 and study 2. For study 1, we chose the mean response time per question T=Q att for a given category, where T is the total time to complete the practice category since the required number of questions correct in a row for a given category varied by assignment.For study 2, we chose the logarithm of the completion time log 10 ðTÞ, which we viewed as the preferred measure to use when possible.Specifically, the completion time is readily interpretable, and we suppose that students are more likely to aim to minimize their total time spent on a STEM fluency assignment instead of minimizing how fast they can answer individual questions, which is related but not identical in a mastery assignment.We use log 10 ðTÞ for purposes of better data analysis.Specifically, the completion time distributions for each category were skewed right, with skewnesses ranging from 3 to 5, which is outside the range of validity for normally distributed residuals in our model fits.The transformation to log 10 ðTÞ results in a more symmetric, normal-like distribution, resulting in better model fits.Note that we will keep in mind the fact that the logarithm is a nonlinear function, because this affects our interpretation of the model results, especially when considering interactions.
A common feature of timing data includes right-skewed distributions with long tails.The very long tails in our context indicate that some students took a very long time to answer the questions (on the order of thousands of seconds per question, i.e., 15 to 20 min per question).One possible explanation for these long response times is that sometimes students leave the assignment open on their computer when they were not actively working on the assignment but are instead engaged with other activities.To account for this tail, we trimmed the top 2.5% of our timing data from each practice category trial, removing that entire entry for that category trial for those students, including the questions attempted, questions correctly answered, their response time, and completion time [33].For example, a student could have data removed for category 1 and practice trial 2 but still have data included for category 1 and practice trial 3, and for all trials of another category.An examination of the time distributions revealed that this cutoff effectively removed the extreme times.The time distributions also revealed that some students had average response times shorter than 1 would reasonably expect to read the entire question and determine an answer (typically less than 4 to 6 s per question).We believe a portion of these response times were due to students randomly guessing, and since these guesses are not an accurate portrayal of how long it takes a student to complete these assignments, we removed the bottom 2.5% of our timing data.We will assume that the data is "missing at random," and the linear mixed modeling used for our analysis is valid for such missing data [34].This assumption seems reasonable for the upper time cutoff, but the lower timed cutoff may introduce some bias, and this is a potential threat to a small bias in the results of this study.However, to check, we reran the models in study 2, including the data below the lower cutoff, and found no qualitative differences (e.g., in significance in slopes and interactions) and very minor quantitative differences from the results reported here.Procrastination was measured using submission time, which we define as the time between when the completed STEM fluency assignment was submitted by the student and the assignment deadline time.The smaller the submission time, the more the student procrastinated.For each student, an average submission time was calculated by adding the submission times for each assignment and dividing by the total number of assignments submitted.Note that other aspects of submission times were investigated in more detail by the authors in a previous study [27].
To measure the mindset of our students, we administered a student's personal physics mindset beliefs survey.This mindset survey was only administered to the calculus students because a different motivational survey study was being conducted at the same time in the algebra course.In order to determine the predictive power of mindset, the survey was administered at the beginning of the first mastery assignment.The personal physics mindset scale contained four items pertaining to the student's beliefs about physics intelligence (see Appendix A), with two items from the Dweck mindset scale [35], and two additional items more aimed at understanding physics and problem-solving in physics.For our dataset, the scale had a level of reliability of Cronbach's α ¼ 0.8 indicating that the scale was internally consistent.A high score on the personal physics mindset corresponds to a student with a fixed mindset while a person with a low personal physics mindset score is considered to have a growth mindset.

C. Models
In order to provide more precise quantitative answers to our research questions, in study 2, we employed linear mixed modeling to build and analyze statistical models of the data.At a broad level, these models are somewhat similar to ordinary multiple regression models in that we will estimate regression coefficients to test and quantify relationships, but because of the relatively complex structure of the data in study 2, linear mixed modeling is needed [34].For example, not only are there within-student repeated measured (practice trials), but students and practice categories are crossclassified clustered data, namely, practice categories are clustered within students and vice versa.Linear mixed modeling also allows for missing data in the cases where some of the students missed some of the assignments, or, as discussed earlier, are trimmed out because of outlier response times on specific categories in an assignment.Students were modeled as a random effect to account for expected variation in student abilities.To account for variation in performance by practice category, we modeled practice categories as fixed factors since there were only five categories, which is too small to reliably model as a random factor.Below, we describe the models used in study 2. Note that the ordering and numbering of the models were chosen to improve the clarity and comparability of the data tables summarizing the results of all of the models.

Model 1
Model 1 investigates RQ3 and RQ4.This first model compares the effects of full to partial training on the number of questions attempted during training.To compare full vs partial training, only data from practice trials 3 and 4 were used for this model.
In this model, ðQ att Þ ijk corresponds to the number of questions attempted for category i, student j, and practice trial k.The coefficient γ 00 represents the overall mean intercept of our model and γ cat;i represents the fixed-effect estimate for the average questions attempted for all students in category i.Note that "work" is the reference category.The variable ðTrialÞ k is coded as 0 for practice trial 3 and 1 for practice trial 4 (see Fig. 1).The coefficient γ trial4 represents the mean difference in Q att between practice trials 3 and 4. The variable ðTrainÞ ij is coded 0 for student j receiving partial training on category i and 1 for full training.Therefore γ train represents the mean effect of full vs partial training on Q att in practice trails 3 and 4. Note that each student receives partial training in some categories and full in others, depending on their training condition (Table III).For model 1, the coefficient γ interaction represents the estimate for the practice trial-by-training interaction and indicates the extent to which full training affects the change in performance between practice trails 3 and 4 compared to partial training.The term u 0j represents the random effect of student j, and r ijk is the random error associated with trial k, student j, and category i.

Model 2
The second model investigates RQ1 and RQ2.Specifically, this model tested how a student's prior preparation, as measured by ACT math score, is related to the number of questions attempted during practice and how Q att evolves throughout the semester for the fulltrained students receiving four practice trials for every category.To investigate evolution during the semester in model 2, we analyzed only the performance for the first and last practice trials for the fully trained students who were assigned four practice trials.
Several terms are the same as for model 1 described above.The variable ðInit FinÞ k indicates the first or last practice trial on the practice category i for student j.This variable is coded as either 0 for initial practice or 1 for final practice.Therefore γ init fin represents the change in Q att from the initial to final practice trials.The variable ðACTÞ j the mean-centered ACT math score of student j.Therefore, γ ACT represents an estimate of the extent to which Q att depends on the ACT score.For model 2 the coefficient γ interaction is an estimate of the ðInit FinÞ k × ðACTÞ j interaction, namely how the change in performance depends on ACT score.

Model 3
To further study RQ2, model 3 investigates the extent to which student performance (i.e., Q att ) and the evolution of student performance on mastery assignments is related to gender, which here is considered only as a binary term (male or female) since this is how it was recorded in the university database from which the gender was reported for this study.
Several terms are the same as for model 2 described above.The variable ðGenderÞ j is coded as 0 for male and 1 for female for student j.For model 3, the coefficient γ interaction is an estimate of the ðInit FinÞ k × ðGenderÞ ij interaction, namely how the change in initial-to-final performance depends on gender.

Model 4
Model 4 investigates the extent to which student performance and the evolution of student performance on mastery assignments is related to procrastination, as measured by submission time, as defined in the previous subsection.This model also investigates RQ2.
Several terms are the same as for model 2 described above.The variable ðSub TimeÞ ij is the amount of time, in hours, before the deadline that the assignment with category i was submitted by student j.Therefore, low submission times mean the student procrastinated since the student submitted a small amount of time before the deadline.For model 4, the coefficient γ interaction is an estimate of the ðInit FinÞ ij × ðSub TimeÞ ij interaction, namely how the change in performance depends on submission time.

Model 5
Model 5 tested how a student's personal physics mindset impacted the student's performance and evolution of performance, which is relevant for RQ2.Because we are specifically interested in determining if the mindset is predictive for students who struggle, for model 5, we limit the population to students who have mean scores above the median proportion correct for the initial practice trials because a student who is struggling on the assignments is less accurate and will answer more questions.
Several terms are the same as for model 2 described above.The variable ðMindsetÞ j is the physics mindset score for student j.For model 5, the coefficient γ interaction is an estimate of the ðInit FinÞ k × ðMindsetÞ j interaction, namely how the change in performance depends on physics mindset.

Models 6-10
For models 6-10, the equations are identical to models 1-5 except that the outcome variable ðQ att Þ ijk is replaced with the outcome variable log 10 ðTÞ ijk , corresponding to the logarithm base 10 of the total time that student j spends on category i in practice trial k.

A. Study 1-Trends in the evolution of performance
We begin by presenting graphical representations of the evolution of mean response time per question and proportion correct for several categories practiced 3-4 times spaced throughout the semester (see Fig. 2).Let us discuss several observations prompted by Fig. 2. The first is that there are notable decreases in time per question and increases in proportion correct for each category.Second, there is a significant variation in response time and accuracy between categories, ranging by an order of magnitude in time and a factor of 2 in accuracy.Third, for the calculus-based course, the accuracies show signs of plateauing for all but the lowest accuracy category.Likely related to this observation is that these same categories also show signs of reaching an asymptote in the decrease of time per question.Essentially, students are maintaining the same accuracy and time per question over multiple practices.However, for the algebra-based course, there is far less of an indication of plateauing in accuracy for any category.Rather there are signs of continued substantial improvement in accuracy, perhaps indicating the benefit of more practice.The same trend appears for the time per question, namely that there are no signs of reaching an asymptote in decrease in time per question.These observations are relevant to RQ4 regarding the benefits of continued practice, which appears to depend on the initial level of performance and the population.Finally, the calculus-based students tend to be a little faster and more accurate than the algebra-based students.This difference at least qualitatively is consistent with observations of plateauing for the former but not the latter.
To gain more insight into subpopulations of students, Figs. 3 and 4 display the evolution from first to last practice trial grouped by students receiving different final grades.There are several notable features of these graphs.First, all groups are improving on average, and there is some indication for some practice categories that performance gaps are narrowing.However, overall, it appears that students receiving an A grade began and ended as the fastest and most accurate students and those receiving a D grade began and ended as the slowest and least accurate.Note that using the final grade is post hoc grouping variable rather than a predictive one.To get a better sense of whether initially less-prepared students evolve differently than better-prepared students, in study 2, we will investigate ACT math score as a predictive covariate using model 2.

B. Study 2-Factors predicting evolution: Trends and quantitative models
To determine the potential predictive power of the various factors discussed in the introduction on the evolution of performance, Tables IV-VII present the results of Models 1-10, and Figures 5-8 provide visual information that provides more insight into the model results.Overall, and consistent with the results of study 1, these results show clear decreases in the number of questions attempted to achieve mastery Q att and completion time T between the first and last practice trials for both courses.Below, we discuss the results for each factor.

Partial vs full practice
The results from model 1 indicate that, compared to partial training, the full training condition had a small but statistically significant beneficial impact on the number of questions attempted and the total time to mastery, as measured by γ train in Tables IV-VII (see also Figs. 5 and 6).Recall that γ train represents the estimated mean difference in performance between conditions in trials 3 and 4, where students in the full training had two training practices before trials 3 and 4, and students in the partial training did not have any training practice before trials 3 and 4. Specifically, on average, the students in the full training condition completed trials 3 and 4 with 2.69 fewer questions attempted and about 103 s faster (per trial) in the algebra-based course and 1.37 fewer questions attempted and about 140 s faster in the calculus-based course.To get a sense of effect size, consider that the residual standard deviation σ r for Q att is about 17 and 9 for the algebra-based course and calculus-based courses, respectively.In terms of speed, the full training results in a roughly 5-10 s per question increase in speed compared to the partial training.In the four tables, there was only one significant interaction estimate (γ interaction ), indicating that the evolution from practice trial 3 to 4 was the same for partial and full training, with the exception of the total time for the calculus-based students where the partially trained students sped up a little faster than the fully trained students.
In summary, students who practiced directly after instruction and the week after became faster and more accurate than the students who only practiced several weeks after instruction.The fact that the fully trained students were faster and more accurate on the third practice trial indicates that STEM fluency practice benefitted students above and beyond benefits from other components of the course, such as the lectures, group-work recitations, and homework assignments.We can see this effect by looking at the third practice trial and comparing students without any STEM fluency training before the third practice trial (i.e., "partially trained students") to the "fully trained" students who completed two STEM fluency assignments before the third practice trial.For many of the practice categories, the "partial practice" students never caught up to the fully trained students.Figures 5 and 6 display the mean total times and Q att by category for each practice trial for these two groups, and graphically confirms the model 1 findings, and can provide deeper insights into the results.For example, for the rotational unit conversion practice category, students in the full training condition were faster and had lower Q att than the partial training in practice trial 3 for both courses.Differences between conditions on practice trial 3 vary by category, though it is not clear why there is such variation.IV.Study 2 model coefficients for algebra physics classes with the number of questions attempted (Q att ) for a given practice category as the outcome variable.Values in parentheses are standard errors.Note the ACT scores are mean centered.Bolded numbers are significant at the p < 0.01 level (and often significantly lower).An * denotes the cell is significant at the p < 0.05 level.

ACT math score
There are three main results from models 2 and 7, which investigate how ACT score might be related to the evolution of accuracy and speed.The first, perhaps as expected, is that the number of attempted questions Q att significantly decreases with increasing ACT score, as estimated by γ ACT .It is important to keep in mind that model 2 uses the mean-centered ACT score, thus a score of zero is at the mean (see Table I), and scores below the mean change the sign of the effect.Therefore, for algebra-based TABLE V. Study 2 model coefficients for calculus physics classes with the number of questions attempted (Q att ) for a given practice category as the outcome variable.Values in parentheses are standard errors.Note the ACT scores are mean-centered.For model 5, only students scoring below the median on the initial trials are included.Bolded numbers are significant at the p < 0.01 (and often significantly lower).An * denotes the cell is significant at the p < 0.05 level.The results of model 7 indicate that the completion time also significantly decreases with increasing ACT score.For example, for algebra-based students with a mean ACT score, the time to complete a category is T ¼ 724 s on average.But for a student with an ACT score one point above the mean, T ¼ 692 s, or 32 s faster.The results for Q att and T could naturally be related.One hint toward this possibility is the fact that while ACT score is moderately to weakly correlated with both (r ≈ 0.1-0.3)for any practice category, it is not significantly correlated with the time per question (r < 0.10), see Tables VIII to X in Appendix A.

Calculus students
Finally, there is a significant ACT-by-practice trial interaction for Q att , as indicated by the estimate of the interaction term in Model 2. For example, following Table IV, consider that Q att decreases by 6.2 questions from the initial to final practice trials for students with the mean ACT score in the Algebra-based course.However, the interaction implies that this decrease is moderated by the ACT math score such that for students with an ACT score one point above the mean the decrease narrows to 5.4 questions and for students with one point below the ACT, the decrease widens to 7.0.In other words, students with lower ACT scores improve more in terms of questions attempted than students with high ACT scores.Roughly, the same effects and magnitude of the effects on Q att and completion time are found for calculus-based students.Figure 7 graphically displays the interaction effect on Q att for both courses.
For both courses, there was no interaction in terms of the logarithm of completion time.However, as mentioned earlier, when interpreting these results, there is an important point to keep in mind due to the non-linearity of the logarithmic function: while there is no interaction in logarithmic time, effectively there could still be an interaction in linear time, so caution must be used in interpreting the result of the model.For example, consider the estimates for model 6 for algebra-based students in Table VI.Students scoring one point above or one point below the mean ACT completed a category on average in about 692 or 758 s, respectively, a difference of 66 s.In the final practice trial, those times become 309 and 338 s, respectively, a difference of 29 s.In other words, logarithmically, there was no interaction (no closing of the gap), but linearly, the gap was cut in half, reduced by 37 s, thus indicating some level of interaction between ACT score and improvement in completion time, with the time gap closing between high and low ACT students.

Gender
The factor of gender, as reported in the university database, was also found to be significant, even accounting for ACT math score, as estimated by γ female in model 3. Specifically, the results of model 3 in Tables IVand V indicate that on average for the first practice trial, the number of attempted questions Q att is 4.8 questions higher for women than for men for the algebra-based course for a practice category, and 2.7 questions higher for women in the calculus-based course.Given that Q att is around 20 questions in the first practice trial, this indicates a difference of about 15%-25% in questions attempted between genders.However, the results of model 8 indicate that there were no such significant differences in completion time.This implies that women tend to answer the questions slightly more rapidly.
There is also a significant gender-by-practice trial interaction for the algebra-based students in model 3.An inspection of Table IV indicates that Q att decreased by 4.4 questions between practice trials 1 and 4 for male students, however, for female students Q att decreased by 7.9 questions.In short, though female students began with a significantly higher Q att than males, this gap essentially reduced to zero by the fourth practice trial.This interaction was not significant for calculus-based students, but the point estimate for the interaction trended in a similar way.There was no interaction effect on completion time.Figure 8 graphically displays the interaction effect for both courses.

Submission time
The results of models 4 and 9 indicate that procrastination, as measured by submission time, does not predict any differences in the number of attempted questions Q att .Specifically, in Tables IV and V, for model 4, the estimates for γ SubT and γ interaction are not significantly different from zero.However, procrastination does predict differences in how students evolve during their practice in terms of completion time, even accounting for ACT scores.Recall that submission time is measured in hours and indicates the amount of time before the deadline the assignment was submitted.While Table VI indicates that there was no relation between submission time and completion time (i.e., the time it takes to complete the assignment) for the first practice trial for students in the algebra-based course, there was a submission time-by-trial interaction predicting completion time.Specifically, on average, in their first practice trial, all students completed one practice category in about 692 s regardless of submission time.Students with the mean ACT score who procrastinated and submitted near the deadline decreased their completion time to about 389 s on average per practice category by the last practice trial, but students with the mean ACT score who submitted their assignments 72 h (on average) before it was due decreased their completion time to about 279 s per category.That difference in the decrease of 110 s between procrastinators and nonprocrastinators is substantial considering the original completion time.In short, students in the algebra-based course who procrastinate improved their completion times significantly less than students who do not procrastinate, even controlling for ACT scores.
For the calculus-based students, Table VII indicated that submission time does predict an overall significant difference in the logarithm of completion time.For example, for the first practice trial, students with the mean ACT score submitting near the deadline on average completed a practice category in about 589 s, but students with a mean ACT score who submitted their assignments 72 h (on average) before it was due completed a category in about 510 s.In other words, students in the calculusbased course who procrastinate complete each category about 79 s slower than students who do not procrastinate, even controlling for ACT scores.As discussed earlier with the ACT scores, while the interaction term of the logarithm of time is not significant, there is still a reduction in the completion time gap to 48 s between the last practice trial for procrastinators (355 s) and for nonprocrastinators (307 s).

Mindset
The results of models 5 and 10 indicate that the mindset scores do not predict performance or evolution of performance, as estimated by γ mindset .As stated earlier, the analysis of models 5 and 10 only includes those students scoring above the median proportion correct on the initial practice trials.

IV. DISCUSSION AND CONCLUSION
In a series of studies, we have characterized the evolution of accuracy and speed of students responding to questions on online mastery-based assignments repeated throughout the semester covering basic introductory physics skills.To summarize, let us discuss how our results address our research questions, starting with RQ1 and RQ4.Following expected patterns of accuracy and response time learning curves typically found in studies of learning ( [18]), both algebra and calculus students on average systematically improved their accuracy and decreased their response time per question on a range of physics topics and categories over multiple repeated spaced practices throughout the semester.While calculus students were slightly faster and more accurate than the algebra students, the STEM fluency assignments were still effective and beneficial to both classroom populations in improving student fluency and performance on the assignments.We noticed the differences in the shapes of the accuracy and response time curves in study 1 reached saturation for some categories (i.e., the student's speed and accuracy plateaued after two trials) while other practice categories, like work, did not reach saturation even after full training.On average, this saturation happened mainly in the calculus-based population, suggesting that for some of the categories studied, one could decrease the number of practice trials without sacrificing gains in performance.
Considering RQ2, we found that several student-level factors were associated with differences in initial performance and evolution.Perhaps most notably, while students with low ACT math scores were initially less accurate and slower than students with high ACT scores, this gap decreased by the final practice trial.This suggests that STEM fluency mastery assignments are a useful tool for instructors to help students refresh important basic skills, and it helps students with lower levels of preparation to catch up.
Regarding differences between genders, women are initially spending the same time as men on assignments but are answering more questions to achieve mastery, even controlling for ACT scores.By the final practice trial, both men and women increased in accuracy, but for algebrabased students, the gap closed: women improved more than men, such that they both ended up with similar accuracies.For the calculus-based students, there was no significant decrease in the gap.For both courses, men and women decreased the time they spend on the assignments by about the same amount.
Combining the results from this study (that the performance gap between men and women is diminished after spaced practices) and previous work (that women complete more STEM fluency assignments [23] and procrastinate less on the assignments than men [27]) all controlling for ACT score, it leads us to wonder why we are seeing a distinct difference in study habits and evolution of performance between women and men, namely that women have initially poorer performance but appear to be working harder and catching up.This suggests a potentially interesting line of inquiry for future work to present a coherent framework to explain these differences between the two groups.
We were somewhat surprised to find that student mindset is not predictive of the number of questions attempted or completion time for students who struggle initially with the assignments (RQ2).Despite mixed results reported on mindset [28][29][30], we were expecting that mindset would predict performance on mastery assignments.Specifically, we were expecting to see that students who initially had relatively low accuracy on the mastery assignments but had a growth mindset would improve more than students with a fixed mindset because they would be more resilient to failure, but this was not the case.Ours was a superficial investigation of the factor of mindset, and before we can make any firm conclusions about whether or not mindset is important in this context, further research is needed to perhaps more carefully measure this construct (beyond a four-item scale) and devise a more careful theoretical argument identifying which behaviors it might influence.
In terms of submission time, controlling for ACT scores, students who did not procrastinate reduced their assignment completion time more than students who did procrastinate.This is true even though the number of questions attempted to achieve mastery did not depend on procrastination.In other words, the nonprocrastinators sped up or became more fluent than the procrastinators.We hypothesize that this could occur because the nonprocrastinators are more committed to learning, resulting in their performance improving.Another possibility is that the procrastinators have put themselves in a stressful environment by submitting the assignments late, which results in a lack of improvement in performance.Future work could look further at the individual question level of heavy procrastinators to see how the evolution of the response time per question varies in the final hours before the deadline, seeing if heavy procrastinators are exhibiting rapid guess behavior, meaning they are not rapidly responding to questions before time expires.
Finally, models 1 and 6 in study 2 provided evidence that on average across several practice categories, STEM fluency practice improves both accuracy and speed beyond any gains accrued from traditional lectures (RQ3).Figures 5 and 6 reveal that this added benefit depends on the category and the course, though it is not immediately evident why there is such variation.These results along with the overall STEM fluency learning curves that match general expectations from past learning research help to further validate the STEM fluency materials and design [8] as a useful learning tool, though naturally, the effectiveness is modulated by numerous factors such as those studied here.Example factors of interest for future studies include the timing of spaced and interleaved practice, which has been studied in numerous contexts and could be applied to mastery learning of basic skills in an introductory physics context [35][36][37][38][39].
There are a few limitations to keep in mind when interpreting this work.First, there may be selection effects in our results since about half of our students consented to participate in our study, and our population sample is skewed toward higher mean grades (0.2-0.4 standard deviations) and ACT scores (less than 0.1 standard deviation).Therefore, the sample is slightly underrepresenting low-performing students.Further, study 2 took place during the COVID-19 pandemic, which could have impacted a variety of factors in our study.For example, it could have impacted the motivation of students completing these assignments, though our observations indicate similar overall trends in improvement in accuracy and speed in both studies.Additionally, the studies here were for brief (15-30 min) online STEM fluency assignments that are designed to be low-level difficulty practice sessions.Because these assignments are distinct from traditional, "back-of-the-textbook" homework questions, this impacts our ability to generalize this work to other assignments.Future work could look at how practicing with STEM fluency assignments might impact students' exam performance on problems that cover topics practiced in the STEM fluency assignments.Future work should also investigate if STEM fluency practices help students on homework topics similar to the topics covered in the STEM fluency assignments.This would allow us to discuss the impact of STEM fluency assignments on other important components of the course.

FIG. 2 .
FIG. 2. Study 1 data (a) algebra students' mean response time spent for multiple essential skills categories at multiple practices times.(b) algebra students' mean accuracy for multiple essential skills categories at multiple practice times.(c) Calculus students' mean response time spent for multiple essential skills categories at multiple practice times.(d) Calculus students' mean accuracy for multiple essential skills categories at multiple practice times724 [Note: lines are drawn only to help pair data points from the same category.Some categories included practice sessions in between the initial and final practice].

FIG. 3 .
FIG. 3. Study 1 (a) Algebra-based students' average response time spent for each category.The average response times are measured the first-and last-time students saw the categories and subset by the course grade the students earned in the course.(b) Algebra-based students mean accuracy for each category.The mean accuracy is measured the first-and last-time students saw the categories and subset by the course grade the students earned in the course.
students, for example, the estimate is γ ACT ¼ 1.4 meaning that for every ACT point above the mean, Q att decreases by 1.4 questions, and for every ACT point below the mean, Q att increases by 1.4 questions.Again, to get a sense of effect size, every ACT point changes Q att T by about 0.1 residual standard deviation.

FIG. 6 .
FIG. 6. Study 2 (a) total completion time and (b) questions attempted for each training group evaluated at the same time for the calculus physics class.

FIG. 8 .
FIG. 8. Study 2 estimated the marginal mean number of questions attempted at the initial and final practice trials across all practice categories, split by gender for (a) algebra-based students and (b) calculus-based students.Error bars are 1 SE.The lines are drawn only to help pair data points from the same category.Note that some categories included practice sessions in between the initial and final practice.

FIG. 9 .
FIG. 9. Image to accompany work done by a constant force sample question.

FIG. 11 .
FIG. 11.Image to accompany the vector components sample question.

TABLE I .
Data on all enrolled students and study participants.

TABLE II .
Description of categories students practiced throughout the semester and during which semesters we collected student performance data on that particular category (sample questions for each category are found in Appendix C).Given two arrows representing the magnitude and direction of forces on an object, choose the correct expression of the net force in the x or y direction Rotational kinematics Rot Kin 1 Using rotational kinematics equations to solve for α, θ, ω, or t

TABLE III .
Study 2 design indicating the full and partial training practice categories for each condition.

TABLE VI .
Study 2 model coefficients for algebra physics classes with total Log base 10 completion time for a given practice category as the outcome variable.Values in parentheses are standard errors.Note the ACT scores are mean-centered.Bolded numbers are significant at the p < 0.01 (and often significantly lower).An * denotes the cell is significant at the p < 0.05 level.

TABLE VII .
Study 2 model coefficients for calculus physics classes with Log base 10 completion time for a given practice category as the outcome variable.Values in parentheses are standard errors.Note the ACT scores are mean centered.For model 10, only students scoring below the median on the initial trials are included.Bolded numbers are significant at the p < 0.01 (and often significantly lower).An * denotes the cell is significant at the p < 0.05 level.