Group Dynamics in Inquiry-based Labs: Gender Inequities and the Efficacy of Partner Agreements

Recent studies provide evidence that social constructivist pedagogical methods such as active learning, interactive engagement, and inquiry-based learning, while pedagogically more effective, can enable inequities in the classroom. By conducting a quantitative empirical examination of gender-inequitable group dynamics in two inquiry-based physics labs, we extend results of previous work. Using a survey on group work preferences and video recordings of lab sessions, we find similar patterns of gendered role-taking noted in prior studies. These results are not reducible to differences in students' preferences. We find that an intervention which employed partner agreement forms, with the goal of reducing inequities, had a positive impact on students' engagement with equipment during a first-semester lab course. Our work will inform implementation of more effective interventions in the future and emphasizes challenges faced by instructors who are dedicated to both research-based pedagogical practices and efforts to promote diversity, equity, and inclusion in their classrooms.


I. INTRODUCTION
Group work is a common feature of many university physics lecture and lab courses.Research-based pedagogical practices, like active learning, interactive engagement, and inquiry-based learning, employ varying degrees of group work.Beyond its role as a pedagogical tool, group work has been identified as a learning goal itself.The American Association of Physics Teachers has designated it a component of scientific collaboration and as a learning goal for lab courses in particular [1].However, despite its potential benefits, group work introduces complexities into a course which can produce unintended effects.
Pedagogical practices which incorporate group work such as active learning, interactive engagement, and inquiry-based learning enjoy broad support within the physics education community.This support is based on a breadth of empirical studies that show them to be pedagogically more effective than traditional lecture or lab courses [2,3].However, while these practices may be best for learning overall, there is also evidence that they can enable certain inequities.For example, Quinn et al. [4] observed that incorporating inquiry-based instructional practices into laboratory courses, compared directly with traditional labs, can result in an increase of gendered role-taking.Other studies have found this gendered division of labor to women's disadvantage [5][6][7].In the context of lecture courses, Gordon et al. [8] found that a flipped classroom had a negative impact on learning and achievement for low-income, systemically nondominant race/ethnicity, 1 and first-generation students when compared with an interactive lecture.These re-cent works corroborate the results of other studies, which show more generally that-absent proactive efforts by an instructor-systemically non-dominant groups engage less in active-learning components of lecture courses [10][11][12][13][14][15].Collectively, these empirical observations suggest that many popular research-based pedagogical methods can be especially susceptible to unintended inequitable outcomes.As we will review in Section II, there are also sound theoretical reasons to expect this.This situation presents instructors with a serious tension: pedagogical methods which research suggests are best for overall student learning can be worse for diversity, equity, and inclusion.
Of course, this is by no means inevitable.In some contexts, research-based remediation strategies have been developed and successfully implemented [11,12].It is an important goal to build upon this work, especially to include methods designed specifically for inquiry-based lab courses and lab courses more generally.
Given the integral role of group work in these pedagogical methods, it should not be surprising that one major source of these inequities lies in problematic group dynamics.The value of group work, its potential for inequities, and frameworks for promoting fair and effective group work have been important topics in physics education research for several decades, dating back to a multiyear study at the University of Minnesota [16,17].More recent research has explored this issue in depth in the context of laboratory courses, finding cross-cultural evidence for gendered division of labor [18], documenting the likelihood of women to adopt secretary or project manager roles [7,19], and assessing how the frequency of inter-group interactions is affected by lab design and group gender composition [20].However, these studies are limited to three institutions and it remains unclear what strategies can be used to resolve the inequities they identify.
Group dynamics in lab courses have been the subject of study beyond university physics courses.For example, important studies in university STEM courses documenting similar dynamics to the above have been conducted in engineering [12,21], chemistry [22], and biology classes [23,24].In particular Donovan et al. [15] studied different methods of group formation in a college biology class.These dynamics may contribute to higher attrition rates of women in STEM, given the role of college in the leaky pipeline [25].Systematic investigations of inequitable group dynamics in a pre-college setting are less common.A study by Greenfield [26] found that girls were just as likely to manipulate equipment and record data as boys from elementary to early secondary school.Meanwhile, a study by Jovanovic et al. found boys manipulated equipment more than girls in grades 5-8 [5].
Regardless, a broad study by Burkam and Smerdon [27] emphasizes the importance of equitable equipment use for supporting women's performance in STEM.
In this work, we present a quantitative empirical examination of inequitable group dynamics and of a possible remediation strategy.The context of our work is two introductory physics lab courses for non-majors at a major public university.Importantly, our work is at the intersection of aforementioned results [4,8], given our courses' recent redesign implementing much of the inquiry-based framework advocated forin other studies [28,29] at our institution.The size and diversity of these lab courses make them useful testing grounds.The expository in The Inequality Machine: How College Divides Us by Paul Tough [30] suggests the observed student body may be of unique interest given its diversity, including dimensions such as race, socioeconomic status, and major explicitly mentioned as limitations or factors of interest in Quinn et al. [4,31].In this study, we do not empirically assess any link between inquiry-based course design and inequities but rather treats it as motivation for the study of the scaffolding of course elements to ameliorate them.
We choose to focus this work on the gendered aspect of group dynamics for three reasons: One, a number of previous works have also focused on gender [e.g., 4,32], so this allows direct comparison between student populations.This is important since there is evidence that the effects of a given curriculum can depend on a student's demographics and other characteristics [33,34].Two, gender represents a subdivision of students with large populations in two categories.Three, if gender inequities are observed, they may signal broader inequities and provide impetus for follow up studies examining other dimensions of identity and associated inequities.
The first goal of this work is to extend previous results from Cornell University [4,32] to the context of inquirybased labs for non-physics STEM majors at a large public, research-intensive institution in the southern U.S., The University of Texas at Austin.Additionally, by examining student-reported preferences for group work and actual observed behaviors in a single study, we are also able to unify the previous results [4,32] by explicitly controlling for preferences in modeling role division in lab activities.The second goal of this work is to provide an assessment of an intervention rooted in social constructivism that is meant to remediate the anticipated inequitable dynamics and which involved students completing partner agreement forms.
This paper is organized as follows: In Section II we provide theoretical rationale relevant to the study.In Section III we explain the instructional context.In Section IV we explain the motivation and implementation of our partner agreement intervention.The next three sections present the methods (Sec.V), results (Sec.VI), and analysis and discussion (Sec.VII) of the two major data sets collected: the preferences survey and the video observations.We conclude by synthesizing our results and their implications both for instruction and for future research on the dynamics of group work in physics lab courses in Section VIII.

A. Constructivism, Instructivism, and Equity
Many research-based instructional practices common in the physics education research literature fall under the umbrella of the learning theory known as constructivism.Constructivism posits that knowledge is constructed by a learner through the active linking together of previous ideas or pieces of information [35].It contrasts with "traditional" teaching methods, which are based on instructivist or behaviorist views of learning [36] wherein knowledge is a collection of facts or skills that need to be transmitted from teacher to students [37].Modern pedagogical practices such as active learning [2], inquiry-based learning [28,38], interactive-engagement [39], and peerassisted learning [40,41] frequently incorporate students working in groups and are generally rooted in a more specific type of constructivism called social constructivism.Social constructivism emphasizes that knowledge construction occurs through social interaction with people [42].As reviewed in the introduction, there is empirical evidence that pedagogical practices based on social constructivism can enable inequities absent proactive remediation efforts on the part of the instructor [4,8,[10][11][12][13][14][15]32].Here we define, and will subsequently focus on, equity as structuring course policies that directly consider students' identities and backgrounds, so that current and past structural injustices are addressed to help students learn the course material (similar to [43,44]).Equality, on the other hand, is having course policies that treat all students the same regardless of their identities or backgrounds.
There are several theoretical reasons why we might expect some pedagogies grounded on social constructivism to have the capacity to inadvertently reinforce inequitable outcomes.A few well-studied examples which apply specifically to inquiry-based labs include: • Stereotype threat: Inquiry-based labs often require students to engage in open-ended exploration and problem-solving, which can lead to increased performance pressure.Stereotype threat, the concern of confirming negative stereotypes about one's group, can be heightened in such situations [45].This pressure can disproportionately affect systemically non-dominant groups, including women, and impact their performance and confidence in the lab setting.
• Confidence and self-efficacy: Research suggests that women, on average, may exhibit lower confidence and self-efficacy in STEM fields compared to men [46][47][48].Inquiry labs can involve higher levels of autonomy, uncertainty, and risk-taking, which may lead to, or be impacted by, decreased confidence among students who are less familiar with this type of learning environment.Lower levels of confidence and self-efficacy can influence participation and engagement [49].
• Identity-based participation patterns: Lab classes in general involve working in groups and group problem solving.Since identity-based inequities exist in society, for example involving gender, race, class, and other categories, they can play a role governing the division of labor in group work.This division of labor can perpetuate traditional (e.g., gender) roles and create inequitable participation opportunities and experiences [50,51].
• Classroom climate and peer/instructor bias: Group work involves interaction with peers and inquirylabs typically require proactive support from in-structors.Biases, even unconscious, may influence interactions with and between students and, even inadvertently, affect the experiences and performance of systemically non-dominant groups [52,53].
In the case of lecture courses this issue has been studied extensively [8,[10][11][12][13][14][15]54] and some specific remediation strategies have been proposed, studied, and found to be effective [12,14].All lab classes which involve group work, similarly, can produce inequities.Based on these theoretical considerations, we expect that inquiry-based labs would be especially susceptible to these problems, and this was found in the aforementioned study [4], which directly compared traditional and inquiry-based labs. 2 In particular, from both theoretical and empirical angles, we expect group work to be a locus of inequitable dynamics in lab courses in general and inquiry-based labs in particular.

B. Group Work Practices
To understand how students work together in groups, and how instructors might structure group work to ensure pedagogically sound and equitable outcomes, it is useful to consider three distinct group work practices: instructivist, collaborative, and cooperative learning.A visual depiction of our framework, explained below, is provided in Figure 1.
Instructivist approaches involve highly-structured group work.In theory, this maximizes instructor control over group dynamics.This may have the benefit of allowing the instructor to preclude inequitable outcomes by carefully structuring groups and how groups operate.On the other hand, by at least some measures, instructivist methods can be pedagogically inferior [2,3] to other methods such as those rooted in social constructivism given they do not incorporate active involvement on the part of the student.Especially if effective group work constitutes a learning goal of a course, not just an instructional tool, we would prefer to employ a different framework for organizing group work in class.
Collaborative learning and cooperative learning, meanwhile, are rooted in social constructivism.In collaborative learning, instructors direct students to work together in groups freely, without assigning roles or structuring group work, provided they achieve a certain goal or outcome by dividing tasks effectively amongst themselves.Cooperative learning is more structured, and involves scaffolding group work so that students work together more cohesively, preferably because the goal or outcome is not achievable by individuals working independently and merely collating their work at the end.According to Davidson [55], cooperative learning has specific characteristics: a common task or learning activity suitable for group work, small-group interaction, norms for cooperative and mutually helpful behavior among students, individual accountability and responsibility (with a possibility of including group accountability), and positive interdependence.It is important to differentiate the structure provided to students in instructivist and cooperative learning -the former dictates group structure and dynamics while the latter scaffolds students self-regulation.
Collaborative learning has the feature of being aligned with social constructivism, but it is highly plausible, and has been argued in the literature [56], that the free form nature of collaborative group work may enable social and cultural factors to preserve, reproduce, or produce inequities.Group work in inquiry-based labs without explicitly structured group dynamics resides within this framework.
Cooperative learning methods offer both alignment with social constructivism, which is better empirically supported [2,3] and matches the inquiry-based learning framework of our lab courses, as well as specific structures meant to ensure healthy group dynamics.We argue it is therefore an appealing framework for resolving inequities observed in group work in inquiry-based labs.
We will rely on instructivist, collaborative, and cooperative learning, to frame prescriptions for group work as well as our proposed intervention, as explained in the corresponding section (Sec.IV).

III. INSTRUCTIONAL CONTEXT
We investigated two introductory physics lab courses, which took place during the Fall 2022 semester at The University of Texas at Austin.While the two courses are sequential and together constitute a two-semester introductory sequence, we studied students from the two courses simultaneously; we did not track a cohort of students through both courses.Each course is a single credit hour taken by students from one of three introductory corequisite lecture sequences, including algebrabased physics, calculus-based physics for life science majors, and calculus-based physics for engineering majors.In some cases, students with prior credit are not enrolled in a corequisite lecture course.This setup mixes students from all tracks into the same lab sections and provides an important dimension of diversity in these lab courses.
The first course, which we will refer to here as Physics I Lab, covers standard topics in mechanics.The second, which we will refer to as Physics II Lab, continues with optics, electromagnetism, and some modern physics.Both courses are designed to implement the Structured Quantitative Inquiry framework [59][60][61], which has some similarities to the Investigative Science Learning Environment [62] and Scientific Community [63] approaches.The Structured Quantitative Inquiry framework provides students with genuine investigative freedom, supported by research-based scaffolding (i.e., invention activities [64]), and requires students to make fully quantitative comparisons of models with data.
Both courses are very large (∼1,000 students per course) and diverse along several dimensions.At the university level, approximately 20% of students are firstgeneration college students [65] and a similar percentage are Pell-grant eligible students [66,67], a federal financial aid program open to students with significant financial need.This institution is also designated a Hispanic Serving Institution [68].The students in both lab courses are a representative cross-section of this student body, as shown in the demographic breakdown from an anonymous survey in Table I.
Lab sections are taught by graduate teaching assistants (TAs), occasionally with assistance from undergraduate learning assistants (LAs).Each course is supervised by a faculty instructor of record and two to three graduate assistant instructors, or "head TAs." Head TAs collectively are responsible for helping with curriculum development, running weekly instructional meetings, resolving grade disputes, and supporting the other TAs.
Each course includes nine lab sessions.Each session is three hours long and meets once a week.During a given lab session, students usually work in groups of two, but occasionally three.In rare instances, students work in a larger group or on their own, but that practice is discouraged.Lab activities typically involve designing an experiment to test how well a given model describes a physical system.Student groups collectively turn in a single set of informal, but structured, "lab notes" at the start of the following lab session which document their procedure, analysis, results, and conclusions.This gives students a week to work on analysis and writing outside of class.Because of this, most students spend class time collecting and analyzing data and leave the write-up for outside of class.
Students are allowed to pick their own partners throughout the semester.They change partners/groups every three labs, so that they work with three distinct partners or groups in a given semester.This is occasionally complicated by absences or students dropping the class, in which case groups may be slightly shuffled.
In addition to lab assignments, students individually complete pre-lab activities and a final capstone quiz or project.The pre-lab activities are completed before lab sessions as quizzes on the Canvas Learning Management System and are graded upon completion.In Physics I Lab, the final assignment is a Lab Practical Quiz that is meant to test student's mastery of essential measurement methods, analysis tools, and familiarity with equipment.In Physics II Lab, the final assignment involves students proposing and executing their own experiment on a topic of their choice and turning in a scientific poster.The inclusion of these end-of-semester individual assignments, which are worth 20% of students' final grade, are meant to motivate individual responsibility, an important best practice for group work [69].

IV. INTERVENTION
Given the similar pedagogical framework and cultural context, we anticipated observing comparably inequitable group dynamics similar to previous works [4].As such, we designed an intervention aimed at remediating expected inequities.Below we explain the motivation, form, and implementation of this intervention.Some prescriptions for improving the equity of group work exist in the literature.For example, a highlystructured approach to designing and managing groups is advocated by Heller et al. [16,17].In this framework, students are assigned specific roles which are regularly rotated, are prompted to write reflections on their experiences with their group, and are given group assignments that avoid placing women in the minority.In a summary of effective group work practices for college courses, Rosser's suggestions include ensuring rotation of instructor-defined roles throughout the semester and to avoid isolating women in groups [70]. 3In the language of Section II, these prescriptions may be thought of as implementations of instructivist approaches with the corresponding benefits and drawbacks.
We seek alternative solutions for a few reasons.First, we believe that teaching students to work effectively in groups is an important learning goal for lab courses in itself.This is consistent with the recommendation of the American Association of Physics Teachers, which categorizes working in small groups as a component of scientific collaboration [1].We expect that a highly-structured, top-down approach to group management rooted in an instructivist approach does not provide students with a sufficiently active role to learn to resolve problematic group dynamics.In the spirit of the "structured inquiry" philosophy, which aligns with social constructivism and has demonstrated effectiveness for teaching students experimental physics [71], we seek solutions which enhance and scaffold students' active role in shaping their group work.This allows students to learn to resolve inequities and establish more effective group dynamics.Second, research has shown that highly-structured approaches to group work are often met with resistance from students [72].Third, we prefer to avoid explicit role assignment and rotation because evidence suggests it is better for student learning for them to share, not split, work, even if the splitting is equitable with respect to gender [73].Fourth, although avoiding isolating women in groups may be sufficient to prevent inequities in collective problem solving [17], it is unclear if this is also sufficient to prevent inequitable divisions of labor (e.g., in equipment use).
We therefore designed an intervention that was meant to give students an active role in preempting and resolving problematic group dynamics themselves.In the language of Section II, we aimed to encourage cooperative group work. 4This intervention has three components: • Individual Reflections: Students were given a one-time, individual writing assignment, to be completed outside of class, before any lab sections met.
The assignment asked students to reflect on their values and experiences with group work and to write about them.This component of the intervention is inspired heavily by the values affirmation intervention [74,75], which has been shown to reduce gender achievement gaps on high-stakes exams by combating stereotype threat.The primary purpose of this exercise was to prime students' awareness of what was important for them in group work, so that they would be better equipped to recognize when their lab experience did not conform to their values for group work and learning.We expected that by providing students with an opportunity to reflect on their values, we would induce more dialogue between group members in lab and through partner agreement or reflection forms.We also speculated it would reduce stereotype threat or other identitybased issues which could play a role in group dynamics as explained in Section II.
• Partner Agreement Forms: Each time students were put into a new group, they were tasked with collectively filling out a partner agreement form.This form required students to introduce themselves and to have an explicit conversation about how work would be split or shared.It was deliberately worded not to bias students towards any particular way of sharing or splitting work, while giving them an opportunity to express the preferences that were primed by the individual reflection assignment.This component of the intervention was motivated in part by results suggesting reduced inequities due to explicit conversations about equipment usage [76].It also provides a space for norm setting and the development of positive interdependence as components of Cooperative Group Work as explained in Section II • Partner Reflection Forms: Each time students returned to the same group, they were tasked with collectively filling out a reflection form.This gave them an opportunity to discuss their experience working together the previous week.This element was borrowed from the aforementioned framework from Heller et al. [17], since it fits with our preferred approach.Importantly, our partner reflection forms differed in that they were done collectively, not individually.
The individual reflection assignment, partner agreement form, and partner reflection form can be found in the Supplemental Materials.Students were given participation credit for completing the individual reflection assignment.They were not given any points for completing the partner agreement nor reflection forms to minimize differences in grading across all sections.Instead, TAs required students to complete these at the start of class before proceeding with lab activities.It is worth noting that since students were not obliged to invest significant effort on these activities, some students may choose not to make maximal use of them.This was deliberate, since whether or not students proactively make use of course structures to obtain equitable and preferred modes of group dynamics is part of what we are testing in this study.
The control sections were not given the Individual Reflections, Partner Agreement forms, or Partner Reflection forms, but were otherwise treated identically to the intervention sections.It is possible that students in the control section may have learned of some aspects of the intervention going on in the sections it was applied toindeed, we did not hide the study from the students.However, we do not expect this to have impacted our results very much.Small differences between lab sections are common due to differences in TA style and students do not perceive these differences as out of the ordinary.
Partner agreement forms have been implemented and studied before-typically in courses with a project-and these studies have found group contracts often improve communication [72,[77][78][79].However, few studies discuss role-taking.Students in one study were assigned a role at the start of the class [77].In another study by Chang and Brickman on group work in an introductory biology class [72], students were instructed to rotate roles explicitly as well as to write and follow group contracts.However, students did not assign or rotate roles explicitly and they often disregarded their group contracts.These implementations differ from ours as we wanted students to share their roles in a context where their work varied week-to-week.
We assess the success of the intervention from the effects on equitable dynamics as observed in the video recordings.We expect that the intervention will eliminate or mitigate gendered role-taking and prevent introducing new inequities as compared to the control sections.

A. Preferences Survey
Prior to the first week of lab, students were asked to indicate their preferences for different lab activities, forms of role distribution, and leadership styles [32].Here, the text and responses of these questions are reproduced.
"Which of the following experiment tasks do you prefer taking on?(Select all that apply)" a.) Setting up the apparatus and collecting data.All three preferences questions were closed response.Multiple preferences could be selected on the activity preferences question.Only one preference could be selected on the role distribution and leadership preferences questions.These questions appeared in one of the mandatory pre-lab quizzes that students completed electronically before each lab (see Section III).We had 1,871 completed responses. 5e examined survey results for differences across gender controlling for course, track, and the interaction of course and track in our model.The three lecture tracks act as a proxy for student majors and therefore, motivations which may in turn influence preferences.Additionally, student preferences in Physics II Lab may differ from student preferences in Physics I Lab because students have gained more familiarity with the course structure and the lab's style of group work.These changes in preferences between courses can also vary with lecture track, as some students' preferences may evolve differently to better match their priorities.These various conditions are controlled for in the logistic regression model for role preferences given in Equation 1, where R is the response variable, which is the binary preference selected by students; For the activities preference survey, because students could select as many or as few responses as they chose, we treated each of the five responses as binary logistic regressions as in Equation 1.We identified gendered differences in the activities preferences using the regression estimates.For the role distribution and leadership preferences questions, because students could only select only one response, we treated each question as a multinomial logistic regression controlling for the same factors as shown in Equation 1.For an introduction to multinomial logistic regression and its uses, see work by Theobald et al [80].For these role distribution and leadership preferences questions, we used pairwise comparisons of means to identify gendered differences in specific answer choices.

B. Video Observations
To evaluate how students divide tasks in the setting of a laboratory course, we conducted observations of recorded sections.Out of 93 lab sections covering both Physics I and Physics II Labs during the Fall 2022 semester, we video recorded four sections.These sections included one control and one intervention section for Physics I Lab, and one control and one intervention section for Physics II Lab.All sessions for the semester were recorded for these sections.
All four recorded sections were taught by "head TA" assistant instructors, as opposed to TAs.These sections were chosen for analysis under the assumption that they would be the most uniform subset of sections, as well as most adherent to the course objectives, minimizing instructor effects compared to using novice TAs.The four sections took place at the same time and weekday, and included two sections of the Physics I Lab and two sections of the Physics II Lab, with a control and partner agreement intervention section for each.
Labs started with a brief lecture from the TAs that ranged from roughly 10 to 30 minutes.During this period, students listened and took notes and did not start on lab work until the lecture concluded.We did not code student activities during this period.Once the TA finished their lecture, we began coding student activity.Every five minutes, a researcher coded what each student was doing at that time according to our coding scheme described below.When it was unclear what a student was doing at these five minute increments, we checked the video up to 30 seconds before and after to make a determination.
We used a coding scheme similar to Quinn et al. [4] (see Table II for a description of each code).The 'Other' code covered a broad range of activities including students being off-task (e.g., using their phone, leaving the room, and talking to peers) as well as on-task (e.g., thinking and discussing with peers, LAs, or TAs).Students not touching but looking at a computer screen they had been scrolling on or typing at within 30 seconds or looking at a piece of paper and holding a pencil without actually writing were coded as the closest relevant activity, rather than 'Other.'Additionally, when a student was holding equipment, but not using it to explicitly conduct the experiment, we coded it as 'Equipment.'A student who watched another student do an activity was coded as 'Other.'TABLE II.Coding scheme used for video observations.The 'Laptop,' 'Calculator,' and 'Paper' codes were later collapsed.

Code Description Equipment
Student was handling the equipment.This includes handling objects that are not necessarily lab equipment (e.g., a phone) when it was explicitly obvious the materials were being used to conduct the experiment (e.g., timing a pendulum's period).

Desktop
Student was operating a lab desktop computer.

Laptop
Student was using a personal computer.This includes iPad or tablet use, but excludes cell phone use.

Calculator
Student was using a calculator.This includes cell phone use when it was explicitly obvious the phone was being used as a calculator.

Paper
Student was using pen and paper.

Other
All other student activities.
In our analysis, we chose to combine the 'Laptop,' 'Calculator,' and 'Paper' codes as the distinction between these activities was unsubstantial in two important ways.First, students often used their personal computers to do calculations and take notes.Second, all three activities were associated with analysis or report-writing and required technical understanding but not physical engagement with lab materials.They were thus functionally similar to one another, but distinct from conducting the experiment itself.The 'Desktop' code was not included in the grouping of 'Laptop', 'Calculator', and 'Paper' because the desktop computer had mixed uses.Students often used the desktop computers to collect data and could therefore be linked to the 'Equipment' code.Nevertheless, desktop computers were also often used to read lab instructions, conduct data analysis, or write lab notes which could align desktop computers more with the 'Laptop,' 'Calculator,' and 'Paper' codes.Due to this conflict, we left 'Desktop' as an independent code.

Coders and Inter-rater Reliability
To establish inter-rater reliability, three researchers coded 23 students in a single 3-hour recorded class session.For that purpose, we chose the second lab session in the Physics I Lab course.We chose that lab session because students frequently use a diverse set of materials and methods and it would make any difficulties with using the coding scheme apparent.For example, many students use their phones as timers in this lab session; coding this as equipment requires more careful observation than equipment observations, such as using a scale, in later labs.
We obtained a Fleiss' Kappa [81] of 0.80 and Kappas over 0.75 signify excellent agreement [82].When we combined the 'Laptop,' 'Calculator,' and 'Paper' codes, our Fleiss' Kappa increased to 0.84.After coding, the researchers discussed their disagreements and resolved any disputed coded segments by coming to consensus.There were no trends in which codes caused more disagreements.The researchers then coded separate sections.Two researchers (MD and AL) each coded one section of the Physics I Lab and one researcher (EH) coded both sections of the Physics II Lab.

Video Observations Quantitative Analysis
While our data was coded in segments, students spent varying amounts of time in the lab.Since our research questions relate to how students work in groups, we normalized observations to a student's group.We refer to this type of data presentation as a student's "group fraction." A student's group fraction for a coded activity in one class session is the fraction of codes we have of that activity out of the group's total number of codes for that activity in one class session.In a former study, Day et al. referred to this as "normalized participation" [6].For a given student and class session, the group fraction for an activity (g activity ) is the number of codes of that student for that activity (N activity ) divided by the sum of the total number of codes of that activity (N activity ) over all the students in that group, as given in Equation 2, For example, if in one class session we coded a student using equipment twice and their partner using it eight times, the former had an equipment group fraction of 0.2 and the latter 0.8.If we did not observe a group doing an activity for an entire class session, no student in that group was assigned a group fraction for that activity.While this is infrequent for codes such as 'Equipment,' the appearance of codes such as 'Desktop' varied by group and lab.
A summary of our student population and the total number of class observations can be found in Table III.Most student genders were obtained in an optional supplemental survey given to students who agreed to participate in video recordings.Students were able to selfidentify their gender in this survey.A total of 14 students did not fill out this survey, so we supplemented our data with university enrollment information.None of the students who filled out the survey identified as non-binary/other gender.Therefore, our data set only considers men and women.
To analyze our data, we used hierarchical linear modeling.Hierarchical linear modeling is a form of data analysis that accounts for nested structures of data.We use this form of analysis because we have repeated observations of students in each group.For an introduction to hierarchical linear modeling, see work by Van Dusen and Nissen [83].For our hierarchical linear model, we treat our data as a two-level model where level-1 data is a group fraction from one class session and level-2 data is a student in a particular group.This allows us to have repeated observations of a student for each group with which they work.We do not consider students across the whole semester as our level-2 data because we want to know how students act within each individual group.Additionally, we do not use a three-level model that takes groups into account as this would violate the assumption of independence of observations, as an increase in one student's group fraction necessitates a decrease in another student's group fraction.
For level-1 data, our 'Equipment' group fraction for the ith observation of the jth student, g equipment,ij , is modeled as in Equation 3, where β 0j is the intercept term, GroupSize is the number of group members associated with the observation, and r ij is the residual term.The level-2 data, which represents a student for each group they join, is modeled as in Equation 4,  a Note that one student preferred not to share their gender.
model is very similar, but the EquipPref term is replaced by two separate terms: one for a preference for notes and one for a preference for analysis (and each with a term interacting with Int).A visual and statistical check ensuring the assumptions of our hierarchical linear model are met can be found in Appendix B. Additional models accounting for group gender composition were examined as part of this study to see if this had a measurable effect on outcome accounting for preferences.We investigated this as previous research has suggested that women take on different lab roles when in mixed-gender groups [4,7].Statistical factors which describe the quality of this model, AIC and BIC, indicated that this additional dimension did not improve on Equation 4. This may be due to lacking a sufficient number of observations from each context.As this more complex model was not a statistical improvement, we do not include its discussion here, but note it may be valuable to examine in future studies.

VI. RESULTS
In this section we briefly present key results from the preferences survey and video observations.We leave further analysis and interpretations to Section VII.

A. Preferences Survey
The expected fraction of student preferences for certain roles in the lab, controlling for different courses and tracks of physics that students were enrolled in, is shown in Fig. 2. We find that women and men indicate different preferences at the beginning of the semester.Women more often prefer notes (p < 0.001) and managing (p < 0.001), while men more often prefer equipment (p = 0.024), analysis (p = 0.001) or have no preferred role (p = 0.004).The full numerical results of our regression models are included in Appendix C.
The expected fraction of student preferences for role distributions in lab controlling for course and track is shown in Fig. 3. From a pairwise comparison of means, we find that men are more likely than women to report having no preference (p = 0.001).
The expected fraction of student's preferences for role distributions in lab controlling for course and track is shown in Fig. 4. From a pairwise comparison of means, we observe that women more often reported a preference for taking turns in leadership (p < 0.001) while men more often reported no preference (p = 0.004).

B. Video Observations
The results of our regression model for 'Equipment' group fraction are shown in Figure 5 and Table IV.In Physics I Lab without partner agreements, we find that women are responsible for less equipment usage than men (β = −0.226±0.061,p < 0.001) accounting for group size, lecture track, equipment preference, and random effects.However, in Physics II Lab without partner agreements, we do not observe a gendered difference in equipment usage (β = 0.066 ± 0.088, p = 0.452).
The partner agreements had differing effectiveness across the two courses.In Physics I Lab, compared to men in the control section, men in the partner agreements section had a lower average 'Equipment' group fraction (β = −0.283± 0.080, p < 0.001), while women had a higher average 'Equipment' group fraction (β = 0.222 ± 0.076, p = 0.004).In Physics II Lab, compared to men in the control section, we did not observe a statistically significant difference in 'Equipment' group fraction for men or women who used partner agreements.
Notably, indicating a preference for using equipment in Physics I Lab did not lead to a statistically significant increase in 'Equipment' group fraction for students with or without partner agreements.In Physics II Lab, similarly, indicating a preference for equipment did not lead to a statistically significant increase in 'Equipment' group fraction for students with or without partner agreements.
Figure 6 and Table V show the results of our regression model for 'Laptop,' 'Calculator,' and 'Paper' group fraction.In both Physics I Lab and Physics II Lab without partner agreements, we do not see any statistically significant difference in 'Laptop,' 'Calculator,' and 'Paper' group fraction between men and women.We also see no statistically significant effects from partner agreements on men or women's 'Laptop,' 'Calculator,' and 'Paper' group fraction.Similarly, we see no statistically significant effects from student preferences on 'Laptop,' 'Calculator,' and 'Paper' group fraction in either course with or without partner agreements.IV.FIG. 6. Results from multilevel regression in (a) Physics I Lab and (b) Physics II Lab for 'Laptop', 'Calculator', and 'Paper' group fraction.These results are controlling for group size, lecture track, and random effects.The base term for each course is the group fraction of men in the control section and enrolled in the algebra-based lecture track who did not indicate a preference for notes or analysis.The error bars represent the standard error of the regression coefficients.There were no statistically significant differences.The full model output can be found in Table V.

VII. ANALYSIS & DISCUSSION
In this section we build on the results presented briefly in the previous section.We analyze and interpret these results and models in the context of our theoretical lens and research questions, presented earlier in the paper.

A. Preferences Survey
When surveyed at the beginning of a semester, the most popular lab activity among students was equipment use.Men indicated a preference for equipment usage slightly more often than women.The difference in the expected fractions between men and women resembles the magnitude of the difference found by Holmes et al. [32], although they did not conduct tests of statistical significance on their data set.Men were also more likely to prefer the analysis role or to indicate having no role preference.Women more often expressed a preference for note-taking and managing at the start of the semester.This may be related to the "Hermione" and "secretary" archetypes from Doucette et al. [7], suggesting that previously observed gendered division of labor may be driven in part by student preferences.
For student preferences in role distributions, men and women in our courses have similar preferences, albeit men are more likely to have no preference.Generally, students nearly equally prefer working on different tasks or working together on the same tasks.In the language of Section II, this could suggest that students are similarly likely to prefer collaborative or cooperative modes of group work.This is notably different from the findings of Holmes et al. [32], where both men and women preferred working on the same task together in a laboratory class targeted towards physics majors.This comparatively strong preference among students in our study for splitting the work may be because these students, not being physics majors as was the case in Holmes et al. [32], prioritize efficiently completing lab work over content mastery.Another difference with previous results appears in the leadership preferences.In Holmes et al. [32], students were unreceptive to a singular leader and were comparatively more likely to prefer having no leader.A large fraction of our study's students want some form of leadership, whether that is a rotating or singular leader.
The observations of the last paragraph have important implications, since student preferences inevitably intermix with course structures and interventions to produce outcomes.Differences in student preferences between populations suggest best practices for group work may require some institution-specific or course-specific tailoring.

Control and Partner Agreements
In the non-intervention sections, we observed men being responsible for more equipment usage in their groups than women in Physics I Lab, but not in Physics II Lab.However, we did not find gendered differences in how men and women used laptops, calculators, and paper in their groups.Students primarily used their laptops for data analysis and note-taking, while they nearexclusively used calculators for analysis and paper for notes.This suggests that, in terms of roles, men were more likely to be a group's equipment user.We cannot claim that men or women are more or less likely to be note-takers or data analysts.
Men being more likely to be their group's equipment user echoes the results of previous studies that have examined student roles in physics labs.In observations of similarly structured inquiry-based physics labs, Quinn et al. [4] found that men were more often responsible for equipment usage, however, they also found women used laptops more.Another study of the labs at the same university found that equipment usage was similarly gendered for in-person courses, but that online courses with fixed groups across the semester resolved this inequity [76].
We found differences compared with a study by Day et al. [6] who analyzed role distribution among mixedgender pairs of students.While their coding system differed from ours in that they had codes just for equipment, computer, and everything else (also called 'Other'), they found that men and women used equipment a similar amount.However, they observed men more frequently using computers and women were more frequently coded 'Other.'Since students submitted notes on paper in that course, this suggests that men in the course did more data analysis while women did more note-taking.
Recall that in Section IV we provided a consideration for assessing the effectiveness of the intervention.It is effective if it resolves or at least mitigates any gender inequitable division of roles and does not introduce any new gendered inequities.By this metric, the results are positive regarding the effectiveness of the partner agreement intervention.
Our results suggest that the partner agreements led to more equitable equipment usage among students in Physics I Lab.For 'Laptop,' 'Calculator,' and 'Paper' in Physics I Lab, the partner agreements did not alter the gendered distribution of labor.Similarly, when we consider Physics II Lab, we see no statistically shifts for 'Equipment' group fraction or 'Laptop,' 'Calculator,' and 'Paper' group fraction.
While partner agreements do not explicitly prompt students to consider gender equitable labor, their effectiveness for 'Equipment' usage in Physics I Lab has a possible explanation.In Physics I Lab, students are required to complete a practical quiz at the end of the course which tests, among other things, skills with equipment.When completing partner agreements, students may be more motivated to ensure everyone gets equal experience with the equipment or more motivated to self-advocate for a role in equipment use.
Zhang et al.'s findings on the effects of formal contracts and competence trust on group work might offer an additional lens on the outcomes observed here [84].Competence trust is how confident a student is in their partners' experience and ability to complete a task at a high level [85].Zhang et al.'s findings show maximal benefit to group work when group contracts are present and groups have mild competence trust, where neither partner is perceived to be notably more or less capable by the other.In the courses studied here, competence trust may be activity-specific; students may inherently perceive some others as more or less competent in Physics I Lab due to their experiences with and perceptions of the subject, equipment, or processes.In Physics II Lab, however, students may perceive each other as being more uniformly capable for two reasons.They may perceive everyone as less capable due to the sophisticated, and unfamiliar, electronics in Physics II Lab.Alternatively, they may perceive everyone as more capable because they all completed Physics I Lab and have gained familiarity with group work in a university physics lab context.Similarly, students in Physics II Lab may have more college experience or higher maturity levels which could impact students' perceptions of each other's capabilities and influence the expression of their own interests.
Despite students not being given a grade incentive, the changes in equipment usage from the intervention suggest students are engaging with and using the partner agreements.It is possible that students use the conversations prompted by the partner agreements as a tool to complete group work more efficiently, but the results noted here show a more equitable outcome -at least for equipment usage -as a result of their inclusion in the course.

Role Preferences
Across the Physics I and Physics II Labs, we found that student preferences at the start of the semester did not correlate with their frequency of engaging in observed lab activities.So students who identified as preferring to use equipment more were not later observed to use equipment more often.Interestingly, this behavior was consistent across both control and partner agreement sections.This common trend of student preferences not affecting actual roles suggests students are dividing roles for other reasons.Previous research has found that students often informally take on lab roles [4,32].This suggests that informal role assignments are not even due to students with proclivities towards certain work instinctively taking it up.In their study of a project-based physics lab, Stump et al. [19] found a managerial role allowed some women a form of self-expression which could promote identity development, engagement, and learning.If students are not taking on roles they want to do, it could potentially impact their learning outcomes and affect (e.g.attitudes).

C. Limitations
While student preferences for roles were probed in the survey, our observation protocol did not directly observe students within all of these roles.Notably, we did not directly observe note-taking and data analysis; we observed students using laptops and calculators and writing on paper.Because these activities encompass both note-taking and data analysis, we can not fully glean how students divided roles.It is possible that some students tended to take on more secretarial or lead scientist roles, however, we could not observe these differences.
It is also important to note that the neither the manipulation of equipment nor note taking should be equated with the entirety of scientific practice.Both are necessary components, and they are aspects which are learning goals of the course which have corresponding graded assignments (like the lab notes for each lab or the end of the semester lab practical and final projects).But other activities beyond the scope of our video coding scheme such as problem-posing, discussions of experimental design, and data analysis, are also important components.We do not, therefore, have a complete portrait of the inequities in this lab course, even though the differences in equipment usage which are not reducible to preferences implies inequities for at least this aspect according to the definition adopted in Section II.
Since the interventions were applied to separate sections and implemented for an entire semester, the data compares sections with potential for different random effects.Hierarchical linear modeling accounts for random differences in behavior of students across their sessions with one group, however, there are possible random differences in sections.We could not account for these because we only have one section per condition (i.e., course and control/intervention).We attempted to minimize the differences between sections by analyzing sections taught by head TAs occurring on the same weekdays at the same times; uncontrollable factors can cause otherwise identical sections to differ in significant ways which could have altered the apparent intervention effectiveness.TA behavior and identity also differed between sections.This could have had an impact on the inter-vention's effectiveness.However, previous research finds little impact due to instructor gender, suggesting this is unlikely [86][87][88][89], although this remains an open question.
In this analysis we have neglected to discuss a "group manager" role.This is because observations were conducted on video that lacked audio.This put identifying a group manager outside the possible scope of this study.Other studies have analyzed the positive [19] and negative [7] aspects group management can have on a woman's experience in physics labs.
We have not controlled for the possibility that students may have chosen to work with students with whom they were previously acquainted with and acknowledge that this could play a role in outcomes.Group formation procedures were the same between control and intervention sections, so this does not impact the assessment of our intervention in its context, but this is an important factor worth considering in other contexts or future studies.See Pulgar et al. [90] for one example of a study investigating this dynamic.
We have only considered groups of sizes two and three.Other lab courses may have larger groups.Since this affects the available number of roles per student, it could conceivably produce different outcomes.

VIII. CONCLUSIONS
In this section, we summarize our findings holistically in light of our research goals, draw some conclusions, indicate implications for instruction, and suggest avenues for future work.
The first goal of this study was to apply the same methods as previous work [4,32] in the context of our courses.Given the similarity in course framework, we expected we would observe similar inequities.We did, indeed, observe inequities [4] and they do not appear to be reducible to differences in lecture tracks or role preferences among our students [32].Given the differences in student populations between those studies and ours, we take this as evidence that they are common to inquirybased labs or lab courses more broadly.Importantly, by including both preferences and observations in a single study, and by including preferences as a component of our model, we unify and therefore strengthen the conclusion of these studies that preferences do not account for differences in observed lab activities.In fact, we found role preferences played very little part at all in determining actual role-taking in labs, which further suggests students may require scaffolded group work to ensure their participation reflects their interests and/or is more equitable.
This emphasizes the tension between best practice pedagogical methods and efforts to promote diversity, equity, and inclusion.An important point of Quinn et al. [4] is that these inequities are not merely background effects of culture on all physics courses, they are consequences which can be inadvertently reinforced by partic-ular choices of curriculum design.
Although we did see some relatively small pre-semester gendered differences in preferences that match Holmes et al. [32], we did find some differences in preferences between the two student populations.Our students are more likely to prefer dividing tasks and a single leader.Our observations, however, indicate that student preferences do not result in statistically significant differences in observed behavior.In the language of Section II, this suggests that students are willing to engage in cooperative group structures with a brief, recurring intervention that does not explicitly compel them to equitably divide group tasks.
In this study we have only examined inequities based on student gender.There may be inequities based on other demographic criteria such as race/ethnicity or students' academic backgrounds [8].We plan to follow up with future work to explore these possibilities.Given the differences between Physics I Lab and Physics II Lab, it would also be interesting to conduct a future study in which the preferences survey is administered both presemester and post-semester to observe how lab activities may influence preferences.Additionally, a future study could examine preferences and observations within each group to explore if preferences have different effects at the start of the semester or the first sessions of new lab groups.
The second goal of this study was to implement and evaluate the impact of an intervention meant to reduce inequitable task division.This intervention was intended to work by scaffolding group work to enhance student's active role in shaping group dynamics, ideally producing something closer to cooperative rather than collaborative group work.Our results show that the intervention had positive results when students had motivation (e.g. a summative practical quiz) to self-advocate for an equal role in equipment usage, since observations showed improvements where inequitable task division existed.
While our results are promising that small interventions can help mitigate inequitable group dynamics, further study is needed to investigate factors noted above, and to examine if these outcomes are true in other contexts.In particular, developing interventions to further instantiate students' self-interest in exploring all roles in a lab course would be beneficial.Using these interventions to broaden student awareness of best practices may improve their intrinsic motivation and promote equitable and effective learning experiences.
A motivation for this work were the theoretical reasons and empirical evidence that inquiry-based labs may enable or inadvertently reinforce inequities.But we have not tested this relationship in this work and, importantly, our conclusions may apply more broadly to traditional labs as well.In fact, our results emphasize the importance of the distinction between inquiry-based labs as a general laboratory design strategy and the structure and scaffolding of group dynamics (among other aspects of the course).Although there may be reasons to believe that inquiry-based design have the potential to enable inequities, our results show that this can be affected by how group dynamics are managed by instructors.Because inequities exist in society, instructors need to intentionally design lab experiences that scaffold and direct groups so as to create equitable experiences regardless of other curriculum choices.Indeed, the richer and more authentically scientific lab activities are, the more this is likely to be both necessary and valuable.We present the full results of the preferences survey from Section VI A as well as the number of students for each category, Table VII To conduct logistic regression for the roles students preferred, we used the glm function in the base package in R [93].For role distribution and leadership distribution questions, we conducted multinomial logistic regression using the multinom function in the nnet package [94].For all of these questions, we then used the effects package to display the probabilities of men and women selecting each answer [95].For pairwise comparisons of means, we used the emmeans function from the emmeans package [96].
b.) Writing up the lab procedures and conclusions.c.) Analyzing data and making graphs.d.) Managing the group progress.e.)No preference or none of the above."Which of the following approaches to group tasks do you prefer?" a.)One where each person has a different task.b.)One where everyone works on each task together.c.)One where everyone takes turns with each task.d.)No preference.e.) Something else."Which of the following approaches to leadership do you prefer?" a.)One where one student regularly takes on the leadership role.b.)One where no one takes on the leadership role.c.)One where the leadership role rotates between students.d.)No preference.e.) Something else.

FIG. 2 .
FIG. 2. Expected fraction of men and women that preferred a given role controlling for course, track, and the interaction of course and track.Errors bars represent 95% confidence intervals.The asterisks denote statistical significance where * indicates p < 0.05, * * indicates p < 0.01, and * * * indicates p < 0.001.Students could select as many or as few roles as they wanted.

FIG. 3 .
FIG. 3. Expected fraction of men and women that preferred a given method of role distribution controlling for course, track, and the interaction of course and track.Errors bars represent 95% confidence intervals.The asterisks denote statistical significance where * indicates p < 0.05, * * indicates p < 0.01, and * * * indicates p < 0.001.Students could select only one answer.

FIG. 5 .
FIG. 5. Results from multilevel regression in (a) Physics I Lab and (b) Physics II Lab for 'Equipment' group fraction.These results are controlling for group size, lecture track, and random effects.The base term for each course is the group fraction of men in the control section and enrolled in the algebra-based lecture track who did not indicate a preference for equipment.The error bars represent the standard error of the regression coefficients.The asterisks denote statistical significance where * indicates p < 0.05, * * indicates p < 0.01, and * * * indicates p < 0.001.The full model output can be found in TableIV.

FIG. 8 .
FIG. 8. Visual check for the assumption of normality for hierarchical linear modeling.These plots show the residuals vs. the fitted values for a) 'Equipment' group fraction in Physics I Lab, b) 'Equipment' group fraction in Physics II Lab, c) 'Laptop,' 'Calculator,' and 'Paper' group fraction in Physics I Lab, and d) 'Laptop,' 'Calculator,' and 'Paper' group fraction in Physics II Lab.
Pedagogical practices considered in this work depicted within their respective theoretical frameworks.For each pedagogical practice, arrows indicate known link to sustained inequities (solid line) and plausible links to less inequities (dashed lines).

TABLE I .
[57]-reported, anonymous results from a survey of student demographic information: gender (including nonbinary/other options), racial/ethnic identity[57], and parents' highest level of education [58] across courses (≈ 70% response rate).Racial/ethnic groups were not considered mutually exclusive.Counts may not equal the total as students may not have answered all background questions or preferred to not disclose.
0 is the intercept (Man, Physics I Lab, Algebra-based track); Woman indicates if a student is a woman; PhysIILab indicates if a student is enrolled in Physics II Lab; CalcEngr indicates if a student is enrolled in the calculus-based physics track for engineering majors; CalcLifeSci indicates if a student is enrolled in the calculus-based physics track for life science majors; and NoCoreq indicates if a student is not enrolled in a corequisite lecture course.

TABLE III .
Student and observation demographic data from the four lab sections which were recorded.Student demographic data indicates the number of men and women in each section.An observation describes one student in one lab period, thus, observation demographic data indicates the number of men and women in each session across the semester.For example, we have 8 unique students that are men in the Physics I Lab control section; we have 61 observations of these 8 unique students across the full semester due to absences.a

TABLE IV .
Results from linear regression for 'Equipment' group fraction.The table shows the regression coefficient, standard error, and p-value (in parentheses).The conditional (marginal) R 2 values for these models are 0.41 (0.16) for Physics I Lab and 0.52 (0.08) for Physics II Lab.

TABLE V .
Results from linear regression for 'Laptop,' 'Calculator,' and 'Paper' group fraction.The table shows the regression coefficient, standard error, and p-value (in parentheses).The conditional (marginal) R 2 values for these models are 0.55 (0.14) for Physics I Lab and 0.62 (0.10) for Physics II Lab.
. Table VIII provides the roles students preferred, Table IX provides the role distributions students preferred, and Table X provides the leadership distributions students preferred.

TABLE VII .
Number of men and women enrolled in each course and track for Physics I Lab and Physics II Lab.

TABLE VIII .
Results from the logistic regressions for student role preferences given in Equation 1 which controls for gender, course, track, and the interaction of course and track.The table shows the regression coefficient, standard error, p-value (in parentheses), and odds-ratio (in brackets).

TABLE IX .
Results from the multinomial logistic regression for student role distribution preferences similar to Equation 1 and detailed in Section V A which controls for gender, course, track, and the interaction of course and track.The table shows the regression coefficient, standard error, and p-value (in parentheses).Take Turns vs. Take Turns vs. Take Turns vs. Take Turns vs. Different Tasks vs. Different Tasks vs. Different Tasks vs. Work Together vs. Work Together vs.No Preference Intercept 1.475 ± 0.222 1.913 ± 0.216 0.538 ± 0.266 -3.461 ± 1.077