Effective student teams for collaborative learning in an introductory university physics course

We have studied the types of student teams that are most effective for collaborative learning in a large freshman university physics course. We compared teams in which the students were all of roughly equal ability to teams with a mix of student abilities, we compared teams with three members to teams with four members, and we examined teams with only one female student and the rest of the students male. We measured team effectiveness by the gains on the Force Concept Inventory and by performance on the final examination. None of the factors that we examined had significant impact on student learning. We also investigated student satisfaction as measured by responses to an anonymous evaluation at the end of the term, and found small but statistically significant differences depending on how the nine teams in the group were constructed.


I. INTRODUCTION
Physics education research (PER) has led to an increased emphasis on collaborative learning in reformed-pedagogy physics courses around the world.Some of the best-known examples are Peer Instruction [1], McDermott's Tutorials in Introductory Physics [2], and Laws's Studio Physics [3].Therefore, questions about how to structure teams of students for collaborative learning to achieve the best possible outcomes are increasingly important.
Psychologists investigate important questions about the types of collaborative learning that are most effective [4], and their research has a number of different approaches to these questions [5].Within the PER community, there is a long tradition of another approach to these questions: videotaping, transcribing, and analyzing student interactions [6].
In this study we instead ask some comparatively simple questions about teams of students engaged in collaborative learning.First, should the teams of students be sorted by student ability, or instead should the teams contain a mix of strong, medium, and weak students?Second, is a team of three students better than a team of four students?Finally, previous studies suggest that a team of only one female student with the rest males should be avoided, because the male students will dominate the interactions in the team [7,8].Is this true for our students?
Regarding heterogeneous and homogeneous groups, in a study and survey of previous work, Jensen and Lawson wrote in 2011, "In sum, the few college-level studies that have been done reveal no clear consensus regarding the better group composition" [9].As far as we know, this situation has not changed since 2011.
The prediction that learning in teams can be more effective than learning as isolated individuals was first best articulated by Vygotsky [10].He introduced the concept of the zone of proximal development, which describes what a student can do with help, as compared to what he or she can do without help.In the context of collaborative learning and Peer Instruction, each student's teammates form the scaffolding needed to keep the student within their most effective zone of proximal development [11].If an instructor designs each team to have a maximal spread of talents, this should improve the availability of appropriate scaffolding available to the weaker students.However, if the ability levels of two collaborators are too far separated, the efficacy of the collaboration can actually decrease.This is often pointed to as why Peer Instruction is more effective: two students of differing ability level are still closer to one another than a student and an instructor.This argues that perhaps an entirely opposite strategy for prescribing team composition is best, in which students are sorted into homogeneous teams by ability: in this way, the top students would interact with each other at a higher level without making any weaker students feel excluded, while the weaker students would also be interacting with each other at an appropriate level.
The Force Concept Inventory (FCI) is an important tool in PER.The FCI was introduced by Hestenes, Wells, and Swackhammer in 1992 [12], and was updated in 1995 [13].A common methodology is to administer the instrument at the beginning of a course, the "precourse," and again at the end, the "postcourse," and to examine the gain.
In PER, gains on diagnostic instruments such as the FCI have long been used to measure the effectiveness of instruction.For example, when Mazur converted from traditional lectures to Peer Instruction at Harvard in 1991, Peer Instruction was shown to be a better way of teaching by showing that the normalized gains on the FCI increased from 0.25 to 0.49 [14].Fagen, Crouch, and Mazur also used normalized gains on the FCI to demonstrate the increased effectiveness of Peer Instruction compared to traditional lectures by over 700 instructors at a broad array of institution types across the United States and around the world [15].Hake's seminal paper of 1998 also used gains on the FCI for over 6000 students to demonstrate that interactive engagement was more effective than traditional instruction [16].Recently, Freeman et al. did a meta-analysis of 225 studies comparing lectures to active engagement in science, technology, engineering, and math (STEM) courses that confirmed that active engagement was more effective than traditional lectures [17].The Freeman meta-analysis looked at test scores and dropout rates, showing that measuring effective pedagogy using a different metric than FCI gains comes to the same conclusion.
Using gains on the FCI to measure effective teaching is still common in PER, although the questions that are being asked are somewhat more sophisticated.For example, in 2011, Hoellwarth and Moelter used gains on the FCI and the related Force-Motion Concept Evaluation instrument [18] to show that for a particular implementation of Studio Physics there was no correlation with any instructor characteristics [19].In 2015, Coletta used FCI gains as one tool to investigate scientific reasoning ability of students and also gender correlations with performance in introductory physics [20].Also in 2015, we used FCI gains to compare the normal 12-week term of the course that is studied here to the compressed six-week version given in the summer [21].
Although our study is of students in a freshman university physics course for life science majors, we expect that our results are relevant for many courses that do collaborative learning, in other physics courses, in courses other than physics, and probably at the secondary as well as postsecondary level.

II. COURSE
We examined team effectiveness in our 1000-student freshman physics course intended primarily for students in the life sciences (PHY131).PHY131 is the first of a twosemester sequence, is calculus based, and the textbook is Knight [22].Clickers, Peer Instruction, and Interactive Lecture Demonstrations [23] are used extensively in the classes.The session that is studied here was held in the fall of 2014.
In addition to the classes, traditional tutorials and laboratories have been combined into a single active learning environment, which we call practicals [24].In the practicals students work in small teams on conceptually based activities using a guided discovery model of instruction, and whenever possible the activities use a physical apparatus or a simulation.Most of the activities are similar to those of McDermott and Laws, described in Refs.[2,3] respectively, although we also spend some time on uncertainty analysis and on experimental technique such as is found in traditional laboratories.The typical team has four students.The students attend a two-hour practical every week, and there are ten practicals in the term.It is the effectiveness of the teams in the practicals that is studied here.
A third major component of the course is a weekly homework assignment.We use MasteringPhysics [25], and the typical weekly assignment takes the students about one hour to complete.Although we use some of the tutorials provided by the software to help student's conceptual understanding, the principle focus of most homework assignments is traditional problem solving, both algebraic and numeric.We expect that most students do these assignments as individuals, although we do not discourage the students from working on them together in a study team.
We gave the Force Concept Inventory to students in PHY131.The students were given one-half a point, 0.5%, towards their final grade in the course for answering all questions on the precourse FCI, regardless of what they answered, and another one-half point for answering all questions on the postcourse FCI, also regardless of what they answered.Below, all FCI scores are in percent.The student's score on the precourse FCI was used to define whether he or she was strong, medium, or weak.
Interactive engagement is the heart of the learning in our practicals, and in this study, as we discuss, we have separated the variable of the type of team by constructing them in different ways for different groups based on the students' strength.
We used gains on the FCI from the precourse to the postcourse as one measure of the effectiveness of instruction that arises from student interactions in the practicals.We also compared final examination grades for different types of teams.Although the FCI and the final examination both measure student understanding of related content of classical mechanics, they do not measure exactly the same thing.First, the FCI is purely conceptual, although Huffman and Heller did a factor analysis of the FCI and concluded that it "may be measuring small bits and pieces of students' knowledge rather than a central force concept, and may also be measuring students' familiarity with the context rather than understanding of a concept" [26].Our course, and therefore the final examination, was about classical mechanics through rotational motion and oscillations, but not waves.Table I shows the type of questions that were asked on the final examination.As can be seen, 64% of the exam tests traditional problem solving, which is not examined by the FCI.
We also looked at the end-of-term anonymous student evaluations for different ways of assigning students to teams.

III. METHODS
We studied only practical teams whose membership did not change from early in the term to the end of the term, which reduced the total number of students in our sample to 690.Each practicals group contains up to nine teams, and although most teams have four students, due to logistic constraints 15% of the teams had three students and four teams out of 178 that we studied had five students.We do not allow teams of fewer than three or greater than five students.Each group of about 36 students has two teaching assistants (TAs) present at all times.
This study caused us to make only two changes in the structure of the course for this year only.In past years the students were initially assigned to teams in the practicals randomly, and halfway through the term the teams were scrambled; the first meeting of the new teams began with an activity on teamwork [27].This term we did not scramble the teams and, not entirely because of this study, we did not use the teamwork activity.Therefore, the composition of the teams typically only changed because dropouts required some redistribution of students within a group, or, rarely, because a team was felt by the TAs to be dysfunctional or even toxic.
The second change in the structure of the course was that we assigned students to teams based on their precourse FCI score.We had 30 groups, each typically consisting of about 36 students divided into nine teams of four students.We used two methods for assigning students to teams, which we call "spread" and "sorted."For the "spread" method, which we used for about half of the groups, we assigned team numbers 1, 2, 3, 4, 5, 6, 7, 8, and 9 to the top nine students based on FCI score, respectively.Then the next nine students were also assigned to teams 1 through 9, and so on.For the "sorted" method, which we used for the other half of the groups, we assigned the students with the top four FCI scores to team number 1, then the next four students to team number 2, and so on; all four of the students with the lowest FCI scores were assigned to team number 9. In total, 16 groups were "spread" and 14 groups were "sorted."In order to avoid biases by our TA instructors, we did not inform them that we had constructed the teams in this way.

A. Classifying students and teams
We classified students as "strong," "medium," and "weak" by their score on the precourse FCI.A strong student is one whose score was in the upper third of the class, a medium student in the middle third, and a weak student in the bottom third.
We defined a weak team as one for which all students were weak, a medium team as one with all medium students, and a strong team as one where all student precourse scores were strong.A "mixed" team had at least one strong student, one medium student, and one weak student.Note that some teams, such as one with one medium student and three weak ones, are not included in any of these types.
In addition to the standard 30 questions on the FCI, on the precourse FCI we asked some further nongraded questions about the student's background, motivation for taking the course, and their gender.The gender question, and the number and percentage of students in each category, was as follows: What is your gender?(A) male (405 students ¼ 40%) (B) female (603 students ¼ 59%) (C) neither of these are appropriate for me (9 students ¼ 1%) In our gender analysis, we ignored the nine students who chose C in the above question.Not all students answered this question, and therefore received no credit for taking the precourse FCI.

B. FCI
1045 students took the precourse FCI, which was almost all students in the course.910 students took the postcourse FCI, again almost all students still enrolled in the course.The difference in these numbers is almost entirely because of students who dropped the course.In our analysis we only used FCI scores for "matched" students who took both the precourse and the postcourse FCI.This was 878 students.The 32 students who took the postcourse FCI but not the precourse FCI were late enrollees or missed the precourse for some other reason.
Figure 1(a) shows the precourse scores and Fig. 1(b) shows the postcourse scores for the matched students.The displayed uncertainties are the square root of the number of students in each bin of the histogram.Neither of these distributions are Gaussian, especially the postcourse one, so the mean is not an appropriate way of reporting the results.We will instead use the median of the scores.The uncertainty in the median is taken to be AE1.58IQR= ffiffiffiffi N p , where IQR is the interquartile range and N is the number of students [28].This uncertainty is taken to indicate very roughly a 95% confidence interval, i.e., the equivalent of 2σ m for normal distributions, where σ m ¼ σ= ffiffiffiffi N p is the "standard error of the mean" [29].
We used the gain on the FCI as one measure of the effectiveness of different types of teams.The standard way of measuring student gains is by Hake [16].It is defined as the gain normalized by the maximum possible gain: Clearly, G cannot be calculated for precourse scores ¼ 100.This was eight students in our course.
One hopes that the students' performance on the FCI is higher at the end of a course than at the beginning.The standard way of measuring the gain in FCI scores for a class or subset of students in a class is called the average normalized gain, to which we give the symbol hgi mean , and was also defined by Hake [16]: where the angle brackets indicate means.However, as discussed, since the distribution of FCI scores is not Gaussian, the mean is not the most appropriate way of characterizing FCI results.We will instead report hgi median , which is also defined by Eq. ( 2), except that the angle brackets on the right-hand side indicate the medians.
The uncertainties in the median normalized gains reported are the propagated uncertainties in the precourse and postcourse FCI scores.Since both of these are uncertainties of precision, they should be combined in quadrature, i.e., the square root of the sum of the squares of the uncertainties in the precourse and postcourse scores.Therefore, from Eq. ( 2), for the median normalized gain: where Δðhprecourse%iÞ and Δðhpostcourse%iÞ are the uncertainties in the medians of the precourse and postcourse FCI scores.

C. Final examination
968 students completed the final examination in the course.In our analysis of examination grades, we looked at only the 899 students who also completed the precourse FCI.The mean grade for these students was ð47.63 AE 0.68Þ%, where the uncertainty is the "standard error of the mean."Although this mean was lower than we intended, the fact that it is close to 50% and also that the grades had a wide distribution (σ ¼ 20%) means that the examination is close to perfect for discriminating between students [30].We used final examination grades as another measure of the effectiveness of different types of teams.
If the mean grade on the final examination was usually 47%, as it was 2014, this could have a dramatic effect on student attitudes towards the course and that in turn could negatively impact the interpersonal dynamics of collaborative learning in the course.At the University of Toronto, grades for 60-69 are classified as "C" and grades of 70-79 are "B."Typically, the mean grade on the final examination in this course is between 65 and 70, which is consistent with other courses at the university.So 2014 was atypical, and could not retroactively impact the attitudes of students towards the course studied here.
Figure 2 shows a box plot of the final examination grades for different student strengths as determined by their precourse FCI scores.The "waist" on the box plot is the median, the "shoulder" is the upper quartile, and the "hip" is the lower quartile.The vertical lines extend to the largest or smallest value less or greater than a heuristically defined outlier cutoff [31].The dots represent data that are outside the cutoffs and are considered to be outliers.The "notch" around the median value represents the statistical uncertainty in the value of the median; notched box plots were first proposed in Ref. [28].
Because of the large overlap of the ranges of exam grades seen in the box plot, the Pearson correlation coefficient of precourse FCI scores and exam grades is only 0.62.Nonetheless, the box plot makes it clear that the grades are significantly correlated with the FCI scores.This gave us confidence that using the precourse FCI scores to classify students by ability is reasonable.
Because of the large overlap in the range of exam grades, using the precourse FCI as a placement tool for students is not appropriate.A similar conclusion, with more sophisticated analysis, was reached by Henderson in 2002 [32].

IV. RESULTS
In this section, we first discuss the normalized gains on the FCI, then discuss final examination grades, and finally present some data about the "sorted" and "spread" groups.
As mentioned, 690 students were in teams whose membership did not change throughout the term.Not all of these students were in various categories examined.For example, students in a team with two strong and two medium members are not in a strong, medium, weak, or mixed team.Similarly, those few students in a team with five members were not part of the sample comparing teams with three members to those with four members.Table IV includes the sample sizes for students who completed the final examination.Almost but not quite all students who completed the exam also completed both the precourse and postcourse FCI, and the sample sizes for the FCI data are all within AE5 students of the values in Table IV.

A. Gains on the FCI
As discussed, we defined a strong student as one whose precourse FCI score was in the upper third of the class, a medium student as one whose score was in the middle third, and a weak student a one whose score was in the bottom third.There were 273 strong students and for them hgi median ¼ 0.500 AE 0.086; there were 339 medium students with hgi median ¼ 0.467 AE 0.036; there were 266 weak students with hgi median ¼ 0.409 AE 0.036.These values are equal within uncertainties.
Table II summarizes the median normalized gain for strong, medium, mixed, and weak teams.The different values of hgi median and the median of G for different types of teams are also all are roughly equal within uncertainties.Recall that the stated uncertainties correspond to a 95% confidence interval; i.e., they are equivalent to twice the uncertainty given by the standard deviation for data that are normally distributed.
Also shown in Table II are the results for all students.For hgi median the value is for the 878 matched students, while for the median of G the value is for the 870 matched students who did not score 100% on the precourse FCI.
Examining the individual normalized gains G tells a similar story.Figure 3 is a box plot for the different team types.The vertical scale of the box plot has been chosen so that the ten values of G less than −0.95 are not displayed: these student outliers most likely put less effort into the postcourse FCI due to end-of-semester fatigue, or a cynical awareness that the participation points would be awarded regardless of their answers.These students were all in mixed or strong teams: there were no outliers for the weak or medium teams.The box plot shows that there are no significant differences in the values and distributions of G for the different types of teams, except perhaps for the outliers.
Table III shows the median normalized gains for strong, medium, and weak students in the 56 mixed teams.
Figure 4 is the box plot of the values of G for students in mixed teams.The vertical scale is chosen so that two strong students whose G values were −3.0 and −3.6, and are therefore outliers, are not shown.
Note that the students in Table II and Fig. 3, except for the mixed teams, are completely different from the students  in Table III and Fig. 4. For example, all strong students in our sample were in either a strong team or a mixed team.
Although the gains in Table III are the same within uncertainties, the somewhat low value for strong students is due to the fact that, as can be seen in Fig. 4, some strong students seem to have put less effort into the postcourse FCI: we will later discuss this a bit more.The comparatively large uncertainty in the value is largely due to the large interquartile range.
We examined FCI gains for teams with three students and teams with four students.The values of hgi median were 0.57 AE 0.14 and 0.50 AE 0.05, respectively, which are the same within uncertainties.The box plot, which is not shown, also shows no significant differences in the distribution of values of G for the two groups.
For the 21 teams with one female student and the rest males, the median normalized gain for the female students was 0.35 AE 0.28.The large uncertainty in this value is due to the small number of female students in the sample, but the gains here are the same within uncertainties as all of the other categories we examined.Although it might be interesting to look at gender correlations with the type of team the female student was in, we lack the statistics for such a study to be possible.

B. Final examination grades
We have presented the overall mean final examination grade for the course: ð47.63 AE 0.68Þ%.The 211 students in mixed teams had a mean exam grade of ð46.8 AE 1.4Þ%.
Table IV summarizes the final examination grades for some other categories of students and teams.
Rows 1 and 2 of the table include all strong students in our sample.The mean on the final examination for strong students in strong teams minus the mean for strong students in mixed teams is ð61.6 AE 1.8Þ − ð64.2 AE 2.1Þ ¼ −2.6 AE 2.8, which is zero within uncertainties: this is the value shown in the final column of the table.The later rows are similarly constructed.In all cases, the differences are zero within uncertainties.
As discussed, there are some insignificant differences in the number of students N for various categories compared to the numbers given in the previous section.These are due to the fact that in the previous section the data are for students who took both the precourse and the postcourse FCI, while the final examination grades are for students who took the precourse FCI and completed the final examination, but not necessarily the postcourse FCI.

C. Sorted and spread groups
As discussed, the initial team assignments were done two ways: in one-half the groups we assigned students so that all members of each team had roughly the same precourse FCI scores, the "sorted" groups, and in the other half we distributed the students so that each team had a mixture of students with different precourse FCI scores, the "spread" groups.The values of hgi median are 0.536 AE 0.062 for the sorted groups and 0.467 AE 0.057 for the spread groups, which are the same within uncertainties.
At the end of the semester, 912 students filled out an anonymous paper-based evaluation during the practicals.We do not know which type of team the students who participated in the evaluation were in, but we do know whether they were in a sorted or a spread group.
Several questions on the evaluation asked about the TAs, but the first five questions asked specifically about student evaluations of the practicals themselves; these questions are shown in the Appendix.Note that for all five questions, a  response of 5 is in general the most favorable.Figure 5 shows the distribution of the means of the five questions for the two types of groups.The displayed uncertainties are based on taking the uncertainty in the number of students in each bin to be equal to the square root of that number.For the 427 students in sorted groups who participated in the evaluations, the mean of the five questions is 3.779 AE 0.031, and for the 485 students in spread groups it is 3.609 AE 0.031; since from Fig. 5 the distributions are roughly normal, we have taken the uncertainty in the means to be the "standard error of the mean." For data like the means of the five practical evaluation questions for the two types of groups, student's t-test is well known for testing whether or not the two distributions are different [33].It calculates the probability that the two distributions are statistically the same, the p value, which is sometimes referred to as just p.By convention, if the p value is <0.05, then the two distributions are considered to be different.For our evaluation data for sorted and spread groups, p ¼ 0.000 107, which is ≪0.05.However, there is a growing feeling that the p value is not enough for this type of analysis, and that effect sizes are a more appropriate way of comparing two distributions [34].One such measure of effect sizes is the Cohen d [35].It is defined as where For our evaluation data, d ¼ 0.257, which is heuristically characterized as "small."This characterization of the difference is consistent with what we see in Fig. 5.The 95% confidence interval range for d is 0.126-0.388;since this range does not include zero, the difference is statistically significant.

V. DISCUSSION
As discussed, the literature shows clearly that pedagogy with an emphasis on students interacting with each other in small groups has been shown to be effective in promoting conceptual understanding as measured by diagnostic instruments such as the FCI in a precourse and postcourse protocol.In this study, reformed pedagogy is used in both the classes and the practicals.
Courses that concentrate on conceptual understanding, as we do in our classes and practicals, have been shown by others [36,37] to have a small impact on the ability of students to solve conventional problems, although this depends on the type of problem.Since, as shown in Table I, conventional problems make up 64% of the grade on the final examination, any effect of different types of practical teams on exam grades would perhaps be small.
As mentioned in the Introduction, this study attempted to distinguish between two opposing strategies for prescribing the membership of small teams.One strategy is to spread students into mixed teams in order to maximize potential for Peer Instruction, providing the necessary range of scaffolding to keep weaker students in their most effective zone of proximal development.The other strategy is to sort students into more homogeneous teams, in order to improve the quality of peer interactions by ensuring that each student had teammates they could actually relate to.Our study produced a null result: the makeup of the teams had no measurable effect on student learning.The normalized gains on the FCI are the same within uncertainties for all types of students in all types of teams.We see indications that a few medium and especially strong students put less effort into the postcourse FCI: perhaps at the end of the term, when there are many course evaluations occurring in all their courses, they had survey fatigue, and since they did not receive any credit for answering the questions correctly, they did not take it seriously.Of course, this is not an issue for final examination grades.Strong, medium, and weak students as defined by their precourse FCI scores have different mean final examination grades, but those grades are the same for a given student strength regardless of what type of team the student was in.
It is possible to argue that our null results are specific to our particular course.For example, our classes use Peer Instruction and Interactive Lecture Demonstrations extensively: both of these techniques are proven to increase gains on instruments such FCI.In addition, we have the practicals, which are the main subject of this study.Perhaps the impact of the classes has lessened the sensitivity to effectiveness of different types of teams in the practicals.If so, a similar study to ours in a course whose classroom component is lecture based, or a course using a Studio Physics model, where the classes, laboratories, and tutorials are integrated, could show some differences in learning for different types of teams.We think this is unlikely, but PER has shown repeatedly that such opinions that are not backed by data should be treated skeptically.
Another course-specific argument is that the course that is discussed here is intended primarily for students in the life sciences, and the median score on the precourse FCI was 50.0%.There is a separate course intended for physics majors and specialists; this course has about 150 students and the median precourse FCI score is typically about 73%.If there were only a single course with both life science and physics majors in it, the FCI scores used to define strong, medium, and weak students would change, and this would tend to increase the difference in ability between strong and weak students in a mixed team.Perhaps with a wider difference in ability, a difference in team effectiveness would be observed, although we think this too is unlikely.
Most teachers have probably noticed that the dynamics or "personality" of small groups of students can be very different for different groups, such as a small course in different years or different conventional tutorial groups for the same course.These variations seem to be greater than one might expect just from the statistics of small groups.Here we have found that student satisfaction with the practicals was higher for sorted groups by 4σ over spread groups, which is nonetheless characterized as a "small" difference using Cohen's d.In written comments on the evaluations, nine students in the spread groups complained about the dynamics of their team, while no students in the sorted groups made a similar complaint.Of course, correlations are not necessarily indicators of causation.And also, of course, the normalized gains on the FCI were the same within uncertainties for the two types of groups.
Lacking such data, we can nonetheless speculate on why there are small but statistically significant differences in student satisfaction depending on the makeup of the teams in the group.Perhaps in the spread groups the best students all ended up feeling a little alone in their teams, while the weaker students felt intimidated by the better students and were less willing to participate.In the sorted groups perhaps there may be more of a feeling that each student "belonged" with his or her partners.So the four best students in team 1 worked together to produce close to perfect work.Similarly, the four weakest students in team 9 worked hard as well, and there was always a student in each of the weaker teams who rose to the challenge to become the leader, asking the TA for help and keeping the team focused.Also, in our rooms teams 1 and 2 were directly across the aisle from teams 8 and 9, which created some interteam dialogue.
We tried defining student strength by performance on the first term test, although by this time 60% of the students had already had two 2-h practicals with the remainder having had three 2-h practicals.We attempted to examine the effectiveness of different types of teams using this definition by comparing student performance on the first term test to performance on the final examination, and also by the normalized gains on the FCI.Perhaps not surprisingly, neither attempt was successful: the data gave no meaningful information on team effectiveness, although they were not obviously inconsistent with the FCI data.We also attempted to increase the number of students being studied by relaxing the definition of types of teams and thereby reducing the uncertainties in our values: we defined a "strongish" team as one of strong students but with one medium student, and a "weakish" team as one of weak students but with one medium student.This attempt also yielded no meaningful data.
The students in the course studied here are mostly in the life sciences and only 16% of the students self-report that their main reason for taking the course is for their own interest [38].As mentioned, there is a separate freshman course for students intending to be physics majors or specialists, and they are, in general, much more motivated to learn the material.It would be interesting to do a study similar to this one with that course.

VI. CONCLUSIONS
There are no statistically significant differences in the normalized student gains on the FCI for students in strong, medium, weak, or mixed teams.Although final examination grades are different for different student strengths, there is no statistically significant correlation with the type of team the student was in.There are also no significant differences between students in a team of three compared to students in a team of four, as measured either by the normalized gains on the FCI or by the mean grades on the final examination.Also, the conclusion of previous studies that a team with a single female student should be avoided is not supported by our data.Although there is some indication that students in teams sorted by strength are more satisfied than students in mixed teams, the difference is small.
Based on these results, instructors using team-based pedagogies should consider assigning the teams randomly, as it appears to be just as effective as sorted teams but requires significantly less effort to implement.
Although, as discussed, our data indicate that using precourse FCI scores as a way of classifying student strength is reasonable, if the University of Toronto used SAT scores it would be interesting to use that as a way of classifying students.Perhaps such an experiment would give information that ours did not.Also, as discussed, replicating our study in a course with traditional lectures plus interactive engagement tutorials, or a Studio Physics course, or in a course with both life science and physics majors could be worthwhile.
Finally, our null result seems to be contrary to the theory of zone of proximal development, which predicts that "spread" teams should have greater learning gains due to the increased effectiveness of scaffolding within teams.One possible way to investigate the amount and type of Peer Instruction occurring in different types of teams would be to use sociometric badges [39].These devices, about the size of a small TV remote or a classroom clicker, are worn by members of a team and measure the amount of face-toface interaction, body language and orientation, dynamics of conversation often without recording the actual words being said, and similar properties.The devices and analysis of the "big data" that they produce have successfully been used to measure the characteristics that lead to effective teams in a commercial context, where "effective" is defined as getting a piece of research done, or a product marketed, or similar such tasks.Using these devices and methodology to study effective teams for learning would be an interesting future project.

FIG. 2 .
FIG. 2. Box plots of final examination grades for different student strengths as measured by their precourse FCI scores.

FIG. 3 .
FIG. 3. Box plot of G for different types of teams.FIG. 4. Box plot of G for different student strengths in mixed teams.

FIG. 5 .
FIG.5.Mean of the five evaluation questions about the practicals for the two types of groups. 1 is the least favorable, 5 the most favorable.

TABLE I .
Types of questions on the final examination.

TABLE II .
Median normalized gains for different team types and for all students.

TABLE III .
Median normalized gains for different student strengths in mixed teams.

TABLE IV .
Final examination grades for students in different types of teams.