Personality types and student performance in an introductory physics course

We measured the personality type of the students in a large introductory physics course of mostly life science students using the True Colors instrument. We found large correlations of personality type with performance on the Pre-Course Force Concept Inventory (FCI), both term tests, the Post-Course FCI, and the final examination. We also saw correlations with the normalized gain on the FCI. The personality profile of the students in this course is very different from the profile of the physics faculty and graduate students, and also very different from the profile of students taking the introductory physics course intended for physics majors and specialists.


I. INTRODUCTION
Classification of people into different personality types goes back to at least Hippocrates (460-370 BC).A more modern theory of personality types comes from C. G. Jung.Jung published the results of his almost 20 years of research in 1921 in Psychological Types [1].Twenty years later Isabel Briggs Myers and her mother Katharine Cook Briggs slightly changed Jung's theory [2], and the resulting Myers-Briggs Type Indicator (MBTI) is now the most widely used psychological typing tool.
The MBTI is based on four "dichotomies": orientation (extraversion-introversion), cognitive perceiving function (sensing-intuition), cognitive judging function (thinking-feeling), and attitude of the functions (judgment-perceiving).Thus, there are 2 4 ¼ 16 different personality types in the MBTI taxonomy.Keirsey simplified the Myers-Briggs classification into four "temperament" groupings based on the dichotomies he considered to be most important [3].He named the temperaments guardians, artisans, idealists, and rationals.Keirsey has a lot of followers in the area of learning and teaching styles [4].In 1979 Lowry introduced a metaphor for Keirsey's temperaments, using four colors in a system called True Colors [5].There are many variations of the assessment instruments based on the Kiersey and the True Colors taxonomy.One version of the True Colors test instrument has correlated well with the MBTI [6].Table I shows the Keirsey temperaments, their corresponding color metaphor, the Myers-Briggs classification, and a very brief summary of each type.Although many people have a dominant personality type, some have two or more types equally, and virtually nobody has a dominant type with no aspects of other types.In addition, some people with a dominant color have almost equal scores for one or more of the others.Below we will refer to the different types by their colors.
As discussed by Shen et al., the MBTI, Keirsey, or True Colors test instruments have been given to students in engineering, psychology, economics, pharmacy, dentistry, and more.For example, data on 3784 students in 8 engineering schools in the U.S. shows that 40.16% of the students' dominant color was gold, 33.79% were green, 19.12% were orange, and 6.94% were blue [7].
The True Colors instrument shown in the Appendix is given every year to 2nd year civil engineering students at McMaster University, Hamilton Ontario.Initially we assigned a dominant color type based on the one with the largest score.For the 86 students in 2016 the results showed that the dominant colors were 31% green, 28% orange, 25% gold, and 16% blue.
The MBTI was given to 20 applied physics students at Central Queensland University, Queensland Australia and collapsing the 16 types into True Colors showed that 63% of the students' dominant color was green, 27% were gold, 10% were blue, and none were Orange [8].
However, the concept of personality type and its measurement can be overused and/or misused.There are troubling questions about the test instruments' statistical structure, reliability, robustness, and validity [9].In addition, almost all people have a mixture of personality types, and focusing on just one dominant type can be highly misleading.Also, some misguided career counselors use the results to recommend that a person should choose a particular profession: any such attempt to put an individual into a particular "box" is oversimplified to the point of being wrong.In this study we attempt to restrict ourselves to using the personality type data to investigate if there are trends and correlations between personality type and performance by the students that can guide us to adjust the structure of our course and the types of pedagogy that we use in order to make it more effective for a larger number of our students.
There are some preliminary studies attempting to correlate brain activity with personality.For example, Koelsch, Skours, and Jentschke used functional magnetic resonance imaging and other techniques, and found some small correlations of the data with the results of a questionnaire on personality type [10].The questionnaire was the NEO instrument, based on a 5 factor model of personality [11].They conclude by calling for improved questionnaires based on neurological data.
Here we have measured the personality type of the students in a 1000-student introductory physics course intended primarily for life science students, and correlated the personality type with performance on the Force Concept Inventory [12], both precourse and postcourse, and with the two term tests and the final examination.The course features interactive engagement forms of pedagogy throughout, and is described more fully elsewhere [13].The fall 2016 session studied here uses the same structure and pedagogy as in the 2014 session described in Ref. [13], except that the textbook is Wolfson [14] instead of Knight [15].
We have also measured the personality type of the physics faculty, first-year physics majors, and physics graduate students.These results were compared to the results of the life science students.

II. METHODS
The MBTI test instrument consists of 93 questions, Keirsey has 71 questions, and NEO has two common versions having 90 and 240 questions.The length and resulting time necessary to answer all the questions makes them unsuitable for use in our pre-course assessment.Further, the MBTI and Keirsey instruments are both used by many major corporations for assessing their employees, and are available only for a fee.This also precludes them from use in our study.There are different versions of the True Colors test instrument, and some of them also require a fee for use.
For this study, we chose a True Colors instrument that is free and fairly short [16].In its original format, it can be self-scored.The Appendix is a slightly modified version of that instrument.This version of the test instrument is referred to as the "Word Cluster" version.
In our precourse assessment, given during the first week of the term, we converted this instrument into multiplechoice format.The major changes from the original are that we have omitted 1 page of introductory material and another page at the end describing the characteristics of each color.In the table, we also changed "confrontational" to "confrontive.Then there are 3 more similar questions about Set B, C, and D of Group I.There are then four more sets of 4 questions each for the other four Groups of words.We scored "most" as 4 points, "a lot" 3 points, "somewhat" 2 points, and "least" 1 point, the total number of points for all five groups should be 5 The result of the True Colors assessment is a numerical value for each of the 4 colors, ranging from 5, the lowest, to 20, the highest.As with all precourse assessments, the students are not given their results and are therefore not explicitly told about their measured personality type.
Each of the four colors of the True Colors assessment is measured five times, once for each of the five groups of word clusters.One measure of the reliability of the assessment instrument is to calculate Cronbach's α for the five measurements of each color [17].Table II summarizes the results of the calculations.Values between 0.70 and 0.90 are heuristically described as "acceptable" and values much larger than 0.90 are often considered to be too good to be true [18].This analysis supports the reliability of our modified-multiple-choice version of the True Colors instrument.Also included in the precourse assessment were the 14 questions from the first of two half-length FCI tests [19].These two tests each take much less time to administer and are shown to be valid alternatives to the full FCI test instrument [20].The postcourse assessment consisted only of the second half-length FCI test and was given to the students during the last week of the term.
Initially we assigned a color type for each student by choosing the color with the highest value.978 students wrote the precourse assessment, which included the True Colors questions.This was almost all the students who were enrolled at that time.However, 12 students did not answer all the questions on personality types and are excluded from our analysis.For the 120 students who had 2 or more color types equally dominant all combinations were represented.The two most prevalent ties were blue-gold (32 students of 120 ¼ 27%) and gold-green (32 students of 120 ¼ 27%); 2 students had all four types equally, and 11 students had three types equally dominant.Various weighting schemes were attempted to account for the ties and also for the values of the nondominant color values.Finally, we settled on an analysis based on the centroid of the colors scores.
For example, if a student has scores of ðblue; gold; green; orangeÞ ¼ ð7; 15; 15; 13Þ, we can plot the blue score in the first quadrant ðx; yÞ ¼ ð7; 7Þ, the gold score in the second quadrant ðx; yÞ ¼ ð15; −15Þ, the green score in the third quadrant ðx; yÞ ¼ ð−15; −15Þ, and the orange score in the fourth quadrant ðx; yÞ ¼ ð13; −13Þ.Then we calculate the following centroid of the scores: For the example student, this gives ðx C ;y C Þ¼ð−2.5;−1.5Þ. Figure 1 illustrates for this student.The central red dot is the value of the centroid.As shown in Table I, there is some justification for the color assignments to these specific quadrants.In the original Meyers-Briggs classification, green and blue are both intuitive, but they are opposite in regards to thinking or feeling.Similarly, gold and orange are both sensing, but are opposite in regards to judging or perceiving.The arrangement of the quadrants proposed here allows us to examine the MBTI so-called FT dimension (feeling vs thinking) as well as the JP dimension (judging vs perceiving).These two dimensions lie on diagonal lines with slopes of 1 and -1, respectively, and pass through the origin.In Fig. 1, the student has a goldgreen tie for highest value.What places the centroid into the green is not that the student had a high orange (13) but that the student had a low blue (7).Thus thinking is "beating" feeling (higher green-blue difference of −8) by more than the judging is beating perceiving (lower gold-orange difference of −2).
We then assign a color type for each student based on their centroid values.Defining a cutoff value c, then a blue student is ðx C ; y C Þ ¼ ð>c; >cÞ, a gold student is (<c; >c), a green student is (<c; <c), and an orange student is (>c; <c).Results such as grades on the first term test are surprisingly insensitive to the value of the cutoff c.For example, below we will show that blue students consistently exhibit the weakest performance on test and examination grades, and on FCI scores.Table III shows the mean value of the first term test for blue students as defined by various by values of c.
Therefore, we chose c ¼ 0.0 to assign a color type for each student.
There were a total of 108 students from our sample who had x C and/or y C ¼ 0, so did not have a single color type.Of these, 9 students had centroids at the origin.There were 37 students whose centroids were equally blue and gold, i.e., ðx C ¼ 0; y C > 0Þ.Similarly there were 29 students with equal gold-green centroids, 12 students with equal green-orange centroids, and 12 students with equal blueorange centroids.It is fairly easy to show that if the colors  scores for 3 colors are equal with the fourth score different, the centroid will not lie on either axis, and similarly if 2 colors for opposite quadrants are equal with the other 2 colors different from each other the centroid will also not lie on either axis.We did some analysis of results using three methods: the simple-minded assigning of color types by choosing the color score with the highest value, a one-dimensional weighting procedure to try to account for all 4 color scores [21], and the two-dimensional centroid method discussed here.All three showed qualitatively the same trends, but somewhat different quantitative values.We believe the centroid method is the most accurate of the three in assessing the impact of colors on student performance, and that is what is used in the remainder of this study.
The methodology described above was also applied to the results obtained by administering the True Colors instrument to the physics faculty, first-year physics majors, and physics graduate students.

A. Centroid calculations and color assignments
We gave the True Colors assessment to the physics faculty at the University of Toronto in 2016; 26 of 63 faculty (41%) responded to the anonymous survey.Just using the highest score, not the centroid, to assign a color, the dominant colors were 19 green, 1 gold, 1 blue, 1 orange, 3 green-gold ties, and 1 orange-gold tie.There is a separate first year course for our physics majors and specialists, and we also gave the True Colors assessment to those students.29 of 208 students, 14%, responded to the anonymous survey.The dominant colors were 18 green, 6 gold, 2 orange, and 0 blue.There was 1 green-blue tie, 1 orangeblue tie, and 1 orange-green tie.We also gave the True Colors assessment to Toronto physics graduate students; 30 of about 200 students (15%) responded to the anonymous survey.The colors were 16 green, 3 gold, 4 blue, 3 orange, 1 green-gold tie, 1 orange-green tie, and 2 green-blue ties.
For the physics faculty, Fig. 2(a) shows the values of the centroids, and the open circle shows the mean value of the centroids.Figure 2(b) shows the centroids for the students in our 1st year course for physics majors and specialists, and Fig. 2(c) shows the centroids for our physics graduate students.
Figure 3 shows the centroids for all students in the 1000student introductory physics course for life science students that we are studying here, and the mean value of centroids as the open circle.Also shown are histograms of the values of centroids.
Perhaps not surprisingly, the physics majors and specialists are much closer to the physics faculty than the mostly life science students in the course of this study.The physics graduate students are similar to the physics faculty in that the mean centroid is also located in the Green quadrant, however, it is shifted somewhat towards the orange quadrant.
Table IV shows the mean value of the centroids for the physics faculty, students in the course for physics majors and specialists, the physics graduate students, and for the students in the course being studied here.The stated uncertainties are the standard "error" of the mean σ m ¼ σ= ffiffiffiffi N p [22].Table V shows the distribution of color types for the 1000-student course as determined by the value of the centroid.Students whose centroid fell on one of the axes did not have a single color type and were, therefore, not included.

B. Statistical tests and student performance
There were five assessments in the course: the precourse half FCI, the first term test, the second term test, the postcourse half FCI, and the final examination.We will examine these in order.Then the results of an ANOVA regression for all five assessments are discussed.

Precourse FCI scores
The precourse half FCI was given in the first week of the term.978 students wrote the assessment.As discussed in, for example, Ref. [13], the distribution of FCI scores is not Gaussian, so the median is more appropriate than the mean to characterize the results.We report FCI scores in percent.
Table VI shows the median precourse FCI scores for all students and for students with defined color types.The uncertainties are 1.58 × IQR= ffiffiffiffi N p , where IQR is the interquartile range and N is the number of students in the sample [23].This uncertainty is roughly taken to indicate a 95% confidence interval, i.e., it is equivalent to 2 × σ m for a normal distribution.
The highest median value is for green students and the lowest median value is for the blue students.This is a pattern that we will see for all the other assessments discussed below.For the precourse FCI scores, the difference between the orange and green students is not significant: ð57.1 AE 4.8Þ − ð50.0 AE 5.4Þ ¼ 7.1 AE 7.2.The gold-green difference is nonzero within uncertainties: ð57.1 AE 4.8Þ − ð50.0 AE 2.8Þ ¼ 7.1 AE 5.6.Further analysis of the various pairs of colors is in Sec.III.B. 6 below.
Remembering that the claimed uncertainty is roughly equivalent to 2 × σ m for a normal distribution, the combined uncertainty of the difference between the green and blue students is about 6 "standard deviations"; i.e., since ð57.We used Cliff's δ to examine the effect size of the difference between the green and blue students.The Cliff δ for 2 samples is the probability that a value randomly selected from the first group is greater than a randomly selected value from the second group minus the probability that a randomly selected value from the first group is less than a randomly selected value from the second group.It is calculated as where indicates counting.The values of δ can range from −1, when all the values of the first sample are less than the values of the second, to þ1, where all the values of the first sample are greater than the values of the second.A value of 0 indicates samples whose distributions completely overlap.Calculating the value of Cliff's δ for green and blue precourse FCI scores gave δ ¼ 0.37, which is heuristically characterized as "medium."The 95% confidence interval range is 0.26-0.47;since this range does not include zero, the difference is statistically significant.
The box plot is a particularly nice way of visually comparing distributions such as FCI scores and test grades.Figure 4 shows the box plot of the precourse FCI scores for the different personality types of the students.The "waist" on the box plot is the median, the "shoulder" is the upper quartile, and the "hip" is the lower quartile.The vertical lines extend to the largest (smallest) data point value less (greater) than a heuristically defined outlier cutoff [24].The "notch" around the median value represents the statistical uncertainty in the value of the median.

First term test
The first term test was given early in the term, after 3 weeks of classes.927 students wrote the test, which was Cliff's δ for the blue and green students is 0.32, which is heuristically characterized as "small."The 95% confidence interval is 0.19-0.43;since this does not include zero, the difference is statistically significant.
For distributions which are Gaussian, such as test grades, an alternative to Cliff's δ is Cohen's d [26].It is defined as where Cohen's d is somewhat easier to interpret than Cliff's δ.Note that it uses the standard deviation, not the standard error of the mean.
Comparing the blue and green student grades on the test gives d ¼ 0.60, so the difference in the means is over onehalf of the pooled standard deviation.This value is heuristically defined as a medium difference.The 95% confidence interval for d is 0.36-0.83:since this range does not include zero, the difference is statistically significant.
Figure 5 shows the boxplot of the test grades for the different personality types.The dots are data points that lie outside of the cutoffs, and are considered to be outliers.

Second term test
The second term test was given during the 9th week of classes.716 students wrote the test, which was 80 min long.Once again, the overall mean on the test was lower than we intended: it was 51.77 AE 0.81.The same pattern we have seen for the precourse FCI and the first term test is true here: the green students outperformed the blue students.The difference between the green and blue students is about 6.5 × σ m : ð60.4 AE 1.8Þ − ð42.9 AE 2.0Þ ¼ 17.5 AE 2.7.This difference is the largest of the three assessments we have examined so far by a small amount.
The Cliff δ is also the largest.It is 0.47 (medium) with a 95% confidence interval of 0.33-0.59.
Cohen's d is also larger for this test than for the first one.It is 0.85 ("large") with a 95% confidence interval of 0.57-1.13.
The difference in test grades is confirmed by the box plot in Fig. 6.

Postcourse FCI and FCI gains
The postcourse half FCI was given during the last week of the term.671 students wrote the assessment.Table IX summarizes the results.
Once again, the green students outperformed the blue students.The difference between the green and blue scores is ð85.7 AE 5.3Þ − ð57.1 AE 7.1Þ ¼ 28.6 AE 8.9.Calling Δm the difference in the median values, and u the uncertainty in Δm, the difference is about the same as observed for the precourse scores: Δm=u ¼ 28.6=8.9≃ 3 − 6 standard deviations: Cliff's δ for the blue and green scores is 0.37 (medium) with a 95% confidence interval of 0.22-0.50.These values are also comparable to the ones for the precourse.
One hopes that the students' performance on the FCI is higher at the end of the course than at the beginning.Comparing Table VIII to Table V, the postcourse scores are higher than the precourse ones for all categories of students.The box plot of postcourse scores, which is not shown, looks similar to that of the precourse ones, Fig. 4, except for the upward shift in values.
As in Ref. [13], we characterize the gains from the precourse to the postcourse by the median normalized gain: where the angle brackets on the right-hand side indicate medians.We examined the gains for the 628 "matched" students who wrote both the precourse and the postcourse FCI.Table X summarizes.Once again, Green students outperformed blue ones, with gold and orange students in the middle.The difference between the green and blue students is Figure 7 shows the box plot of G for different personality types.The vertical scale is chosen to not display the 34 students who either had a precourse score of 100 or a value of G < −0.5.
Cliff's δ for the values of G for blue and green students is 0.23, which is heuristically characterized as small.The 95% confidence interval is 0.09-0.38,so the difference is statistically significant.

Final examination
The final examination was 2 h long, and was written by 696 students.The overall mean grade was 64.4 AE 0.7; at the University of Toronto, this is a letter grade of C. Table XI shows the mean grades for different color types.
The same pattern is evident that has been shown for all the other assessment instruments: with green students outperforming blue students.In this case the difference between the green and blue performance is ð70.0AE 1.6Þ − ð58.4 AE 1.8Þ ¼ 11.6 AE 2.4, which is almost a 5 standard deviation difference.
Cliff's δ and Cohen's d for green and blue students are somewhat smaller than for the second term test, 0.36 and 0.64, respectively.Both of these are heuristically characterized as medium.The 95% confidence intervals are 0.21-0.48and 0.36-0.91,respectively, so both statistics indicate a statistically significant difference.
The box plot, which is not shown, also shows no surprises.

ANOVA results
Above we used Cliff's δ and Cohen's d to compare performance for two of the four colors, blue and green.In addition, we did a one-way analysis of variance (ANOVA) of the means of all course assessments for all four colors.The results are summarized in Table XII.We found that the means of all course assessments had statistically significant differences when broken into groups of color types.Note that ANOVA assumes the values are normally distributed, which is not correct for FCI scores, so those values should be treated with particular caution.
From the results of the ANOVA, we used Tukey's honest significance test for a 95% confidence level to examine where those differences lie [27].The results are summarized in Table XIII.
Because all assessments have a wide spread of values and the small number of blue and orange students in our sample, it is difficult to interpret some of these values.Nonetheless there are some trends.With the exception of the FCI gain, the green-blue differences are all much less than the accepted statistically significant p value of <0.05.The orange-gold differences are not significant for any assessment.
Trying to draw further conclusions from the data is probably not appropriate without better statistics and a deeper analysis of the assessment instruments.

IV. DISCUSSION
Earlier, we provided some justification for how we assigned the colors to the four quadrants of an x-y plot.The data showing significant differences in student performance between blue and green students with the other colors in the middle provides another justification: surely these blue and green students should be in opposite quadrants of the plot.The fact that the assignments that we made contain a mnemonic (the color names are assigned to the quadrants in alphabetical order) is a coincidence.Other assignments, such as green-orange-blue-gold, should be equally valid so long as the green and blue scores are in opposite quadrants.
It should be made clear at the outset that just because we see correlations between color type and student performance does not mean we are suggesting a simple causal relationship.We are also not advocating for using color type to assess the suitability of a student for our course: a glance at, for example, the distribution of grades on the first term test in Fig. 5 makes it clear that there are high performing and low performing students for all color types.However, thinking about personality types provides a new perspective on our students and some of the difficulties they may have in doing well in our course.
For example, it is clear that there is an "impedance mismatch" between the strongly green physics faculty and graduate student TAs, and our students in the introductory physics for the life sciences course, who are mostly gold but with significant numbers of blue and orange color types.In order to accommodate the detail-oriented gold students, faculty should be sure to make expectations, deadlines, etc., extremely clear.
Similarly, to accommodate the blue students, we should emphasize the benefits of physics for the public in general and for health care in particular.It could also be useful for these students, who value intuition, to point out, as Livio wrote, "More than 20 percent of Einstein's original papers contain mistakes of some sort.In several cases, even though he made mistakes along the way, the final result is still correct.This is often a hallmark of great theorists: They are guided more by intuition than by formalism" [28].
To make the course more relevant to the orange students, it could be worthwhile to devote some time in making the risks of scientific inquiry clear by emphasizing that good scientists need the courage to be wrong.It could also be useful for these students to point out, as Gopnik et al. wrote, that "Science is a kind of institutionalized childhood" [29].
None of these recommendations are particularly revolutionary.However, putting these issues in the context of personality types may make them particularly compelling.
We need to beware of thinking statements such as "I/you/ he/she am/are/is/is measured to have an orange personality type" are the same as "I/you/he/she am/are/is/is an orange personality type."As with all such psychological assessments, the result can be faked to one degree or another.For example, a person who is inherently a playful risk taker (orange) can consciously choose to answer the True Colors assessment questions to come out as a detail-oriented person (gold).Even without such a conscious decision, we all have a self-image, which perhaps we acquired from what we have been told by our parents, peers, or former teachers.In such a case we will unconsciously choose answers that conforms to that self-image.And, of course, such self-images can be self-fulfilling prophecies.So a student who believes he or she is an intellectual idea person (green) will have the confidence necessary to do well in a physics course.This is one reason why we cautioned against interpreting the correlations we see between personality type and performance as indicating a simple causal relationship.
Steele and Aronson introduced the phrase "stereotype threat" in 1995 in the context of test performance of African-American students [30].Since then the phrase has been applied to the gender issue in physics courses [31,32].We are proposing that it is also appropriate in the case of a mismatch between a measured personality type and the ability to do well in a physics course.
A related perspective on the issue of color type is that physics is generally perceived by the public to be difficult and requires considerable raw intellectual talent.In terms of personality type, this is most similar to green.In 2015 Leslie, Cimpian, Meyer, and Freeland published a study of U.S. postsecondary institutions [33].They looked at disciplines that are perceived as requiring different levels of intellectual talent.Those perceptions are negatively correlated with the percentage of female Ph.D. students in the disciplines: the greater the perception of required raw talent, the fewer females in the discipline.Evidence strongly suggests this is true not only in the STEM fields of science, technology, engineering, and mathematics, but also in the social sciences and humanities.A similar correlation was found in the percentage of African-American Ph.D. students, but not Asian-American Ph.D. students.Although there are no data on whether or not the perception that some disciplines require more raw talent than others is actually correct, they argue that in either case stereotype threat is a factor in participation rates.We think it is likely that physics faculty and graduate students in general are strongly green, as are the faculty and graduate students at the University of Toronto.It would be interesting to examine the color type of faculty and graduate students in other disciplines to see if the fields that are believed to require raw talent are also green compared to fields that are generally considered to be "easier".
In physics education research, the normalized gain on the FCI, hgi, has played a crucial role for 25 years.It is widely taken to be a measure of the quality of instruction.Its value has been shown repeatedly to depend strongly on the type of pedagogy used, and therefore has played a leading role in the adoption of interactive engagement types of teaching.Although precourse and postcourse FCI scores have been shown to depend on a number of factors, the value of hgi turns out to be surprisingly insensitive to these factors.For example, previous results suggest the value of hgi does not depend on factors such as whether or not the student took a senior-level high school physics course or the student's motivation for taking our course [34].Furthermore, a previous study suggests that values of hgi are consistent when comparing the normal 12-week term of the course studied here to the compressed 6-week version given in the summer [35].In Ref. [13] we presented evidence that the value of hgi is statistically the same for teams of students with roughly equal strength compared to teams with a mixture of student strengths.Hoellwarth and Moelter showed that in a particular implementation of Studio Physics, hgi was independent of the instructor [36].Wood, Galloway, and Hardy showed hgi was largely independent of whether or not the student is capable of suppressing an intuitive and spontaneous wrong answer in favor of a reflective and deliberative right one [37], a result that we have replicated [38].Therefore, the fact that our results suggest a correlation between hgi and color type is particularly dramatic and troubling.Evidently our research-based pedagogy does not serve our blue students as well as it should.
When a performance gap is discovered for some factor, such as gender, socioeconomic background, Piagetian cognitive level or, here, personality type, one hopes to find ways to reduce it.Here we have used Cliff's δ and Cohen's d as one way to quantify the performance gap between blue and green students.Figure 8 illustrates this for the precourse FCI, the first term test, the second term test, the postcourse FCI, and the final examination in the course.These are the order in which the students did them.The solid black is for Cliff's δ and the red dashed is for Cohen's d.Note that d is not calculated for FCI scores, since d assumes a normal distribution.It is clear that our current course does not reduce the gap.
The uncertainties in Fig. 8 need some explanation.Earlier, for each of the assessment instruments, we presented a value D for Cliff's δ or Cohen's d, and then the lower value L of the 95% confidence interval range and upper value U of the 95% confidence interval range.We can write this as D AE ðD − LÞ ¼ D AE ðU − DÞ.However, the uncertainties from the 95% confidence interval correspond to 2 × σ.In plots the displayed uncertainties are usually the standard deviation, not twice the standard deviation.Therefore, the displayed uncertainties in Fig. 8 are ðD − LÞ=2 ¼ ðU − DÞ=2.
Probably because the first term test was much too hard, the dropout rate for this session of the course, about 25%, was higher than usual.We attempted to correlate the color type with the dropouts, and did not see a large correlation.We also attempted to compare student learning teams comprised of students with the same personality type to teams with a mix of personality types, but for a number of reasons this attempt failed.
There is a somewhat troubling issue with our data on personality types.For each of the five groups of four sets of words, the students are asked to choose which set is most like them, a lot like them, somewhat like them, or least like them; the example question shown in Sec.II is the question for set A of group I. Then there are 3 more similar questions about set B, C, and D of group I.There are then four more sets of 4 questions each for the other four groups of words.Since "most" is scored 4 points, "a lot" 3 points, "somewhat" 2 points, and "least" 1 point, the total number of points for all five groups should be 5×ð1þ2þ3þ4Þ ¼50.However, this was only true for just under one-half of the students who did the precourse assessment (419 of 978), please see Table XIV  For any assessment instrument, like this one, where students are given credit for answering all the questions regardless of what they answered, a disturbing issue is that some students will not take their answers seriously and will, for example, answer randomly or just choose A, B, C, or D in order or something similar.In Ref. [8] we showed some data for the postcourse FCI indicating that particularly the good students were not trying to give their best answers.Inserting a question in the middle of the instrument to check that the students are at least reading all the questions can check this, and it turns out that most students seem to take giving accurate answers fairly seriously.So, for the personality type questions, perhaps some students were somewhat confused or sloppy, or perhaps they decided that two word sets were equal in ranking.We have assumed that, despite these issues, the personality scores we measured reflect to some degree the personality types of the students.This assumption is supported by the results shown in Table II.

V. CONCLUSIONS AND FUTURE WORK
In the early days of the Royal Society of London in the 17th century, members regularly performed and reported on experimental measurements [40].Many of these experiments were crucial in the development of the sciences of mechanics, the gas laws, optics, and more.However, some of those experiments in retrospect look silly.For example, Boyle investigated the difference in behavior of a butterfly, a bee, a hen-sparrow, and a mouse when placed in a partially evacuated chamber [41].However, it is only in retrospect that those experiments seem silly: at the time people did not understand the issues and in this case oxygen had not even been discovered.
We are not making such grandiose claims for the experiments on personality type described here.However, like those early experimentalists, we are not sure just what we are measuring, or exactly how it relates to student learning and performance.Nonetheless, the correlations that we see between measured personality type and student performance makes it obvious to us that assessments of personality type, however flawed, are measuring something relevant to physics education.
At this stage of our research, it is important to view our results primarily as observations.Some of us were initially skeptical about whether we would see any significant correlation between measured personality types and student performance, and have been very surprised by the size of the correlations that we have observed.An investigation into possible operational strategies for reducing the greenblue performance gap will form the second phase of our research into personality types.A few of our specific intentions are described below.Hopefully the strong correlations observed in this study will entice other researchers to investigate strategies for improved pedagogy based on an understanding of personality type.
We have shown that student performance on the precourse FCI, two term tests, and postcourse FCI, normalized gain on the FCI, and the final examination correlate to the color type, with green students consistently outperforming the blue.For all but the normalized gain on the FCI, the difference in blue-green performance was 5σ or better; for the normalized gain it was somewhat less, at about 3σ.
We believe our observed correlations of personality with student performance are probably true in a much broader context than just students at the University of Toronto.There has been a study of Singapore university students that is similar to ours for two of their courses, one for a first year mechanics course with 110 students and the other for a second year quantum mechanics course with 80 students.Although their statistics are limited because of the small number of students, the results are consistent with ours [42].A similar study performed on first year chemistry students at the University of Sydney used the Myers-Briggs Type Indicator and found a correlation between student performance and the Myers-Briggs FT dimension (feeling-thinking, i.e., blue-green).With students scoring high in "thinking" outperforming students who scored high in "feeling" [43].
Important questions that we have not addressed here involve what characteristics of these personality types are contributing to the performance gap we have observed, and how can we modify our pedagogy to address these differences.We intend to address these issues in at least three ways.
First, we will be forming two focus groups of students.One will be all blue students and the other all green students.We wish to probe the differences in the ways that these students interact with each other and the material of the course.An individual from outside the department will facilitate these focus groups.
Second, we intend to form the learning teams of 4 students two ways: one will be homogeneous in terms of measured personality type, and the other will be a mixture of different measured personality types.This is similar to our study of effective teams of Ref. [13], except there the teams were formed on the basis of the results of the precourse FCI, not the measured personality type.
Third, it may be that the Investigative Science Learning Environment (ISLE) provides a perspective on pedagogy that addresses the observed gap between the performance of blue and green students [44].We will be explicitly modifying the activities we use for collaborative learning to incorporate the rubrics developed by ISLE.Our hope is that

FIG. 2 .
FIG. 2. The centroids (the red dots) and the mean of the centroids (the open circle).(a) Physics faculty.(b) 1st year physics majors and specialists.(c) Physics grad students.

FIG. 3 .
FIG. 3. The centroids of the students, the mean of the centroids (the open circle), and histograms of the centroid values.

FIG. 5 .
FIG. 5. Box plots of grades on test 1 for different personality types.
FIG.8.Comparing blue and green students.Cliff's δ (solid black) and Cohen's d (dashed red) for the precourse FCI (Pre), the first term test (T1), the second term test (T2), the postcourse FCI (Post), and the final examination (Final).

"
For example, referring to set A of Group I in the Appendix, one of the 20 questions in this format is Question 15: For set A in Group I: A. Set A is most like me.B. Set A is a lot like me.C. Set A is somewhat like me.D. Set A is least like me.

TABLE I .
The four personality types.

TABLE III .
Mean test 1 grades for blue students for different cutoff values.

TABLE IV .
Mean centroid values.

TABLE VII .
Mean test 1 grade for students with a dominant personality type.
Table VIII shows the test grades for different personality types.Students lacking a clearly defined color were excluded from the table.

TABLE VIII .
Mean test 2 grade for students with a dominant personality type.

TABLE X .
Median normalized FCI gains for matched students.
FIG. 7. The normalized gain for different personality types.

TABLE XI .
Mean final exam grade for students with a dominant personality type.

TABLE XII .
One-way analysis of variance (ANOVA).

TABLE XIII .
p values for Tukey's honest significance test for pairs of colors.

TABLE XIV .
Student total color points Fall 2016 PHY 131.will not only benefit all students, but will also reduce the blue-green performance gap.Total orange score: Sum of A, H, K, N, S __________ Total green score: Sum of D, E, L.P, Q ____________ Total blue score: Sum of C, F, J, O, R _____________ Total gold score: Sum of B, G, I, M, T ____________ If any of the scores are less than 5 or greater than 20 you have made an error.Please go back and read the instructions. this