Investigating students seriousness during selected conceptual inventory surveys

Conceptual inventory surveys are routinely used in education research to identify student learning needs and assess instructional practices. Students might not fully engage with these instruments because of the low stakes attached to them. This paper explores tests that can be used to estimate the percentage of students in a population who might not have taken such surveys seriously. These three seriousness tests are the pattern recognition test, the easy questions test, and the uncommon answers test. These three tests are applied to sets of students who were assessed either by the Force Concept Inventory, the Conceptual Survey of Electricity and Magnetism, or the Brief Electricity and Magnetism Assessment. The results of our investigation are compared to computer simulated populations of random answers.


I. INTRODUCTION
Conceptual Inventories (CI) came out of our necessity to quantify students' understanding and their progress in class. CIs attempt to measure student comprehension of concepts by monitoring learning gains [1]. The physics education research that followed has driven modern teaching with a focus on developing novel methods to stimulate our students' understanding, and has also redefined our learning goals [2]. Halloun and Hestenes raised the concern that traditional instruction marginally affects students' understanding while their common sense beliefs usually contradict the laws of physics [3,4]. Following their work, the Force Concept Inventory (FCI) survey arrives as a first tool to measure students mastery of force concepts widely taught in the first semester of physics [5]. Since the FCI was introduced, CI surveys have gained widespread use in physics and astronomy education [6], as well as many other disciplines of STEM [7] including biology [8,9], chemistry [10,11], computer science [12,13], genetics [14,15], engineering [16][17][18], and mathematics [19,20].
Since CIs became more useful to instructors, they started to be used as research-based assessment instruments (RBAI) in education research [21]. Research-based assessment instruments (RBAIs) are multiple-choice but carefully designed survey instruments, designed to provide insight into students' attitudes and understanding. Over time, RBAIs have undergone different rounds of scrutiny and validation [22]. When RBAI data are collected regularly, they are valuable measuring tools because they provide standardized comparisons among institutions, instructors, and teaching methods, over multiple implementations of the same course. They also allow us to track trends and investigate correlations over time [23,24]. The physics education research that has followed from the use of RBAIs has driven physics instructors toward developing and implementing novel methods for increasing students understanding as well as toward redefining student learning goals [2]. PhysPort, an online resource for instructors interested in implementing research-based physics teaching practices in their classrooms, currently provides 92 RBAIs with diverse foci, including content knowledge, problem-solving, scientific reasoning, lab skills, beliefs and attitudes, and interactive teaching [25].
Among the RBAIs available on PhysPort are the Force Concept Inventory (FCI), the Brief Electricity and Magnetism Assessment (BEMA), and the Conceptual Survey of Electricity and Magnetism (CSEM). The FCI is a 30question RBAI used to measure students mastery of the mechanics concepts widely taught in a first-semester introductory physics course [5]. The FCI is among the most rigorously developed and reviewed RBAIs. In particular, the FCI has been investigated by Hestenes et al. who interviewed students and instructors to confirm that surveyed individuals correctly understood the wording and the pictographs [5,26], whereas Stewart et al. confirms that test scores are not particularly context dependent [27]. Version H of the CSEM, published by Maloney, O'Kuma, Hieggelke, and Van Heuvelen in 2001 [28], is a 32-question RBAI used to measure student conceptual understanding of electricity and magnetism at an introductory undergraduate level. The BEMA is a 31-question RBAI also designed to assess conceptual understanding of electromagnetism.

A. Main concerns with RBAI
From the early days of RBAIs [29], researchers and instructors have raised concerns about whether students might not make a serious attempt at answering the questions on a conceptual-inventory RBAI, such as the FCI, CSEM, or BEMA correctly [21,[30][31][32][33]. In order for instructors and researchers to appropriately evaluate RBAI data, it is useful to know what proportion of students in a population are taking that RBAI seriously. We define serious students as those who chose answers with consideration, including educated and/or thoughtful guesses, throughout their entire assessment.
Stewart et al. [34,35] study the effect of guessing on both the FCI and CSEM tests. They show that linear models can be used to correct the pre-test and post-test results to account for guessing; however, the normalized conceptual gain allows for the comparison of different student populations and is shown to be invariant to linear transformation and therefore unaffected by guessing. In a different study, Yasuda et al. show that while question 5 scores on the FCI are marginally affected by erroneous reasoning, questions 6, 7, and 16 are more prone to guessing. These questions return a high percentage of false positives as students seem to reach the right answer while using erroneous conceptual reasoning [36,37].
Wang et al. take the matter at a whole new level by using item response theory [38] to build a 3-parameter item response model and use it to analyze student performance on FCI surveys [39]. They show that a student's proficiency is in linear correlation to a student's raw FCI score. Furthermore, the probability of answering correctly is compared against the discrimination potential of each question, difficulty level, and guessing chance. For example, they show that low proficiency students have less than a 5% chance of guessing the correct answer on questions 23 and 26, whereas question 16 yields about a 34% chance that low-proficiency students will guess correctly. On the other hand, the difficulty parameter predicts that questions 1 and 6 are the easiest, whereas questions 25 and 26 are the most difficult. As anticipated, each of the 30 questions in the FCI has a different guessing chance and difficulty level, which comes in support of our present work hypothesizing that when students take the survey seriously, there is a better chances that they will select the correct answer for those questions [39].
Hake et al. considered that careful attention must be given to motivational factors to persuade students to take an RBAI seriously. Without much evidence at the time, he made the remark that surveyed students did take the [FCI] pre-test seriously [28]. Later work from Henderson, however, shows that about 2.8% of surveyed students may not take an RBAI seriously [29]. Henderson was concerned about whether students take the FCI seriously when it is not graded. To identify those students, answer patterns were examined for lack of seriousness from five different angles. By comparison, Pollock et al. ran a longitudinal study of students' conceptual understand- ing using the BEMA survey, and requested that students report how hard they tried. Three levels were identified: take it very seriously, take it seriously, and did not take it seriously. This study shows that over 50% of students took the RBAI very seriously, and only 3% indicated that they did not take it seriously [40].
We have developed a set of seriousness tests and applied them to the FCI, CSEM, and BEMA. It was our goal to develop seriousness tests that could give instructors and researchers an estimate for the proportion of students who did not take an RBAI seriously. Notably, it was not our goal to develop seriousness tests that could identify individual students, and we recommend that the seriousness tests described in this paper not be used in that manner. In subsequent sections we will describe how these seriousness tests were developed as well as those tests' effectiveness in accurately categorizing students as either taking an RBAI seriously or not.

II. DATA SOURCES FOR THE FCI, CSEM, AND BEMA
Data for this paper was obtained from PhysPort's collection of student data. After administering an RBAI, instructors can use the PhysPort Data Explorer to analyze the data from their students. Once the instructors have uploaded their students' responses, the data is stored in a database in PhysPort. We were able to use the data from this database to run our seriousness tests on both the pre-and post-test data for the FCI, CSEM, and BEMA. The database is larger than any data set that has been tested previously, with 64,076 assessment results for the FCI, 15,032 assessment results for the CSEM, and 8,708 assessment results for the BEMA. Table 1 presents the average and the standard deviation for each RBAI.
Along with the RBAI results from PhysPort, we created 20,000 simulated RBAI results each for the FCI, CSEM, and BEMA. Our simulated students guessed randomly on all questions. We generated this simulated data in order to model the responses we might expect from non-serious students. Because we could be certain that each simulated individual in the random data set was a random guesser, the seriousness tests needed to flag a significant fraction of this population in order to be considered successful. We did not expect our seriousness tests to identify every member of the simulated population as non-serious, however, because a seriousness test that achieves this would likely lead to misidentifying serious students as non-serious. It should also be noted that real students are almost never able to behave in a truly random manner on an RBAI, even when they are being non-serious. Their results might show tendencies toward certain answer choices, patterns on the answer sheet, or other trends. This means that students might exist who do not take an RBAI seriously, who also are not well-represented in the simulated population.

III. THE SERIOUSNESS TESTS
We developed three seriousness tests that can be applied to FCI, CSEM, and BEMA responses in order to estimate the percent of students in a sample who did not take that RBAI seriously: the Pattern Recognition Test (PRT), the Uncommon Answers Test (UAT), and the Easy Questions Test (EQT). These seriousness tests are not designed, however, to identify individual students who did not take an RBAI seriously. In developing these tests, we made the assumption, based on the previous work from Henderson as well as from Pollock et al., that the majority of students take RBAI seriously. As such, we expect the portion of the real population that a successful seriousness test identifies to be small.

The pattern recognition test
The Pattern Recognition Test (PRT) is based on the premise that students who do not take an RBAI seriously might choose instead to leave certain patterns throughout their answers. Since computers are not good at picking up on these patterns, we came up with patterns based on what we thought would be likely to find from nonserious test takers. The patterns that we searched for in the RBAIs were: • more than 50% zeros or blank answers • more than 50% one letter When these patterns are present in a response, it is likely that the test taker was not taking the RBAI seriously for a significant portion of the test. The correct answers on none of the RBAI evaluated follow any of these patterns. It should also be noted that non-serious students might sometimes produce response patterns outside of those listed above. We limited the patterns that we sought for, however, to avoid misidentifying serious students as non-serious.

The uncommon answers test
The Uncommon Answers Test (UAT) is based on the idea that students who do not take an RBAI seriously sometimes choose answers that were uncommonly chosen by the larger student population. There are nine questions on each of the RBAIs where two or three answer choices were preferred by most of the population. The common answers were most often the correct answer plus one or more of the incorrect answers. Evidently, these preferred choices are attractive to people who were reading carefully through all questions and were being thoughtful in their responses.
If a student chose an unpopular answer on several of these questions, it is likely that they were guessing rather than applying reasoning throughout the assessment. We identified uncommon answer choices based on how few students have picked those answers in the existing Phys-Port data. Table 2 summarizes the questions and the less frequently chosen answers. We identified 9 questions with uncommon answers for each RBAI. For the FCI, fewer than 7% of the population chose one of the uncommon answers for each identified question. For the CSEM, fewer than 10% of students chose one of the uncommon answers for each identified question. For the BEMA, fewer than 6% of students chose one of the uncommon answers for each identified question. We counted survey takers who choose at least 4 uncommon answers for the FCI or CSEM or at least 3 uncommon answers for the BEMA as possibly non-serious.

The easy questions test
The Easy Questions Test (EQT) was based on the idea that students who take a concept-inventory RBAI seriously will get most of the easier questions correct. A student making an effort on such an RBAI might still have one or two of even these questions incorrect, but they are unlikely to be incorrect for all the easy questions. It stands to reason that an answer set in which all the responses to the easy questions are incorrect is more likely to come from a student who did not take that assessment seriously.
We looked at the existing PhysPort data to determine which questions were easiest for students ( Figure 1). For each RBAI, we chose the top four questions which had the highest scores, and calculated the percent of students who got a certain number of those questions correct. The students who answered all four easy questions incorrectly were considered as not having taken the assessment seriously. We note, however, that even a random guesser is likely to choose at least one correct answer in any set of five-choice questions. Overall, this means that the EQT will undercount the number of non-serious test-takers. Figure 2 shows the percent of students who answered all of the easy test questions incorrectly for an increasing number of easy questions. We can see that when we choose four or more easy questions, the percent of students who get all of the questions wrong stays relatively constant. For this reason, we chose the four easiest questions from each RBAI based on the proportion of correct responses to that question. The questions chosen for the EQT for each RBAI are shown in Table 3. For the FCI, an easy question has a score greater than 71%. For the  CSEM, an easy question is one with a percent of correct responses above 58%. For the BEMA, there was not a clear division, and there was a large discrepancy between easy questions for the pre-and post-tests, so the questions were different and thus the percents were different. The easy questions for the pre-and post-tests of the BEMA had scores greater than 43% and 69%, respectively.

IV. ANALYSIS AND DISCUSSION
Results of the PRT, UAT, and EQT are shown in Table 4 and in Figure 3. In Figure 3, each of the segments includes the percent of test takers caught by only that section. As an example, the percent of the actual population from the CSEM caught by the PRT is 1.3%, as shown in Table 4. This comes from combining 0.92% with each of the segments within the entire PRT circle. The percent of the actual population found to be nonserious by each of these tests is very small, ranging from less than 1% up to a few percent.
Applying the PRT to the actual population data identifies between 0.6% and 1.3% as non-serious for each of the different RBAIs. The PRT identified nearly zero nonserious survey-takers in the simulated population, however. This is unsurprising because the patterns sought for are non-random. Pattern recognition was thus excluded By comparison, the UAT and EQT found a high proportion of non-serious responses in the simulated data ( Figure 4 and Figure 5). 52% and 68% of the respective simulated populations for the FCI and CSEM had four or more Uncommon Answers in their responses, and 48% of the BEMA random population had three or more Uncommon Answers. 41%, 41%, and 51% of the respective simulated populations for the FCI, CSEM, and BEMA were identified by the EQT as non-serious. In the actual populations, on the other hand, the UAT identified slightly more than 2% of students as non-serious for each RBAI, and the EQT identified between 3% and 4.6% of students as non-serious for each RBAI.
Comparing the uncommon answers chosen for each data set and each RBAI in Figure 4, we see that the actual population chose fewer uncommon answers than the random simulation. Fewer than 50% of the real students selected any uncommon answers. Conversely, there was an average of three or four uncommon answers for the random population. Therefore, testing for the number of uncommon answers helped us differentiate serious students from random guessers.
1. Combining the results to determine the overall percent of non-serious students The percent of the actual population identified by each of the three seriousness tests as non-serious is small, ranging from less than 1% up to a few percent. The center segments of Figure 3 show that very few real test takers were caught by all three tests for any of the RBAIs, with the FCI catching only 0.016% with all 3 tests.
As a final comparison, we looked at the scores of any test taker in the actual population who was identified as non-serious by the PRT as well as either of the other two seriousness tests. When we look at the overlap of the PRT with either the EQT or the UAT, the percent of test takers not taking the assessment seriously was 1.5% to 1.6% of the population. The number of test takers in the actual population caught by the non-serious tests was 1,048 out of 63,896 assessments (1.6%) for the FCI, 228 out of 14,876 assessments (1.5%) for the CSEM, and 127 out of 8,642 assessments (1.5%) for the BEMA. These results were very similar, suggesting that our seriousness FIG. 4. Distribution of the percentage of assessments that selected a number of uncommon answers for both the real and simulated populations for each RBAI. There were 9 questions for each RBAI where uncommon answers are rare. Nonserious test takers were those who chose at least 4 uncommon answers for the FCI and CSEM and at least 3 uncommon answers for the BEMA.
tests accurately determine the percent of students who did not take an RBAI seriously. Figure 6 shows graphs of the non-serious scores, identified in the combined manner described vs. scores in the larger population. From this graph, we can see that the non-serious assessment scores are much lower than those from the actual population. This is further evidence that the scores identified by a combination of the PRT and either the UAT or the EQT were most likely truly not serious.

V. CONCLUSION
Our results are in contrast to work mentioned in the introduction by Henderson, who found that about 2.8% of students did not take the FCI seriously [30], and the results from by Pollock et al. who found that 3% indicated that they did not take the BEMA seriously [40]. We find that fewer students were caught by our seriousness tests, and we conclude that the overall percentage of students who did not take the CIs seriously is only about 1.5% -1.6%. Regardless, our results are in line with previous work that shows that the incidence of non-seriousness in RBAI results is very low.
In addition, our seriousness tests might undercount in- cidents of non-seriousness. We made deliberate choices to avoid misidentifying serious test-takers as non-serious, and those choices could have resulted in misidentifying some non-serious test-takers as serious. It is still likely, however, that the methods described will sometimes falsely identify a serious student as non-serious. Because of this, we do not recommend using any of these seriousness tests to identify individual students as serious or non-serious. We suggest that the three seriousness tests developed here could be used together as described to give reasonable estimates of percents of non-seriousness for FCI, CSEM, and BEMA datasets, and that results similar to those just described indicate a low incidence of nonseriousness in a dataset. In addition, these seriousness tests might be applied to other concept-inventory RBAIs, although some details would need to be worked out for the UAT and EQT for each RBAI.