Effects of testing conditions on conceptual survey results

Pre-testing and post-testing is a commonly used method in Physics Education Research to assess student learning gains. It is well recognized in the community that timings and incentives in delivering conceptual tests can impact test results. However, it is difficult to control these variables across different studies. As a common practice, a pre-test is often administered either at or near the beginning of a course, while a post-test can be given either at or near the end of a course. Also, in conducting such tests there often is no norm as to whether incentives should be offered to students. Because these variations can significantly affect test results, it is important to study and document their impact. We analyzed five years of data that were collected at The Ohio State University from over 2100 students, who took both the pre-test and post-test of the Conceptual Survey of Electricity and Magnetism under various timings and incentives. We observed that the actual time frame for giving a test has a marked effect on the test results and that incentive granting also has a significant influence on test outcomes. These results suggest that one should carefully monitor and document the conditions under which tests are administered.


I. INTRODUCTION
Among various educational evaluation techniques, 1 pretesting and post-testing is the most widely adopted method 2 in the physics education community.Given the fact that an increasing number of valid and reliable research-based conceptual tests have been developed in the physics domain since the Force Concept Inventory, 3 it is now fairly convenient for an instructor to choose a desired test to make a premeasurement and postmeasurement on student conceptual understanding and thus to gauge student learning gains in a particular class.][17][18] In conducting pre-tests and post-tests, one tacit default is that pre-tests are administered at or near the beginning of a course, and post-tests are given at or near the end of the course. 19It is often expected that student performance will remain approximately the same over a few days at the beginning or end of the class.Additionally, whether or not to grant students incentives for completing these research-based tests often varies across a range of studies.Within a single study, testing conditions including timings and incentives are usually well controlled.However, these conditions may change across different studies, making it difficult to compare results.In existing literature, there is no research on whether and to what extent timings and incentives in delivering conceptual tests may impact test results.A good understanding of this issue is of importance to the Physics Education Research community, particularly as more researchers are starting to collaboratively address similar research questions in different settings.
Over the past five years, the Conceptual Survey of Electricity and Magnetism ͑CSEM͒ 20 has been administered as both the pre-test and the post-test in the calculus-based introductory electricity and magnetism ͑E&M͒ course at The Ohio State University ͑OSU͒.The administration of the CSEM took place at different timings and with various incentives.Continuous use of the CSEM at OSU has so far resulted in a collection of pre-test and post-test matched data from over 2100 students.Based on the analysis of these data, we report findings regarding the effects of test timing and incentives on student performance in the CSEM.The goal of this paper is to provide evidence documenting possible effects of test timings and incentives on student performance, so that researchers can take appropriate controls to address these issues in future studies.In the following, we first present relevant background on the introductory physics course offered and the student populations at OSU ͑Sec.II͒; then we report on the analysis of results yielded in different testing conditions.Specifically, we discuss pre-test results at three different timings ͑before any instruction, after a week of lectures, and after one lecture͒ and post-test results under four different incentives ͑no incentives, points for just taking the test, replacing a quiz if scoring high, and part of final examination͒ ͑Sec.III͒.Finally, we discuss implications of the results for future test administration ͑Sec.IV͒.

II. BACKGROUND
The calculus-based introductory E&M offered at OSU is the second quarter of the standard introductory physics course for science and engineering majors.Typically, students who attend the E&M classes are mostly freshmen or sophomores, and the majority of them have finished the first quarter of mechanics and met the prerequisite of scoring D or higher. 21Materials covered in E&M are standard and include electrostatics, electric circuits, magnetism, and electromagnetic induction.Students meet three times a week for a 48 min lecture delivered by regular faculty in large-lecture halls.Except for summer quarters, typically two different faculty members teach two parallel classes in each quarter.
Before the 2005 Fall quarter, lectures were given traditionally.In the 2005 Fall quarter, one of the authors ͑N.W.R.͒ started to adopt in his lecture electronic voting machines ͑also known as clickers͒ 22,23 in combination with various interactive engagement pedagogies.Since then, the clickers have been continuously used in one ͑and only one͒ of the two parallel E&M classes of each quarter.For the period from which the data were extracted, students were also required to attend a 48 minute recitation along with a separate 108 min laboratory session each week, both of which are taught by graduate teaching assistants.

III. TEST RESULTS AND ANALYSIS
In the past five years, data from over 2100 students were collected at OSU, and the majority of the data are pre-test and post-test matched.In this paper, we use matched data ͑N = 2198͒ for analysis to trace student performance under various testing conditions ͑timings and incentives͒.In the following, we discuss the test results in terms of two major time periods: the quarters from Fall 2003 to Fall 2005 and the subsequent quarters through Spring 2007.

A. Results of the comparison group (Fall 2003-Fall 2005)
From the 2003 Fall quarter through the 2005 Fall quarter, nine different instructors taught the calculus-based introductory E&M course at OSU.Two different textbooks 24 and two different homework delivery systems 25 also were adopted during this time period.These differences notwithstanding, the course materials covered in class were similar, the laboratories and recitations were essentially the same, and the training of teaching assistants also remained unchanged.In administering the CSEM, the pre-test was always given in the first laboratory of each quarter, which took place during the second school week.Students typically had attended three or more lectures before completing the pre-test.The CSEM post-test was always given in the last laboratory, which was usually conducted in the second-to-last week of a quarter ͑or, in a few cases, in the last week͒.No incentives were granted to students for taking either the pre-test or the post-test in any of these quarters.Because the test timings and the lack of incentives were the same for all these quarters, we combine these quarters together and name them the "comparison" group.Results from these quarters set a base-line for comparison.We have excluded from our analysis one class of the 2005 Fall quarter, in which clicker questions were adopted in lecture.By so doing, we eliminated possible effects from this intervention.
Figure 1 shows the pre-test averages, post-test averages, absolute gains, and normalized gains ͑see Sec.III D for defi-nitions͒ of the individual quarters in the comparison group.An ANOVA analysis shows that there is no significant difference across these quarters in the pre-test scores ͓F͑8 , 1526͒ = 0.84, p = 0.5678͔, post-test scores ͓F͑8 , 1526͒ = 0.43, p = 0.9052͔, or gains ͓absolute gains: F͑8 , 1526͒ = 0.91, p = 0.5097; normalized gains: F͑8 , 1526͒ = 0.95, p = 0.4759͔; thus, confirming the validity of combining these quarters as a comparison group.

B. Pre-test results of the subsequent quarters (Winter 2006-Spring 2007) and comparisons with the comparison group
In subsequent quarters from Winter 2006 through Spring 2007, the CSEM pre-test was given under two different timings: on the first day of class ͑either in lecture or in recita-tion͒ or after one lecture.Quarters of 2006 Winter and 2007 Spring belong to the former case, and the 2006 Spring quarter to the latter.Similarly to the comparison group, no incentive was offered in any of these quarters.We combine the 2006 Winter and 2007 Spring quarters together for analysis and label them the "no-instruction" group.͑Here we did not exclude the data of the clicker classes, as no intervention had been introduced prior to the pre-test.͒In the following, we discuss the pre-test results of the no-instruction group and the 2006 Spring quarter and compare these results with those  of the comparison group.We show that pre-test results can be significantly affected by a week of lectures or sometimes even a single lecture.Table I shows the pre-test conditions and results.
Figure 2 displays the pre-test total scores of the comparison group and the no-instruction group.The comparison group outperformed the no-instruction group by 8%, which is equivalent to two and half questions.A t-test suggests that the difference in pre-test score between the two groups is both significant ͑t = 13.63,p Ͻ 0.001͒ and sizable ͑effect size= 0.7͒. 26This result indicates that if the CSEM pre-test is conducted a few days and lectures into a course, student overall performance is noticeably better than that if the pretest is conducted before any instruction takes place.
Our analysis further shows that the better performance of the comparison group in the CSEM pre-test is mainly from their higher scores on the electricity questions ͑Q1-Q20͒.However, the comparison group did not outperform the noinstruction group on all the electricity questions.From Fig. 3 where the individual item scores are plotted, we find the pre-test difference between the two groups lies mostly in the first nine questions.Note that these nine questions mainly deal with "electric charge and force," 27 which are exactly the topics discussed in the first several lectures of a quarter.For these questions, the average difference ͑⌬ 1 ͒ between the two groups is 20%, equivalent to two questions, which accounts for a large percentage of the difference detected in the total score.On the other hand, for the remaining electricity questions that address "electric field and force" or "electric potential and energy" 27 ͑topics not covered in the first week͒, the average difference ͑⌬ 2 ͒ is only 6%, equivalent to half a question.͑One clarification worth making is that the curves in Fig. 3 do not intend to imply continuous data but rather to provide a better visual effect on the trend of item scores.͒ The above results suggest that a week of lectures can have a significant effect on pre-test results.As a matter of fact, we find that sometimes even one lecture can markedly impact pre-test results depending on what is covered in that lecture.In the 2006 Spring quarter, where the CSEM pre-test was administered in a recitation after the first lecture without incentives, one instructor ͑in the clicker class͒ unknowingly discussed several CSEM questions in his first lecture.As a result, the average pre-test score for that class turned out to be 10.8, similar to that of the comparison group ͑t = 1.16, p = 0.2442͒.Conversely, the other instructor ͑in the nonclicker class͒ spent nearly half of the class time addressing logistic issues and covered less material in the first lecture.Consequently, that class only scored average 9.0, noticeably lower than that of the comparison group ͑t = 6.53, p Ͻ 0.001͒.Clearly, depending on what is covered in one lecture, the impact of that lecture on pre-test results sometimes cannot be ignored.

C. Post-test results of the subsequent quarters and comparisons with the comparison group
Table II lists the post-test conditions ͑timings and incen-tives͒, with results for the comparison group and the individual quarters from 2006 to 2007.Since the post-test conditions were all different from 2006 to 2007, we discuss each quarter separately.In the analysis of the post-test results, we have retained the data only from the nonclicker classes to eliminate possible effects from the "clicker" intervention.
In the 2006 Winter quarter, the post-test was administered with the same timing as in the comparison group ͑in the last laboratory͒ but with an incentive.Students who completed the post-test would get a small amount of points regardless of how they performed on the test.Consequently, a large fraction of the class took the post-test and resulted in nearly 90% pre-test and post-test matched data of the entire class, higher than that of the comparison group ͑average of 72%͒.However, the post-test average was only 15, slightly but not significantly lower than that of the comparison group ͓t = 0.71, p = 0.4771͔ ͑see Fig. 4͒.Possibly, the kind of incentive offered in this quarter had drawn a larger fraction of the class to take the post-test, including those lesser-achieving students, which in turn yielded a slightly lower average than that of the comparison group.
In the 2006 Spring quarter, the post-test was administered during the last recitation, which took place several days after the last laboratory.Another type of incentive was offered; students were told that if they scored 90% or higher, they could replace the CSEM score for a lowest quiz score.The participation rate in that quarter dropped significantly; the percentage of students taking both the pre-test and post-test was only 65%.However, the post-test average was noticeable higher compared with the comparison group ͑t = 4.58, p Ͻ 0.0001; effect size= 0.5͒.Using a scale that goes from 4 for grade A ͑excellent͒ to 0 for grade E ͑fail͒, we found that students who took both the pre-test and the post-test obtained an average of 2.63 in the course final grade, whereas those who missed at least one of these tests had an average of 1.86.The post-test incentive offered in the 2006 Spring quarter may have attracted only more motivated and achieving students to take the post-test, increasing the post-test score.
In the 2007 Spring quarter, the post-test CSEM was incorporated into the final exam.Students were able to review course materials before taking the test and were motivated to answer the questions correctly.Our results show that the post-test average is 19.9, the highest among all the quarters even with more than 90% of all students participating.Compared to the comparison group, the increase in the post-test score is both significant and large ͑t = 9.73, p Ͻ 0.0001; effect size= 0.9͒.
These analyses illustrate possible effects of testing timings and incentives on test outcomes.Particularly, different incentives seem to attract different fractions of a class to complete the test, which may cause a noticeable fluctuation in the results.

D. Gains and normalized gains of the subsequent quarters and comparisons with the comparison group
In gauging the change of student performance after course instruction, absolute gain and normalized gain 28 are perhaps  In the following, we use both measures to demonstrate the effects of test timing and incentive on the CSEM test outcomes.Figure 5͑a͒ displays the absolute gains of the comparison group and the subsequent quarters.By adjusting pre-test and post-test timings and/or incentives, we have observed an absolute gain from as low as 12% ͑equivalent to 4 questions͒ up to 35% ͑equivalent to 11 questions͒.Normalized gains are given in Fig. 5͑b͒.It is evident that normalized gains have increased from 18.5% in the comparison group up to 48.2% in the 2007 Spring quarter.

IV. DISCUSSION AND IMPLICATIONS
Although one can administer a pre-test either at or near the beginning of a course, our results suggest that the CSEM pre-test scores are sensitive to moving the test even a few days and lectures.Similarly, when to administer a post-test also has a significant effect on test results.Besides, incentives also have a potential impact on student performance; different incentives may attract different fractions of students to take the post-test, impacting test outcomes.
It follows that absolute or normalized gains may also vary greatly.In our analysis of the data that were collected in the past 12 quarters over five years, normalized gains for a traditionally taught course varied from 18.5% to 48.2%.Note that when pre-test and post-test conditions were maintained consistently in the comparison group, years of data showed a fairly stable normalized gain 18.5% Ϯ 0.6% ͑std.error͒.Table III summarizes the results.
Although it is widely accepted that different timings and incentives may impact test results, the extent of this impact is still largely unclear within the physics education community.To this end, we present the above results in the hope of alerting instructors and researchers to the potentially large effects of test timings and incentives on student performance and test outcomes.We encourage interested readers to further investigate how the analysis and results based on the CESM data collected at OSU extrapolate to other institutions, student populations, and conceptual tests.

FIG. 1 .
FIG. 1.The pre-test averages, post-test averages, absolute gains, and normalized gains for the individual quarters in the comparison group.͑The error bars denote standard errors.͒

FIG. 2 . 3 FIG. 3 .
FIG. 2. Pre-test total scores of the comparison group and the no-instruction group.͑The percentages indicate average pre-test score percentages; the error bars denote standard errors.͒

FIG. 4 .
FIG. 4. Pretotal and Post-total scores of the comparison group and the subsequent quarters.Note that the pre-test scores of the subsequent quarters are similar, but the post-test scores are rather different under different test timings and incentives.͑The error bars denote standard errors.͒

FIG. 5 .
FIG. 5. ͑a͒ Absolute gains of the comparison group and the subsequent quarters.Note that the pre-test scores for quarters from 2006 to 2007 are rather similar ͑see Fig. 4͒.͑b͒ Normalized gains of the comparison group and the subsequent quarters.Note that the error bars denote standard errors.

TABLE I .
Pre-test conditions and results.

TABLE II .
Post-test results and conditions.

TABLE III .
A summary of the test conditions and test results.