Examining the effects of testwiseness in conceptual physics evaluations

Testwiseness is defined as the set of cognitive strategies used by a student that is intended to improve his or her score on a test regardless of the test’s subject matter. Questions with elements that may be affected by testwiseness are common in physics assessments, even in those which have been extensively validated and widely used as evaluation tools in physics education research. The potential effect of several elements of testwiseness were analyzed for questions in the Force Concept Inventory (FCI) and Conceptual Survey on Electricity and Magnetism that contain distractors that are predicted to be influenced by testwiseness. This analysis was performed using data sets collected between fall 2001 and spring 2014 at one midwestern U.S. university (including over 9500 students) and between Spring 2011 and Spring 2015 at a second eastern U.S. university (including over 2500 students). Student avoidance of “none of the above” or “zero” distractors was statistically significant. The effect of the position of a distractor on its likelihood to be selected was also significant. The effects of several potential positive and negative testwiseness effects on student scores were also examined by developing two modified versions of the FCI designed to include additional elements related to testwiseness; testwiseness produced little effect post-instruction in student performance on the modified instruments.


I. INTRODUCTION
Multiple-choice tests are a widely used means of evaluation and are very important in physics education research (PER) because of the extensive application of research-validated conceptual instruments.Multiple-choice instruments also suffer numerous potential weaknesses inherent to using multiple-choice items [1][2][3][4].Some of these weaknesses may be exploited by students to improve their chances of selecting the correct answer regardless of the student's content knowledge [5].These weaknesses include strategies such as examining the length of the available options [6] and converging on the correct answer based on sets of similar options or analysis of other patterns of available options [3,7].Testwiseness is the collective use of cognitive strategies to exploit weaknesses inherent to the format or characteristics of the test to achieve a higher score [5,[8][9][10].Testwiseness has been acknowledged in the literature for over sixty years as a potential factor affecting reliability [11].
These weaknesses have led to a considerable number of item writing rules [12,13] and several assessment level rules [14].These rules have been developed to aid in the construction of multiple-choice items and structuring of multiple-choice assessments that minimize the effect of the application of testwiseness.Despite the existence and dissemination of these rules through textbooks and articles, many items developed for and included in introductory level texts violate one or more of these item writing rules [15].Unfortunately, very little research confirming the validity and reliability of these item and test writing strategies exists [16].Haladyna and Downing performed an extensive search for theoretical and empirical studies that supported their taxonomy of 43 multiple-choice item and assessment writing rules and found that for nearly half of the rules no supporting research could be identified [17].One study suggested that by understanding and exploiting the rules employed in structuring the answer key of the Scholastic Aptitude Test (SAT) students can increase their verbal SAT score by between 10 and 16 points, which is considerably more than the increase in score from participation in formal coaching programs [14,18].
Well-established item writing rules include key balancing, avoidance of related distractors, and avoidance of "none of the above" or "all of the above" distractors.Key balancing refers to the practice of selecting items that produce an instrument with an approximately uniform distribution of answer choices.Item location within the instrument can also be adjusted to avoid consecutive sequences of the same answer choice.Key balancing is important because of students' expectation of a balanced key developed through experience with standardized testing and student answering patterns which may select distractors in certain locations preferentially.Students unsure of the correct answer may have patterns in selecting answers where they preferentially select the middle option (option "c" for a five-option item), "central bias," select the first item ("a"), "primacy," or the last item ("e"), "recency."Students can also use strategies to limit the number of options from which they make a random selection by focusing either on the selection containing the most words or by focusing on pairs of selections that are parallel or opposite.Testwiseness has been shown to be important for the overall performance of the evaluation.Many standardized instruments try to produce a balanced set of multiplechoice options with correct answers randomly distributed across the available options, a balanced key [14].
The effect of the inclusion of a none of the above (NOTA) or an all of the above option has been extensively explored [19][20][21].All of the above options are rare in the physics education research instruments examined; therefore, this paper will focus on NOTA options.Haladyna and Downing [17] found the use of a NOTA option to be the third most commonly investigated item-writing and/or testwriting rule among the 43 they identified.While some disagreement regarding the effect of NOTA on item validity exists [20], most sources argue that it should be avoided in the development of evaluation tools [13,17].
Another form of testwiseness involves students' bias towards selecting multiple-choice answers based on the option's position [22][23][24][25].Attali and Bar-Hillel measured a predisposition to select the central answers both by instructors when generating multiple-choice questions and by students when guessing the answer to a question [26].In contrast, other studies have found that primacy and recency appear to have a stronger effect on responses to multiple-choice items resulting in the first and last alternatives being predominately selected [27,28].Further studies noted differences in the effect of position bias for various types of questions [29].
Another potential element of testwiseness involves commonalities that exist between sets of multiple-choice answers [5,25,30] which can occur when the correct answer is written first and the distractors are then written to share characteristics with the correct answer.This can result in a system of distractors that are similar or opposite to the correct answer [5,25,30].
Testwiseness strategies for improving scores on examinations for items where the student either has no or limited content knowledge are most effective when students understand the item and assessment writing rules discussed above.In most cases, students have no reason to believe that physics assessments conform to these rules; however, most students in today's physics classes have been exposed to years of standardized testing utilizing instruments that conform to good assessment construction practices.During informal discussions, many students reported awareness of features of physics evaluations that activate their testwiseness strategies.Some students report specific instruction into testwiseness strategies as part of their preparation for standardized examinations.Because of this conditioning, it seems possible that, if presented with a multiple-choice instrument that violated some of the common assessment construction rules, a students' pattern of responses might be altered leading to changes in the distribution of item answers or overall evaluation scores.
Testwiseness-influenced changes in overall scores or item scores could have important implications for PER.Research in physics education often involves the use of multiple-choice conceptual instruments and compares pretest and post-test scores using a statistic, the normalized gain, that forms a composite of the two scores.Because testwiseness effects should be more prevalent on the pretest, they may influence the interpretation of the normalized gain.Testwiseness-affected distractors may affect item-level analyses and alter the interpretation of changes from pretest to post-test [1,17,20,26].
Testwiseness plays the role of both a metacognitive strategy students employ for monitoring their performance during an examination and a cognitive strategy which is applied for specific items.Students report awareness of the overall key balancing of examination instruments which takes the form of an awareness of highly unbalanced sets of responses (e.g., many more a's) or long sequences of the same responses.This awareness may raise unnecessary doubts in students that have consequences for the outcome of the examination.
This study complements and extends the existing literature on testwiseness and PER.With potential gains from utilizing testwiseness strategies rivaling the gains from other more traditional strategies, a more extensive examination into the effect of these strategies is warranted.The degree to which items in research-validated physics instruments may be vulnerable to testwiseness effects is investigated through a survey of widely used PER instruments.Testwiseness has been exclusively studied as a general effect common to all disciplines; however, little exploration of discipline-specific testwiseness has been conducted.The existence of testwiseness effects specific to instruments involving scientific reasoning is explored.Finally, the degree to which testwiseness influences the results of some of the most widely deployed PER instruments is examined.
This study addresses the following questions.(i) To what extent do the instruments used in physics education research conform to well-established item writing rules?(ii) Are there testwiseness effects that are specific to scientific assessment instruments?(iii) Does student application of testwiseness affect the outcome of the PER instruments at the item or instrument level?

A. Quantitative testwiseness effects
Testwiseness research has predominately focused on the performance on broad evaluations requiring little topical knowledge; as such, research into testwiseness effects specific to individual disciplines is rare.We seek to demonstrate the existence of testwiseness effects specific to quantitative disciplines.We will demonstrate that students show an aversion to selecting the distractor zero, "zero bias."An extensive review of the literature pertaining to testwiseness and item writing rules revealed only rules that were tangentially related to zero options (e.g., avoid specific determiners such as always, never, all, none) [15].Zero distractors will be defined as "0" or zero options, as well as options which imply zero (e.g., "the object will not move" implies zero velocity).Zero bias will be analyzed along with the more extensively studied NOTA bias after an analysis of the extent to which these two effects are present in widely used PER evaluation instruments.NOTA distractors were most commonly identified as options which explicitly said none of the above or none of these (sometimes followed by additional explanation such as "The ball falls back to ground because of its natural tendency to rest on the surface of the earth" or "The elevator goes up because the cable is being shortened, not because an upward force is exerted on the elevator by the cable").Options which similarly identified none of the other four options as correct were also included as NOTA distractors (such as "other", "not enough information is given to answer the question", or any statement including "cannot determine" sometimes followed by additional text such as "without knowing the forces −Q exerts on the two negative q's").Also options in which no force is exerted are included as NOTA options such as "none of the forces.Since the chair is at rest there are no forces acting upon it."The zero and NOTA options identified in the FCI and CSEM are shown in Table I.

B. Analysis of research-based assessments
To explore the degree to which item writing rules designed to reduce the effects of testwiseness are applied in the construction of instruments in PER, 12 introductory level physics assessments were examined.Assessments were selected that spanned a variety of introductory physics topics including mechanics, electricity and magnetism, waves, and optics.The assessments include many of the most widely used instruments in PER.Each instrument was analyzed for several factors including the number of NOTA options, the number of zero options, and the distribution of correct answers for each of the assessments that included only five-option questions.Analysis of the distribution of correct answers for the remaining assessments was uninformative because each contained questions with varying numbers of answer choices.
Table II shows that all of the 12 assessments examined contain NOTA options with varying frequency.Across the 12 assessments analyzed, NOTA options were available in approximately one-third (33.6%) of all questions.All but one of the assessments examined contained at least one instance of a zero option.Zero options appear in slightly over one-third (34.5%) of all of the questions examined.
The distribution of correct answers for the five assessments that were comprised of only five-option multiplechoice questions shows that many of these assessments appear to have been developed to have an approximately even distribution of correct answers.One notable exception was the distribution of correct answers present in the FCI; option (b) was the correct answer for ten of the questions in the FCI while the average of other options was only five.
These 12 assessments only represent a small sample of those available in physics education research but are representative of some of the most used in PER.Despite the popularity of these assessments, all ignore one or more commonly accepted item writing rules.The degree to which these features affect the student selection of incorrect answers and, therefore, the interpretation of the patterns of student answering is explored in the following sections.

C. Data sets
The analysis which follows utilizes four data sets collected at two universities: U1 and U2.Data set 1 and 2 were both collected at a large midwestern U.S. land-grant university (U1) serving between 15 000 and 25 000 students.Data set 3 and 4 were collected at a large eastern U.S. land-grant university (U2) serving approximately 30 000 students.
Data set 1 was collected from students in introductory calculus-based mechanics and electricity and magnetism classes that were administered the Force Concept Inventory (FCI) [31,43] and Conceptual Survey of Electricity and Magnetism (CSEM) [37], respectively.The data were collected from the Fall 2001 semester to the Spring 2014 semester.Each of these assessments was administered as both a pretest, prior to instruction, and as a post-test, after instruction resulting in over 6000 responses to the FCI (6617 pre and 6241 post), over the course of 26 semesters, and approximately 3000 responses to the CSEM (2992 pre and 3074 post), over the course of 19 semesters.The students received credit for a good-faith effort on the FCI pretests.The FCI post-test and the CSEM pre-and posttests were graded for credit.For two semesters, Spring 2006 and Fall 2006, the students were asked to report for each question whether they were sure of their answer or were guessing.Additional analysis of this experiment was presented in Stewart and Stewart [44].
The FCI is composed of questions which address 6 common Newtonian concepts distributed across 30 questions.It is intended to force students to choose between Newtonian concepts and common sense alternatives [31].The CSEM is composed of 32 questions which address 10 electricity and magnetism concepts as well as Newton's third law and how it applies to electricity and magnetism problems [37].Both tools are designed to be administered as both a pretest and post-test to measure conceptual learning gains.
Data set 2 was collected from students in the calculusbased introductory electricity and magnetism class at U1 over the course of 10 semesters from the Fall 2007 semester to Spring 2012.This data set contains all multiple-choice questions given in the class including homework, lecture quizzes, laboratory quizzes, and in-semester examinations.These questions were developed by the teaching staff for use in the course and did not consist of questions from the FCI, CSEM, or any other standardized assessment.These questions include a mixture of qualitative and quantitative items developed for the classes studied.A total of 1851 students were included in this data set with 243 084 total responses to multiple-choice questions recorded.For the purposes of this study, only five-option multiple-choice questions were analyzed.These questions were a mix of qualitative (30%) and quantitative (70%) questions.All questions were given for credit post-instruction.
Data set 3 was collected in the introductory calculusbased electricity and magnetism course at U2.The CSEM was administered as a pretest and post-test from Spring 2011 to Spring 2015 to roughly 2000 students (2278 pre and 1753 post).Students received credit for a good faith effort on both the pretest and post-test.
Data set 4 was collected in the introductory calculusbased mechanics and electricity and magnetism classes at U2 during the Spring 2015 semester.In both classes, each student was administered one of three versions of the FCI modified as a post-test at the end of the class in an attempt to elicit testwiseness effects.The modifications are described in Sec.III.Students received credit for a good faith effort on these examinations.A total of 475 students completed the assessment with approximately 160 students completing each instrument.

A. NOTA and zero bias
The analysis of the effects of NOTA and zero options began by examining two assessment instruments designed to evaluate student understanding of Newtonian mechanics (FCI) and electricity and magnetism (CSEM).While there were instances of these testwiseness-affected options being the correct answer, our analysis will focus on the questions for which all testwiseness-affected options were distractors.This will allow comparison of the strength of these testwiseness-affected distractors with the strength of other distractors.Any problems that contained both zero and NOTA distractors were also excluded resulting in the removal of two questions from the CSEM analysis.After these removals, 4 NOTA and 4 zero distractors remained in the FCI and 10 NOTA and 13 zero distractors remained in the CSEM.For five-option multiple-choice questions, such as those used in both the FCI and CSEM, the average likelihood of a student who answers incorrectly to randomly select any of the four available distractors is 25%.Any deviation in the rate of selection of these distractors from that of a uniform distribution, which should occur if all distractors are equal in strength, should be indicative of the distractor's relative strength.The distribution of incorrect student responses for questions in data sets 1 and 3 containing either NOTA or zero distractors is presented in Table III.Students selected the NOTA distractors in the FCI and the CSEM at a statistically significantly lower rate than the other distractors.While the selection of the NOTA distractor was highest for the CSEM post-test, it was still less than the 25% selection rate that is expected from random chance.Zero distractors in the FCI were also selected at a very low rate, less than 5%, for both the pretest and posttest.For the CSEM, these rates were considerably closer to random selection for both the pretest and post-test at both U1 and U2.The CSEM pretest zero distractor selection rates were substantially less than 25% at both institutions.The CSEM post-test selection rate for U2 was also substantially less than 25%, but the post-test rate for U1 was 23%.This result was still statistically significantly different from 25% [χ 2 ð1; N ¼ 13422Þ ¼ 29.51, p < 0.001] but represents a small effect.The class instructor for the U1 course in which the CSEM was administered reported explicitly confronting zero bias in his lectures, thus potentially affecting the outcomes for the CSEM post-test.
These results support previous work showing that NOTA options are weak distractors.Further, zero options are identified as weak distractors demonstrating the existence of testwiseness effects specific to quantitative disciplines.Student bias against selecting either NOTA or zero options should be strongly considered when developing an assessment.The inclusion of either of these options as a distractor could result in students randomly selecting the correct answer more often than intended.While this cannot be demonstrated for the above analysis, testwiseness may also partially suppress the selection of the correct answer when the correct answer is NOTA or zero.

B. The effect of testwiseness on overall scores
Testwiseness may influence item scores by making it either more likely that a student randomly selects the correct answer when the testwiseness option is incorrect or less likely that the student selects the correct answer when it is testwiseness affected.Using the values in Table III as well as the distribution of questions in the FCI and CSEM with zero and NOTA as either a correct or incorrect answer, an estimate of the cumulative effect of zero and NOTA answers can be calculated for the average student.To calculate the estimate, we make the following assumptions: (i) student avoidance of a testwisenessaffected distractor will increase the probability of selecting each of the other available options equally, which will increase the likelihood of selecting the correct answer, and (ii) student avoidance of a testwiseness-affected option that is correct will produce a similar effect decreasing the likelihood of selecting the correct answer.Applying these assumptions, the net effect on the FCI was an increase in score of 0.58% from NOTA distractors and of 0.55% from zero distractors.The calculated change in the CSEM was 0.20% from NOTA options and 0.05% from zero options.These small net effects for the FCI and CSEM indicate that for these instruments overall scores are not substantially affected by testwiseness; however, the effect on the interpretation of item-level results could be substantial.

C. Misconceptions
An extensive body of research has shown that some students bring strongly held misconceptions to physics classes [45,46] and that these misconceptions are often not removed by instruction [47,48].The instruments employed by this study were constructed to contain distractors that represented the results of applying common misconceptions.As such, the low rate of selection of zero or NOTA options may result because the zero or NOTA option does not represent a common misconception.To explore this effect, the two semesters of data in data set 1 that asked the students to express whether they were sure of their answer or guessing when answering were analyzed.The pretest and post-test results of the NOTA and zero-affected questions is presented in Table IV.
For the FCI, those students who were guessing on the question selected the zero and NOTA distractors more frequently than students who reported being sure of their answer; however, for the guessing students the rate of selection of the NOTA and zero options was still TABLE III.Total number and percentage of students selecting the testwiseness-affected distractor and other distractors in questions which have a NOTA distractor or zero distractor in data sets 1 and 3.For "other distractors," the number is the total number of selections for all three other distractors and the percentage is the average percentage for one of the three.Superscripts * denotes p < 0.05, ** denotes p < 0.01, and *** denotes p < 0.001 based on a χ 2 test of difference from a random distribution.substantially less than would be predicted by chance.This is exactly the pattern one would expect if some of the students who were sure of their answers were using a misconception not represented by the NOTA or zero distractor.

None of the above
For the CSEM, the NOTA results are similar to those found in the FCI with guessing students selecting the NOTA option less frequently than predicted by chance but more frequently than the sure students.The zero option results for the CSEM were less clear.For the pretest, the guessing students select the zero option a statistically significant 5% less often than predicted by chance [χ 2 ð1; N ¼ 1642Þ ¼ 20.53, p < 0.001].The sure students' selection rate of the zero option for the pretest was not significantly different than that predicted by chance.Neither the sure nor the guessing students selected the zero option at a significantly different rate than that predicted by chance on the post-test.The pretest results seem to indicate that the zero option represents a common misconception on some CSEM problems-this would explain the difference in the zero option results between the FCI and the CSEM.The explicit confrontation of the zero option by the instructor may have modified the students' application of this testwiseness strategy.

D. Position bias
A student may also select multiple-choice answers in situations where the correct answer is unknown based on the position of the answer choice.This effect will be called "position bias."Position bias can interact with NOTA or zero bias because these options are often placed as the last option.The effect of position-bias was examined in fiveoption, multiple-choice homework, quiz, and test questions collected in data set 2. Three classes of questions were selected for examination, those with a NOTA-affected option (e), those with a zero-affected option (e), and those with neither testwiseness effect present in any of the options.The first two classes of questions were selected for examination as a result of the prevalence of both NOTA and zero as option (e).NOTA appeared almost exclusively as option (e) with only a few problems containing a NOTA option as one of the other four options.Zero appears considerably more often than NOTA across the other four available options, but still appears as option (e) roughly twice as often as it appears in the sum of the other four options.The selection of the third class of questions, those with neither testwiseness effect, was made to determine students' distribution of selection in the absence of other testwiseness effects.With the predominance of NOTA and zero appearing as option (e), it was impossible to disentangle the distribution of student selection resulting from position bias from the effects of NOTA and zero bias across all questions.Examining questions without NOTA and zero options provides an opportunity to determine the effects of the position bias alone, as well as the cumulative effects present in the first two classes of questions.
For each of these three classes of questions, all instances of students answering incorrectly as well as the distractor that was selected were recorded.From this set of incorrect responses, instances of questions with correct answers in each of the possible positions [(a), (b), (c), (d), and (e)] were equally, and randomly, sampled to ensure that no bias was introduced because of a prevalence of correct answers in any one position.The distribution of incorrect responses is presented in Table V.If the position of the distractor had no bearing on the likelihood of it being selected, the distribution of incorrect answers should be evenly distributed across the five available options resulting in an average of 20% of the students selecting each option.
Option (e) was selected less often than the other four options for all three classes of questions in Table V.The NOTA distractor continued to be selected at a substantially lower rate than other distractors in the (e) position.The students selected the zero option at approximately the same rate of other options (e) but at a significantly lower rate than would be predicted by chance.Options (b) and (c) were TABLE IV.Distractor distribution for students who are "sure" of their answer and students who are "guessing" in data set 1. Total number and percentage of students selecting the testwiseness-affected distractors and other distractors in questions which have a NOTA distractor or zero distractors.For category other distractors the number is the total number of selections for all three other distractors and the percentage is the average percentage for one of the three.Superscripts * denotes p < 0.05, ** denotes p < 0.01, and *** denotes p < 0.001 based on a χ 2 test of difference from a random distribution.selected preferentially over the other options for all three classes of questions.This pattern of distractor selection can be explained with a synthesis of the effects of primacy and middle bias.Middle bias accounts for a predisposition towards selecting options (b), (c), and (d) with an aversion to (a) and (e) while primacy would make the early distractors more likely to be selected.The results presented in Table V demonstrate a statistically significant position bias in students postinstruction as determined by a χ 2 test (p < 0.001 for each of the classes of questions).The failure to detect a zero bias in addition to the position bias may be a further indication that the instructor's efforts to confront zero bias were successful or may be a result of some of the zero answers forming common misconceptions in electricity and magnetism.The distribution of correct answers was analyzed for the questions present in data set 2. The number of correct answers for each of the available options was tallied for the three classes of questions as shown in Table VI.The sum of all correct answers for each available option was examined to determine the overall trends of the professor's selection of correct answers.Table VI demonstrates a relatively uniform distribution of correct answers for problems in data set 2. As such, the position bias identified in Table V cannot be explained by students modifying their responses based on experience with the instructor.

None of the above
The strength of these biases for position-based selection make the inclusion of well-vetted key balancing techniques valuable in the development of evaluation tools.A bias in the key either towards overly selected options or towards underselected options could result in either an increase or decrease in the average score.

E. FCI modified to introduce testwiseness effects
To examine the effect of parallel and opposite constructions and to further examine position bias, two modified versions of the FCI were created.These parallel and opposite constructions are options which are grammatically similar to preexisting options and have either similar or opposite meaning to the preexisting option.One modified version of the FCI (the FCIþ) included 4 testwiseness treatments intended to increase student selection of the correct answers when utilizing testwiseness, while the second modified version (the FCI−) included 4 testwiseness treatments intended to decrease student selection of the correct answers when utilizing testwiseness.Each of these testwiseness treatments were used on 3 to 6 questions in their respective modified version of the FCI and each modified question was only affected by one of these treatments.These treatments are summarized in Table VII.
The results of applying these modified versions of the FCI were collected in data set 4. Students were divided into three approximately equal groups and given either the FCIþ, FCI−, or an unmodified FCI as a post-test at the end of the semester.The average student score for each treatment was obtained by averaging the student score on all questions which had been modified by the treatment.To determine how each of these treatments affected student scores, the average student score was also obtained for each corresponding set of questions present in the unmodified FCI as a control.The difference between the modified average scores and unmodified average scores are presented in Table VIII.The difference in averages for each effect was small and for many treatments opposite to that which would have been expected from the TABLE V. Total number of students selecting each distractor for five-option multiple-choice questions in data set 2 under three conditions (E was a NOTA option, E was a zero option, or neither NOTA nor zero options were present).For each condition equal numbers of incorrectly answered questions were sampled with options A, B, C, D, and E as the correct answer.All distributions of distractor selection were significantly different from a random distribution based on a χ 2 test (p < 0.001).testwiseness literature; many of the negative treatments produced positive increases in the average.This experiment supports the conclusion that the testwiseness effects explored, except for NOTA and zero bias, are weak effects postinstruction when the students have substantial content knowledge.

IV. DISCUSSION
This study investigated three research questions; these will be discussed in the order proposed.
To what extent do the instruments used in physics education research conform to well-established item writing rules?Many extensively researched instruments common to PER use distractors that may be preferentially avoided by students because of testwiseness, test taking strategies that do not require correct content knowledge.Table II shows some instruments with unbalanced keys, many instances of the NOTA option which has been identified in the literature as problematic [13,17,20], and still more instances of the zero option identified here as having potential testwiseness effects.
Are there testwiseness effects that are specific to scientific assessment instruments?The existence of NOTA bias identified in studies of nonscientific examinations [13,17,20] was confirmed as an effect in both quantitative and nonquantitative examinations of scientific understanding.Zero bias, while not as strong as NOTA bias in all cases, was identified as a testwiseness effect specific to fields requiring quantitative reasoning.While some part of zero and NOTA bias could be attributable to the application of misconceptions where the testwisenessaffected distractor does not represent the misconception, the effects were still substantial for students who do not report confidence in their answers and thus are not applying strongly held misconceptions.
Does student application of testwiseness affect the outcome of the PER instruments at the item or instrument level?An analysis of the overall effect of testwiseness effects from NOTA and zero options on the FCI and CSEM showed a small effect on final score which suggests that testwiseness is not a validity threat to the use of the overall instrument.Item-level testwiseness effects were more substantial and should be considered in any itemlevel analysis of problem difficulty.Overall, these results for NOTA options agree with the findings of Haladyna and Downing [17] that NOTA options should be avoided, and extends this assertion to a physics environment.
Both NOTA and zero options were weak distractors when compared to the other distractors in the studied instruments.This could result in increased difficulty on questions in which either of these options are the correct answer causing the misinterpretation of the scores on such problems.NOTA and zero aversion could also increase the likelihood of students randomly selecting the correct answer without use of the proper content knowledge when these options are used as distractors.
Analysis of data set 3 indicated that students appear to be affected by a combination of the effects of middle bias and primacy, with recency having little effect.This combination fully supports the work of Attali and Bar-Hillel [26] regarding middle bias.It is only partially supportive of the arguments of Blunch and Payne [27,28] regarding the importance of primacy and recency.Key balancing, as  described by Bar-Hillel and Attali [14], can be used to reduce any unintended effects of position bias.The identification of zero bias suggests that there may still be a number of science-specific elements of testwiseness that are yet to be explored.While it could not be examined in this study, it seems possible that students may also have an aversion to selecting other extreme values such as infinity or the limit does not exist.

V. IMPLICATION FOR INSTRUCTION
The analysis above suggests that the use of questions with NOTA or zero as one of the distractors produces an instrument with effectively fewer distractors which can change the possibility of the students selecting the correct answer by chance.More importantly, it is quite possible that the use of a zero or NOTA option as the correct option may increase the effective difficulty of the problem without changing the physical concept tested.With this analysis, NOTA options should be eliminated from multiple-choice instruments.While zero options cannot be eliminated, they should be used sparingly.Instructors may also consider explicitly confronting students' zero bias.

VI. IMPLICATION FOR RESEARCH
The above analysis also suggests several potential ways in which use of NOTA, zero, and potentially other testwiseness-affected options may impact research.While these analyses demonstrate that testwiseness effects are lessened postinstruction, the effect on the pretest could modify normalized gain results [49].One should also consider the potential affect of the inclusion of NOTA or zero options on item level validity and reliability when developing an instrument for research purposes.The effect of the correct answer position should also be considered and key balancing should be used to mitigate the effects of position bias.Overall, these results suggest that NOTA and zero options should be avoided, when possible, in the development of new instruments and the evaluation of results obtained from existing instruments.
The effect of testwiseness on item response patterns could affect research methodologies that use item rather than test level data including factor analysis and item response theory.Further, the observation that students are using cognitive strategies unrelated to their physics knowledge to answer some conceptual questions makes the relation between assessment results and student knowledge more tenuous.Testwiseness, as explored in this research, may be only one of many testing or problem-solving strategies that affect the interpretation of the conceptual instruments used in PER.

VII. LIMITATIONS AND FUTURE WORK
The work presented examined only the selective aversion to certain incorrect answers in situations where the correct answer was unknown.This suggests there may also be an aversion to selecting certain correct answers even when the correct content knowledge is present.This would represent a substantially more serious threat to validity and would be clinically more important; this effect will be the focus of future research.The existence of other testwiseness effects beyond zero bias that are specific to quantitative disciplines should also be explored.These effects may be more important in mathematics than in physics because of the wider range of extreme values (∞, the limit does not exist) that are available.
This work presented one experimental study, Sec.III E, which showed that the more subtle testwiseness effects were not important postinstruction, but more experimental work is needed.

VIII. CONCLUSION
This paper supported the existence of NOTA bias, a student's preferential selection of distractors different than the none of the above distractor.Zero bias was identified as a weaker, but still substantial testwiseness effect.Students showed some position bias selecting the central items in the answer choice list preferentially and avoiding the last distractor.Many popular PER conceptual instruments contain questions with NOTA or zero distractors.Some instruments, notably the FCI, have substantially unbalanced answer keys where the distribution of correct answers is not uniform.The effect of options that include grammatically similar structure to other options were shown to be weak postinstruction, when students have substantial content knowledge.However, if a significant pretest effect exists that is not present in a post-test, this could modify the normalized gain.As such, testwiseness effects should be considered and minimized in evaluation construction.Testwiseness should also be considered in item-level analysis where items contain a testwisenessaffected distractor or when the correct answer may be influenced by testwiseness.

TABLE II .
Comparison of the number of questions with NOTA options and zero options, as well as the distribution of correct answers for all assessments that consist solely of five-option multiple-choice questions.

TABLE VII .
Description of changes to FCIþ and FCIinstruments in data set 4.

TABLE VIII .
Average scores are for a selection of 3 to 6 questions present on the FCI or one of the two modified versions of the FCI in data set 4. The modified versions of the FCI had the options changed for many of their questions to elicit testwiseness effects that were intended to either improve or diminish student performance.