Evaluating multiple-choice exams in large introductory physics courses

The reliability and validity of professionally written multiple-choice exams have been extensively studied for exams such as the SAT, graduate record examination, and the force concept inventory. Much of the success of these multiple-choice exams is attributed to the careful construction of each question, as well as each response. In this study, the reliability and validity of scores from multiple-choice exams written for and administered in the large introductory physics courses at the University of Illinois, Urbana-Champaign were investigated. The reliability of exam scores over the course of a semester results in approximately a 3% uncertainty in students’ total semester exam score. This semester test score uncertainty yields an uncertainty in the students’ assigned letter grade that is less than 1 3 of a letter grade. To study the validity of exam scores, a subset of students were ranked independently based on their multiple-choice score, graded explanations, and student interviews. The ranking of these students based on their multiple-choice score was found to be consistent with the ranking assigned by physics instructors based on the students’ written explanations r 0.94 at the 95% confidence level and oral interviews r=0.94−0.09 +0.06 .


I. INTRODUCTION
The Department of Physics at the University of Illinois, Urbana-Champaign began reforming its introductory physics sequence in the fall of 1996. 1 As part of the reform, midterm and final exams were converted from constructed-response to multiple-choice format.Prior to this reform, the physics exams had been relatively traditional exams in which students were asked to solve problems and were given credit based on the correctness of their written work.With classes as large as 1000 students, grading the exams and assigning partial credit in a consistent manner was a major endeavor.Even with trained graders using rubrics, inconsistencies arise among different graders as well as for a given grader between different students.Students often felt the allocation of partial credit was unfair, and a significant amount of time was spent dealing with student appeals.This likely produced further systematic effects as outspoken students were more likely to succeed in getting their exams regraded.The net effect of this exam format was that both professors and students were frustrated by the experience.
The difficulty of reliably grading large numbers of exams is not unique to physics and has been extensively studied by professional testing agencies.Much of the research has focused on comparing the multiple-choice format with the constructed-response format.Lukhele et al. from the educational testing service found that, on a chemistry advanced placement ͑AP͒ examination, "a 75 min multiple-choice test is as reliable as a 185 min test built of constructed-response questions. 2 " In the time to give a single-constructed response question, they could give many more multiple-choice questions and receive more information about the students.They also found that "to predict a particular student's score on a future test made up of constructed-response items," they "could do so more accurately from a multiple choice than from a constructed-response test that took the same amount of examinee time."Hence, many of the national exams such as AP exams and the graduate record examination ͑GRE͒ utilize the multiple-choice format.
Switching to the multiple-choice format solved the grading difficulties experienced with the constructed-response exams.Student complaints about grading essentially disappeared, with the occasional exception being exam questions that could legitimately be open to multiple interpretations.Still there remained considerable concern about the ability of multiple-choice exams to accurately assess students' understanding. 3,4Although significant research has been performed for professionally constructed exams, there is little or no research that exists on the validity or reliability of multiple-choice exams constructed by course instructors.Indeed, much of the success of the national exams is attributed to the careful construction and testing of each item to ensure its effectiveness.This procedure is unrealistic in physics departments where exams are generally created in a short period of time by one or more members of the faculty who have little or no formal training in exam construction.The goal of this study was to determine if multiple-choice exams created in the Department of Physics at the University of Illinois yield scores that are reliable and valid assessments of student understanding in introductory physics.A discussion of the construction and evaluation of the multiple-choice exams is given in Appendix A. To see all of the midterm, multiple-choice exams used in the introductory courses in recent years, visit the Illinois Physics Education Research Group's web site at http://www.physics.uiuc.edu/Research/PER/ and click on the "Resources" link.
Exam construction experts measure the ability of an exam to assess student understanding based on the reliability and validity of the exam scores.Reliability refers to the reproducibility of students' scores, i.e., the extent to which one would expect a student's score to vary if the student was given another equivalent exam.Validity refers to the extent to which exam scores are representative of what the writer intends to measure.Reliability and validity are two dimensions that can be used to evaluate scores from an examination.Exam scores can be reliable, but not valid, if they are measured precisely and are repeatable but are not indicative of what one wants to measure, i.e., the scores are not accu-rate measurements.Exam scores can also be valid while not being reliable if they are measured accurately with what the instructor intends but there is a large amount of uncertainty in each measurement, i.e., the scores are not very precise.
Section II of this paper describes two methods for determining the reliability of exam scores and how this can be used to estimate the uncertainty in a student's exam score.Section III describes the study that was conducted to determine the validity of exams scores from one of the multiplechoice exams.Grading students' written work and how it can be implemented into a course apart from exams is discussed in Sec.IV.In Sec.V is a summary of the work and results presented.

II. RELIABILITY
Reliability is the extent to which a student's exam results are reproducible.To estimate the reliability of exam scores, students can be given two similar exams, both in content and in difficulty. 5The distribution of the differences between the two sets of test scores for each student provides one estimate for quantifying the reliability of exam scores.A narrow distribution in the set of test score differences would suggest that the exam results are reliable, whereas a broad distribution would suggest that student exam scores are not reproducible.
Ideally to perform this type of analysis, one would administer two separate but equivalent exams to each student throughout the semester.This, however, is not practical.Instead, one can take the complete set of exam items and split them into two equivalent sets, e.g., split by even and odd numbered questions or split by item difficulty.
A split-exam analysis was used to determine students' semester exam score uncertainties for all four introductory physics courses at the University of Illinois at Urbana-Champain.Students in these courses take 4 multiple-choice exams each semester: 3 midterms and a final.Each midterm consists of 25 to 30 questions and the final exam is approximately 50 questions. 6To make two equivalent exams, calling them euphemistically "even" and "odd" exams, the semester set of exam items were split based on item difficulty. 7Figure 1 shows the cumulative results for both the algebra and calculus-based courses between the years 1999 and 2003.This combines 32 different course semesters, 128 multiplechoice exams, 4250 questions, and 12 281 students ͑A+ to C− only͒. 8,9A Gaussian fit to the data reveals a 3.1% standard deviation based on this splitting method.][12] This result is consistent with an error estimate based on a binomial distribution: where p is the average test score and N is the number of test questions.For our 32 courses, the average test score is 73% and the average number of questions in a semester is 133 giving an estimated percent uncertainty in a student's semester test score of 3.85%.The uncertainty in student scores is one important measure for quantifying the reliability of exam scores.However, this uncertainty must be normalized with the standard deviation of the class scores to obtain an estimate of reliability.For example, a 3% uncertainty in a student's semester test score would not give much information about that student if there were only a 3% difference separating the "A" students from the "D" students.
When course grades are essentially dependent upon exam performance, a letter grade uncertainty can then be estimated from the student's exam score uncertainty.Figure 2 shows the correlation between exam score and course grade for students in an introductory physics course.Translating the uncertainty in exam score into a letter grade uncertainty ͑l.g.u.͒ is achieved by dividing the test score uncertainty by the slope of the best-fit line of the average total test score versus grade point average: l.g.u.= total test score uncertainty slope of avg.total test score vs gpa .͑2͒ When this is done for the 32 different course semesters of our introductory courses, the letter grade uncertainty ranges from 1 4 to 1 3 of a letter grade, with the average uncertainty being 0.27. 14Here, the mapping of letter grades to grade point average is an A = 4.0; an A − = 3.7; a B + = 3.3; a B = 3.0; etc.Therefore, the letter grade uncertainties reported here are less than the difference between a letter grade of an A and an A− or the difference between an A− and a B+.
Mathematically using true score theory, 15 this test score uncertainty can be understood using the correlation coefficient between the even and odd split tests, r e,o .In brief, a student's exam score uncertainty can be found using Eq.͑3͒: 16 exam score uncertainty = exam ͱ 1 − r exam .

͑3͒
Here exam is the standard deviation in exam scores and r exam is the reliability coefficient for the exam.An exam's reliability coefficient tells how correlated students' scores would be between that exam and a similar exam.The value of r exam can be estimated by splitting the exam into two equivalent sets ͑e.g., even and odd questions͒ and using the correlation between these two split halves, r e,o .The exam's reliability coefficient can then be found using the Spearman-Brown formula 16 r exam = 2r e,o 1 + r e,o .͑4͒ Table I lists semester test reliability coefficients along with their predicted exam score uncertainty for the three methods of splitting the exam questions ͑A+ to C−-students only͒. 17ather than providing justifications for specific splitting methods, one can take the conservative approach of estimating the reliability in terms of the Cronbach alpha, ␣, which is the average of all exam reliability coefficients that could be obtained from the different splittings of a test: 18

͑5͒
Here N is the number of test questions, i 2 is the variance in scores for the ith question, and 2 is the variance in the total test scores. 19The average exam reliability coefficient for the 32 course semesters using Eq.͑5͒ was 0.87± 0.01, which leads to an average semester test score uncertainty of 3.5% ± 0.1%.This uncertainty is consistent with the value obtained using the specific split-exam methods.Figure 3 shows a histogram of alpha coefficients in each of the 32 semester courses for all students ͑A to F͒.

III. VALIDITY
Determining the reliability of exam scores is relatively straightforward.Assessing the validity of exam scores is much more difficult as it requires comparing a student's exam results with an assessment of the student's physics knowledge.This comparison can be even more difficult for a multiple-choice format exam.It is well known that scores on multiple-choice format questions can depend strongly on the nature of the distractors, or even the position of the correct answer in the list. 3Exams used in our introductory courses are used to assess the relative level of mastery of the specified curriculum, with the scores ultimately being used to assign letter grades to students.The goal of this study is to compare the scores students receive from a multiple-choice exam to those they would receive from a constructedresponse exam.It is, therefore, instructive to know whether the distribution of assigned letter grades to students is consistent across exam formats.We use the scores from constructed-response exams, where the student's work can be examined by physics instructors, as the assessment of the students's physics knowledge.This procedure for assessing the validity of multiple-choice tests was developed early in the 20th. 20We attempt to improve this assessment of the student's physics knowledge by supplementing the written constructed-response questions with an interview with the student designed to help clarify any ambiguities that remained after reviewing the written work.The details of this study are presented below.

A. The study
In the spring of 1999, two similar, multiple-choice finals were given for the introductory electricity and magnetism ͑E&M͒ course for physics and engineering majors.Of the two populations of students who took the final exams, a select subset of students who scored consistently on their first three midterms were invited to participate in the validity study and received $20 for their participation.Roughly equal numbers of students who received A's, B's, and C's on the first three midterm exams were accepted.This selection process was chosen to ensure a uniform distribution of student abilities for which the exam is designed to differentiate. 21he number of students in the subset of those who took the first final was N 1 = 16 and of those who took the second final was N 2 = 17.In total there were 33 students who participated in the study which was 9% of the total number of students enrolled in the course.
Immediately after completing the multiple-choice course final exam, the students were taken to another room.In this room they were asked to work through 20 questions selected from their final exam, this time showing all of their work.These questions covered five of the major topics discussed in this E&M course: electric fields, electric potential, Gauss' law, Coulomb's law, and Faraday's law.The students were allowed to see their final exam and use any notes they had made during the actual exam in completing this section.They also had the liberty to mark different answers on this 20 question follow-up form than they had marked originally during their multiple-choice test.Ideally it may have been better for the students to first complete the constructedresponse portion first and then complete the multiple-choice exam due to the possibility of students' written work being influenced by the item choices present in each problem.However, this study was intended not to interfere with the structure of the course.Thus, the constructed-response exam was completed second.
Once each student completed the follow-up form, the student was interviewed by one of four physics instructors participating in the study.The interviewer reviewed the student's work and asked questions to assess the student's understanding of the material.Each interview had a duration of 10-20 min and was recorded onto audiocassettes.
The student's written explanations for each question were then independently graded by the same four physics instructors.The assigned grade was made on an integer scale between zero and three, with zero representing little or no knowledge of the physics involved in the problem and three representing full knowledge of the physics knowledge involved.The partial knowledge decision between a grade of one and two was made by determining whether or not credit would have been given had the grading been only credit or noncredit.
Once the independent grading had been completed, the instructors met in a committee to assign a grade to each question for each student.The objective of the committee was to assign grades based only upon the level of the student's understanding of the relevant physics.The committee based its score upon the independently graded scores, the recorded interviews, and the observations made by the interviewer.The committee also gave an integer score from 0 to 3 to each question for each student as described in the previous paragraph.
The 20 items from each final exam that were used in the validity study are listed in Appendix C. Full credit for a two-choice, three-choice, or five-choice question was two points, three points, or six points, respectively. 22To account for this weighted system, the scores assigned to the students by each grader and the committee were also weighted to have a parallel structure with the weights assigned to the different types of multiple-choice questions.In total, then, there are three sets of scores for the students: ͑1͒ their multiple-choice score ͑MC͒ from their original final exam, ͑2͒ their average-grader score from the four instructors ͑AG͒, and ͑3͒ their committee score ͑CS͒.The results in Part B of this section will address correlations between the MC and the other two sets of scores.Large correlation coefficients imply that students' MC scores are consistent with their scores from their graded solutions and their interviews.

B. Validity results
The raw correlations between the MC scores and the AG scores for the two groups were 0.88 and 0.92. 23The probability, or p value, of obtaining these correlation values randomly is p Ͻ 10 −11 .Because the study only involved probing students with 20 questions, there is a statistical correction to the correlation coefficient that is made to predict what the correlation would be had they answered an infinite number of questions.To know this correlation between MC and AG, we must correct our raw correlations for attenuation: 16,20,24 Here r MC and r AG are the reliabilities of the multiple-choice questions and the average-grader scores, respectively, as explained in the Reliability section of this paper.Table II lists the different correlation values between MC and AG obtained for the two samples of students.Similar studies were performed to compare students' MC scores with their CS scores.Raw correlations between the two sets of scores were 0.78 and 0.83 for the first and second group, 25 respectively, with a probability of randomly occurring at p Ͻ 10 −7 .Correcting for attenuation raises the correlation values to 0.91 and 1.00.

C. Validity sensitivity
To determine the sensitivity of these validity results, we performed a Monte Carlo analysis.In particular, simulations were run for different assumed values of the true correlation coefficients to determine the probabilities of the observed correlation coefficients to fluctuate as high as those found in our validity study.This analysis then can be used to set lower limits on the true correlation coefficients between the MC and the AG ͑or CS͒ scores.
In a given simulation, the MC scores are first generated according to the observed MC score distribution from our data.For each MC score an AG ͑or CS͒ score is then generated using the assumed true value of the correlation coefficient, r true , and the observed AG ͑or CS͒ score distribution from our data.A single run generates a set of 33 pairs of observed MC and AG ͑or CS͒ scores.From this single run, an observed correlation coefficient, r obs , can be calculated.We then repeat this process thousands of times to create a distribution of the observed correlation coefficient, r obs , that are generated by a specific r true .From this distribution, we can calculate the probabilities to use in a maximum likelihood analysis. 26BLE II.Correlations between students' MC score and their AG score.These results imply that the students' multiple-choice score are indeed consistent with their scores given by instructors grading their written solutions.

Correlations between MC and AG
Raw correlation r MC,AG raw 0.88 0.92 Corrected for attenuation r MC,AG atten 1.00 1.00 Figure 4 shows the maximum-likelihood fit using input values from the AG and CS scores.At the 95% confidence level, the true correlation r true between students' MC score and their AG score is found to be greater than 0.94.When using input data from the CS scores, the true correlation between MC and CS was found to be 0.94 −0.09 +0.06 .These results are encouraging and suggest that for our population of students, their MC scores are as valid as scores from constructed-response exams and oral interviews.However, there remains the possibility that a subpopulation that was not involved in the study may exist for whom MC scores would not be indicative of their relative physics knowledge based on written work and oral explanations.What our results do show is, that for those students who were involved in the validity study, their scores and relative rankings from the multiple-choice test were consistent with the scores from having their work graded and having the students interviewed.

IV. DISCUSSION
The reliability and validity studies verify that the multiple-choice exams administered in the introductory physics courses at the University of Illinois are fulfilling their primary function of assessing student understanding and assigning the appropriate grade.One should be careful not to conclude from these results that seeing and grading student work is not important.In addition to changing the exams to multiple-choice format, the course reform included the transformation of the recitations sections into discussion sections. 27,28The discussion sections have students working in groups of four on concepts and calculations.The emphasis of these sections is on showing work and justifying reasoning and strategies.Students receive feedback on this work from classmates as well as the teaching assistant.Each discussion section ends with a constructed-response quiz, which is graded by the teaching assistant based on the work shown.
It might appear that the reform just shifted the grading difficulty from exams onto quizzes.Certainly grades for the constructed response quizzes suffer the same reliability shortcoming as the exams.Perhaps even more, as a single TA grades all of the quizzes for an individual student throughout the semester.However, the impact of the quizzes on the final grade is significantly less than the exams.We see the role of the quiz as more of a formative rather than an evaluative assessment.In addition, since quizzes are given every week, the grade on any individual problem has a very small impact on the student's course grade.The result is that both students and faculty are generally satisfied with the quiz format and grading.

V. CONCLUSIONS
This study demonstrates that physics instructors under real time constraints can produce multiple-choice exams which yield results that are both reliable and valid assessments of students' understanding of introductory physics.Statistics such as the Cronbach ␣ provide a straightforward method for determining the reliability of exam scores, and hence the statistical uncertainty in any student's score.Integrating all questions over the course of a semester reveals that students' total exam score uncertainty is about 3%, which corresponds to a course grade uncertainty of roughly 1 4 of a full letter grade.
Assessing the validity of exam scores is much more difficult as it requires making an independent assessment of the student's physics knowledge with which to compare the exam results.Although this is not practical to do, in general, a study of 33 students taking the calculus-based E&M course at the University of Illinois, who had scored consistently on their three midterm exams, showed that the multiple-choice exams gave a statistically equivalent assessment of their understanding compared to their written explanations and interviews.Indeed, the difference between these rankings was less than the statistical difference of 3% found in the reliability analysis.Although some "poor" questions inevitably make their way into exams, the large number of questions throughout the course provides sufficient information to accurately assess students' understanding.

APPENDIX A: TEST CONSTRUCTION AND EVALUATION
Although multiple-choice exams are easier to grade than free-response exams, they are also more difficult to create.On a constructed-response exam, poorly worded or easily misinterpreted questions can be compensated for during the grading.Multiple-choice exams do not have this flexibility.Hence, the preparation of good multiple-choice questions is essential to the reliability and validity of the exams.The team-teaching approach of the introductory physics courses helps ensure this quality.
The introductory physics courses at UIUC are taught by a team of three to four professors, depending on class size.One or two professors are responsible for lecturing, one professor is in charge of the laboratory teaching assistants, and another is in charge of the discussion teaching assistants.In addition to their other assignments, this team is also responsible for creating the exams.It should be noted that in the four introductory courses between the years of 1999 and 2003, more than 50 physics professors contributed in creating the exams.Of these professors, most know very little of the research that has been done in the creation of questions with good distractors.
Each professor is typically assigned a few topics on which to write problems.They are encouraged to define an interesting situation, and then ask several questions that pertain to the situation.These problems are then assembled into an exam, which is reviewed by each of the team members.Having several independent people review the exam typically results in significant improvements to the questions.
The types of questions that appear on the exams are qualitative, quantitative, graphical, symbolic, and scaling questions with two-, three-, or five-choice answers to choose from. 29Sometimes these answers exhaust all possible answers and sometimes they do not.Some examples of questions used on the exams can be found in Appendix B. 30 Table III is a listing of the various types of questions that have appeared on midterm exams given in the calculus-based E&M course from the spring of 1997 through the fall of 2002.In the table, the questions are listed by their number of choices, their type ͑qualitative, quantitative, etc.͒, whether the choices exhaust all possible answers, and what percentage these types of questions have appeared on the various exams.
After the exam has been administered, a standard exam analysis is performed and made available to the professors.In addition to the average exam score and average question score, the Cronbach ␣ is provided as well as a discrimination analysis for each question.Figure 5 shows a typical discrimination analysis.The class is broken up into groups of 50 students based on their exam score.Each group's average score on that question is plotted versus their average exam score.Questions with good discrimination have a steep slope, questions with little discrimination are relatively flat and often deserve a second look.Sometimes questions with low discrimination are simply "unique."For instance, a fact about one of the laboratories might have low discrimination.Sometimes, however, they reveal an ambiguous or misleading question.This is important feedback which helps improve future exams.5. ͑Color͒ This is a discrimination plot for question 23 on the first final exam shown in the Appendix.Each data point represents a bin of approximately 50 students.The exam score for each bin is the average for that bin on the remaining questions on the exam.
To conclude, we offer a conjecture as to why these multiple-choice exams that contain questions that were constructed without the use of research-based distractors, can nonetheless, be valid and reliable.The first point to make is that about 65% of the nonquantitative questions have choices that exhaust all possible answers.Clearly, the issue of research-based distractors is moot for these questions.Indeed, we see that the number of these exhaustive questions that have poor discrimination is about 50% less than that from the nonexhaustive ones.The second point to make is that instructors are encouraged to construct qualitative questions with answers, which if nonexhaustive, at least are couched in general terms and avoid specific explanations.

APPENDIX B: VALIDITY STUDY DATA
In Table IV we provide the raw data from our validity study.The table lists each students' multiple-choice, averagegrader, and committee score.Plots of this data can be seen in Figs. 6 and 7.The following 20 test items are a subset of questions taken from the first version of the spring 1999 semester final exam in the calculus-based E&M course.These items were used in the validity study.
3. If the magnetic flux through a coil is zero at time t 0 , the induced current in the coil must also be zero at time t 0 .
͑A͒ True ͑B͒ False The next two questions pertain to the following situation: Three identical rectangular wire loops ͑b Ͼ a͒ are being moved in the plane of the page at speed v into a B field filled ͑shaded͒ region from a region of zero B field.The B field in the shaded region is spatially uniform and is normal to and pointing out of the plane of the page.When each loop is exactly half way into the shaded region: 5.The direction ͑clockwise or counterclockwise͒ of the current being induced in loop 2 is the same as the direction of the current being induced in loop 1.
͑A͒ True ͑B͒ False 6.The magnitude of the current being induced in loop 3 is greater than the magnitude of the current being induced in loop 1.
͑A͒ True ͑B͒ False 10.W is the network you would have to do to move the charges from configuration I, to configuration II.
Which one of the following is true?͑A͒ a current that is constant as long as is constant ͑B͒ a sinusoidally varying current of angular frequency ͑C͒ no current The next three questions pertain to the figure below: A positive charge of magnitude q is placed at ͑x , y͒ = ͑a ,0͒ and a negative charge of magnitude 2q is placed at ͑x , y͒ = ͑−a ,0͒ as shown in the figure above.The numerical values are q =3 C, a = 5 cm.
18.There will be no place on the x axis for −a Ͻ x Ͻ + a at which the net electric field due to these charges is zero.

͑A͒ True ͑B͒ False
19.There will be no place on the x axis for x Ͼ + a at which the net electric field due to these charges is zero.
͑A͒ True ͑B͒ False 20.What is the value of E y , the y component of the electric field due to these two charges at point A defined as ͑x , y͒ = ͑0,−2a͒?Be careful-all the answers can be attained using values given in the problem.23.Calculate the electric potential at the origin, given that the potential at infinity is zero.
A spherical Gaussian surface ͑shown with the dotted line in the figure͒ is drawn concentric to the conducting sphere and shell at a radius R = 4 cm.There is not enough information given to determine both the relative signs and the relative magnitudes of Q A and Q B .

Final 2
The following 20 test items are a subset of questions taken from the second version of the spring 1999 semester final exam in the calculus-based E&M course.These items were used in the validity study.
3. A wire coil is located in an external magnetic field.If the magnetic flux through this coil is zero at time t 0 , the induced current in the coil must also be zero at time t 0 .
͑A͒ True ͑B͒ False 4. Three identical copper loops are leaving a region of uniform magnetic field at the instant shown.The loops all have the same speed.Assume the magnetic field is uniform inside the region and zero outside.
The induced current is clockwise in all three loops.͑A͒ True ͑B͒ False 8. W is the network you would have to do to move the charges from configuration I, to configuration II.
Which one of the following is true?
The next 3 problems pertain to the situation below: Consider two isolated well separated ͑i.e., neglect any effect of one sphere on there other͒ solid spheres of equal radii R each carrying total positive charge Q.One sphere is conducting, the other sphere is insulating ͑with the charge distributed uniformly throughout the volume͒.͑A͒ a current that is constant as long as is constant ͑B͒ a sinusoidally varying current of angular frequency ͑C͒ no current

The next three questions pertain to the figure below:
A positive charge of magnitude 2q is placed at ͑x , y͒ = ͑a ,0͒ and a negative charge of magnitude q is placed at ͑x , y͒ = ͑−a ,0͒ as shown in the figure above.The numerical values are q =3 C, a = 5 cm.20.There will be at least one place on the x axis for −a Ͻ x Ͻ + a at which the net electric field due to these charges is zero.
͑A͒ True ͑B͒ False 21.There will be at least one place on the x axis for x Ͼ + a at which the net electric field due to these charges is zero.
͑A͒ 27.Midterm exams are written to be 60 min exams, but students are allotted 90 min to complete them.Students are allotted 3 h to take the final exam.For most students, time is not an issue. 7For clarification, to get each student's even and odd scores, each of the four exams were first ordered by item difficulty.Then a student's even score is the sum of their scores from the even questions from exams 1 and 3 and the odd questions from exams 2 and 4. Likewise, a student's odd score is the sum of their scores from the odd questions from exams 1 and 3 and the even questions from exams 2 and 4. 8 This analysis considers only our A to C students because it is these students whose exam performance shows a strong linear correspondence to their assigned letter grade.That is, these students tend to receive 90% or more credit on the effort components of the course ͑e.g., homework, quizzes, and laboratories͒.Thus, their effort grade is not a distinguishing factor to the grade they receive in the course.This is not true, in general, for D and F students.Not only do these students do poorly on the exams, they also tend to do poorly on the effort components of the class.Therefore, the strong linear relationship between exam performance and assigned letter grade that is present for A to C students is not present for D to F students. 9It should also be noted that over this same time span, more than 50 physics professors contributed in creating the exams used in the introductory courses. 10In a second splitting method, the "even" test is literally the collection of the even-numbered questions from the first and third midterms and the odd-numbered questions from the second midterm and final.The reverse construction is made for the "odd" test.The uncertainty found using this method was 3.5%.
A third splitting method is simply an alteration of the second splitting method.Here, the "even" test is questions 1, 4, 5, 8, 9,¼, from the first and third midterms and questions 2, 3, 6, 7,¼, from the second midterm and final.The reverse construction is made for the odd test.The uncertainty from this splitting was 3.6%.An offset to zero for each semester could be made so that all semesters had the same average percent difference in even and odd tests.This correction would account for the fact that students in different course semesters do not have the same even and odd tests.Adding this offset has the inherit effect of diminishing the standard deviations in the distributions to 3.2% for both the second and third methods of splitting the questions.This offset had little effect on the first splitting method.J. R. Common convention is to desire reliability correlation coefficients greater than 0.80 to ensure that a student's exam score uncertainty is less than half of the standard deviation in the class' exam score distribution.L. J. Cronbach, Coefficient alpha and the internal structure of tests, Psychometrika 16, 297 ͑1951͒.Because some of the exam items are grouped together under the same physical situation, splitting these items into separate splithalf exams generally increases the correlation coefficient between the split-half exams and thus artificially increases the coefficient alpha.It may be more appropriate to treat those questions that are grouped together under the same prompt as testlets, and then to calculate alpha using testlet scores.To see what effect this might have on our alpha values, we examined four semester sets of exams: two from calculus-based mechanics and two from algebra-based mechanics.In each of the four semesters, the testlet alpha was indeed less than the item alpha, but never by more than 2% of the item alpha.This difference between the item and testlet alphas is less than the variation between semester item alphas.T. P. Hogan, Relationship between free-response and choice-type tests of achievement: A review of the literature ͑ERIC Clearinghouse on Tests and Measurements, Princeton, NJ, 1981͒.One justification for this selection process is that if only A and F students participated in the study, correlations between multiplechoice and constructed-response scores would artificially be high.We wanted to make sure there was an even distribution of students in the letter grade range from A to C. This is the range of most interest to us since it is this range students' course grades are predominately dependent upon exam performance.Students in the D to F range do poorly on all components of the course, not just the exams.To ensure that there were equal number of students in each grade category, we chose to select only those students who had scored consistently on their three midterm exams.If a student receives an "A" on one midterm but then receives a "C" on another, one does not know whether this student is really an A, B, or C student. 22This weighting system was instituted to allow for partial credit.
The five-option items are intended to be more difficult than twoand three-option items.Students can receive partial credit on a five-option item in one of the following ways: six points if only one option is chosen and is correct, three points if only two options are chosen and one of the chosen options is correct, two points if only three options are chosen and one of the chosen options is correct, and zero points for all other markings. 23To address any concerns that these raw correlations are large because of the selection of students who participated in the study, there is a correction that can be made to estimate what the raw correlations would be if the students were a pure random sampling of the entire class.This correction of heterogeneity had little effect on our raw correlations: for group 1, r = 0.88 went to 0.90, and for group 2, r = 0.92 went to 0.89.We were able to test the validity of this correction from our reliability data and found

FIG. 1 .
FIG. 1. ͑Color͒ Percent difference in the students' even and odd test scores.Items were split based on item difficulty.
FIG.3.͑Color͒ A histogram of alpha coefficients for the semester set of exams for each of the 32 courses for all students ͑A-F͒.

FIG. 4 .
FIG.4.͑Color͒ These are maximum likelihood plots for the combined groups of students generated from the simulation data to determine the true correlation coefficient between students' MC and AG scores and the MC and CS scores.

FIG. 6 .
FIG.6.͑Color͒ The raw data from the students involved in the validity study ͑N =33͒.The graph is a plot of the students' averagegrader score ͑AG͒ versus their multiple-choice score ͑MC͒.
͑A͒ E y = + 5.79ϫ 10 +6 N/C ͑B͒ E y = + 1.93ϫ 10 +6 N/C ͑C͒ E y =0 N/C ͑D͒ E y = −1.93ϫ 10 +6 N/C ͑E͒ E y = −5.79ϫ 10 +6 N/C The next four questions pertain to the following situation: A solid metal sphere of radius a has a net positive charge Q a .The sphere is surrounded by a thin concentric conducting spherical shell of radius b.The shell has a net negative charge Q b =−Q a .21. Various spherical Gaussian surfaces are drawn concentric to the conducting sphere and shell at different radii R. Which graph best describes the electric flux ⌽ through the entire Gaussian surface as a function of R? ͑Recall that the area vector for a closed Gaussian surface points outward.͒22.Let a = 2 cm, b = 5 cm, Q a = +2ϫ 10 −9 C, and Q b = −2 ϫ 10 −9 C. Calculate the radial component of the electric field at R = 4 cm due to the conducting shell and sphere.͑A͒ E r = −2.91ϫ 10 4 N/C ͑B͒ E r = −1.55ϫ 10 4 N/C ͑C͒ E r =0 ͑D͒ E r = + 1.12ϫ 10 4 N/C ͑E͒ E r = + 2.35ϫ 10 4 N/C When a positive point charge +Q a is brought close to ͑but outside of͒ the conducting shell the magnitude of the electric flux through the entire Gaussian surface at R =4 cm ͑A͒ increases ͑B͒ decreases ͑C͒ remains the same The next three questions pertain to the following situation: 25.Calculate ͉U I ͉ the magnitude of the potential energy for the configuration of changes shown as ͑I͒, given Q =3 C and d = 2 meters.͑A͒ 20.0ϫ 10 −3 J ͑B͒ 28.6ϫ 10 −3 J ͑C͒ 37.3ϫ 10 −3 J ͑D͒ 46.0ϫ 10 −3 J ͑E͒ 63.4ϫ 10 −3 J 26.The potential energy of the configuration of charges shown as ͑II͒ is ͑A͒ U II Ͼ 0 ͑B͒ U II =0 ͑C͒ U II Ͻ 0 27.Compare U I the potential energy of configuration 1 with U 2 the potential energy of configuration II.͑A͒ U I Ͼ U II ͑B͒ U I = U II ͑C͒ U I Ͻ U II 28.Shown below is a portion of a very thin infinite charged insulating sheet perpendicular to the x axis.The sheet has uniform positive charge density + a = +10 C/m 2 .An infinite conducting slab, with thickness 1 cm, is also placed perpendicular to the x axis and 10 cm to the left of the insulating sheet as shown in the figure.The total surface charge density on the conducting slab, L + R , is −6 C/m 2 .What is the surface charge on only the right side of the conducting slab ͑the side closest to the sheet͒?Be carefulall the answers can be attained using values given in the problem.͑A͒ R =0 ͑B͒ R = −3.0C/m 2 ͑C͒ R = −6.0C/m 2 ͑D͒ R = −8.0C/m 2 ͑E͒ R = −10.0C/m 2 37͒ An ac generator consists of a square coil with N =25 turns and side dimension b = 3 cm in a spatially uniform magnetic field B = 0.45 T that points in the positive z direction.The coil rotates about the x axis at constant angular frequency = 666 rad/ s.Calculate the magnitude of the peak EMF generated in the coil.͑A͒ EMF 0 = 4.71 V ͑B͒ EMF 0 = 5.36 V ͑C͒ EMF 0 = 6.74 V ͑D͒ EMF 0 = 7.82 V ͑E͒ EMF 0 = 9.45 V 41.Suppose we have two point charges located along the y axis: Q A at y = +a and Q B at y =−a.Which of the following statements about the signs and magnitudes of the charges must be true if the electric field produced by these two charges is equal to zero at ͑x , y͒ = ͑0, +2a͒?͑A͒ Q A and Q B have the same sign and the magnitude of Q A is less than the magnitude of Q B .͑B͒ Q A and Q B have the same sign and the magnitude of Q A is greater than the magnitude of Q B .͑C͒ Q A and Q B have the opposite sign and the magnitude of Q A is less than the magnitude of Q B .͑D͒ Q A and Q B have the opposite sign and the magnitude of Q A is greater than the magnitude of Q B .EVALUATING MULTIPLE-CHOICE EXAMS IN LARGE¼ PHYS.REV.ST PHYS.EDUC.RES. 2, 020102 ͑2006͒ 020102-9 9. If the potential at r = ϱ is zero, which of the following statements is true about the potential outside the radius of the sphere ͑i.e., for all r Ͼ R͒?͑A͒ V͑r Ͼ R͒ conducting Ͻ V͑r Ͼ R͒ insulating ͑B͒ V͑r Ͼ R͒ conducting = V͑r Ͼ R͒ insulating ͑C͒ V͑r Ͼ R͒ conducting Ͼ V͑r Ͼ R͒ insulating 10.If the potential at r = ϱ is zero, which of the following statements is true about the potential at the center of the spheres?͑A͒ V͑r =0͒ conducting Ͻ V͑r =0͒ insulating ͑B͒ V͑r =0͒ conducting = V͑r =0͒ insulating ͑C͒ V͑r =0͒ conducting Ͼ V͑r =0͒ insulating11.If the potential at r = ϱ is zero, what is the potential inside the conducting sphere?͑A͒ V͑r Ͻ R͒ conducting Ͻ 0 ͑B͒ V͑r Ͻ R͒ conducting =0 ͑C͒ V͑r Ͻ R͒ conducting Ͼ 0 14.Two parallel conducting rails in a horizontal plane are connected by a resistor R.They are in a region of spatiallyuniform magnetic field that points out of the page as shown in the figure.A conducting bar in electrical contact with the rails is being pulled away from the resistor at constant speed v by an external agent.Which one of the following is true?͑A͒ A current flows through R in the direction of arrow 1. ͑B͒ A current flows through R in the direction of arrow 2. ͑C͒ No current flows through R as long as v is constant.16.A copper ring is being rotated clockwise by an external agent at a constant angular speed around a point on the ring as shown in the figure below.The ring is in the plane of the page and its motion is also in the plane of the page.The region is filled with a spatially uniform magnetic field normal to and pointing out of the plane of the page.The current induced in the ring is ͑A͒ in the clockwise direction ͑B͒ in the counterclockwise direction ͑C͒ zero 17.The circuit shown in the diagram lies fixed in the plane of the page except for the semicircle of wire in the top side of the circuit which can be rotated around the axis defined by the top side of the circuit by the crank shown at the right.All this is in a region of space completely filled with a spatially uniform magnetic field normal to and pointing out of the plane of the page.When the crank is turned by an external agent at a constant angular frequency , what current flows in the resistor?
True ͑B͒ False 22.What is the value of E y , the y component of the electric field due to these two charges at point A defined as ͑x , y͒ = ͑0,−2a͒?Be careful-all the answers can be attained by using values given in the problem.͑A͒ E y = + 5.79ϫ 10 +6 N/C ͑B͒ E y = + 1.93ϫ 10 +6 N/C ͑C͒ E y =0 N/C ͑D͒ E y = −1.93ϫ 10 +6 N/C ͑E͒ E y = −5.79ϫ 10 +6 N/C 23.Shown below is a portion of a very thin infinite charged insulating sheet perpendicular to the x axis.The sheet has uniform positive charge density + a .A cylindrical Gaussian surface ͑centered on the x axis͒ of length 2L 0 encloses a portion of the sheet.The radius of each end cap is R. ͑Recall that for a closed surface, the area vector points outward.͒The electric flux through the left end cap ͑surface 2͒ of the Gaussian surface is ͑A͒ positive ͑B͒ negative ͑C͒ zero 24.A positive point charge Q 0 is now placed 10 cm to the right of the sheet and on the x axis as shown in the figure below ͑ignore the X in the figure until the next problem͒.Assume the charge distribution on the sheet is unaffected by the point charge +Q 0 .The absolute value of the flux through the left end cap ͑surface 2͒ will ͑A͒ increase ͑B͒ decrease ͑C͒ remain the same 25.Let +Q 0 = +2 C ͑+ 0 is still +5 C/m 2 ͒.What is the magnitude of the net electric field on the x axis a distance 10 cm to the left of the plane ͑at the X in the figure͒?Be careful-all the answers can be attained using values given in the problem.͑A͒ E = 1.68ϫ 10 5 N/C ͑B͒ E = 2.82ϫ 10 5 N/C ͑C͒ E = 4.50ϫ 10 5 N/C ͑D͒ E = 7.32ϫ 10 5 N/C ͑E͒ E = 9.78ϫ 10 5 N/C The next two questions pertain to the figure below: A thin conducting spherical shell of radius a = 3 cm has a net charge Q a = +3ϫ 10 −9 C. The inner shell is surrounded by a thin concentric conducting spherical shell of radius b = 7 cm.The outer shell has a net charge Q b =−3ϫ 10 −9 C. 26.Calculate the radial component of the electric field at r = 5 cm.͑A͒ E r = + 2.26ϫ 10 4 V/m ͑B͒ E r = + 1.08ϫ 10 4 V/m ͑C͒ E r =0 ͑D͒ E r = −1.49ϫ 10 4 V/m ͑E͒ E r = −2.79ϫ 10 4 V/m

28 .
Shown below is a portion of a very thin infinite charged insulating sheet perpendicular to the x axis.The sheet has uniform positive charge density + a = +5 C/m 2 .An infinite conducting slab, with thickness 1 cm, is also placed perpendicular to the x axis and 10 cm to the left of the insulating sheet as shown in the figure.The total surface charge density on the conducting slab, L + R , is−3 C/m 2 .What is the surface charge on only the right side of the conducting slab ͑the side closest to the sheet͒?Be carefulall the answers can be attained using values given in the problem.͑A͒R =0 ͑B͒ R = −1.5 C/m 2 ͑C͒ R = −3.0C/m 2 ͑D͒ R = −4.0C/m 2 ͑E͒ R =−5.0 C/m 2 33.Suppose we have two point charges located along the y axis: Q A at y = +a and Q B at y =−a.Which of the following statements about the signs and magnitudes of the charges must be true if the electric field produced by these two charges is equal to zero at ͑x , y͒ = ͑0,−2a͒?͑A͒ Q A and Q B have the same sign and the magnitude of Q A is less than the magnitude of Q B .͑B͒ Q A and Q B have the same sign and the magnitude of Q A is greater than the magnitude of Q B .͑C͒ Q A and Q B have the opposite sign and the magnitude of Q A is less than the magnitude of Q B .͑D͒ Q A and Q B have the opposite sign and the magnitude of Q A is greater than the magnitude of Q B .͑E͒ There is not enough information given to determine both the relative signs and the relative magnitudes of Q A and Q B .37.An ac generator consists of a circular coil with N = 30 turns radius R = 2 cm in a spatially uniform magnetic field B = 0.55 T that points in the positive y direction.The coil rotates about the x axis at constant angular frequency = 333 rad/ s.Calculate the magnitude of the peak EMF generated in the coil.͑A͒ EMF 0 = 4.82 V ͑B͒ EMF 0 = 5.49 V ͑C͒ EMF 0 = 6.90 V ͑D͒ EMF 0 = 8.01 V ͑E͒ EMF 0 = 9.68 V ing an aircraft carrier: Revising the calculus-based introductory physics sequence at Illinois ͑Forum on Education of the American Physical Society, 1997͒. 2 R. Lukhele, D. Thissen, and H. Wainer, On the relative value of multiple-choice, constructed response, and examinee-selected items on two achievement tests, J. Educ.Meas.31, 234 ͑1994͒. 3E. F. Redish, Teaching Physics with the Physics Suite ͑John Wiley and Sons, New York, 2003͒. 4S. Tobias and J. B. Raphael, In-class examinations in collegelevel science: New theory, new practice, J. Sci.Educ.Technol.5, 311 ͑1996͒. 5G. J. Aubrecht and J. D. Aubrecht, Constucting objective tests, Am.J. Phys.51, 613 ͑1983͒.

TABLE IV .
A list of each student's mulitple-choice ͑MC͒, average-grader ͑AG͒, and committee ͑CS͒ score for those who participated in the validity study.The data for each group are shown separately.

TABLE III .
A list of the types of questions used on midterm exams in our calculus-based E and M course ͑1003 questions in total͒.
Taylor, An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements ͑University Science Books, Sausalito, CA, 1982͒.A letter grade difference of 1.0 is equivalent to a letter grade difference of A to B or B to C. A letter grade difference of 1 3 is equivalent to the difference between an A and an A− or an A− to a B+.H. Wainer and D. Thissen, in True Score Theory: The Traditional Method, edited by David Thissen and Howard Wainer ͑Lawrence Erlbaum Associates, Hillsdale, NJ. 2001͒, Chap.2, pp.23-72.C. C. Peters and W. R. Van Voorhis, Statistical Procedures and Their Mathematical Bases ͑McGraw-Hill, New York, 1940͒.