Evaluating an electricity and magnetism assessment tool : Brief electricity and magnetism assessment

The Brief Electricity and Magnetism Assessment BEMA , developed by Chabay and Sherwood, was designed to assess student understanding of basic electricity and magnetism concepts covered in college-level calculus-based introductory physics courses. To evaluate the reliability and discriminatory power of this assessment tool, we performed statistical tests focusing both on item analyses item difficulty index, item discrimination index, and item point biserial coefficient and on the entire test test reliability and Ferguson’s delta . The results indicate that BEMA is a reliable assessment tool.


I. INTRODUCTION
Standardized, multiple-choice tests can be a useful tool in assessing what students learn in physics courses.A number of such tests have been developed; these tests, covering different physics domains including kinematics, 1 force, 2 motion, 3 dc circuits, 4,5 electricity and magnetism, 6 and other topics, have increasingly been used by a wide range of physics instructors to measure some aspects of what students learn in both traditional and reform physics courses.BEMA ͑Brief Electricity and Magnetism Assessment͒ 7 was developed in 1997 by Chabay and Sherwood, aided by Fred Reif, to measure students' qualitative understanding and retention of basic concepts in electricity and magnetism.We report elsewhere on the use of BEMA to compare student performance at the end of both traditional and reform introductory electricity and magnetism ͑E&M͒ courses and to compare retention of these concepts over a period of up to five semesters after the end of the courses. 8The test itself is not included here because the utility of a standardized test decreases if its contents are widely known and the questions become very familiar to the population who will be tested.Any instructor may obtain a copy of the test at http:// www.compadre.org.
In this paper we report on the reliability of BEMA, as measured by statistical tests focusing both on individual items and on the test as a whole.Test reliability has two aspects: consistency and discriminatory power.A test is reliable if it is consistent within itself and consistent across time.If a test is shown to be reliable, one can have confidence that the same students would get the same score if they took the test more than once.In addition, on a reliable test, a large fraction of the variance in scores is caused by systematic variation in the population of test takers; students whose levels of understanding or mastery are different will achieve different scores on the test.Both of these aspects of test reliability can be assessed statistically.If a test is to be used in comparing the performance of different groups, the reliability of the assessment instrument is particularly important.
To be useful, a test must also be "valid."A test is valid if the skills or knowledge it measures are directly relevant to the stated domain of the test.Validity cannot be assessed statistically and is usually determined by a consensus of ex-pert opinions.Though the issue of validity-the question of whether BEMA in fact assesses knowledge of E&M -is not one of the main topics of this paper, it is involved in the overall evaluation of BEMA.We will briefly address the validity of BEMA in Sec.II.
Aubrecht and Aubrecht 9 were among the first to describe the use of statistical methods to evaluate objective physics tests.The measures they employed included the item difficulty index, the item discrimination index, and test reliability.Subsequently others introduced additional statistical tests, including item point biserial coefficient and Ferguson's delta.Although all of these statistical measures are available for some published assessment tools such as the TUG-K ͑Ref.1͒ and DIRECT ͑Refs.4 and 5͒, many authors have limited their focus to individual item analyses, such as the item difficulty index and test reliability.In Sec.III we will report on the results of applying all these statistical tests to BEMA and will explain briefly the nature and significance of each test.

II. BACKGROUND AND VALIDITY OF BEMA
BEMA is a 30-item multiple-choice test which covers the main topics discussed in both the traditional calculus-based E&M physics curriculum and the matter and interactions curriculum ͑Matter & Interactions II: Electric and Magnetic Interactions 10 ͒.It was originally designed for a retention study measuring students' knowledge of E&M at times ranging from three months to five semesters after completing an introductory E&M course.Test items are mostly qualitative questions with a few semiquantitative questions, which require only simple calculations.All test items are intended to assess students' understanding of basic concepts in calculusbased introductory E&M courses.An example of a question from BEMA is shown in Fig. 1.

A. An example of a BEMA question
The test was designed to incorporate broad coverage of elementary E&M, rather than to probe any particular concept in great detail.Since the population of interest included students who had taken both traditional and reform introductory physics courses, only questions on topics common to both courses were included in the test.To establish the validity of the test, initial drafts of the test were critiqued by all eight faculty members at Carnegie Mellon University ͑CMU͒ who had taught undergraduate E&M at any level ͑introductory or intermediate E&M͒ within the past five years.If an instructor reported that a proposed question dealt with a topic not covered in the version of the introductory course he or she had taught, the question was eliminated; the final set of questions was approved by all professors consulted, who agreed that the test did deal with important aspects of E&M .
Soliciting expert opinions is a standard method of assessing the validity of a test.The term "validity," which is not a statistical construct, refers to the extent to which a test actually measures what it purports to measure.Validity can have several aspects. 11"Face validity" can be determined by a surface level, common sense reading of an instrument; a test would lack face validity if it tested concepts not related to the subject matter."Content validity" reflects the coverage of the subject matter-does a test cover enough aspects of a specific topic?Both of these aspects of validity are typically assessed by expert consensus, as was done with BEMA.͑Other aspects of validity, not relevant here, are "construct validity"-the extent to which the test is demonstrated to measure a theoretical construct or trait such as creativity, honesty, or intelligence-and "criterion-related validity"evidence that performance on one assessment instrument can be used to make inferences about performance in a different domain.͒Pilot testing was done with a small group of volunteers including senior physics majors who had recently completed the junior-level intermediate E&M course.The initial version of the test contained both multiple choice questions, whose alternatives were based on common errors made by students on written tests, and a small number of short-answer semiquantitative questions, which were later converted to multiple-choice questions by including common incorrect responses as alternative answers.͑We thank Tom Foster for converting the short-answer questions to multiple-choice questions.͒

III. STATISTICAL EVALUATION OF BEMA
BEMA was first administered to 189 paid volunteers at CMU in the spring of 1997. 7All of these students had completed either the traditional calculus-based introductory E&M course or the matter and interactions ͑M&I͒ version of ͑E&M͒ at some time before they took the test; most were science, computer science, or engineering majors.This was not an end-of-course assessment; the elapsed time between completion of the E&M course and the BEMA assessment varied from three months up to five semesters.Since this was a longitudinal study, BEMA was also administered to a control group of students who had just completed the firstsemester physics course ͑classical mechanics and thermal physics͒ and were ready to take the second-semester E&M course.In the fall 2003 semester, BEMA was administered as both a pre-and post-test to a large number of students at North Carolina State University ͑NCSU͒ via WebAssign, a computer-based homework system.͑WebAssign is an online homework system.It is a centrally hosted subscription service with users from many different institutions.For more information see http://www.webassign.net͒All students were taking either a traditional calculus-based E&M course or an M&I course in that semester.Two hundred and forty-five students took the post-test, and 191 students took both preand post-tests.Students were asked to take the tests seriously with no penalty for wrong answers.
Pretest performance on BEMA does not vary much among different populations, and pretest scores average around 23%.In this paper we use only post-instructional data for test statistics, since we are focusing on evaluation of BEMA, and not on a comparison of student pre-and postinstructional performance.Post-instruction averages, standard deviations, and standard error for students at CMU, NCSU, and the combined groups from both CMU and NCSU are given in Table I.For comparison, the average score of senior physics majors at CMU was 80%.
Using the data from this combined sample, we performed five statistical tests: three measures focusing on individual test items ͑item difficulty index, item discrimination index, item point biserial coefficient͒ and two measures focusing on the test as a whole ͑test reliability and test Ferguson's ␦͒.In the following sections, each test is briefly explained and the results discussed.Sections III A-III C discuss statistical measures focusing on individual test items, while Secs.III D and III E discuss statistical measures focusing on the test as a whole.

A. Item difficulty index
The item difficulty index ͑P͒ is a measure of the difficulty of a single test question.It is calculated by taking the ratio of  the number N 1 of correct responses on the question to the total number N of students who attempted the question: This difficulty index P might more meaningfully have been called the "easiness index," since it is simply the proportion of correct responses on a particular question.The greater the P value is, the higher the percentage of respondents giving the correct answer and the easier this item is for this population.The range for the difficulty index P value is ͓0, 1͔ If the P value is 0, then no one can answer the question correctly; on the other hand, if the P value is 1, then every one can correctly answer this question.Under most circumstances, such extremes should be avoided in a test.
A noteworthy aspect of the difficulty index is that the P value depends on the particular population taking the test.As an example, consider the first question in BEMA.Of 189 CMU students, 168 students answered correctly, so the difficulty index for the first question is 0.89.Among 245 NCSU students who took the post-test, 194 students chose the correct answer, so the difficulty index for the first question is only 0.79 for this population of students.
There are a number of different possible criteria for acceptable values of the difficulty index for a test. 12In evaluating BEMA, we choose a widely adopted criterion that requires the difficulty index value to be between 0.3 and 0.9, 13 a range which includes the optimum value of 0.5.A difficulty level of 0.5 on each question would lead to the highest values of the statistics discussed in the following sections.However, it is difficult to control every item in one test, especially when the number of items ͑K͒ in one test becomes large.An averaged difficulty index value ͑P ¯͒ of all the items ͑P i ͒ in a test is often used as an indication of the test difficulty: We can compare the P ¯value with the criterion chosen to check if it meets a certain standard.
Figure 2 plots the difficulty index P values of each item in BEMA from the combined sample of 434 students.BEMA item difficulty index values range from slightly below 0.2 to slightly above 0.8, with most items being around 0.4-0.5, within the desired range.The averaged difficulty index P ¯is 0.42, which also falls into the criterion range ͓0.3, 0.9͔.

B. Item discrimination index
The item discrimination index ͑D͒ is a measure of the discriminatory power of each item in a test.In other words, it measures the extent to which a single test item distinguishes students who know the material well from those who do not.On a test item with a high discrimination index, students with more robust knowledge will usually answer correctly, while students whose understanding is weaker will usually get the item wrong.͑In contrast, a flawed test question might lead more thoughtful students to give answers that are judged wrong, while students who think less deeply give a correct answer.͒If a test contains many items with high discrimination indices, the test itself can be useful in separating strong students from weak students in a specific test domain.
In order to calculate the item discrimination index ͑D͒, we divide the whole sample of students into two different groups of equal size, a high group ͑H͒ and a low group ͑L͒, based on whether an individual total score is higher or lower than the median total score of the entire sample.For a specific test item, one counts the number of correct responses in both H and L groups: namely, N H and N L .If the total number of students who take the test is N, then the discrimination index D of this item can be calculated as In educational and psychological studies, there are several different calculations of discrimination index often employed by researchers.The calculation described above ͑50%-50%͒ is the one which we adopted to calculate discrimination indices for BEMA items.Other researchers may use the top 25% as the high group and the bottom 25% as the low group ͑25%-25%͒, in which case the discrimination index D can be expressed as The 50%-50% calculation can underestimate the discriminatory power of test items, since it takes all the students, especially the relatively unstable middle 50%, into account.The 25%-25% calculation uses only the most consistent individuals, reducing the probability of underestimating the discrimination index due to unstable performance, but necessarily discarding half of the available data.The possible range for the item discrimination index D is ͓−1 , + 1͔, where +1 is the best value and −1 is the worst value.In the extreme ideal case all students in the high group get the item correct and all students in the low group get it wrong, giving a discrimination index D of +1.In the worst case the situation is reversed: everyone in the low group answers correctly, and everyone in the high group gets it wrong.In this case the discrimination index D will be −1.These extreme cases are unlikely, but it is important to eliminate any items with negative discrimination indices.An item is typically considered to provide good discrimination if D ജ 0.3. 14Items with a discrimination index lower than 0.3 ͑but greater than 0͒ are not necessarily bad, but a majority of the items in a test should have relatively high discrimination index values to ensure that the test is capable of distinguishing between strong and weak mastery of the material.
Figure 3 shows the discrimination index for each BEMA item.As one can see, most of the discrimination index D values for BEMA items vary from 0.2 to 0.6, with a majority number of items ͑18 items͒ around 0.3-0.4.This shows that most BEMA items have quite satisfactory discriminatory power.We also calculated the averaged discrimination index D ¯for all K items ͑D i ͒ in BEMA, which can be expressed as We found the average discrimination index D ¯for BEMA to be 0.33.This satisfies the criterion that D ¯ജ 0.3.In order to illustrate the underestimation of the 50%-50% calculation, we also computed BEMA item discrimination indices using 25%-25% method.The index values for all 30 items were increased, and the averaged discrimination index D ¯for BEMA using 25%-25% calculation is 0.52.
Question 9 has the lowest D value and clearly stands out as different from all the other questions.This question asks students to select an algebraic expression for the conventional current in a pipe containing ionized salt water, given the drift speeds of sodium ions and chloride ions, the crosssectional area of the pipe, and the number density of the ions.Almost no students get this question correct, probably because systems with more than one kind of mobile charge are not emphasized in introductory E&M courses.

C. Point biserial coefficient
The point biserial coefficient ͑sometimes referred to as the reliability index for each item͒ is a measure of consistency of a single test item with the whole test.It reflects the correlation between students' scores on an individual item and their scores on the entire test, and is basically a form of the correlation coefficient.The point biserial coefficient has a possible range of ͓−1 , + 1͔.If an item is highly positively correlated with the whole test, then students with high total scores are more likely to answer the item correctly than are students with low total scores.A negative value indicates that students with low overall scores were the most likely to get a particular item correct and is an indication that the particular test item is probably defective.
To calculate the point biserial coefficient for an item, one needs to calculate the correlation coefficient between the item scores and total scores.A student's score on one item is a dichotomous variable which can have only two values: 1 ͑correct͒ or 0 ͑wrong͒.Scores for the whole test usually can be viewed as continuous ͑if the test has a relatively large number of items-say, ജ20͒.The correlation coefficient between a set of dichotomous variables ͑score for an item͒ and a set of continuous variables ͑total scores for the whole test͒ 15 r pbs = X ¯1 − X X ͱ P 1 − P .
Here X ¯1 is the average total score for those students who score 1 for the test item ͑correctly answer this item͒, X ¯is the average total score for a whole sample, X is the standard deviation of the total score for the whole sample, and P is the difficulty index for this item.
As an example, consider item 1 in BEMA.Among the combined sample of 434 students from CMU and NCSU, 362 students answered the question correctly, so P = 0.83.For those 362 students, the average total score ͑X ¯1͒ is 13.52.For all 434 students in the combined sample, the average total score ͑X ¯͒ is 12.50.Together with the standard deviation ͑ X = 6.04͒ of the total score for the whole combined sample, we can calculate the point biserial coefficient for BEMA item 1 to be around 0.37.
Ideally all items in a test should be highly correlated with the total score, but that is somewhat unrealistic for a test with a large number of items.The criterion widely adopted 16 for measuring the "consistency" or "reliability" of a test item is r pbs ജ 0.2.Items with point biserial coefficient lower than 0.2 can still remain in a test, but there should be few such items.One way to check whether there are a majority number of items satisfying r pbs ജ 0.2 is to calculate the average point biserial coefficient ͑r ¯pbs ͒ of all items ͑K͒ in a test: where K is the number of items and ͑r pbs ͒ i is the point biserial coefficient for the ith item.The average point biserial coefficient for BEMA is 0.43, which is greater than the criterion value 0.2, so BEMA items overall have fairly high correlations with the whole test.
Figure 4 provides the point biserial coefficient values for each BEMA item.As one can see, almost all items have satisfactory r pbs values, indicating that almost all BEMA items are reliable and consistent.We again see that item 9 on the current in salt water is problematic.
Note that Fig. 4, plotting the point biserial coefficient, and Fig. 3, plotting the discrimination index, are quite similar.It is worth asking whether or not these two statistics actually measure the same property of an item.The answer is no; theoretically, these two statistics are different measures of an item and could in principle give different results.The item discrimination index is a measure of how powerful an item is in separating strong and weak students, while the point biserial coefficient is a measure of whether an item is consistent with the whole test.An item could have a fairly high discrimination index value, but could also show little consistency with the test as a whole.If this were the case, the item might actually be testing some topic that is different from the main subject matter of the rest of the test.On the other hand, an item could be consistent with the test as a whole ͑high point bi-serial correlation coefficient͒ but could offer little discriminatory information.
For example, suppose half of the students in a sample answer a question correctly, giving it an item difficulty index ͑P͒ of 0.5.If half of those who answer correctly ͑25%͒ have total scores in the top 25% ͑quartile͒ and the other half of them ͑25%͒ have total scores in the lower mid-25% ͑quar-tile͒, this item would have a fairly high point biserial coefficient ͑r ¯pbs ͒, but zero discrimination index ͑D͒ according to the 50%-50% method.This zero discrimination index could be avoided by switching to the 25%-25% discrimination calculation, but this method has its own extreme cases.Suppose only the top 8% of test takers get a particular item correct.Then the point bi-serial coefficient ͑r ¯pbs ͒ will still be fairly high, but the discrimination index ͑D͒ will be lower than 0.3.There are many other possible situations in which the two statistics may be different.

D. Kuder-Richardson reliability index
In contrast to the point biserial coefficient, which is a measure of single-item consistency or reliability, the Kuder-Richardson reliability index is a measure of the selfconsistency of a whole test.If a test is administered twice ͑at different times͒ to the same sample of students, then we would expect a highly significant correlation between the two test scores, assuming that the students' performance is stable and that the test environmental conditions are the same on each occasion.The correlation coefficient between the two sets of scores is defined as the reliability index of the test.However, this approach does not actually provide a practical way of determining the reliability index of a test, since students may remember the test questions and study for the test, test conditions at different times may not be identical, etc.Another method of measuring the reliability index of a certain test is to calculate the correlation coefficient of students' scores on two parallel tests that have the same content, structure, number of items, etc., but with different ques-tion contexts.As we know, designing two truly parallel tests is very difficult, so this does not seem to be a feasible way of measuring the reliability index.
The question is whether there is any method one can employ to calculate the reliability index without administering one test twice or designing two tests, by using the information from just one test administered just once.For tests that are designed specifically for a certain knowledge domain with all items being parallel measures, the Spearman-Brown formula can be invoked to calculate the reliability index.This equation connects the reliability index with the correlation between any two parallel composites of a test.The parallel composites are subsets of the test containing the same number of components ͑test items͒.For an example, a 100item test can have two 50-item composites, or four 25-item composites, or five 20-item composites, and so on.Based on the stipulation that the means, variance, and standard deviation of parallel measures be the same, the Spearman-Brown formula can be expressed 17 where K is now the number of parallel composites and r xx is the correlation between any two parallel composites.Kuder and Richardson further developed this idea and proposed to divide a test into its smallest componentsitems.Simply put, each item is regarded as a single parallel test and is assumed to have the same means, variance, and standard deviation.Two theoretical perspectives, "true and error theory" and "domain theory," can be used independently to derive the Kuder-Richardson formula from the Spearman-Brown formula.Though the two theories focus on different features of a test ͑"true and error theory" deals with the performance of students and "domain theory" deals with sample tests formed from a test pool͒, they yield the same final expression ͑KR-20͒ for calculating the reliability index of a test [18][19][20] : K is once again the number of the test items, xi is the standard deviation of the ith item score, and x is the standard deviation of the total score.This calculation takes into account the different variances of the different items, relaxing the strict assumption that all items have the same means, variance, and standard deviations.One does not have to have perfectly parallel items in a test to be able to use this formula.
For a multiple-choice test where each item is only scored as "correct" or "wrong," the above formula can be written as

FIG. 3 .
FIG.3.͑Color͒ BEMA item discrimination index from a combined sample of 434 students.The average discrimination index is 0.33 ͑50% method͒.

TABLE I .
Post-test results.

TABLE II .
Summary of statistical test results.