Measuring Student Learning With Item Response Theory

We investigate short-term learning from hints and feedback in a Web-based physics tutoring system. Both the skill of students and the difﬁculty and discrimination of items were determined by applying item response theory (cid:1) IRT (cid:2) to the ﬁrst answers of students who are working on for-credit homework items in an introductory Newtonian physics course. We show that after tutoring a shifted logistic item response function with lower discrimination ﬁts the students’ second responses to an item previously answered incorrectly. Student skill decreased by 1.0 standard deviation when students used no tutoring between their (cid:1) incorrect (cid:2) ﬁrst and second attempts, which we attribute to “item-wrong bias.” On average, using hints or feedback increased students’ skill by 0.8 standard deviation. A skill increase of 1.9 standard deviation was observed when hints were requested after viewing, but prior to attempting to answer, a particular item. The skill changes measured in this way will enable the use of IRT to assess students based on their second attempt in a tutoring environment.


INTRODUCTION
This work stands at the intersection of two current trends in education: interactive learning environments and formative assessment of students during learning. The usefulness of learning environments that give relevant "when needed" guidance has been emphasized by many learning theories. For instance, cognitive apprenticeship asserts that teachers ͑experts͒ should provide "scaffolding" to students ͑novices͒ to help them solve items that they otherwise could not solve by themselves. 1 Advances in computer technology are enabling more students to experience individualized tutoring and simulation environments. An example is the widely used MASTERINGPHYSICS homework tutorial system, 2 used to collect the data in this experiment.
Recent National Academy studies, 3 among others, have called for more formative assessment, and for extending formative and summative assessment from testing environments to everyday learning environments. This presents a significant challenge for item response theory ͑IRT͒ because this widely used method of psychometric analysis assumes both that students have only one attempt at a given item, and furthermore that students do not learn while working through the items ͑i.e., while taking the test͒.
In this paper we propose and study a method to extend IRT to assess students in a learning environment where they are allowed multiple attempts at answering a particular item, often with learning activities in between. There are three major motivations for this study: quantifying the effectiveness of the learning activities, extending the range of applicability of IRT into the "multiple attempts with intervening learning" regime, and providing better formative assessment of students during an interactive learning process. Quantifying the effectiveness of the various tutoring material in the tutor program offers the hope that we can improve educational effectiveness by revising the tutoring material appropriately. Subsequently, one might revise the format or the grading algorithm to encourage students to follow along tutoring trajectories that maximize their learning outcome. Additionally, one might strengthen the content along the most effective or most popular learning routes. The extension of IRT to this new regime may also increase the reliability of computer-given standard tests, since giving students a second attempt after a wrong answer is straightforward in these environments and provides another gradable interaction at the cost of little additional student time. Finally, being able to reliably assess a student while ͑s͒he is being tutored can provide formative as well as summative assessment without the need for interrupting the learning process for testing or some other form of assessment.
This paper presents a study of the effect of various forms of learning assistance on problem-solving performance in the Web-based homework tutor MASTERINGPHYSICS, which uses a Socratic tutoring style. ͑We use "problem" in the colloquial sense-most are like end of chapter problems in a typical introductory physics textbook as shown in the top of Fig. 1͒. This Socratic tutor provides help in several ways as students work toward the correct solution. If the student submits a wrong answer ͑even in the free response mode studied here͒, the program makes a useful response to that particular answer about half of the time ͑see the bottom left of Fig. 1͒. A list of hints is available upon student request at any time during the solution process. These hints are one of three types: a list of steps, declarative statements, and procedural subtasks. Declarative hints provide a description of what should be done, taken note of, or used to solve the item at hand ͑see the bottom middle of Fig. 1͒, while procedural subtasks ask students to work on simpler subtasks that are helpful steps toward solving the current item ͑see the bottom right of Fig. 1͒. In addition, MASTERINGPHYSICS may also ask follow-up questions and give comments to help the student understand the import of the just-obtained solution. To enable constructivist learning, students can open hints and subtasks, and answer problems and items in any order.
Given the complexity of student interactions with the feedback and hint structure, some simplification is necessary. Our simplification is to lump interactions into one of two groups: ͑1͒ received ͑useful͒ wrong answer feedback or ͑2͒ engaged hints and subtasks. We then considered various paths from start to second attempt. Using IRT, we first determine the skill of students and the difficulty of items based solely on the first response of the students to each item ͑in the spirit of IRT, we exempt students requesting a hint before attempting an answer from this analysis͒. Second responses on each item were then analyzed to find the changes in skill of students who had followed a particular path through the hints and feedback.

METHOD
We first employed IRT using a two-parameter logistic model in which the probability for student s ͑who has skill s s ͒ to get an item i correct, denoted as P correct s,i , is assumed to be P correct In Eq. ͑1͒, each item is parametrized by a discrimination coefficient ␣ i and a difficulty coefficient d i ; 4-6 no allowance for guessing seems necessary for the free response answer type involved here. ͑To lower cognitive load for those unfamiliar with standard IRT notation, we have used s and d for skill and difficulty, respectively, rather than the conventional and b.͒ The difficulty coefficient d i on the skill axis is the point for which the predicted probability of correct response, P correct s,i , equals 1 / 2. The discrimination coefficient ␣ i is sometimes called a slope parameter because it is related to the slope of the linear portion of a logistic curve. The larger a discrimination coefficient is ͑in other words, the steeper a logistic curve is͒, the better the item discriminates ͑distin-guishes͒ between students of low vs high skill.
The commercial IRT program 7 BILOG-MG was used to build this model from the first responses to 15 homework problems with 58 total items by 142 students who took an introductory Newtonian physics course in the fall 2004 semester at Massachusetts Institute of Technology. All 58 items were free response questions ͑all requiring a symbolic re-sponse͒. Students were allowed to submit up to six responses, being provided with feedback and/or electing some ͑or no͒ intervening learning activities before each subsequent attempt at answering the main item. After the solution is obtained, the program may make follow-up comments or ask follow-up questions ͑which are treated like any other items͒.
Obviously, the variety of elective learning material implies that students' learning paths through the tutorial help can be quite complicated. Figure 2 shows a transition diagram specifying four common paths students take that involve some learning. Most tutoring paths involve a second attempt preceded by an incorrect first attempt followed by no useful feedback or hint ͑no feedback or hint͒; receiving only feedback ͑useful feedback only͒; and using hints and/or subtasks ͑using hint and/or subtask͒. We characterize students' paths through the hints only by whether they used declarative hints or opened any subtasks ͑completion of these is elec-tive͒. The last path considered is using hints and subtasks before their first attempt ͑start with hint and subtask͒. We are interested to see how each of these particular paths affects students' skill on their second attempt, and expect that it will quantify the utility of hints and feedback to students.
For each of the paths in Table I ͑the paths are shown graphically in Fig. 2͒, we compute the change in skill on each item ͑␦s i ͒ by fitting the end-of-path response data to Eq. ͑2͒, where the skill of students ͑s s ͒ and the item parameters of items ͑␣ i , d i ͒ have been previously found from first response data as indicated above. N correct2 in Eq. ͑2͒ is the number of students who got the item correct out of the N s who followed that particular trajectory ͑e.g. one of those listed in Table I͒. FIG. 1. ͑Color͒ Solving a "pushing a lawnmower" problem with MASTERINGPHYSICS ͑top͒; wrong answer feedback ͑bottom left͒, declarative hint ͑bottom middle͒, and procedural subtask ͑bottom right͒. This amounts to fitting these "second attempt" data to a similar logistic function as the first attempt, the only change being a shift toward lower or higher skill by the amount ␦s i .

JUSTIFICATION OF MODEL
A sample of binned data used in this procedure is shown in Fig. 3. The data points of Fig. 3 were obtained by binning observed second attempts of students who received some meaningful feedback from the tutor after their erroneous first attempt, but did not use any hints or subtasks before making their second attempt. The solid line shows the expected P correct from their skill determined as described from previous problems ͑before they used the meaningful feedback provided by the tutor͒, the dashed line represents the best logistic fit to their second attempts ͑after they received meaningful feedback to their particular incorrect first answer͒ with the discrimination parameter ␣ i fixed to the value obtained from the first response data, and the dotted line represents the best logistic fit to the second attempts when the discrimination parameter ␣ i is determined from the second response data.
The data in Fig. 3 are fitted by the logistic curve of Eq. ͑2͒. This shows that the overall skill of the students is the prime determinant of whether they can answer the posed question or not, even after receiving feedback. Since the feedback differs depending on the specific error, the learning value clearly is different for different students, which may well explain why the discrimination is lower after the feedback.
The fits here are typical and justify two fundamental points about the model: ͑1͒ The model fits all four data sets within error This procedure is equivalent to fitting the difficulty of an item using only attempts made after the particular learning trajectory, and then setting ␦s i equal to the decrease in difficulty. We prefer to represent the learning as a path-͑and skill͒-dependent increase in skill because it is more consistent with the postulates of IRT. The difficulty of a free response item remains independent of the student or his ͑her͒ recent preparation; the increased probability of responding correctly is most logically attributed to a skill change that is local for that item. ͑Certainly the overall skill of the student does not change by one or two standard deviations due to five minutes of tutoring on one item.͒ For a multiple choice item, the item would obviously be easier given that one of the attractive distractors has been eliminated on the first attempt; nevertheless, this change in difficulty of the item could be absorbed into ␦s i unless this skill change depends on the particular distractor chosen on the first attempt. ͑This paper uses only free responses to items requesting a symbolic response, so this point is moot.͒ The key point here is that we assume a skill increase due to the tutoring interactions that depends on both the type of trajectory and the effectiveness of the help available in that item along that particular trajectory.
In determining the error bars of data points in each bin, we assume that the number of students in each bin who got the item correct follows a binomial distribution. Under this assumption, the error bar of each data point is given as ͱp͑1−p͒/N s* where p is N correct2* / N s* , N correct2* is the number of students in the current bin who got the item correct, and N s* is the total number of students in that bin. Consequently, if every student who followed a particular route in a

Second Attempts After Useful Feedback Only
Skill − Difficulty P correct Expected P correct Observed P correct (w/ a fixed α) Observed P correct (w/ a free α) FIG. 3. ͑Color͒ Logistic fits to the second attempt of students after they received useful feedback from the tutor. The discrimination has decreased from 0.82 to 0.55. particular bin gets an item correct or wrong in the subsequent attempt, that bin is assigned a zero error. In such extreme cases, we used p =1/ ͑3N s* ͒ instead, as suggested by Kim et al., to assign nonzero error bars. 8

RESULTS AND DISCUSSION
The techniques described above were used to fit data along the other paths indicated in Table I ͑see Figs. 3-5͒. First of all, it is noteworthy that students making a second attempt to answer without receiving either useful feedback or requesting hints showed a skill of −1.02Ϯ 0.22 standard deviation below expectation ͑see Fig. 4͒. Surely receiving no useful information does not make these students less skillful. Therefore we attribute this decrease to "item-wrong bias." Any item is particularly easy or difficult for some students ͑this is a cause of testing error͒; the sample of students who fail to answer an item correctly on their first attempt is obviously biased toward students for whom that item is especially difficult. The observation that students with sufficient skill to do the item ͑i.e., with s − d Ͼ 0͒ trend well below initial ͑IRT-based͒ expectation on this second attempt may well indicate that many started with, and retain, some significant confusion.
Significantly, receiving useful feedback or using hint͑s͒ and/or subtask͑s͒ provided by the tutor increased the skill of the students who had initially answered incorrectly. 9 The effect size ranged from 0.34 to 0.62 standard deviation, with an average 0.44Ϯ 0.10 standard deviation. Their actual learning probably significantly exceeds this, since without useful tutoring this group should have ͑negative͒ item-wrong bias for the same reason as the sample just discussed. In addition, the data indicate that consulting hint͑s͒ and/or subtask͑s͒ seems to be more beneficial than receiving only feedback from the tutor ͑p = 0.12 using the two-parameter fit, but p = 0.02 if the fit constrains the discrimination͒. The IRT result that the discrimination is reduced can be stated more informatively as follows: while the second-attempters whose skill exceeds the difficulty of the item get back to the level initially expected of them, those with less skill ͑i.e., s − d Ͻ 0͒ benefit mosttheir probability of answering correctly is very significantly increased over initial expectation by the feedback.
We were surprised that students who elected to consult the hints ͑at some penalty to their score͒ before submitting their first answer 10 showed the largest increase in their skill, effect size 1.87Ϯ 0.16 standard deviation ͑see Fig. 5͒. One possible explanation is that these students have enough metacognitive ability to recognize that they need help solely from inspecting the item ͑rather than having to make a mistake to learn this͒ and that enables them to use hints and subtasks until they are reasonably certain that they have the solution. This interpretation is buttressed by the fact that students whose a priori chance of answering correctly is less than 50% ͑i.e. s − d Ͻ 0͒ show an increased probability of answering correctly on their second attempt by 0.3 or more, a factor of 2-5.
A majority of students who failed their first attempt and used any form of tutoring were able to solve items on their second attempt, as reflected in their skill increase by 0.43Ϯ 0.10 standard deviation. This may be quantified by noting that on first attempt for all items 60% were answered correctly. After tutoring, 59% of the second attempts will be correct, hence by now 84% of the students will have answered the item correctly. Those who have not continue using the tutoring until eventually ϳ95% of all students have answered correctly. In this work, student background information such as gender or student major was not considered. However, previous research shows that such contextual factors do not correlate with student performance on MASTER-INGPHYSICS homework. 11 Thus, it would be reasonable to assume that the learning effect reported in this work is also not associated with such contextual factors.

No Feedback or Hint
Skill − Difficulty P correct Expected P correct Observed P correct (w/ a fixed α) Observed P correct (w/ a free α)

FIG. 4.
͑Color͒ Logistic fits to the second attempt of students when students neither received feedback from the tutor nor used hint or subtask. This paper does not address the question of whether the skill increase is transferable to subsequent problems. However, research on knowledge transfer in this environment shows that students need 15.4% fewer hints and submit 11.4% fewer wrong answers when working a subsequent problem on the same topics as the previous one. 12,13 We estimate that this would be consistent with a learning effect of the increased student skill by about 0.5 standard deviation ͑but just on items concerning that detailed topic͒.

IMPLICATIONS
The generalization of IRT developed here has several potential applications beyond studying the effectiveness of hints and subtasks. These include refining tutorial content based on its performance and also assessing students in a learning environment.
Data like those presented here inform the tutorial authors about the efficacy of the tutoring material along each learning trajectory through each individual item because the skill changes found, ␦s i , have this specificity. With some study and experience, these should allow the content author to evaluate the effectiveness of the tutorial help along each path relative to tutoring in other items. For this purpose ͑aiding the content author͒ we note that our approach-assigning a skill increase to each trajectory through the tutoring-might better be replaced by an analysis in which each particular wrong answer response and each particular hint and subtask is regarded as a separate learning element. Then a multiple regression approach on a large enough sample could find the skill increase attributable to each hint and subtask, allowing even more specificity in evaluating the tutoring elements for each tutoring item.
The major assessment payoff of our generalized IRT is greater reliability, achieved by making inferences about a student's skill from each attempt at solution, rather than from only their first attempt ͑as conventional IRT does͒. Assume that a large number of students have calibrated a particular trajectory and thereby determined ␦s i for that trajectory.
Then for subsequent students, their second attempt at this item after having passed along that trajectory may be regarded as a "first" attempt at a new item of known difficulty ͑i.e., d new = d i − ␦s i ͒ and discrimination. Then Eq. ͑2͒ may be used in the same manner as Eq. ͑1͒ was used to determine or to update the student's skill. Our use of "update" here implies that, in a learning environment, a student's skill is changing over time, and hence it is desirable to have a way to determine a "current skill" rather than just a final skill ͑as IRT normally provides after a test͒.
It also seems possible to use our approach to increase the reliability of computer-administered examinations. If they require free responses, the above techniques will work ͑the only trajectory being a second attempt without benefit of specific feedback or hints͒. For a multiple choice exam, the item is arguably less difficult after the student has been informed that the distractor that he selected is incorrect. Nevertheless, if each particular distractor is regarded as a distinct learning trajectory, then Eq. ͑2͒ may be applied to determine ␦s i , allowing the skill of future students to be assessed along that trajectory. This is because Eq. ͑2͒ ͑like all of IRT͒ depends only on the difference between skill and difficulty; whether ␦s i is regarded as an increase in skill or a decrease in difficulty is immaterial.

CONCLUSIONS
We have presented a mathematical model and procedure to generalize IRT to measure students' skill change due to learning that occurs between successive attempts to answer a single item. We showed that this model, using a shifted logistic function, accounted within error for response after feedback. We showed that the skill change depended on the students' path through the tutoring available in an online tutorial environment, with effect sizes from −1.02 to +1.87. We argued that the surprisingly large −1.02 standard deviation skill change of students who attempted to solve the item without using any hint or feedback from the tutor did not represent any sort of "unlearning," but probably reflected item-wrong bias associated with being selected because their first response was incorrect. This certainly indicates that simply telling a student that they are wrong does not help them perform very much better on their second attempt.
Our findings make a strong positive statement about the overall effectiveness of the tutoring of MASTERINGPHYSICS in helping students who answer incorrectly on their first attempt. Students who either spontaneously or by request receive such help perform 0.84 standard deviation better on their second attempt. ͑More help is available for those who get it wrong the second time.͒ We have shown only dramatic correlations between the skill change and the learning route selected by the students ͑or selected by the computer for spontaneous feedback͒. We do not see that particular students strongly favor the most effective "tutoring before first attempt" path so it appears that the learning correlates with the path and not the student. However, the path selected may indicate the student's readiness to learn ͑e.g., we suggested that selecting tutoring first showed metacognition in action͒. Still, while some paths may be selected by those "ready to learn," the available paths did indeed help the students reach the correct solution.
Another significant finding is that less skillful students benefited far more from the tutoring than more skillful ones. Figure 3 shows that tutoring increased P correct significantly only for students attempting items whose difficulty exceeds their skill. Figure 3 shows improvements of ϳ1.9 to ϳ4.0 error bars ͑p values from ϳ0.02 to 1.3ϫ 10 −4 ͒ for all points in this region. Students who decide to use hints and subtasks first improve even more.
We argue that the strong improvement from tutoring measured here, and the fact that it helps weaker students more, shows that hints and feedback from MASTERINGPHYSICS effectively play the role of "as needed" scaffolding. This tutoring is available when needed, enables 59% of those students who answer incorrectly on their first attempt to find their way to the solution on their very next attempt, and helps about 2 / 3 of those answering incorrectly on their second attempt to the solution before exhausting their remaining four attempts.