Development and validation of a conceptual multiple-choice survey instrument to assess student understanding of introductory thermodynamics

We discuss the development and validation of the long version of a conceptual multiple-choice survey instrument called the Survey of Thermodynamic Processes and First and Second Laws-Long (STPFaSL-Long) suitable for introductory physics courses. This version of the survey instrument is a longer version of the original shorter version developed and validated earlier. The 19 contexts including the exact wording of all of the problem situations posed in the two versions of the survey instrument are identical and the difference between the long and short versions of the instrument is only in the multiple-choice options. In particular, in the longer version of the survey instrument, there are no alternative conceptions explicitly embedded in the four multiple-choice options students choose from and the questions asked in a given context in one item of the shorter survey instrument were split into several items focusing, e.g., on different thermodynamic variables. After the development and validation of the longer version of the survey instrument, the final version was administered in 12 different in-person classes (four different institutions) in which students answered the questions in-class on paper scantron forms with the instructor as the proctor and 12 different in-person classes (five different institutions) in which students answered the questions online on Qualtrics within a two-hour period. This longer version of the survey instrument was administered to introductory physics students in various traditionally taught calculus-based and algebra-based classes before and after traditional lecture-based instruction in relevant concepts. It was also administered to upper-level undergraduates majoring in physics and Ph.D. students taught traditionally for bench marking purposes.


A. Multiple Choice Surveys
Central goals of college introductory physics courses for life science, physical science and engineering majors include helping all students develop functional understanding of physics and learn effective problem solving and reasoning skills [1][2][3][4][5][6][7][8][9][10].Validated conceptual multiple-choice physics survey instruments administered before and after instruction in relevant concepts can be useful tools to gauge the effectiveness of curricula and pedagogies in promoting robust conceptual understanding.When compared to free-response problems, multiple-choice problems can be graded efficiently and results are easier to analyze statistically for different instructional methods and/or student populations.However, multiple-choice problems also have some drawbacks.For example, students' selection of the correct answers does not necessarily reflect understanding of the underlying concepts.Also, students cannot be given partial credit for their responses.
Multiple-choice survey instruments have been used as one tool to evaluate whether research-based instructional strategies are successful in significantly improving students' conceptual understanding of these concepts [11][12][13][14].Apart from the Force Concept Inventory, the most well-known survey, other conceptual survey instruments at the introductory physics level in mechanics and electricity and magnetism have also been developed, including survey instruments for kinematics represented graphically [15], energy and momentum [16], rotational and rolling motion [17,18], electricity and magnetism [19][20][21][22][23], circuits [24] and Gauss's law [25,26].In thermodynamics, existing conceptual survey instruments include: 1) Heat and Temperature Conceptual Evaluation (HTCE) [27] that focuses on temperature, phase change, heat transfer, thermal properties of materials; 2) Thermal Concept Evaluation (TCE) [28,29] that also focuses on similar concepts to HTCE; 3) Thermal Concept Survey (TCS) [30] that focuses on temperature, heat transfer, ideal gas law, first law of thermodynamics, phase change, and thermal properties of materials; 4) Thermodynamics Concept Inventory (TCI) [31] that focuses on concepts in engineering thermodynamics courses; and 5) Thermal and Transport Concept Inventory: Thermodynamics (TTCI:T) [32] that also focuses on concepts in engineering thermodynamics courses.In addition, recently a multiple response survey has been developed for upperlevel thermal physics [33].
Despite the availability of the five conceptual survey instruments on introductory thermodynamics, there is a lack of research-validated survey instrument that focuses on the basic concepts related to thermodynamic processes and the first and second laws covered in introductory physics courses.Therefore, we earlier developed and validated [34,35] a 33-item conceptual multiple-choice survey instrument on these concepts called the Survey of Thermodynamic Processes and First and Second Laws (STPFaSL-Short).Here we discuss the development and validation [36,37] of a version of the survey instrument called STPFaSL-Long that is a longer version of the shorter survey developed and validated earlier.The survey instrument can be accessed from PhysPort and is included in Ref. [36] in the supplementary materials section.The 19 contexts-including the exact wording of all of the problem situations posed in the short and long versions of the STPFaSL survey instrument-are identical, and the difference between the long and short versions of the instrument is only in the multiple-choice options.In particular, in the longer version of the survey, there are no alternative conceptions explicitly embedded in the four multiple-choice options students choose from and the questions asked in a given context in one item of the shorter version of the survey were split into several items focusing, e.g., on different thermodynamic variables.We note that the overlap of the STPFaSL content with the HTCE and TCE is minimal.Moreover, although there is overlap between TCI, TTCI:T and STPFaSL concepts, contexts used in the TCI and TTCI:T are engineering focused and, therefore, these surveys are unlikely to be used by introductory physics instructors.Finally, the TCS is for introductory physics courses and covers some content common with STPFaSL but the TCS is a much broader survey and has a major emphasis on temperature, the ideal gas law, phase change and thermal properties of materials, content that are not explicitly the focus of the short and long versions of the STPFaSL instrument.

B. Inspiration from prior investigations on student understanding of thermodynamics
Prior research has not only focused on the development and validation of multiple-choice surveys to investigate students' conceptual understanding of various thermodynamic concepts, but many investigations have focused on student understanding of thermodynamics without using multiple-choice surveys .Some of these investigations use conceptual problems to probe student understanding and ask students to explain their reasoning.These investigations were invaluable in the development of both versions, STPFaSL-Short instrument [34,35] and the STPFaSL-Long instrument discussed here.For brevity, below, we only give a few examples of studies that were used as a guide and from which open-ended questions were used in the earlier stages of the development of the multiplechoice questions for both versions of the instrument discussed here.
Loverude et al. [39] investigated student understanding of the first law of thermodynamics in the context of how students relate work to the adiabatic compression of an ideal gas.For example, in one problem used to investigate student understanding in their study, students were asked to consider a cylindrical pump (diagram was provided) containing one mole of an ideal gas.The piston fit tightly so that no gas could escape.Students were asked to consider friction as being negligible between piston and the cylinder.The piston was thermally isolated from the surroundings.In one version of the problem, students were asked what will happen to the temperature of the gas and why if the piston is quickly pressed inward.Another type of problem posed to students in the same research involved providing a cyclic process on a PV diagram in which parts of the cyclic process were isothermal, isobaric and isochoric.Students were asked whether the work done in the entire cycle was positive, negative or zero and to explain their reasoning.
Another investigation of students' reasoning about heat, work and the first law of thermodynamics in an introductory calculus-based physics course by Meltzer [40] asked several conceptual problems some of which involved PV diagrams.For example, one problem in this study involved two different processes represented on a PV diagram that started at the same point and ended at the same point.Students were asked to compare the work done by the gas and the heat absorbed by the gas in the two processes and explain their reasoning for their answers.
In another investigation focusing on student understanding of the ideal gas law using a macroscopic perspective, Kautz et al. [41] asked several conceptual problems.For example, in one problem in which a diagram was provided, three identical cylinders are filled with unknown quantities of ideal gases.The cylinders are closed with identical frictionless pistons.Cylinders A and B are in thermal equilibrium with the room at 200C, and cylinder C is kept at a temperature of 800C.The students were asked whether the pressure of the nitrogen gas in cylinder A is greater than, less than, or equal to the pressure of the hydrogen gas in cylinder B, and whether the pressure of the hydrogen gas in cylinder B is greater than, less than, or equal to the pressure of the hydrogen gas in cylinder C. Students were asked to explain their reasoning.
Another investigation by Cochran et al. [42] focused on student conceptual understanding of heat engines and the second law of thermodynamics.For example, in one question students were provided the diagram of a proposed heat engine (including the temperatures of the hot and cold reservoirs, the heat absorbed from the hot reservoir and the heat flow to the cold reservoir as well as the work done in one cycle) and asked if the device as shown could function and why.In another investigation, Bucy et al. [43] focused on student understanding of entropy in the context of comparison of ideal gas processes.For example, students were asked to compare the change in entropy of an ideal gas in an isothermal expansion with a free expansion in a vacuum, and also to explain whether the change in entropy of the gas in each case is positive, negative or zero in each case, and why.In another investigation by Christensen et al. [44], students' ideas regarding entropy and the second law of thermodynamics in an introductory physics course were studied.They found that students struggled in distinguishing between entropy of the system and the surroundings and had great difficulty with spontaneous processes.Another investigation by Smith et al. [45] focused on student difficulties with concepts related to entropy, heat engines and the Carnot Cycle and how student understanding can be improved.

C. Rationale for developing STPFaSL-Long
Before we discuss the development and validation of the STPFaSL-Long instrument, we discuss the rationale for developing this long version when STPFaSL-Short version exists [34,35].In particular, all the scenarios for the 19 contexts and the wording of the questions before the options students select are identical in both of the versions of the survey instrument related to thermodynamic processes and the first and second laws covered in introductory physics courses.The difference between the long and short versions of the instrument is only in the multiple-choice options.
The rationale for developing the STPFaSL-Long version of the instrument came from discussions with some thermodynamics instructors after the STPFaSL-Short version was developed and validated.Some instructors noted that they prefer to give surveys in which no alternative conceptions of students are embedded in the questions since they did not want students to be misled to an incorrect answer by gravitating to different alternative conceptions.They felt that in the absence of the alternative conceptions, their students may perform significantly better because they would think about each option more carefully instead of being misled by alternative conceptions embedded in some of the questions in the short version of the instrument.Moreover, in the short version, students were often asked about more than one thermodynamic variable in a single question.Since there are only five choices students can select from for each question, they were only provided those choices in the multiple-choice questions that students had selected with highest frequency in the written open-ended administration of these questions and individual interviews during the development of the short version of the instrument.However, some instructors noted that they would prefer to use a version in which all possible choices for each thermodynamic variable were provided in each situation, e.g., the internal energy of a gas in a given situation increases, decreases, remains the same or there is not enough information.
Therefore, based upon the feedback from these instructors, we developed and validated the longer version of the instrument, in which there are no alternative conceptions explicitly embedded in the four multiple-choice options students are asked to choose from.Moreover, the questions asked in a given context in one item of the shorter version of the instrument were split into several items focusing, e.g., on different thermodynamic variables.To give a concrete example, item 1 in the short version of the instrument asks students to choose the correct statement about the change in entropy of the gas that undergoes a reversible adiabatic expansion with incorrect options involving alternative conceptions such as "Entropy of the gas increases because the gas expands" or "Entropy of the gas remains constant because the entropy of the gas does not change for a reversible process".On the other hand, the corresponding item in long version of the instrument does not have alternative conceptions embedded in the options.It asks students to choose the correct statement about the change in entropy of the gas that undergoes a reversible adiabatic expansion with options such as entropy of the gas increases, decreases, remains constant and there is not enough information.
To give another concrete example, item 2 in the short version asks which one of the following statements must be true for the gas that undergoes a reversible adiabatic expansion process: a.
The internal energy must decrease, and the work done by the gas must be positive.b.
The internal energy must decrease, and the work done by the gas must be negative.c.
The internal energy must increase, and the work done by the gas must be positive.d.
The internal energy must increase, and the work done by the gas must be negative.e.
None of the above.Since not all possible choices can be provided for internal energy and work done by the gas in the short version (e.g., there is no option with one or both of them remaining constant), the long version of the instrument splits this item into two separate items, one of which asks about internal energy and the other about the work done by the gas with each question providing all possible options (e.g., the internal energy must decrease, increase, remain constant and there is not enough information and the work done by the gas must be positive, negative, zero or there is not enough information).
Based upon discussions with instructors, we believe that different instructors may prefer one version or another based upon the features of the two versions they value.For example, some instructors may prefer the short version which has alternative conceptions embedded in some of the items and in which an item can ask about two thermodynamic variables with restricted options (as in the example with internal energy and work discussed above) since those restricted options correspond to the most common choices students selected in the written open-ended questions and individual interviews.However, as noted, other instructors may prefer the long version with no alternative conceptions and an item being restricted to only one thermodynamic variable with all possible options provided in the choices for that variable.We note that when we administered both the long and short versions of the instrument to the first-year first semester Ph.D. students in physics at an interval of approximately two weeks, the Pearson correlation coefficient between the two versions was 0.88, which is relatively high suggesting that both versions are measuring very similar things.Thus, instructors can use the version they prefer.Moreover, both versions take approximately the same amount of class time and can be administered in a 50-minute class period.

II. STPFaSL INSTRUMENT DEVELOPMENT AND VALIDATION
The development and validation process of the two versions of the survey instrument, STPFaSL-Long and STPFaSL-Short [34,35], were analogous to those for the earlier conceptual survey instruments developed by our group [16,18,21,22,25].Our process is consistent with the established guidelines for test development and validation using the Classical Test Theory (CTT) [37].According to the standards for multiple-choice survey instrument design, a highquality test instrument should have five characteristics: reliability, validity, discrimination, good comparative data and suitability for the population [34,35].Moreover, the development and validation of a well-designed survey instrument is an iterative process that should involve recognizing the need for the survey instrument, formulating the test objectives and scope for measurement, constructing the test items, performing content validity and reliability checks, and distribution [34,35].
Below, we describe the development, validation and administration of the STPFaSL-Long version of the instrument by first summarizing how the identical contexts for both the short and long versions were developed and validated, how the items involving various contexts in the short version were split into several items without changing the wording of the scenarios posed, and then presenting data from the administration of the final long version of the survey instrument.As noted earlier, the contexts are identical for the two versions of STPFaSL instrument and the difference between them is in the answer options students must choose from.

A. Development of test blueprint, formulation of test objectives and scope
More details about the STPFaSL-Short version of the survey instrument development can be found in Ref. [34,35] which is relevant for the development and validation of the long version discussed here.Summarizing the development common for both versions, we note that before developing the instrument, we first developed a test blueprint to provide a framework for deciding the desired test attributes [34,35].The test blueprint provided an outline and guided the development of the test items.The development of the test blueprint entailed formulating the need for the survey instrument, determining its scope, format and testing time of the test as well as determining the weights of different sub-topics consistent with the scope and objective of the test.The specificity of the test plan helped to determine the extent of content covered and the complexity of the questions.As noted in the introduction, despite the existence of several thermodynamics survey instruments at the introductory level [27][28][29][30][31][32], there is no research-validated survey instrument that focuses on the basics of thermodynamic processes and the first and second laws of thermodynamics covered in the introductory physics courses.Therefore, we developed and validated an instrument (with two versions: STPFaSL-Short and STPFaSL-Long) focusing on these content areas covered in introductory physics courses.With regard to the weights of different sub-topics consistent with the scope and objective of the test, we browsed over introductory physics textbooks, consulted with seven faculty members and looked at the kinds of questions they asked their students in homework, quizzes and exams before determining the 19 contexts.
Both versions of the STPFaSL instrument have multiple-choice conceptual items on thermodynamic processes and the first and second laws covered in both calculus-based and algebra-based introductory physics courses, however, no calculus is required for students to complete the surveys.Both versions can be used to measure the effectiveness of traditional and/or research-based approaches for helping introductory students learn thermodynamics concepts covered in the survey for a group of students.Specifically, the survey instrument is designed to be a low stakes test to measure the effectiveness of instruction in helping students in a particular course develop a good grasp of the concepts covered and is not appropriate for high stakes testing.The survey instrument can be administered before and after instruction in relevant concepts to evaluate introductory physics students' understanding of these concepts and to evaluate whether innovative curricula and pedagogies are effective in reducing the difficulties.With regard to the testing time, the survey instrument (both short and long versions) is designed to be administered in one 50-minute class period although instructors should feel free to give extra time to their students as they deem appropriate.As noted, the reason the longer version of the STPFaSL does not take significantly more time is that all contexts and their wording are identical to the shorter version and each item in the longer version is about one thermodynamic variable (e.g., entropy, internal energy, heat transfer, work done by the system).Additionally, as noted, the answer choices in the long version do not explicitly include student alternative conceptions, but rather focus on how each of these individual variables change; answer choices are "increases," "decreases," "remains the same," "not enough information" or the questions in the longer version are framed as true/false questions.One exception is that in addition to the stem of the items, the four answer choices in item 36 in the longer version and item 19 in the shorter version are identical.
We focused the survey content on thermodynamic processes and first and second laws that is basic enough that the survey instrument is appropriate for both algebra-based and calculus-based introductory physics courses in which these thermodynamics topics are covered.We also made sure that the survey instrument has questions at different levels of cognitive achievement [34,35].To formulate test objectives and scope pertaining to thermodynamic processes and first and second laws, the survey instrument development started by consulting with seven instructors who regularly teach calculus-based and algebra-based introductory physics courses in which these topics in thermodynamics are covered.We asked them about the goals and objectives they have when teaching these topics and what they want their students to be able to do after instruction in relevant concepts.In addition to perusing the coverage of these topics in several algebra-based and calculus-based introductory physics textbooks, we browsed over homework, quiz, and exam problems that these instructors in introductory algebra-based and calculus-based courses at a large research university had typically given to their students in the past before determining the test objective and scope of the test in terms of the actual content, and starting the design of the questions for the instrument.The preliminary distribution of questions from various topics was discussed and iterated several times, and finally agreed upon with seven introductory physics course instructors at a large research university.
The selection of topics for the questions included consultation with 7 instructors who teach introductory thermodynamics (some of whom had also taught upper-level thermodynamics) about their goals and objectives and the types of conceptual and quantitative problems they expected their students in introductory physics courses to be able to solve after instruction.The wording of the questions took advantage of the existing literature regarding student difficulties in thermodynamics, input from students' written responses and interviews and input from physics instructors who teach these topics.In addition to leveraging the findings of prior research on students' understanding of these concepts , the process of administering some written open-ended questions and individual interviews with students at a large research university was helpful to develop the survey items [34,35].Moreover, as part of the development and validation of the survey, the concepts involved in the STPFaSL instrument, and the wording of the questions have been independently evaluated by four physics faculty members who regularly teach thermodynamics at the large research university (in addition to the feedback from members of the Physics Education Research or PER group at the large research university) and iterated many times until agreed upon.Moreover, two faculty members from other universities who are experts in thermodynamics PER provided invaluable feedback several times to improve the quality of the survey questions.
As noted, the scope of both the short and long versions of the survey instrument is the same (identical 19 contexts and wording of the stem of the questions) except that the longer version splits each context in the shorter version into several items and each asks about only one thermodynamic variable with choices such as "increases," "decreases," "remains the same," or "not enough information," while some of the questions are framed as true/false questions.Thus, the longer version separates questions about different thermodynamic variables for the same context and scenario in order to make it easier for instructors to disentangle difficulties with different thermodynamic variables in a given scenario.In addition to separating questions about different thermodynamic variables for the same context and scenario, the four choices provided to students for each question in the longer version do not have any student alternative conceptions unlike the shorter version, which would appeal to instructors who prefer not to have items with alternative choices embedded in the choices provided.
As described in detail in Ref. [34,35], we interviewed individual students using a think-aloud protocol at various stages of the survey instrument development to obtain a better understanding of students' reasoning processes when they were answering the free-response and multiple-choice questions.Fine-tuning of the instrument based upon statistical analysis using classical test theory (to be discussed in the next section for the longer version) was conducted on different iterations of both the short and long versions of the survey instrument as the items were being refined.Within this interview protocol, students were asked to talk aloud while they answered the questions so that the interviewer could understand their thought processes.Individual interviews with students during development of the survey instrument were useful for an in-depth understanding of the mechanisms underlying common student difficulties and to ensure that students interpreted the questions appropriately.Based upon the student feedback, the questions were refined and tweaked.During various stages of the development and validation process, 24 students in various algebrabased and calculus-based physics courses participated in the think-aloud interviews.Ten graduate students and undergraduates who had learned these concepts in an upper-level thermodynamics and statistical mechanics course were also interviewed while taking the shorter version (whose stems are identical to the longer version).In addition, 11 introductory physics students and 6 physics graduate students were interviewed while answering the STPFaSL-Long version of the instrument using think-aloud protocol to understand how they reasoned about different choices on each question on the longer version.The purpose of involving some advanced students in these interviews was to compare the thought processes and difficulties of the advanced students in these courses with introductory students for bench marking purposes.This type of bench marking has been valuable to illustrate growth of student understanding in prior research [64].We found that students' reasoning difficulties across different levels are remarkably similar except in a few instances, e.g., advanced students were more facile at reasoning with PV diagrams than introductory students.
The final version of the STPFaSL-Long survey instrument has 78 multiple-choice items including 22 true/false questions and each question has one correct choice and three incorrect choices.To aid instructors who administer STPFaSL-Long in their classes, Figs.1a-1c in Appendix A classify the broad categories of topics covered in each of the 78 items of the STPFaSL-Long survey instrument as Processes, Systems, Quantities & Relations, Representation, the First Law of Thermodynamics, and the Second Law of Thermodynamics.The Processes category includes items which require understanding of thermodynamic constraints such as whether a process is reversible, isothermal, isobaric or adiabatic.Also included are problems involving irreversible and cyclic processes.We note that these different processes are not necessarily exclusive; e.g., one can consider an isothermal reversible process.The Systems category includes items involving knowledge of the distinction between a system and the universe, items involving subsystems or an isolated system.The Systems category also includes items in which a student could make progress by making use of the fact that the system is an ideal gas (e.g., for an ideal gas, the internal energy and temperature have a simple relationship which can be used to solve a problem).Quantities and Relations includes survey items specific to a quantity such as internal energy, work, heat, entropy, and their quantitative relationships.For example, the relationship between work and the area under the curve on a PV diagram is tested in several problems.The Representation category includes items in which a process is represented on a PV diagram.Finally, the last two categories include items requiring the First Law and Second Law of Thermodynamics.We classified questions about heat engines into the Second Law of Thermodynamics category (although heat engines involve both the first and second laws) due to the particular focus of these.

B. Validation of the survey instrument
While developing and validating the STPFaSL survey instrument (both short and long versions), we paid particular attention to the issues of reliability and validity [37,[67][68][69][70]. Test reliability refers to the internal consistency of the test or relative degree of consistency between the scores if an individual immediately repeats the test, and validity refers to the appropriateness of interpreting the test scores [37,[67][68][69][70].We note that the STPFaSL instrument is appropriate for making interpretations about the effectiveness of instruction in relevant concepts in a particular course and it is not supposed to be used for high stakes testing of individual students.
Face validity of the test refers to whether it is a good measure of students' understanding of the concepts measured by the test and content validity of a test refers to whether the experts in the discipline who review the test agree that the items in the test adequately cover all aspects of the construct being measured [37].In the earlier subsections of section II, we already discussed face validity and content validity including how expert and student involvement (via open-ended questions and individual interviews) was used for this purpose.Also, although the survey instrument focuses on concepts that are typically covered in introductory thermodynamics and is appropriate for introductory physics courses, it was also administered to undergraduates in upper-level thermodynamics and statistical mechanics courses in which these concepts are generally covered and to first-year first-semester physics Ph.D. students to obtain base line data and to check for one form of criterion validity or concurrent validity [71,72].In particular, we discuss the concurrent validity of the STPFaSL-Long survey instrument using comparison between introductory and advanced students' performance (e.g., whether introductory students outperformed advanced students on the survey).Since the quantitative measures validating the STPFaSL-Short instrument were described earlier [34,35], below we focus only on the STPFaSL-Long instrument in terms of the quantitative measures used in the classical test theory for a reliable survey instrument including item analysis (using item difficulty and point biserial coefficient) and KR-20 [37,73,74].

C. Proctored and unproctored administration of STPFaSL-Long
After the development and validation of STPFaSL-Long, data were collected using the final version administered in various classes.This includes 12 different in-person classes (four different institutions) in which students answered the questions in-class on paper scantron forms (proctored administration) and 12 other different inperson classes (five different institutions) in which students answered the questions online on Qualtrics within a twohour period from the time they started the survey (unproctored administration).Even though all classes were taught inperson, the reason for administering STPFaSL-Long unproctored via Qualtrics in half of the classes is that many course instructors were reluctant to spend an entire class period to administer it.Considering physics instructors often feel that they must cover a lot of content and cannot spare a class period to administer a survey, this issue of online administration even in in-person classes is commonly encountered in physics classes [75].Since the data from different institutions for the same type of course (e.g., calculus-based introductory physics course) are similar, average combined data from different institutions for the same course type are presented here.All data presented here are from primarily traditional lecture-based physics courses (in-class administration data are in Appendix B, Qualtrics data for students who took the survey online are in Appendix C and the STPFaSL-Long survey instrument is in the supplementary materials) so that instructors in courses covering the same concepts but using innovative pedagogies can compare their students' performance with those provided here to gauge the relative effectiveness of their instructional design and pedagogies.We combined the upper-level undergraduates and first-year first-semester physics Ph.D. students since the first-year graduate students only had thermodynamics instruction in their upper-level undergraduate courses.
In the in-person administration of the full 78-item survey instrument, students were provided the entire class period (typically 50 minutes) but in the online administration of the survey instrument using Qualtrics, students were provided two-hours from the time they started the survey in one sitting in a given week after instruction in relevant concepts.In consultation with several instructors, the reason students were provided two-hours for the online unproctored administration via Qualtrics is that since these students are not taking the survey in class, they may get interrupted by various things including visitors and phones but we did not want to give students unlimited time to limit the opportunity to look up answers or consult with others (note that the unproctored Qualtrics version explicitly noted that students were not supposed to consult any resources including books, notes, internet, friends).Moreover, STPFaSL-Long was administered before instruction (as a pretest) in relevant concepts to investigate student understanding before instruction in some of the introductory physics courses at one university only because other instructors did not think it was necessary and were reluctant to administer the survey both before and after instruction in their classes.For all pre-test data and some of the in-person administered introductory posttest data, students were given only the first 48 items or the last 52 items of the 78-item survey.In all undergraduate courses, students were given extra credit incentives to take the survey consistent with suggestions of Effective Practices for Physics Programs Guide section on How to Select and Use Various Assessment Methods in your Program [76].In particular, if students are not given any grade incentives, some students do not take the survey seriously, but the survey should be a low stakes assessment consistent with our goals.

D. Overall performance, reliability, item difficulty and point biserial coefficient
After the development and validation of STPFaSL-Long instrument, as noted, the final version was administered as proctored assessment in 12 different classes (four different institutions) in which students answered the questions in-class on paper scantron forms and as unproctored assessment in 12 different classes (five different institutions) in which students answered the questions online on Qualtrics within a two -hour time-window.All 24 classes were taught in-person.The survey instrument was administered to introductory physics students in various traditionally taught calculus-based and algebra-based classes after traditional lecture-based instruction in relevant concepts.It was also administered to upper-level undergraduates majoring in physics, taught traditionally, and Ph.D. students for benchmarking purposes and for one type of criterion validity (concurrent validity), which involved comparing their performance with those of introductory students for whom the survey is intended.We find that although the survey instrument focuses on thermodynamics concepts covered in introductory courses, it is challenging even for advanced students.Moreover, in some introductory courses at one university, as noted, the pretest was administered before students learned about thermodynamics in that course and the posttest was administered after instruction in relevant concepts.We note that some introductory physics instructors teach thermodynamics at the end of the first semester introductory physics course while others teach it at the beginning or in the middle of another second semester course.We did not find any trends based upon these, so all data are included for a particular group of students.
While students who took the survey online via Qualtrics were told that it was a survey in which they were not supposed to consult any resources, we did not have any method for verifying that.On the other hand, the in-class administration was proctored by the instructor of the course.Due to proctored and unproctored nature of the administration, we feel that these data are best kept separate so that instructors using similar mode of administration for a particular group of students can compare their course's performance with the averages provided here.
Table I shows in-class administration data pertaining to average student performance on the STPFaSL-Long instrument from introductory calculus-based and algebra-based courses for matched and unmatched students on pretest (before instruction) and posttest (after instruction) from a university where the pretest was also administered.For calculus-based courses, some of the instructors only administered the pretest or posttest, but not both.The matched and unmatched data in Table 1 are only for those instructors who administered both pretest and posttest.This table shows that the matched and unmatched data are very similar.Therefore, in the rest of the paper, we include all data available for a particular group.Moreover, we conducted a t-test and found that the scores of the calculus-based and algebrabased classes (both for the pretest and posttest) are not statistically significantly different.However, we have kept these scores separate since many instructors/researchers are often only interested in data from calculus-based physics classes or algebra-based physics classes.One way to measure reliability of the test instrument is to prepare an ensemble of identical students, administer the test instrument to them, and analyze the resulting distribution on each item and overall scores.Since this is generally impractical, instead, a method is devised to use subsets of the test itself and consider the correlation between different subsets.The Kuder-Richardson reliability index or KR-20 reliability index [37,73,74], which is a measure of the selfconsistency of the entire test instrument, can take a value between 0 and 1 (it divides the full instrument into subsets and the consistency between the scores on different subsets is estimated).If guessing is high, KR-20 will be low.Table II shows the number of students in each group averaged across similar classes to whom the final version of the survey was administered, as well as the average performance of different groups on the entire survey instrument for in-class and online administrations using Qualtrics; KR-20 for the post-tests for each group that took the entire survey are also shown.The KR-20 values for all groups in Table II are reasonable for predictions about a group of students [38,74,75].We note that the written student data are from 12 different in-person courses from four different large public institutions; students completed the survey in class on Scantrons during a 50-minute class period if they took the entire survey since some students were administered only the first 48 or last 52 questions.In particular, not all introductory students answered all survey questions, and some introductory students were given only the first 48 or last 52 questions during recitation sessions to ensure split-test reliability.On the post-test (pre-test), out of the 492 (753) Int-calc students, 168 (505) were given the first 48 questions, 73 (248) were given the last 52 questions, and 251 (0) were given the full survey.On the post-test (pre-test), out of the 550 (371) Int-alg students, 170 (173) were given the first 48 questions, 218 (198) were given the last 52 questions, and 162 (0) were given the full survey.
The item difficulty of each multiple-choice question on the STPFaSL-Long instrument is simply the percentage of students who correctly answered the question, i.e., it is the average score on a particular item.Results in Table III in Appendix B show not only the item difficulty of each item on the instrument but also the prevalence of different incorrect choices for each item for each group for in-class administration.The corresponding data for online administration using Qualtrics are shown in Table IV in Appendix C. TABLE II.The average performance and standard deviation (SD) of different groups on the STPFaSL-Long, the number (N) of students who participated in the survey in each group as well as KR-20 for posttests when students took the entire survey."Upper Post" consists of advanced undergraduate students who had learned the relevant concepts in an upper-level thermodynamics and statistical mechanics course and first-year physics Ph.D. students in their first semester of the Ph.D. program who had not taken the graduate level thermodynamics and statistical mechanics course at their institution.Pretest (Pre) was administered before students learned relevant concepts in the course and posttest (Post) was administered after relevant instruction in the calculus-based and algebra-based introductory physics courses.For in-person administration of the survey for introductory students only, some students were given the entire 78-item survey and this is shown by (78).Some students were given only the first 48 items or the last 52 items of the 78-item survey, and these groups are represented with ( 48) and ( 52), respectively.

In-person
In The Point Biserial Coefficient or PBC is designed to measure how well a given item predicts the overall score on a test.It is defined as the correlation coefficient between the score for a given item and the overall score.The PBC can take on values between -1 and 1; a negative value indicates that otherwise high-performing students score poorly on this item, and otherwise low-performing students do well on the item.The point biserial coefficients are shown in Figure 2 for in-person administration and Figure 3 for online administration in Appendix D. A commonly used criterion [37,73,74] states that it is desirable for this measure to be greater than or equal to 0.2, which is exceeded for 71 of the 78 items on the STPFaSL for the in-person implementation (and is lower than 0.2 only for one question for the online implementation at 0.16).Five of these questions with a PBC less than 0.2 focus on entropy in which the advanced students did not necessarily do better than the introductory group.

E. Note on items with point biserial coefficient below the threshold
Five of the seven questions below the threshold for the in-class administration (items 5, 8, 24, 48 and 64) are related to entropy on which advanced students' average performance is comparable to introductory physics students' average performance.Our analysis shows that many advanced students performed poorly on these questions because they incorrectly thought that the entropy of a system always increases (e.g., in one cycle of a cyclic process).These findings are consistent with Meltzer's findings [52].The other two items with PBC below the threshold are item 36 (basics of Carnot engine) and item 70 (magnitude of work done in an adiabatic process is the same as magnitude of heat transfer in a constant volume process if the change in internal energy is same in both processes).On item 36, the average scores of all groups are at or below 40%, and for item 70, it is at or below 30% which implies that these items are extremely difficult for students at all levels.The poor performance of all groups reduced the PBC.We believe that the issues covered in these two items are important issues that instructors should focus on to help students at all levels learn them better so we keep these items.

F. Reliability via splitting the survey and administering each half to different students in introductory classes
We also tested reliability by splitting the survey instrument in two parts [37] without splitting related questions involving a given scenario, with 48 questions in one part and 52 questions in another part; 22 items were common between the two parts (these 22 items common items were the last 22 questions in the part with 48 questions and the first 22 questions in the part with 52 questions).At one university, we randomly administered to some students in introductory physics classes one part (48 questions) or the other part (52 questions) in-class.The performances of calculus-based introductory physics students who were administered the two version on the common 22 questions were 58.5% vs. 58.8%,which is similar, providing further evidence of reliability of the survey particularly because the order in which students were administered these questions was different, i.e., for some students these were the last 22 questions while for the others they were the first 22 questions.Also, the performances of these students on all 48 or 52 questions were 57.6% and 55.0%, respectively, showing that the first and second halves of the survey are relatively comparable in difficulty.

G. Concurrent validity via administration to student groups at different levels
Earlier we discussed other forms of validity, e.g., face validity, content validity etc.Here we discuss one type of criterion validity or concurrent validity from administering the survey instrument to upper-level students and Ph.D. students (advanced students) [71,72].The criterion validity shows how well a test correlates with an established standard of comparison called a criterion [37].One measure of this type of validity or concurrent validity can come from the expectation that introductory students will be outperformed by advanced students.As noted, a large number of students from introductory courses in which these concepts were covered were administered the final version of the survey instrument and advanced students in thermodynamic and statistical mechanics courses were administered the survey for the purposes of bench marking and concurrent validity of the instrument (see Table II).
The average data for calculus-based and algebra-based introductory students as well as advanced students tabulated in Table II show the expected trends for proctored in-class administration of the survey.In particular, advanced students with 76% average outperformed introductory physics students on the posttest (calculus-based courses averaged 57% and algebra-based courses averaged 55%).The average pretest performances of introductory physics groups were 52% and 51% (lower than posttest).These expected trends serve as a measure of concurrent validity although the STPFaSL survey is so difficult that none of the groups have stellar performance.
The data for the unproctored online administration (see Table II) show that advanced students obtained an average of 68% vs. 70% and 60%, respectively, for the calculus-based and algebra-based introductory physics courses after instruction in relevant concepts.Thus, introductory physics groups performed better in the online administration but the calculus-based introductory students performed exceptionally well comparable to the advanced students.The difference between the average scores for the proctored in-person and un-proctored Qualtrics administration for advanced students was -8%, for the algebra-based introductory physics students was 5% and for the calculus-based introductory physics students was 13%.In particular, in the online administration via Qualtrics, even though advanced students performed better than algebra-based physics post, calculus-based post averages are comparable to advanced student averages.We note that in the unproctored administration, the concurrent validity holds only for comparison between algebra-based introductory physics students and advanced students.It is not possible to pin-point the reasons for why calculus-based introductory students performed so well (comparable to advanced students), in the unproctored online administration of the survey.Introductory students may be more relaxed when taking the unproctored survey out of class which may be one reason that these groups on average performed better.However, we hypothesize that the better online performance may at least partly be due to a higher percentage of calculus-based introductory students consulting resources (they were asked not to consult with) in the un-proctored Qualtrics survey compared to the advanced students.In particular, while all students were told that the online survey was a closed book and closed notes test and students were not supposed to consult other resources, if some students in a particular group consulted external resources when answering the survey questions, that can affect the overall performance of the group [77,78].These data are still useful for what we can expect from an online administration of the survey to each group of students in which student response counted for extra credit [76].In particular, it is still very useful to have the unproctored Qualtrics data since many instructors are only willing to administer these surveys online and they should be aware of these constraints when analyzing their own class data.

H. Validity via measuring correlation between long and short versions
Some first-year first semester graduate students were administered both versions of the STPFaSL survey.The Pearson correlation between the short and long versions of the STPFaSL survey instrument for these students who were administered both versions of the survey at an interval of approximately two weeks is 0.88, which is a relatively high correlation showing that both the short and long versions measure very similar things and instructors can use whichever version they prefer.This high correlation with the already validated shorter version of the survey further demonstrates the validity of the longer version compared to the already validated shorter version [34,35].

I. A glance at student difficulties on the validated survey
Details about student difficulties found using the STPFaSL-Long instrument and comparison with STPFaSL-Short or prior studies are beyond the scope of this paper and will be presented elsewhere.However, we note that since the STPFaSL-Long instrument has been administered to a large number of students at different institutions for inperson and online formats, quantitative conclusions can be drawn about the prevalence of the many conceptual difficulties students have with these fundamental concepts in thermodynamics (see Table III for average performance of each group on each item for in-class administration in Appendix B and Table IV in Appendix C for online administration).Some of the conceptual difficulties displayed on the survey instrument include difficulty reasoning with multiple quantities simultaneously, difficulty in systematically applying various constraints (for an isothermal, adiabatic, isochoric, reversible, or irreversible process, isolated system, etc.), difficulty due to oversimplification of the first law and overgeneralization of the second law.As noted, many of these difficulties were inspired and incorporated in the survey instrument based upon those that have been documented (e.g., see Ref. ).Moreover, our findings with this validated survey instrument demonstrate the robustness of the previous findings, e.g., in Ref. , about student difficulties with these concepts in old contexts that have previously been studied and new contexts.

III. SUMMARY
We developed, validated and administered the longer version of the STPFaSL instrument, which is a conceptual multiple-choice test focusing on thermodynamic processes and the first and second laws at the level covered in introductory physics courses.This survey instrument is a longer version of the shorter version developed and validated earlier in 2015.The 19 contexts including the exact wording of all of the problem situations posed in the two survey instruments are identical.The difference between the long and short versions of the instrument is only in the multiple-choice options.In particular, in the STPFaSL-Long survey, there are no alternative conceptions explicitly embedded in the four multiple-choice options students choose from and the questions asked in a given context in one item of the shorter survey were split into several items focusing, e.g., on different thermodynamic variables.The concepts related to thermodynamic processes and the first and second laws focusing on topics covered in an introductory physics course were challenging even for advanced students who were administered the survey instrument for obtaining baseline data and for evaluating content validity.The STPFaSL instrument is designed to measure the effectiveness of traditional and/or research-based approaches for helping introductory students learn thermodynamics concepts.The average individual scores on the survey instrument from traditionally taught classes at various institutions included in this study are low.We note that the average scores for other conceptual survey instruments for traditionally taught introductory classes are also low, e.g., for the Brief Electricity and Magnetism Assessment [19], the posttest scores for introductory students range from 23% to 45%, and for the Conceptual Survey of Electricity and Magnetism [20], the scores range from 25% to 47%.The low scores even after instruction indicate that the traditional instructional approach using lectures alone is ineffective in helping students learn these concepts.The instructors can choose the longer or shorter version of the STPFaSL survey instrument depending upon their preference to measure the effectiveness of instruction in these topics using a research-based pedagogy.

APPENDIX C: STUDENT PERFORMANCE ON EACH QUESTION FOR ONLINE ADMINISTRATION
Appendix C shows the distribution of answer choices for each group on each question when the survey was given online.TABLE IV.Average percentage scores for each of the four choices for each item for online administration of the STPFaSL-Long instrument for each group after instruction in relevant concepts.Abbreviations for various student groups: Upper (students in junior/senior level thermodynamics and physics Ph.D. students in their first semester of a Ph.D. program who had also only taken the junior/senior level thermodynamics course), Calc (students in introductory calculus-based physics courses), Algebra (students in introductory algebra-based physics courses).The four columns after the item number show the percentage of students who selected choices A-D for each item.The number of students in each group is the same as in Table II

TABLE I :
Data showing average student performance on the STPFaSL-Long instrument from introductory calculus-based and algebra-based courses for matched and unmatched students on pretest (before instruction) and posttest (after instruction) from a university where the pretest was also administered. .