Cheat sites and artificial intelligence usage in online introductory physics courses: what is the extent and what effect does it have on assessments?

As a result of the pandemic, many physics courses moved online. Alongside, the popularity of internet-based problem-solving sites and forums rose. With the emergence of Large Language Models, another shift occurred. One year into the public availability of these models, how has online help-seeking behavior among introductory physics students changed, and what is the effect of different patterns of online-resource usage? In a mixed-method approach, we investigate student choices and their impact on assessment components of an online introductory physics course for scientists and engineers. We find that students still mostly rely on traditional internet resources, and that their usage strongly influences the outcome of low-stake unsupervised quizzes. However, we also find that the impact of different help-seeking patterns on the supervised assessment components of the course is non-significant.


I. INTRODUCTION
The general assumption in teaching introductory physics courses is that we need to teach a few essential concepts, for example conservation laws, and, based on those, to derive basic equations that govern the motion of all objects in our universe, from atomic nuclei to galaxies.Almost all of us teaching professionals subscribe to the notion that it aids the students' learning processes to flesh out these basic concepts with exercises, in class and as homework.Mastering our subject requires time spent on wrestling with conceptual questions, and familiarity with basic equations is acquired by rearranging and combining them and solving for the unknown quantities.
"I think, however, that there isn't any solution to this problem of education other than to realize that the best teaching can be done only when there is a direct individual relationship between a student and a good teacher-a situation in which the student discusses the ideas, thinks about the things, and talks about the things.It's impossible to learn very much by simply sitting in a lecture, or even by simply doing problems that are assigned.But in our modern times we have so many students to teach that we have to try to find some substitute for the ideal."These sentences were written by Richard Feynman.In 1963 [1]!Some of us, the authors included [2,3], have spent enormous amounts of time to construct learning management systems, audience feedback systems, simulation software, and other apps to help with giving the students an opportunity to spend meaningful time on task.The ultimate goal may have been to construct a software system that acts as a personal tutor for each individual student.The rapid rise of Large Language Models (LLMs) and artificial intelligence (AI) systems gives hope that the substitute for the ideal that Feynman described might become reality.Alternatively, students might just use their new AI tools to sidestep their learning process and just let the AI complete the assignments for them.
For the last 25 year, Michigan State University has been running online sections of the introductory physics courses [4].For decades, even preceding online courses, the prevalent notion regarding course delivery modes had been "no significant difference" [5][6][7].More recently, we confirmed this for the courses under investigation in this study when we found no significant difference between attendance choices with regards to learning success [8] and preparation for subsequent courses [9].
Since the courses first came online, the amount of additional resources available online has greatly increased, particularly during COVID-19.The solution to virtually any introductory-physics problem is available shortly after it is published [10], including on commercial sites like Chegg [11,12]; a typical example can be seen in Fig. 1.Students find these resources useful for homework or unsupervised online exam problems that have been "recycled" from earlier in the semester (e.g., homework problems making an encore appearance on exams).
Instructors have been fighting these internet sites and forums by editing the problem content, removing any references to problem numbers, randomizing the solutions, selecting problems from older editions of textbooks, or writing new problems every time [14].All of these counter measures rely on there being some time delay between publishing a problem and it appearing online (even contracted problem solvers need a little time) and on the sites having one static version of the problem (e.g., not being able to adapt to randomizations).
These same counter measures against cheat sites will not work against artificial intelligence.Tools like Chat-FIG.1.A problem written and copyrighted by one of the authors (GK) in LON-CAPA [13] (top panel) and a typical solution found on a problem-solving site (bottom panel).The problem numbers are randomized, so the student would need to identify their own numerical values and insert them.Depicted is one of 50 answers found on the site.
GPT [15] and Bard [16] deliver solutions ad-hoc and ondemand, they are immune against changing wording or numbers, and they solve problems independent of them having been published days ago or appearing for the first time on an online exam.This is illustrated in Fig. 2, where the original friction problem has been modified by introducing additional distractors and different random-ized numbers.Not only does the availability of these solutions depend on some "expert" having solved the problem before, but as opposed to the forum answer in Fig. 1, the response includes the actual numbers encountered by the student and the physics explanation is arguably better and more helpful.
Chegg can serve as a proxy to the popularity of online problem-solving sites: while the share price of Chegg (NYSE: CHGG) greatly increased when courses went online due to the onset of the pandemic in 2020, with the appearance of ChatGPT in late 2022, they dropped below pre-pandemic levels [17,18], see Fig. 3.
Physics may still have been spared from this, since Large Language Models as "calculators for words" are still notoriously bad at math, and other online resources may still be more reliable.However, it has been shown that even older versions of popular AI chatbots can (barely) pass the assessment components of introductory physics courses [19], and as Fig. 2 shows, newer versions make less calculation errors.
In our study, we investigate if one year into the public availability of powerful Large Language Models, online help-seeking behavior of students in an introductory physics course has shifted from traditional resources to AI.We also investigate the possible impact of different help-seeking patterns on the assessment components of such a course.

II. SETTING
Michigan State University is a public, large-enrollment (> 50, 000 students) R-1 university.Almost 78% of the undergraduate population are from Michigan.The online courses in this study were taught asynchronously using a variety of multimedia components [4].
The study takes place in a calculus-based introductory physics course sequence for scientists and engineers, where both a first-semester mechanics and a secondsemester E&M course were offered during Fall semester 2023.Each course offered several asynchronous video lessons every week and online homework using LON-CAPA [13].The courses had 11 low-stakes weekly exams ("quizzes") [20], of which nine were conducted online and two of which had to be taken on-campus under supervision.Faculty sanctioned the use of the textbook and the LON-CAPA materials during these exams, no other resources were allowed..
The course also included a high-stakes on-campus final exam.The final exams included five questions which were randomized duplicates of problems had been assigned earlier in the semester.As a resource for the students, the course also offered an on-campus and online help room staffed by course faculty and staff.
At the end of the semester, a survey was given asking students to report how frequently they consulted artificial intelligence tools and other online resources during homework and online quizzes, and how often they con- versed with fellow students and course faculty and staff while working on homework.

III. METHODOLOGY A. Survey administration
The survey contained the following items, which for the numerical answers had sliders ranging from 0-100%: • Homework: Estimate the percentage of your homework and lecture problems, for which the following is true: -You used AI tools like ChatGPT, Khanmigo, . . . to solve them (HwkAI ).-You used Internet resources like help sites or forums to solve them (HwkInt).-You consulted other student to solve them (HwkPeer ).-You consulted the TAs/prof to solve them (HwkFac).
• Online Exams: Estimate the percentage of your online exam problems, for which the following is true: -You used AI tool to solve them (OnlAI ).-You used Internet resources like help sites or forums to solve them (OnlInt).The survey was administered online during the last week of the semester, but results were not viewed or analyzed until the grades for the course had been turned in.A nominal participation credit was given for submitting the survey, regardless of whether or not the students agreed to be part of the study.The students were aware of this protocol as part of the informed consent, and data was only analyzed for students who agreed to participate.The study was approved under MSU-IRB-STUDY00009987.

B. Considered variables
We compiled a range of variables that capture various aspects of student performance and behavior, shown in Table I.Key performance metrics include Hwk (homework score), OnlExams (score from online exams), CamExams (score from on-campus exams with supervision), and Final (score from the final exam, also conducted on-campus with supervision).These scores are presented as percentages, reflecting the students' achievement in each respective assessment.Additionally, Sem5 represents the scores for five problems initially available online, Final5 for the same problems when included in the final exam, and Diff5 indicating the score difference between these two settings.Thus, Diff5 can be used as a proxy for retention of concepts between the semester and the final exam.Finally, DiffExams quantifies the score difference between online and on-campus quizzes, offering insight into performance variations across different assessment environments.
The dataset also encompasses variables related to the use of digital resources and student interactions.HwkAI and OnlAI denote the self-reported percentage of problems for which artificial intelligence (AI) tools were used during homework and online quizzes, respectively.Similarly, HwkInt and OnlInt represent the usage of other internet resources in these settings.Finally, HwkPeer and HwkFac quantify the self-reported extent of peer discussions and interactions with faculty during homework.

C. Statistical methods
Data were downloaded from the course management system, and survey results were merged using Python scripts.Calculations for this project were carried out using ChatGPT-4 Advanced Data Analysis [15] and R [21] (in particular qgraph [22] and CTT [23]).

A. Response rate
The first and second semester courses were completed by 156 and 183 students, respectively.Of these, 90 and 131 students agreed to participate in the study, bringing the total to 221 participants.

B. Online versus on-campus exams
Figure 4 shows the score distributions for the nine exams that were conducted online and the two exams that were conducted on-campus under supervision.In a t-test, these distributions are significantly different (p ≈ 4.7 • 10 −47 ).
The conditions under which these exams were conducted led to vastly different outcomes, and an immediate assumption would be that this is related to the use of external resources during unsupervised assessments.
As an example, for nine of the ten questions on the last online exam, solutions could be found on Chegg within about one minute each.

C. Usage of resources during unsupervised assessments
Figure 5 illustrates the average self-reported usage of AI and other internet resources, as well as self-reported consultation with peers and course personnel.
Overall, students report less usage of resources during exams than during homework, but not significantly.The only one-sigma significant differences are between usage of AI and talking to faculty on the one hand, and using other internet resources while working on homework; students have not yet adopted AI and stick with "traditional" problem-solving sites.On the average, students use other internet resources for half of the homework problems.The distributions of these types of resources usages, however, are very different, see Fig. 6, which suggests that there a different classes of resource usage.
Using k-means clustering and elbow method, we identified four different classes as indicated in Table II.Cluster 1, the smallest group, is comprised of students who appear to prefer human interaction to any online resources, and these students mostly adhere to rules for the online exams.Students in Cluster 2, the largest cluster, state that they make little use of resources overall, and that they most closely adhere to rules for the online exams.Students in Cluster 3 make heavy use of internet resources other than AI in both homework and exams, thus not following rules.Finally, students in Cluster 4 use all available resources and disregard exam rules.

D. Correlations of attributes
Figure 7 shows a Fruchterman-Reingold [22,24] representation of the correlation matrix between the variables.Indicated in light blue are the online, unsupervised assessments, in green the on-campus, supervised assessments, and in gray the differences in scores between selected subsets of assessments.The percentages of AIusage are indicated in beige, usage of internet resources in yellow, and discussions with humans in orange.Green edges denote positive correlations, red edges negative correlations, and the thickness their absolute strength.Due to the force-directed nature of Fruchterman-Reingold graphs, closely correlated vertices tend to cluster, while unrelated vertices tend to be further apart from each other.
It is apparent that the scores achieved online and those  significant correlations between the variables (p < 0.05).
For the heavy internet users (Cluster 3), the usage of internet resources other than AI during online exams (On-lInt) is significantly negatively correlated with the scores on the exams (OnlExams) (r = −0.28;p = 0.04), which may indicate that relying on the internet, the students were not able to quickly enough find what they needed to correctly solve the problems, including replacement of the numbers by their values.For the users who made use of all resources everywhere (Cluster 4), the use of AI during online exams (OnlAI ) is significantly positively related to Diff5 (r = 0.31; p = 0.04); this means that AI-usage during online exams is positively correlated with doing better on the final exam instance of duplicate problems than on their first occurrence in the course.
Notably, within all usage classes, significant correlations between the supervised final exam and the unsupervised assessment components of the course remain.This means that in spite of even the heaviest use of external resources, if not as a significant predictor, unsupervised assessment still retains formative relevance.Using Final as a proxy for learning success, we also investigated if higher or lower performing quartiles of the students may have benefitted or been harmed by the use of online resources, but we found no difference in that regard between these populations.

E. Item analysis
The use of external resources is detrimental to the validity of assessments.Table IV shows the average item parameters for assessment items on homework and unsupervised online exams.The mean reflects the average percentage of correctly solved items, and the point biserial ("pBis") the discrimination of these items (ranging from −1 to 1 where negative values usually denote invalid assessment items), that is, how well the item distinguishes between students who generally have a good grasp of the concepts and those who do not.
The values only insignificantly differ between different classes of online resource usage patterns.Overall, they are consistent with the low predictive power of these unsupervised assessments, but they still indicate that items can provide feedback to students and instructors.

F. Student comments about AI
Based on the replies to the open-ended question on the survey, many students recognize AI as a valuable tool for assisting in learning, particularly for understanding complex topics and guiding problem-solving.They particularly value its ability to quickly provide information without having to flip through textbook materials or scroll through video.They appreciate AI's ability to provide alternate explanations and solutions, which can be especially helpful when traditional teaching methods fall short.However, there's a consensus that AI should not replace genuine learning and effort.Students suggest that AI's role should be that of an assistant rather than a solution provider, and its usage should be contextdependent.For instance, in major-related courses, students advocate for minimal AI use to ensure a solid understanding of essential concepts.Conversely, in subjects that are less critical for them, they see AI as a more acceptable aid.FIG. 8. Fruchterman-Reingold [22,24] representation of the statistically significant correlations (p < 0.05) between the variables in Table I for the clusters in Table II.Note the rotation and handedness of these representations are random.
about academic integrity and the potential for AI to promote laziness and dependency are prominent.Students worry that reliance on AI for problemsolving or essay writing could lead to a superficial understanding of course material and hinder the development of critical thinking skills.They propose a balanced approach, where AI is used judiciously to enhance learning without becoming a crutch.This balance involves using AI for initial guidance or concept clarification while avoiding its use for directly solving assignments or exams.Some students commented on having some of the exams during the semester being in-person as beneficial.Ironically, but of course also very temptingly, based on style considerations, it appears that several students filled out the free-response question using ChatGPT.
Finally, the practicality of regulating AI use in online education is a significant concern.Some students acknowledge that while AI tools like ChatGPT may not be sophisticated enough currently to solve complex academic problems accurately, they could still be misused.Several students find ChatGPT is not yet trustworthy, but expect this to improve in the future.There's an acknowledgment that AI is a part of the evolving educational landscape, and rather than outright banning it, educators should find ways to integrate it responsibly into the curriculum.This integration could involve designing assessments that still require a deep understanding of the material, even with AI assistance, and teaching students how to use AI ethically and effectively as part of their learning toolkit.

V. DISCUSSION
Students are making extensive use of external resources when working on unsupervised assessments.On year after becoming available, students have not made the jump from "traditional" problem-solving sites to AI.The reasons might be manifold: the free version of GPT (at the time of writing version 3.5) is much less powerful than version 4, which is only available to subscribers.A GPTsubscription costs $20/month, while for example Chegg costs $14.95/month with promotion sales for half that price.Students may also be used to the "traditional" resources from high school and carry over their habits to college.For questions that contain figures or graphs, on forums, it is sufficient to submit only the text in order to locate the question, while with AI tools, these illustrations need to be described in words [19] (this will change as the multimodal capabilities of these systems develop further).Finally, with "traditional" sites, students would find the exact problem with the expected answer, while all Large Language Models still hallucinate.
A surprising result of this study is the lack of correlations between self-reported resource usage and assessment outcomes.While the discrepancy between the score distributions of online and on-campus assessments (Fig. 4) could be explained by the usage of AI and other online resources (Fig. 5), one would have expected a correlation [25,26]: the more resources are used, the higher the discrepancy; this, however, is not the case.This null result may be due to students being worried about punitive measures in spite of the strict research protocol or students underestimating their reliance on external resources; underreporting of academic dishonesty by approximately 1/3 rd has been reported before [26].It is also surprising that in spite of all of the external resource usage, a significant correlation remains between unsupervised and supervised assessments; while the best and only statistically significant predictor of the score on the final exam are the scores on the supervised in-semester exams, the other assessments have not lost their formative relevance.
From the results, it is clear that high-stake exams like the final exam cannot be conducted in unsupervised settings.At the moment, the time it takes to look up solutions on the internet may still be a hinderance to overly relying on those resources (as was also found by the negative correlation between the extent of using the internet and performance on online exams among students who heavily rely on external resources), on the long run, AI will likely be reliable enough that answers to any introductory physics problem, including newly created ones, can be obtained instantaneously.
Simple usage of lockdown browsers such as Respondus [27] or Safe Exam Browser [28] are no remedy in an online setting, as students can simply use another machine or their phone to access sites such as Chat-GPT [15], Gemini [29] or Chegg [11].Instead, these lockdown browsers, which limit access to local disks and par-ticular internet sites, are useful in supervised on-campus settings where students use their own devices (on-campus BYOD exams).If for logistical reasons, high-stake exams have to be conducted online, there is no alternative to intrusive proctoring systems that use cameras and microphones.
Students are well-aware of the possible pitfalls associated with AI usage.While they argue that it will be part of their professional lives, they support a balanced approach to its use, in particular over-dependence and over-reliance.They are also are aware of reliability and trustworthiness issues, which agrees with earlier findings regarding students' ability to judge the quality of answers [30,31].Overall, though, some statements about using these tools for learning purposes may have to be taken with the same grain of salt as the statement about the expert solution in Fig. 1 being "designed to help students [. . .] learn core concepts;" students may believe these statements, but still not act accordingly [32].Remarks about courses that are "not critical for them" suggest that non-sanctioned external resource usage may be particularly strong when the goal is simply to pass the course [33].

VI. LIMITATIONS
Students self-selected into this study, and students who are self-aware that they are using external resources in non-constructive ways may have refrained from participating.As the survey and consent form were administered at the end of the semester, students who dropped out of the course earlier were not considered.

VII. CONCLUSION
Our findings reveal that despite the emergence of LLMs, students predominantly rely on traditional internet resources for unsupervised quizzes.However, this reliance does not significantly affect supervised assessments.
Our data indicates that the unsupervised online exams and supervised on-campus exams yield markedly different outcomes, presumably due to the use of external resources in unsupervised settings.Interestingly, there's no strong correlation between self-reported resource usage and exam performance, suggesting other factors at play or potential underreporting of resource usage.Despite heavy resource use, significant correlations between supervised and unsupervised assessments persist, indicating that unsupervised assessments, while having no significant predictive properties, retain formative value.
While students advocate for a balanced approach to AI use, emphasizing its role as an assistant rather than a solution provider, the study underscores the necessity of carrying out high-stakes exams in supervised settings to ensure academic integrity.
Only one quarter of our survey respondents claims to have utilized at least some minimal form of AI tools in the solution of their homework problems.AI systems, though, are becoming exponentially more competent.They will rapidly penetrate all education settings.Our present results do not show significant causal effects on student learning success from using AI tools.In this sense our study was conducted perhaps a bit early.But it is clear, nevertheless, that all of us need to rethink our course offerings, and particularly our assessment tools, due to the rise of AI tools.Now is the time to help shape AI environments into the perfect one-on-one tutor for students, similar to what Feynman envisioned, instead of a means to avoid learning physics.

FIG. 2 .
FIG. 2. The problem from Fig. 1 in its original form (left panel) and after on-the-fly modification (right panel), correctly solved by GPT-4 on the first attempt.

FIG. 3 .
FIG.3.Historic share prices in USD of Chegg (NYSE CHGG) as a proxy for the popularity of online problem-solving sites.Prices greatly increased when courses went online as a result of the pandemic, but fell again dramatically coinciding with the emergence of Large Language Models.

FIG. 4 .
FIG. 4. Comparison between score distributions for the nine low-stakes exams that were conducted online (red) and the two low-stakes exams that were conducted on-campus under supervision (blue).

•
Your Opinion: Please tell us what you think about using AI in online classes; what should ideally be done; what should not be done?

TABLE I .
Summary of variables in the dataset

TABLE II .
[22,24]of members and mean values of variables in identified clusters of resource usage.Fruchterman-Reingold[22,24]representation of all correlations between the variables in Table I for all students.
TakingFinal as a proxy for learning success in the course, after discarding the derived variable DiffExams and all variables related to the five duplicate problems, in a linear regression of all remaining assessment and survey variables, only CamExam emerges as a statistically significant pre-dictor of the final exam score (p ≈ 7.3•10 −16 ); OnlExams comes close to statistical significance with p = 0.05.Table III shows the outcome of this linear regression, which has a week correlation of R 2 = 0.36.Also within the usage classes (TableII), there are very few significant correlations between resource usage and assessment performance.Figure8shows the statistically

TABLE III .
Linear regression for Final.

TABLE IV .
Average problem solving mean and average point biserial correlations for homework and online exams by clusters.