Relative impacts of different grade-scales on student success in introductory physics

In deciding on a student's grade in a class, an instructor generally needs to combine many individual grading judgments into one overall judgment. Two relatively common numerical scales used to specify individual grades are the 4-point scale (where each whole number 0-4 corresponds to a letter grade) and the percent scale (where letter grades A through D are uniformly distributed in the top 40% of the scale). This paper uses grading data from a single series of courses offered over a period of 10 years to show that the grade distributions emerging from these two grade scales differed in many ways from each other. Evidence suggests that the differences are due more to the grade scale than to either the students or the instructors. One major difference is that the fraction of students given grades less than C- was over 5 times larger when instructors used the percent scale. The fact that each instructor who used both grade scales gave more than 4 times as many of these low grades under percent scale grading suggests that the effect is due to the grade scale rather than the instructor. When the percent scale was first introduced in these courses in 2006, one of the authors of this paper had confidently predicted that any changes in course grading would be negligible. They were not negligible, even for this instructor.

In deciding on a student's grade in a class, an instructor generally needs to combine many individual grading judgments into one overall judgment. Two relatively common numerical scales used to specify individual grades are the 4-point scale (where each whole number 0-4 corresponds to a letter grade) and the percent scale (where letter grades A through D are uniformly distributed in the top 40% of the scale). This paper uses grading data from a single series of courses offered over a period of 10 years to show that the grade distributions emerging from these two grade scales differed in many ways from each other. Evidence suggests that the differences are due more to the grade scale than to either the students or the instructors. One major difference is that the fraction of students given grades less than C− was over 5 times larger when instructors used the percent scale. The fact that each instructor who used both grade scales gave more than 4 times as many of these low grades under percent scale grading suggests that the effect is due to the grade scale rather than the instructor. When the percent scale was first introduced in these courses in 2006, one of the authors of this paper had confidently predicted that any changes in course grading would be negligible. They were not negligible, even for this instructor.

I. INTRODUCTION
In higher education, the role of grades is paramount. Students enroll in courses, and the grade that they earn in each course indicates the degree to which the student was successful in that course. Good grades indicate that students are successful and they are encouraged to continue in that course and on to more advanced courses in the same topical area. Poor grades on early assessments in a course may cause students to drop a course. Enough poor grades can cause a student to fail a course, which can in turn affect the student's time-to-degree, their retention in a major, or even in their retention in college itself. Furthermore, college grade point averages (GPAs) are important as graduate schools and professional schools have minimum requirements for applicants, and some employers request GPA and/or transcript information so course grades influence student career opportunities and pathways even after graduation. Because the stakes are high, it's important that educators take care to construct the meaning behind their grades and also to understand any implications of chosen grading techniques or philosophies.
Instructors administering grades on many individual assignments and exams must aggregate these data in assigning a single course grade that best describes each student's understanding and/or skill. While instructors can and do use different factors and criteria when determining a grade [1,2], two things are important to achieve in this regard, 1) that the student's grade is a meaningful representation of their achievements in the class, and 2) that the grading process is consistent such that students taking the same course can expect similar grades for similar successes and failures. Choosing how to award grades usually involves selecting a grade-scale, or numerical method, for evaluating students work. Perhaps the most commonly used grade scale is the percent-based grading scale (hereafter referred to as "percent scale"). A student being graded using the percent scale earns a 90 for example if 90% of the work being graded is correct.
Recently, the percent scale has come under criticism [3][4][5][6] and alternative methods of grading have been introduced. Some of these alternative grade scales and methods have been shown to be quite successful [7], even when directly compared [8] to the percent scale. However, despite the wide range of criticisms [3][4][5][6], there has yet to be significant study of the impact, on students, of the problematic issues associated with the percent scale. In this paper, we address this gap in the literature by examining some of the effects of the percent scale compared to an alternative grade scale. Specifically, we address common criticisms of the percent scale, and determine the validity of these criticisms in one particular context as we explore the similarities and differences in student's course grades resulting from these two grade scales.

II. PERCENT SCALE
The percent scale (shown in Table I) is perhaps the most prevalent way of awarding numerical grades to students both in higher education and in the K-12 context in the United States. On the percent scale, 100 is the highest grade students can earn, zero is the lowest, the numbers represent the percent correct, and a value around 65% is the boundary between passing and failing (see Table I). In general, the basic concept is that the number of points awarded indicates the percent of the assignment that is 'correct.' So a grade of 85 would mean that 85% of the assignment is evaluated as correct. There is also a somewhat standard way to interpret this numerical grade as a letter grade. For example, above 90 is generally a grade of 'A', between 80 and 90 is usually considered a 'B' etc. Any grade below 60 or 65 on the percent scale is typically considered failing, and a zero is often awarded for missing work. There are slight variations of this scale depending on the instructor and/or institutional rules, but any scale where a grade is based on percent and students need to earn more than 50% of the points to earn a passing grade can be considered percent scale grading. Despite it's popularity, there are some notable criticisms of this scale.

III. CRITICISMS OF THE PERCENT SCALE
One criticism of the percent scale is the portion of the scale devoted to failing grades. In a 2013 article, Guskey [3] points out that a larger portion of the scale is devoted to failure (60%) than success (40%). This means that failure (grades equivalent to 'F') can be measured in 60 different degrees, while each other letter grade (A, B, C and D) is limited to only ten degrees. Another way of stating this is that the grade space devoted to 'F' is 6 times larger than that of any other letter grade. That the majority of the scale is devoted to F is a potentially a philosophical problem; in fact Guskey [3] asks, "What message does that communicate to students?" But the problems are mathematical as well because, as discussed by Connor and Wormeli [5] any of the F-grades below 50 tend to skew an averaging procedure [4,6].
The amount of the scale devoted to 'F' grades is particularly important [3][4][5][6] when considering awarding the lowest grade, a grade of 0%. Zero grades are often given to students who skip an assignment. While there are conflicting viewpoints on whether missed assignments should be included in aggregate grades at all (see Chap's 14-17 of Ref. [6]), if we take the viewpoint that there are some instances where a earning a 0% on the percent scale is warranted, critics point out that it is a very difficult thing for a student to balance out the effect of even a single zero. Table II shows us that it takes a total of 19 perfect scores to fully erase (i.e. eventually receive an 'A') the impact of a single zero, or a minimum of 6 perfect scores to earn a B. In previous work [9], we have shown that zeros are more likely to be awarded to first generation students and to students identifying as members of racial/ethnic groups that are underrepresented in physics than to other students. Furthermore, we provided evidence that these zeros are not clearly indicative of lack of understanding and we noted that use of the percent scale makes this situation more of a problem for these students than another grade scale (CLASP4 in Table I). Research [6,10,11] has shown that while good grades are motivators of good work, poor grades may not always motivate students to work harder. TABLE II. After receiving score of zero, how many perfect scores (either 100% or 4.5) are needed to achieve the desired average grade? N% is number needed when using a percent scale and N4 is the number needed when using the CLASP4 scale from Table I).
Another criticism of the percent scale is that because there are so many levels (100) it is prone to inaccuracy and inconsistency. Studies of teachers suggest that different teachers grades assigned to the same student work tend to be distributed over a range of width of order 10 when using the percent scale. This suggests that using fewer than 100 specific grades may lead to less teacherto-teacher grade variability.
Finally there are a number of criticisms of the scale that are not mathematical in nature, for example the fact that low failing grades (grades below 50% for example) negatively motivate students, or reduce their self-efficacy [12,13]. These are also important arguments to consider, but our quantitative data does not address these issues and so in this paper we are primarily concerning ourselves with the mathematical implications of grade scale.

IV. ALTERNATIVE GRADE SCALES AND PRACTICES
There are a few different common alternatives to using the percent scale. For example, in order to mitigate the percent scale issue of devastating zeros, the concept of "minimum grading" was conceived [5]. Minimum grading is the practice of raising all very low grades to some 'min-imum' grade (usually 50%) so that students are able to recover from missing work, or just really poor assignment performances. Critics of minimum grading suggest that it may result in students passing even if they have not learned the material. They argue that minimum grading promotes student entitlement (they get something for nothing) and leads to social promotion. Some instructors feel strongly that it's unfair to give students 50% if they have completed less than 50% of the work. There are also concerns that minimum grading contributes to well-documented grade inflation [14] (a phenomenon that may include a higher average grade, more students being given 'A's, or both). Research on minimum grading [8] in one school district shows that neither of these things happened. Carey and Carifio [8] present an analysis of seven years of grading data collected from a school that implemented minimum grading. They used standardized test results to show that students who earned at least one minimum grade actually outperformed students who did not receive any minimum grades on standardized testing. This was true even though students receiving minimum grades on average had lower classroom grades. This indicates that minimum grading does not cause grade inflation and also that minimum grading still does not entirely make up for the fact that a percent grading scheme potentially under-measures the performance of the struggling students for whom the practice of minimum grading was designed to help. Unfortunately, even though this seems to be a reasonable way to address some concerns of the percent scale, many instructors don't like it [4], because they do not agree with the concept of giving students scores that they don't believe they have earned.
Another way to address the percent scale issues is to use the concept of Standards-Based grading (discussed in Brookhart, et al review [7] and references therein). This method asks students to demonstrate proficiency in certain areas, with the instructor providing ordinal grades such as well below proficiency, approaching proficiency, proficient, and excellent, that describe the path to proficiency [15]. This approach has been used in college physics classes previously [16], but, because the method requires students to be able to have multiple opportunities to attempt the same proficiency, this is difficult to accomplish with the large class sizes that are typical of introductory science courses.
A final alternative to the percent scale is the college 4.0 scale which is typically used to calculate GPA. Each integer in the scale (4, 3, 2, 1, 0) corresponds to a letter grade (A, B, C, D, F). While many college level instructors may use the percent scale and then convert to a letter grade (which has a numeric value tied to the 4.0 scale), others may use a 4.0 scale from the beginning of the course and then report the numbers in the form of letter grades. These numbers can be averaged into a single grade. Because the 4.0 scale allocates the same 'space' to each letter grade, the scale avoids most of the issues associated with the percent scale. For example, in Table II we see that after a student earns a zero, it only takes 2 perfect scores for them to earn a grade of "B", and a total of 8 for them to erase the zero completed. This is much more doable than the 6 and 19 perfect scores required by the percent scale to achieve the same goal. This means that a grade of 'zero' is less disastrous for students. This scale is mathematically similar to the practice of minimum grading, and instructors using standards-based grading also sometimes use the integers on a 4.0 scale to mean different levels of proficiency [15,16], so aspects of this scale exist in both of the other alternatives we describe. Because the 4.0 scale mitigates many of the criticisms of the percent-scale, and does not have the any new drawbacks like minimum grading and standards-based grading do, many suggest it as an alternative to percent scale grading.

V. RESEARCH QUESTIONS
The active learning introductory physics course CLASP at UC Davis, has been in existence since 1995 [17]. Originally, all courses used the same grade scale described as "CLASP4" in Table I. This grade scale was based on the standard college 4.0 scale. After many years, instructors began to move away from the CLASP4 grade scale, and began to utilize a percent scale instead. Because the course materials over the years were extremely similar, and some instructors used both types of grade scale in different sections of the same course, the circumstances provide for an ideal opportunity to compare the usage of examples of the two scales. In this paper we consider CLASP10 to be an example usage of the percent scale, and CLASP4 to be an example of a 4-point scale. We use this data to compare use of the percent scale, CLASP10 to the 4-point scale, CLASP4, to examine some common critiques of the percent scale and to further explore similarities and differences between these two grade scales.
Specifically, regarding the controversies over percent scale grading, we ask: 1) Does the 4-point scale lead to course grades that are "inflated" compared to percent scale grading?
2) How does the distribution of course grades differ between the percent scale and 4-point grading?
3) How does the distribution of exam-item grades differ between the percent scale and 4-point grading? 4) How variable are the course grades for each grade scale by instructor and by class? 5) How does averaging grades under these different numerical scales compare with aggregating grades using the median instead of the average? When considering these questions it's important to emphasize that our aim here is to uncover how different grade scales can impact student outcomes. We are not evaluating the philosophy behind either scale, nor making claims about what constitutes an "A" or an "F." Finally we will make no claims about the connection between assigned grade and student understanding.
In this article we examine 10 years of student grades in a large enrollment introductory physics college course. Instructors in this course sometimes used a percent scale (CLASP10), and sometimes used an alternative scale (CLASP4) based on the standard collegiate 4.0 scale. What we will show is that the fraction of students failing the course is much larger when instructors use the percent scale and that instructors assigned more F grades to their student's work when using the percent scale. We also find that the increase in students failing is associated mostly with the grade scale used when aggregating grades by averaging and not with the increase in individual assigned F's. These conclusions seem to be determined by the grade scale rather than the instructor. Furthermore, we find that that using the percent scale leads to more class-to-class variability in grade distribution. Finally, we illustrate how using the arithmetic mean to determine course grades is similar to using the median under the 4-point grade scale but less similar under the percent scale.

A. Setting and Context
When calculating a GPA, most colleges use a 4.0 scale which equates each integer (4, 3, 2, 1, 0) to a corresponding letter grade (A, B, C, D, F). The designers of the Collaborative Learning Through Active Sense-making in Physics (CLASP) [17] curriculum wanted a grading system that was both transparent and non-competitive, and so they directly linked every single graded item (be it an exam question, or the exam itself) to a slightly modified version of this 4-point scale so that students could understand how their performance on a given question related to the expectations of the course instructors. The resulting CLASP4 grade scale, shown in Table I, is therefore a version of the 4.0 scale. CLASP4 is a continuous grade scale from 0 to 4.5 with each letter grade region, except for the F region, centered on the appropriate integer. Generally, a grade of 4.5 (the highest A+) was earned when the description/calculation correctly and completely applied an appropriate physical model to the situation described in the exam. Descriptions/calculations that were not correct or were not complete were assigned lower grades with each grade depending on the instructor's judgment of the quality of the answer. An answer judged to be excellent but not perfect was given a score in the A− to A+ range between 3.5 and 4.5, an answer in the B− to B+ range was assigned between 2.5 and 3.5, in the C− to C+ range between 1.5 to 2.5, etc. Under this grade scale a zero was almost universally reserved for students who did not answer the question at all, the top of the F range of grades was 0.5, and grades between 0 and 0.5 were used for students who gave an answer but whose answer showed almost no familiarity with or understanding of the subject. Multi-part exam problems often had a grade for each part (subexam level grades) and these were averaged, with the weight per part determined by the instructor, to determine the exam grade. These exam grades were then averaged, with weight per exam determined by the instructor, to give the course grade on the same 4-point scale. From 1995 until 2005 essentially all CLASP instructors used this same basic grading method for quizzes, exams, and the class grades 1 .
In 2006 several instructors began experimenting with a 10-point grade scale, CLASP10, that was just a rescaled version of the standard percent scale. With the CLASP10 grade scale, answers in the A− to A+ range were given grades 9 to 10 (instead of the 3.5 to 4.5 of the 4-point scale), B− to B+ range were given grades 8 to 9, etc. for C and D grade ranges. Again the zero of this grade scale was reserved for students who did not answer the question but now the highest F grade given is 6.0. The result is that, relative to CLASP4 scale, CLASP10 had a much larger grading measure available for F's (0-6) even though the other grades have the same measure on each scale. For these reasons we will generally refer to CLASP10 as a "percent scale". The grades on a single exam problem were then, as under CLASP4, averaged to give the exam grade and the exams were similarly averaged to give a course score which then determined the course letter grade.

B. Data Set
Data were collected from the first two quarters of the three quarter CLASP series from course archives spanning the ten years 2003-2012. (The "CLASP A" content primarily covers energy and thermodynamics, while the "CLASP B" content focuses on mechanics.) During those years the structure and content of these classes was relatively constant. Over 75% of a student's time in one of these classes was spent working in discussion/laboratory sections (referred to as DLs in the CLASP curriculum) on activities that only changed slowly over those years. The platform for entering grades in this large enrollment class was centralized using a separate database file for every separate course offering. These course databases include exam scores for each student along with the individual grades that were given as well as the calculations that led to these exam scores. In addition, these databases sometimes included the calculations that led to the actual course grades. Over these ten years there were 133 of these classes, and we have found databases for 96 of them that are identifiably graded as described above using either CLASP4 grading or percent scale grading. This identification was determined by examining the maximum grades given to individual student answers. If the maximum was always 10 then we considered the class to have had percent grading and if the maximum grades were always 4.5 then we considered the class to have had CLASP4 scale. The resulting database contains 773,667 sub-exam level grades given to 15,757 individual students on each part of each exam. Fifty-seven of the included classes are CLASP4 grade scale (including 478,617 grades on individual answers) and the remaining 39 are percent grade scale (including 295,050 grades on individual answers). We also have access, from UC Davis administration, to the recorded course grades from all of these classes as recorded by the Registrar. Course grades were all recorded on the standard 4-point grade scale with letter grades A through D, each of which may include a + or −, and F. In each database that included the calculation of course (letter) grades these letter grades were determined using the cutoffs approximately as shown in Table I. Although we will group together the data from the first two courses (CLASP A and CLASP B) in this series of courses because the differences between the grade scales show up in both courses, we will specifically note each situation where the data from one course differ substantially from that of the other.
Each of our research questions requires a different set of comparisons to make. So rather than providing a list of justifications for the comparability of each set in this section, we instead share this information when it can be considered alongside the research question and resulting comparison.

VII. ANALYSIS & RESULTS
Throughout our analysis we compare the percent scale, CLASP10, to the 4-point scale, CLASP4, used by the instructors of CLASP. In addition to course grades we will also be reporting on individual scores that instructors gave to student answers on exams. These scores will always be referenced by the grade range in which they are contained. For instance, a 3.8 given to a student answer under CLASP4 grading will be considered to be in the A region just the same as a 9.3 given to a student answer under CLASP10 grading. We treat borderline grades (i.e. a 3.5 under CLASP4 grading or the equivalent 9.0 under CLASP10 grading) as being half in each of the bordering letter grade ranges. Most counts, averages, standard errors, etc. were calculated in Excel and most are doublechecked with STATA software. We used STATA for the standard statistical tests, calculations, and regressions. The error estimates we give will be standard error of the mean or propagated from standard errors unless otherwise noted.
A. Percent scale fails more students When one separates the classes taught according to grade scale, either 4-point or percent scale, a trend in student fail-rates is undeniably present. Figure 1 shows the fraction of course grades given that are less than C− as a function of the year. We choose a cutoff of C− to measure because UC Davis allows students receiving less than a C− to repeat a course whereas those with C− or higher cannot. A similar way of noting that C− is an important cutoff is that courses graded Pass/Not-Pass give the Pass grade only to students who would have received a grade of C− or higher. Therefore, the figure shows the fraction of students who have a grade considered low enough to warrant repeating the course. From the averages shown on the figure we see that instructors using the percent grading scale gave 5.3 ± 0.4 times as many grades lower than C− than instructors using a 4point scale. This fraction was 4.5 ± 0.5 in the first course in the series (CLASP A) and 6.9 ± 0.9 in the second course (CLASP B) so the two courses both showed this grade scale effect. A chi-square test for the entire data set shows this difference between grade scales is significant, χ 2 (2, N = 22, 865) = 513.4, P < 0.001. Of course, it's conceivable that this difference is primarily an issue of student academic performance rather than of grade scale used. To explore this possibility we will use i) the variable EnterGP A, a student's GPA upon entering the course, as a predictor of that student's academic performance and ii) the variable GrScale (the grade scale) as a categorical variable in logistic regression models predicting the odds of a student receiving a grade less than C−. The logistic regression model including both of these variables is where e b is the appropriate odds ratio (note that the odds of receiving a grade less than C− is equal to the probability of receiving a grade less than C− divided by the probability of receiving a grade of C− or higher). First, leaving out GrScale and including only EnterGP A as a predictor of the odds of receiving < C−, we find an odds ratio for EnterGP A of = 0.095 ± 0.009 (z = −24.9, N = 20, 950, P < 0.001, P seudoR 2 = 0.12). This means that an increase in entering GPA by 1 (e.g. from 2.5 to 3.5) lowered the odds of receiving a grade < C− by over 90%. Now we include the categorical grade scale variable GrScale along with EnterGP A. The odds ratio for EnterGP A changes to 0.079±0.008 and we find that a student graded under a percent scale had 6.5 ± 0.6 times higher odds of receiving less than C− than the same student under 4-point grading (z = −19.9, N = 20, 950, P < 0.001 for the variable GrScale in this model) and P seudoR 2 = 0.195 for this model. We conclude that student academic performance does not explain the large fraction of students with these low grades under percent grading. Finally, we should point out that the student withdrawal/drop rates are essentially independent of grade scale (0.76% ± 0.08% under 4-point grading and 0.74% ± 0.09% under percent grading) in this set of classes. What isn't shown in these data is why more students were failed when the percent scale was used.
In the remainder of this paper we analyze factors that contribute to this phenomenon.

B. Is Grade Inflation Happening?
Because instructors select either the percent scale or the 4-point scale, it's important to consider the instructors also in this analysis. One possibility is that instructors using the 4-point scale give generally higher grades to their students' answers than those given by instructors using a percent scale, a situation commonly referred to as "Grade Inflation". If 4-point scale instructors were simply inflating grades, we would expect to find fewer low grades but we would also expect to find more high grades under 4-point grading as well as a higher average grade. Figure 2 shows the complete course grade distribution for both grade scales. One possible sign of grade inflation in 4-point classes, more high grades, is easily seen to be missing. Instead, we find that students in percent scale classes are the ones receiving more A's (about 20% more than their peers in 4-point classes).
To look for a shift in the average grade, we computed the average course grade given by each grade scale using the UC Davis method for calculating GPA (A=4.0, A−=3.7, B+=3.3, B=3.0, etc.) except that we use A+=4.3 rather than the UC Davis A+=4.0. This amounts to choosing, for each course grade, a value roughly in the middle of the relevant CLASP4 range shown in Table I. We find the average grade given to a student graded under the percent scale was 2.846 (SD = 0.89) and under the 4-point scale was 2.915 (SD = 0.67) for a grade shift of 0.069 (0.01). These average grades (both between a B and a B−) are shown in Figure 2. The effect size of this grade shift is about 0.088 so it is a small effect. The difference of 0.07 GPA units is less than half of the class-to-class variation for either grade scale (standard deviation, over the individual classes, of average class grade is 0.19 for 4-point classes and 0.34 for percent scale classes). A t-test of the two distributions shows that this small difference in the two average grades is statistically significant (t = 6.6, df = 22863, P < 0.001) and including student GPA as a covariate does not change this conclusion. At this point we should note that the two courses that we have grouped together here showed different results for this comparison when considered individually. The first-quarter course (CLASP A) had essentially the same average grades for the two grade scales (2.858 ± 0.008 for 4-point grading and 2.874 ± 0.012 for percent grading). The second quarter course (CLASP B) had a lower average under percent grading (2.935 ± 0.009 for 4-point grading and 2.757±0.014 for percent grading). So there is conflicting evidence for any simply defined "grade inflation." We find that students graded using the percent scale are more likely than students graded using the 4-point scale to earn "A" grades, but that they are also more likely to fail the class, and these two differences combine in a way that the average grade in courses graded using the 4-point scale is statistically significantly higher but that the difference is small. When we examine the distribution of grades for both courses in figure  2, we see evidence that suggests fewer students fail under the 4-point scale not because the distribution of course grades under that grade scale is shifted uniformly toward higher grades but that the course grade distribution under 4-point grading was narrower than it was for percent scale grading.

C. Instructor Use of Grade Space
Since very low grades given under percent scale grading have a much larger effect on course grade than the lowest grades given under 4-point scale grading (Table II), we might expect that this is the main difference between the two grade scales and accounts for the difference in the student fail-rates. However, there is another clear difference in our data that can also lead to more failing grades. We find differences in how instructors allocate grades on the individual answers given by their students on exams that differ between the two scales.
First, we note that students who leave a problem blank (or mostly blank) on an exam receive a zero, a grade that involves no instructor judgment of understanding or skill. Figure 3 shows the fractions of scores given on individual exam items for all 96 courses. We see that there is very little difference between the number of zeros earned by students under each scale. However, the fraction of F-grades that are not zero is distinctly dependent on the grade scale. Figure 3 shows that instructors using a percent scale are considerably more likely to judge individual student solutions on exam items as non-zero F's than those instructors using the 4-point scale. This amounts to shifting about 14% of the entire exam item grade weight from higher grades under 4-point grading down to nonzero F's under percent grading. Notice that none of this shifted grade weight comes from the A's, which were actually more common under percent grading and likely little of it from B's because the total fraction of (A's + B's) is about the same for the two scales, 56.4% for percent grading and 56.9% for 4-point grading. In order to analyze how these grade shifts affected individual students and whether they might be due to student academic performance, we compute the fraction of non-zero F's for each student. Averaging this student-level number over all students gives 0.0364 ± 0.0004 under 4-point grading and 0.1636±0.0014 under percent grading. Cohen's d for this difference is 1.3 so this is, not surprisingly, a large effect. The student-level distribution of non-zero F's is, unfortunately, both non-normal and heteroskedastic but we can still use student GPA, EnterGP A, in a linear regression to see if it affects the extra fraction of non-zero F's seen for percent grading. The regression model we use to model a student's fraction of non-zero F's is First, leaving out GrScale and using only EnterGP A to predict the exam item fraction of F's with N = 20, 837 gives, coefficient for EnterGP A = −0.056, t = −35, P < 0.001, R 2 = 0.056. Including GrScale with EnterGP A in our model we find that the percent scales had 12.6% ± 0.1% of extra grade weight in non-zero F's compared to 4-point scales, t = 99, P < 0.001 for the variable GrScale, and N = 20, 837 and R 2 = 0.36 for this model. Controlling for student GPA did not change the fraction of shifted grade weight very much so we conclude that the extra non-zero F's under percent scale grading are not due to differences in the students in these courses. Finally, we note that the two courses that we have grouped together here had different amounts of grade weight shifted into non-zero F's. In the first course of the series (CLASP A) 8.5%±0.2% of grade weight was shifted into nonzero F's under percent grading and in the second course of the series (CLASP B) that number was 18.6% ± 0.3%.

D. Individual Instructors' Results
One possible explanation for the difference between the failing rates of the two grade scales is that different instructors choose the scale that serves their interest better. Therefore, it is useful to address any selection effects the choice of grade scale may have. Of the 60 instructors involved in these courses over the 10 years in our data set, seven instructors used both the 4-point scale and the percent scale at various times. This gives us seven comparisons between the two grade scales where an instructor 2 is held constant. Table III shows that these seven instructors gave between 4 and 12 times more course grades less than C− under percent grading than under 4-point grading with an overall average of 5.4 ± 0.5 times as many grades less than C−, a number very similar to the overall result described previously. In addition, each of these instructors had between 4% and 20% extra exam-item grade weight in the non-zero F region under percent grading than that particular instructor had under 4-point grading. These numbers are modified only very slightly if we attempt to account for class-to-class differences in the student academic performance by using their incoming GPA as a covariate. Specifically, the largest shift, caused by controlling for student GPA, in Extra F Weight is from 0.12 to 0.11 for Instructor 4 and, overall, we still find, 4% and 20% more sub-exam-level grade weight in the non-zero F region under percent grading. These data are evidence against either instructor or students as causes of the effects shown in figures 1, 2, and 3.  Figure 1 shows that the fail-rates for the instructors using the percent scale are more variable by year than the 4-point scale. We can quantify this on a course-to-course basis by examining how the fail rate (i.e. fraction of students with grade < C−) varies over the courses offered. Figure 4 has a kernel density plot of course failure rate for each grade scale. One finds not only that the failure rate is larger for the percent scale courses but that it is also much more variable over courses. It is possible that the students themselves were more variable in percent scale courses and we can check this using the student's incoming GPA's to find the distribution over classes, of the class-average GPA, for each grade scale. For 4-point classes we find a class-GPA averaged over classes of 2.97 with a standard deviation over classes of 0.08 and for the percent scale classes we find a class-GPA average of 2.99 with a standard deviation of 0.08. A t-test of these class-average GPA's gives t = 1.1, df = 93, P = 0.26 so the students under the two grade scales are, in terms of average GPA, indistinguishable from each other. We can also examine whether the fraction of students failing in a class varies with the class-average GPA. We find the linear regression coefficient is not significant for either grade scale (P = 0.74 for 4-point scales and P = 0.16 for percent scales). So the variability in failing fraction is not obviously associated with an underlying variability in the students' academic performance. Finally, since both of the distributions in Figure 4 include classes at the lower bound of zero students failing, we might worry that being pushed against this lower bound has artificially narrowed the distribution for the 4-point scale, a floor-effect. We check this by using a higher grade cutoff. A cutoff of C still had classes from each scale with zero students below cutoff, so we use a cutoff of C+ and find that percent scale classes had an average fraction of 0.22 students with lower than C+ with a standard deviation over classes of 0.12 and 4-point graded classes had an average fraction of 0.12 student with lower than C+ and a standard deviation of 0.07. The percent scale classes are more variable under this measure also.
If the variability in fail-rates is not associated with the students then maybe it is an inter-instructor effect where each instructor tends to fail about the same number of students in each percent course but that different instructors fail very different numbers of students. We can turn, again, to instructors who used both 4-point and percent grading at various times during these 10 years. Figure 5 shows the fail rates for two of these instructors. Both of these instructors have considerable variation of the fail-rates in their classes. The fail-rates in their 4-point courses have much smaller variation than the fail-rates in their percent courses. Unfortunately, only four instructors taught at least two courses under each grade scale but each of them had a larger spread in fail rates under percent grading than under 4-point grading (the smallest increase in the spread of fail rates was a factor of 1.7 higher for percent scales and the highest was a factor of 5.5 higher). So, the variability is not obviously an instructor selection effect.

F. Separation of grade weight effects from extra F's
From Table II we see that the percent scale requires 3 perfect grades of a student wanting to raise the lowest Fgrade (i.e. a zero) to a straight C. Similarly, we note that percent scale F-grades of 25 and 50 require 2 and 1 perfect grades respectively to average to a C. The highest percent scale F grades average in much the same way that F's do with a 4-point scale. Since the lower F-grades require more perfect grades to cancel them out, we can think of them as F's that carry more effective "weight" than any F from a 4-point scale. These "weighty" F's lead to the "skewing" of average grades discussed by Connor and Wormeli [5].
As noted earlier, some of the class database files that we have access to not only have all grades recorded but also include all of the calculations that led to the course grades. Courses offered in 2008 and later years do not have all of these calculations but eight of the thirteen percent graded classes offered 2006-2007 do have complete sets of grades and grade calculations. These eight classes gave 1839 course grades to 1272 students (567 students are in two of these particular classes). The eight classes give us a way to separate the effects of the heavy weighing of low F's from the effects of just giving more F's on the individual exam answers. Without changing the number of F's (or any other grade) that were given on exams in these eight classes we map the CLASP10 percent scale grades onto the CLASP4 scale as follows: i) for all grades larger than 6.0 we subtract 5.5 from the grade and ii) in order to be very conservative in our treatment of the F's we set all original CLASP10 grades less than or equal to 6.0 to a 0 in our 4-point re-scaling (that is, all original F's are set to 0). We then do all the original weighted averages of individual grades into exam grades and exam grades into course grades where point cutoffs determining the letter grades are gotten by subtracting 5.5 from all of the original percent scale grade cutoffs. The original percent graded courses had a fail rate of 8.3% ± 0.6% and the 4-point re-scaled courses had a fail rate of 1.3% ± 0.3%. The 4-point rescaled fail rate of 1.3% is consistent with the rest of the 4-point graded classes (see Figures 1 and 4) and the increase of a factor of 6.5 ± 1.4 in the fail rate under percent grading is consistent with both the overall ratio of 5.3 and the individual instructor ratios given in Table III. These overall consistencies from a simple rescaling onto a 4-point scale suggests that the heavy effective weight that the low-F grades carry is the main factor increasing the failure rates and that the extra F's assigned by instructors using the percent scale for individual exam answers aren't the main difference between the two course grade distributions.

G. Aggregating Grades -Mean vs Median
An alternative to using the mean to aggregate student grades is to use the median [18]. This is perhaps most straightforward with letter grades because they are explicitly ordinal but could also be chosen for numerical grades that are referenced to letter grades when they are given by an instructor whose grading philosophy is that the grades they give are ordinal in nature. For the CLASP courses the numerical grades are chosen from a continuous number line and were always averaged but these individual grades are ultimately connected to letter grades so one could argue that taking a median might be a reasonable choice here in constructing course grades. In fact, the argument that it is best to consider even numerical grades to be ordinal and so use the median to aggregate grades has already been made by others [19]. For our grade data an interesting similarity between the two grade scales is that the overall median of all of the grades given to individual student answers under 4-point grading was 3.0 (exactly middle B) and the overall median under CLASP10 was 8.5 (also exactly a middle B). It may help better understand the differences between the two grade scales to use the grade data we have discussed in this paper to compare, at the student level, the FIG. 6. The unweighted average of each student's set of grades is plotted against the unweighted median of these grades for each student from the classes for which all grade data were present. Both grade scales are shown. In each graph the straight line represents average = median and the curved line is a fit of all the grades to a quadratic polynomial done just to show the data trend. The low grade region of the CLASP10 grade scale (the percent scale) is cutoff so that the A, B, C, and D ranges of the two scales occupy the same distance along each axis. two methods, median and average, of determining course grades.
The actual course grades given during the 10 years of these data are always made up of complicated weightings of the individual grading judgments. This complicated weighting can even be student dependent because instructors would almost always drop each student's lowest quiz score, give the final exam more weight if the student performed better on that exam than on exams during the term, and adjust some student's grades slightly depending on their performance in their discussion/lab section. This makes it impossible to devise a unique process of using the median function to determine a final grade that can be directly compared to the actual final grades. Nevertheless, it may be useful in understanding the similarities and differences between the two grade scales to just construct a straight unweighted average of each student's individual grades and then compare those averages with a similarly unweighted median of each student's grades. We find that 49 of the courses graded under 4-scale grading (11,708 students) and 24 courses under 10-scale grading (5,862 students) have a complete set of grades and so can be used in this way. Figure 6 shows each student's unweighted average grade as a function of their unweighted median grade for both grade scales. The relationship between an average grade and a median grade, for a set of grades that have both an upper bound and a lower bound, is evident in the figure. Students with many grades at the upper bound can have a median grade equal to the upper bound but any other grades they have will lower their average grade and exactly what other grades they have will determine how much lower their average grade is, so average grades will tend to be lower than median near the upper bound. Conversely, average grades will tend to be higher than median when the median falls near the lower grade boundary. The result of those two general features is that average grades and median grades must be similar for some region in the middle of the grade space. For the grades given to the students in these courses, the average approximates the median in the B-to B region of grades under 4-point grading and at the F to D crossover region of grades under percent grading. This means that under percent grading the region of grades that includes 92% of student grades (median grade ≥ D) will tend to give lower grades when the average is used than when the median is used.
Another effect that is obvious from Figure 6 is that, for any specific value of the median grade, percent grading gives a much larger spread in average grades. The standard deviation in the distribution of average grades is roughly twice as large for percent as for 4-point grading for any particular letter grade, determined by the median, from D− to A+. So, in the courses we are considering here, percent scale grading had both a larger systematic shift, of the average from the median, than 4-point grading and a larger random spread around that systematic shift.

VIII. DISCUSSION
Considering only the mathematical characteristics of each scale, it is perhaps not surprising that more students fail under the percent scale. After all, if lower grades are given more weight, it's likely that more students will fail. That said, there are several nuances in the overall grade distributions of each scale uncovered by our analysis that are worth considering by instructors who are considering using either scale.
The above results confirm many of the critiques of the percent scale discussed in Section III. A complete understanding of the grade scale entails understanding these critiques so that the percent scale can be used consistently and effectively. In fact, many instructors are aware of some of these critiques, and adjust percent scale grades accordingly (for example, they might consider a lower grade than 65% passing or they might add some number of points to all students grades to increase the class average.) The intention of this discussion is to discuss the nuances of the use of the percent scale in one particular context in order to bring these characteristics attention so that they may be discussed both by the research community and instructors considering their grading philosophy.

A. Considering Partial Credit
If a student does not complete a problem correctly but shows some small part of understanding, instructors will often award an accordingly small amount of "partial credit." Figure 3 shows that a very large portion of grades earned on exam items graded using the percent scale is devoted to "Non-Zero F's." In fact, more students earn "Non-Zero F's" on the percent scale than any other individual grade besides "A". Mathematically, there is a big difference between averaging in a 40% and a 10%, but does the instructor see a correspondingly meaningful difference between these two different grades for failing? Perhaps, but is this the difference between these two failing grades equivalent to distinguishing the differences between a grade of D (65%) and an A (95%) which are also 30 points apart? If an instructor thinks that the numerical grades they assign are actually interval in nature, rather than just ordinal [18,20], and they are averaging the individual grades to determine some aggregate grade, then the same consideration as to whether to award a D or an A should go into determining whether a student earns an F (10%) or an F (40%). Specifically, it seems important that instructors avoid the mind frame of awarding percents lower than 50% and thinking that this is giving the student partial credit if they consider a grade of roughly 60% to be the border between passing and failing. For example, while 30% is obviously better than a zero it is still 30 points below failing, which is the same mathematical difference between an A and an F. The data in our paper show the collective effects of the very low F-grades of the percent scale. And the fact that each instructor who used both grade scales gave more F's suggests that the grade scale itself might affect a teacher's grading judgments in cases when a student's answer does not show much understanding.

B. The Meaning of a Zero
The effect of awarding a zero is greater in the percent scale than in the 4-point scale, as shown by Table  II and therefore contribute to the grade weighting issues discussed in section VII F. When a student leaves an exam problem blank for any reason, the instructor often awards them a zero for this. The instructor's justification for this is certainly logical in the sense that the student has provided zero evidence of understanding, and therefore has earned 0% of the possible points. The instructor might also be using the zero for a motivational purpose in the sense that students need to complete the work in order to earn points. However, many studies [11,21,22] suggest that this simple view of motivation is unwarranted. In addition, as seen in Figure 3, our data set shows that even though a zero carries much more weight on the percent scale than on the 4-point scale, the overall fraction of blank problems remains essentially constant. This is tentative evidence in support of the fact that the number of blank responses is not affected by choice of grade scale in this course and possible additional evidence against the motivational use of a zero. Furthermore, leaving a physics problem blank is a behavioral trait that may be more common [9] for women, students identifying as underrepresented minorities, and first generation college students. Therefore the practice of awarding zeros for missing work may well contribute to achievement gaps.
As pointed out by many authors [6,23], a zero that is earned because a student did not complete an assignment is not a measurement, it's actually missing data. In previous work [9] we showed that the number of problems left blank by a student is poorly correlated with other metrics of understanding, so that leaving a blank is by no means predictive of that student's overall understanding of physics. Averaging in zeros for missing data would be a terrible practice in one's research and so we might consider alternatives to this practice should this concern us in our teaching.

C. Valuing the Instructor's Evaluation
In figure 3 we showed that the instructors using the percent scale gave more non-zero "F" grades than instructors using the 4-point scale and, using Table III, argued that these extra F's were due to the grade scale rather than the instructor. However later, in Section VII F, we argued that the reason that more students failed the course when graded using the percent scale as compared to the 4-point scale, was due to the mathematics of averaging percent scale grades, and NOT because those instructors actually gave more "F" grades on individual exam items when using the percent scale. In some ways this second point seems to temper the first point but the two effects should really be considered together. For example, regarding the second point, an instructor could argue that giving a very low F grade and using it in an averaging process that gives it a large effective weight is entirely appropriate because that is what their student "earned." However the first point, that the percent scale seems to have guided seven out of seven instructors into giving more of these F grades, might give this instructor pause regarding their own ability to provide impartial absolute grading judgments of these poor student answers. The set of low F grades which average with higher effective weight are exactly the grades that may be caused by the grade scale itself and so be particularly hard to be confident about.
We further characterized the student-level skewing effects of these low F grades in Section VII G. In Figure 6 we used unweighted averages to show how the percent scale might allow for students to earn average grades that are consistently lower than their median grade and sometimes much lower. Of course the actual instructor weightings (like dropping the lowest grade) would likely reduce the scale of some of these effects but will also reduce similar but smaller effects of the 4-point scale and are not likely to change our conclusions that the low F grades available to instructors using the percent scale i) can have very large effects and ii) may only have been given to a student because of the grade scale used.

D. Considering Variation at the course level
Beyond the student-level effects discussed above, there also appear to be instructor-level/class-level effects associated with the skewing effect of the low F grades in a percent scale. We have already seen that the effective weight of these low F grades leads to the increase in the fail rate so the class-to-class fail rate variability that we saw in Figure 5 must be caused by an underlying classto-class variability in the number of these low F's and/or various instructor's responses to the low average grades they see in front of them as they determine course grades for their students. If there are a set of possible grades that carry extra weight when averaging then the actual grades given to the students in that class may well be very sensitive to exactly how those powerful grades are used during the term with this sensitivity giving rise to the class-to-class variability.

IX. CONCLUSIONS
In summarizing our results we again emphasize that we are not casting judgment on use of the percent scale in general or any grading practices in particular, but instead argue that it is essential for instructors to consider the biases of the percent scale when planning their course for the semester, so that they can ensure that their teaching philosophies match their grading philosophies.
The primary purpose of these analyses is to fully understand the impact of the percent scale as compared to another somewhat commonly used scale, the 4-point scale. In fact, even knowing these results, one of the authors has chosen to continue using a version of the percent scale in their small graduate courses. In this use case, work that does not meet expectations earns no lower than 60%, but zeros for missing assignments are used to ensure it's not possible to pass the course without completing the assignments. This example is shared NOT as an example of an exemplar use of the scale (we do not have such data to support such a claim), but rather to emphasize that the intention of this paper is not to discredit the percent scale, but rather to expose some characteristics of the percent scale that may have previously avoided consideration due to its widespread use.
With respect to our research questions, we draw the following conclusions.
RQ1) The 4-point scale does not appear to inflate grades in the traditional use of the word because the average grades of the two scales are close and students are actually over 30% less likely to earn A grades in courses using the 4-point scale.
RQ2) Although the average course grades under the two scales are close, the width of the distribution is much larger under percent grading. This led to many more students receiving failing course grades when the percent scale is used as compared to the 4-point scale. We found that the odds of failing is over 5 times higher under per-cent scale grading, P < 0.001. The overall fail rate varies by instructor and by individual class for the percent scale with an average fail rate of about 8%. Nevertheless, each instructor who used both scales at various times failed, overall, at least four times as many students under the percent scale. RQ3) Instructors tend to give out more "F" grades on individual exam items when using the percent scale than when using the 4-point scale ( 13% to 14% of the entire grade weight was shifted down into the F-region under percent grading, effect size = 1.3). We have not seen this sort of effect reported in the literature. However, the number of extra F's is not found to be the main contributor to the higher course fail rate under percent scales. Rather, the extra "effective weight" of these low F's in the averaging process is the main contributor. The grade scale is more important in determining the fail rate than the instructor.
RQ4) The percent scale has a much more variable fail rate when compared to the 4-point scale. The class-toclass variation of the fail rate under percent grading is over seven times higher than under 4-point scale grading even though the variation in the students was negligible.
RQ5) The two grade-aggregating methods, mean and median, gave results more consistent with each other under 4-point scales than under percent scales. Under percent scale grading 88% of students had unweighted average grades lower than their unweighted median grade and 12% had higher averages. Conversely, for 4-point scales 56% of students had lower unweighted averages than unweighted means and 44% had higher averages. In addition to the systematically lower averages under percent grading, the distribution of average grades for any particular median grade was about twice as large under percent grading as under 4-point grading.
These results are derived from a data set that comes entirely from two (sequential) courses offered over a period of ten years by one department at a single institution. This course has a fairly low fail rate regardless of which scale is used. Findings would likely vary across institutions. While we have made every effort to account for student and instructor selection effects, this is not a randomized controlled study. It is possible that there are unseen factors contributing to higher fail-rates in the courses graded using the percent scale. Future work will examine similar data sets at other institutions offering CLASP and other Introductory Physics for Life Sciences courses. Furthermore, a big question we have not addressed here is what happens to these students after completing this course. Are students who would have failed under the percent scale, but passed under the 4-point scale successful in future courses? We do not address this question in this paper, as our intent is not to prove one grading method is superior to another, but rather to uncover characteristics of the percent and 4-point scales that are important for instructors to consider when deciding on a grading practice for their courses. We do plan to investigate this question in forthcoming work.
Each college or university level instructor has their own opinion about the quality of a student's work but this judgment should represent an unbiased opinion of that work. Toward that end, it is useful for instructors to know the origins of possible biases so that they can account for these in assigning grades. Our results indicate that instructor use of the 4-point scale led to many more students passing their introductory physics course as compared with classes using the percent scale. This result was achieved without grade inflation. Our findings align with previous critiques of the percent scale, and indicate that instructors should consider the specific issues we highlight in this paper when using the percent scale.

X. ACKNOWLEDGEMENTS
None of these results would have been possible without the organized databases that Wendell Potter set up in 1997 and continued through his retirement in 2006 so, in memoriam, we owe him a debt of gratitude. We would also like to thank the education research groups at UC Davis and San Jose State for useful comments on the research and the manuscript. We also would like to thank Jayson Nissen for providing feedback on an earlier draft of this paper.