The impact of scaffolding and question structure on the gender gap

We address previous hypotheses about possible factors influencing the gender gap in attainment in physics. Specifically, previous studies claim that male advantage may arise from multiple-choice style questions, and that scaffolding may preferentially benefit female students. We claim that female students are not disadvantaged by multiple-choice style questions, and also present some alternative conclusions surrounding the scaffolding hypothesis. By taking both student attainment level and the degree of question scaffolding into account, we identify questions which exhibit real bias in favour of male students. We find that both multi-dimensional context and use of diagrams are common elements of such questions.


I. INTRODUCTION
The gender gap in attainment in physics is consistent and well documented. Across institutions, male students outperform their female counterparts in terms of undergraduate course performance [1][2][3], as well as outcome on subject specific concept inventories (FCI [1,[4][5][6][7], BEMA [8][9][10], and CSEM [10,11]). At the UK Open University (OU), we observe a significant difference in attainment on the second level physics modules in favour of males, and furthermore this gap is persistent across multiple years of instruction.
While the existence of a real and significant gap is well established, the contributing factors are less well understood (see [12] for a review of 17 studies). Possible factors include background and preparation, of which many possible measures exist. Previous studies identify concept inventory pretest scores [9,13], SAT math scores [7,9,13], ACT math scores [9,13], and prerequisite course grades [9] to vary significantly by gender. Sociocultural factors may also play a role, for example self-efficacy and CLASS scores [14] (a measure of learning attitudes about science). Finally, there is the issue of question construction including type of question (constructed response, multiple choice, or other selected response), presentation (graphs, diagrams, words), and male-biased context (references to sports and cannons). Here we focus on identifying factors from the final category of question structure, as these are the most readily modified.
A recent study from the University of Cambridge [15] observes an interesting dependence on question structure in the form of scaffolding. Scaffolding refers to the degree to which a question guides the student through the problem-solving process. Previous studies support the use of scaffolding in aiding students' learning and conceptual understanding in physics [16][17][18]. However, [15] is the first study to our knowledge to identify a dependence on gender. It is therefore important to verify these findings across institutions and student populations, prior to taking action towards any instructional reform.
In light of the large and diverse student population of the Open University, we find ourselves well situated to address these issues. The goals of the present study are to 1. Identify elements of question structure which may be disadvantaging female students 2. Test the scaffolding hypothesis as a potential solution Taking student ability (as measured by overall attainment levels) and question difficulty into account, we identify questions that pose significant male bias and those which do not. We discuss our findings in the context of current literature on the subject. Furthermore we challenge the conclusions presented in [15], and offer some alternative conclusions.

II. CONTEXT
The present study examines gender differences in attainment observed in the second level (FHEQ Level 5) physics modules at the Open University. We first spend some time reviewing the structure of the Open University, the modules in question, and the student population.
The Open University approaches higher education in a non-traditional way in that there are no admission requirements, and modules are completed at a distance with substantial online elements. Students select and complete modules, according to their needs, to make up a degree comprised of 360 credits if desired. Students are attracted to the open concept for a variety of reasons including flexibility, part-time options, returning to study later in life, and completing second degrees. We therefore expect that the student population is demographically diverse. Despite differences in the student population, similar trends in attainment gaps have been identified as at other institutions. Of particular interest is a large gap in attainment at the second level, the first level at which physics is taught as a separate module, which does not exist at lower or higher levels.
The 60-credit second level physics modules (previously S207, now S217) include mechanics, thermodynamics, electricity and magnetism, quantum physics, and nuclear physics at an introductory to intermediate level. Although prerequisites are not enforced, it is expected that students will have completed the introductory level one science module, from which they will have gained some familiarity with some of these topics as well as appropriate mathematical preparation. The module population TABLE I. A cross-table of the ith stratum depicting the number of students in each group (male and female) to get a particular iCMA question correct or incorrect on the first attempt. The total number of students in the ith stratum is comprises a mixture of students intending to take further physics modules and those intending to take further science or mathematics modules outside of physics. Throughout the module, students complete interactive computer-marked assignments (iCMAs), which are short problems requiring numeric open responses or selected responses, in addition to tutor-marked assignments. Students receive feedback on their iCMA answers and are permitted to retry questions as many times as desired. The module ends with an exam which contains, among other components, long answer open response questions.
In this study, we analyze iCMA questions to identify any gender bias, and look at exam long answer questions to address the scaffolding hypothesis. Data was collected over four recent presentations of the module; S207 in 2012-2014 and S217 in 2015. The total number of students completing the module in this time period was 5535, 4286 (77%) males and 1249 (23%) females.

A. The Mantel-Haenszel method
The Mantel-Haenszel method is a statistical technique used to identify differences between groups using a stratified data set [19]. The idea is that possible confounding variables will be captured by the stratification.
In this case, we wish to detect iCMA questions which exhibit significant male bias while accounting for student ability and question difficulty. Therefore we take our two groups to be male and female students, and students are stratified according to ability as measured by their overall performance on iCMA questions. Table I shows a cross-table representing the number of students in each group answering an item correctly at the ith stratum. For each item, we calculate the odds ratio (ratio of success probabilities between groups) of the ith stratum as A weighted average across all strata then provides the overall odds ratio for a particular question, referred to as the Mantel-Haenszel alpha: For ease of comparison, this is often converted to a loga- rithmic scale as The sign and magnitude of α * M H indicate the direction and strength of bias within a question. Negative values indicate a bias in favour of males, meaning that male students have a greater probability of answering this question correctly compared to female students of equal ability. Likewise, positive values indicate a bias in favour of females. The absolute value of α * M H indicates the strength of the bias, and is deemed to be significant if |α * M H | ≥ 1 [20].
As a second assurance of significance, each α M H value is tested using a chi-squared distribution. In this case, the null hypothesis is that the odds ratio is equal to one at each stratum, and the alternative hypothesis is that at least one odds ratio is different from unity [19]. In this study, we flag questions as having significant bias if both conditions i) |α * M H | ≥ 1 and ii) p ≤ .05 are satisfied.

B. Analysis and results
Applying the Mantel-Haenszel method to 56 iCMA questions flags 3 questions of significant bias, all in favour of male students. Further, 2 questions were noted to be of interest having significant p-values and insignificant |α * M H | values, but in favour of female students. These items were included for being the only questions of some significance with female bias. Table II shows the α *

M H
values with significance levels for each question of interest. Questions are labeled as M 1 , M 2 , M 3 (those having male advantage) and F 1 , F 2 (those having female advantage). We note that M 1 , M 2 , and M 3 all display very strong levels of bias with |α * M H | values well above the threshold. Fig. 1 shows the items displaying male bias. Notably, all questions require interpreting a diagram of more than one dimension, which we find to be consistent with current literature. Wilson et al. [21] studied the impact of question structure on the gender gap along five broad dimensions: content, process required, difficulty, presentation and context. They observed large gender gaps in favour of males for questions which involved the process of interpreting a diagram, which presented the question using a significant diagram, and which involved more than one spatial dimension. Studies which aim to identify gender gaps on FCI questions have observed the largest disparities on items 6 (path of ball leaving a channel), 12 (path of cannonball fired off a cliff) [22], 14 (path of object released from an airplane), and 23 (path of a rocket after thrust is turned off) [1]. Clearly all of these items involve predicting motion in two dimensions, and all are presented using a diagram. The observed gender gap on projectile-motion-like items is sometimes ascribed to male-biased context [23]. However, attempts to reword FCI items in a more traditionally female context have failed to improve female performance [24]. In light of this discussion we find the inclusion of item M 3 particularly interesting. The content is thermodynamics, far removed from kinematics or predicting motion. The con-text is certainly not experienced or male-biased, and yet a large and significant gap is observed. The only identifiable common trait among all items is the need to interpret a multi-dimensional diagram. Fig. 2 shows the items displaying female bias. As previously stated, these questions have significant p-values but do not have significant |α * M H | values, implying that the bias is small. Nonetheless, these items are of interest as the only female-biased questions of some significance. Both items involve careful reading, a task suggested to have a female advantage [25]. Interestingly, item F 2 is on the subject of predicting motion. This observation further supports the idea that male bias arises from the need to interpret a diagram or multi-dimensional context, rather than content related to predicting motion. Other important observations arise from those questions which were not deemed to have significant bias. In particular, we address the widely held belief of male advantage on multiple choice style questions [26][27][28][29]. Of 20 iCMA questions presented in multiple choice format, none are observed to have a significant gender gap. These include questions similar in content to items M 1 , M 2 , and M 3 . Furthermore when item gaps are ranked in order of significance, we find that multiple choice questions populate the side of the spectrum of lesser significance. We conclude that there is no evidence to suggest a female disadvantage owing to multiple choice structured questions.

A. Scaffolding definition
Scaffolding is broadly defined to have occurred when an expert or more knowledgeable person helps a learner to accomplish tasks that would otherwise be unattainable [30]. A traditional example would be a teacher providing strategic guidance and feedback while a student completes a problem. In more recent years this definition has evolved to include interactive computer-assisted learning, as well as peer instruction and similar socialized learning environments [31].
Due to widespread usage of the term "scaffolding" in multiple circumstances, it is important to carefully define the term in the context of physics education research. In the present study, we consider scaffolding only as it may be applied to written exam questions. We define 6 general ways in which scaffolding can occur (elements), and further provide specific instances of each that are likely to be encountered in physics problems. Table III shows a complete itemization of the elements. Many elements are adapted from the guidelines outlined in [32], which com-bines theoretical foundations with prior work to define a common framework for scaffolding within computerassisted assignments. The element of conceptual prompting is motivated by [16]. There it was shown that students will successfully apply physics concepts to problems if they are prompted to identify the concept immediately beforehand. Taken together, the elements listed in Table  III define what is meant by scaffolding within this study.

B. Gains by gender
In a study on question structure and its impact on the gender gap, Gibson et al. [15] administered 2 versions of an exam. One exam used highly scaffolded questions, and one used traditional exam style questions. Between the low and high scaffolding versions, female students achieved a gain in exam score of 13.4% while male students achieved a gain of 9.0%. The study therefore concludes that scaffolding benefits all students, but that female students benefit preferentially. We observe no such preferential treatment, and argue that other factors may be at play.
Using the elements of scaffolding and individual items as a scoring system, all exam questions were assigned a "scaffolding score". Questions displaying 2 or fewer items were labeled as low scaffolding, and questions with 7 or more items were labeled as high scaffolding. All questions belonging to either group can be found in the appendix. Fig. 3 shows the performance of students on each question by gender, and Table IV shows the average performance as well as gains provided by increased scaffolding. The average gain is 6.6% for female students, and 5.2% for male students.
Although not as clear, the data does at first glance seem to support the conclusions of [15]. Male students outperform female students on the low scaffolding questions by 2.9% (p = .087), and by only 1.4% on the high scaffolding questions (p = .42). However neither result is statistically significant, and we should also consider how use of representations and language to bridge expert-novice understanding 1. technical words are described in everyday language 2. mathematical symbols are explained in words 3. a diagram is used to give meaning to technical words or symbols reduction of cognitive overhead 4. includes a math (or other background) reminder 5. somehow automates a routine task (eg. unit conversions given, constants given that could have been looked up) 6. no penalty for missing sig figs, wrong unit, wrong numeric value or other nonsalient component of the question 7. provides a diagram or graph that the student could have constructed with the available information insertion of expert knowledge 8. expert directed focus is used (eg. key information is highlighted using bold or italicized text) 9. explicitly instructs student to make an expert assumption (eg. "you may ignore air resistance") 10. the student is warned of a common mistake or relevant misconception ordered task decomposition (provide structure for complex tasks) 11. each part of the question contains only one expected output (numeric or otherwise) 12. an output (numeric or otherwise) is required in subsequent work 13. marks are awarded for interpreting outputs (no further calculation required) 14. question has a wide mark distribution (each part is worth less than 50% of the total awarded marks) conceptual prompting 15. asks student to define or explain an equation that they should use 16. asks student to identify a concept that they should make use of 17. asks student to draw a diagram before beginning the problem reduction of degrees of freedom 18. gives student the appropriate equation to use 19. prompts at how the question is expected to be solved (eg. "using the principle of conversation of energy...") 20. explicitly instructs student on how to begin a task scaffolding benefits students performing at different levels. Intuitively, we expect that scaffolding cannot greatly benefit the highest achieving students (who likely know the information and do not have much room to improve) or the lowest achieving students (who are too unprepared for scaffolding to provide a use). Students completing module S207 and S217 are assigned a level (1-4) based on overall performance on the module (1 being the highest level of achievement). Table V shows the average score of students on the low and high scaffolding questions by level, as well as the number of male and female students in each level. As expected, scaffolding provides the greatest gains to the intermediate students. Performing a weighted average of gains across level by the number of female and male students in each level can give us an idea of the expected gains by gender. Doing this, we estimate expected gains of 6.2% for female students, and 5.8% for male students. The expected gain is higher for female students as a consequence of the fact that fewer female students achieve a level 1. The expected gains are not significantly different than the actual gains for either gender, and therefore we conclude that preferential female gain is simply an artifact of gain dependency on level.

C. Questions of interest
Although scaffolding does not appear to preferentially benefit female students in general, we note some particular questions of interest. Fig. 4 shows one question from the low scaffolding group (L), and one question from the high scaffolding group (H). Both are 2-dimensional projectile motion questions, but display significant performance differences. Table VI shows the average performance on each question, and the difference between genders with significance levels. Of all exam questions, L exhibits one of the most significant differences in performance between genders, and H shows no significant difference. The scaffolding gains are comparable to those observed in [15] (13.4% for females, 8.8% for males). We conclude that scaffolding may play a role in reducing the gender gap in specific types of problems which were previously identified to contain a male bias, namely questions involving multi-dimensional context.  In summary, we have identified elements of question structure that promote male bias, and further address the scaffolding hypothesis as a potential solution. We conclude that neither the use of multiple-choice style questions nor the level of scaffolding can sufficiently explain the gender gap.
We have used a Mantel-Haenszel stratified analysis to account for student ability, and find iCMA questions with significant performance differences between genders. By flagging only those questions which display significant bias in both measures (|α * M H | and p), we have reduced the possibility of flagging false positives. We therefore conclude that the 3 flagged questions exhibit real and significant male bias. All questions involve interpreting a diagram, and all involve multi-dimensional context. Our findings are in agreement with [21], and similar studies on the FCI [1,22]. Because multi-dimensional diagrams appear most frequently in mechanics problems, previous studies may have incorrectly attributed male bias to mechanics content. Further investigation with more types of questions will be required to separate the variables of content and presentation.
Scaffolding has recently been argued to preferentially benefit female students [15], and therefore have the potential to aid in reducing the gender gap. The study of Gibson et al. uses a smaller number of students and less varied exam content than the present study to reach this conclusion. In a similar analysis, we do not observe a dependence on gender, and argue that any perceived dependence is actually due to student achievement level. The advantage of [15] is that exam questions were designed specifically to measure scaffolding gains, whereas the present study collected data from actual exam responses. Therefore questions between the low and high scaffolding groups do not match onto each other exactly as in [15]. Future studies can make use of the elements of scaffolding to produce low and high scaffolding versions of the same question for use in experimental exams.
Even if scaffolding does not preferentially benefit female students in general, it may still play a role in reducing the gender gap. We make note of a pair of questions involving multi-dimensional context (2D projectile motion), for which the gap is reduced between low and high scaffolding versions. If male bias within a question can be reduced by increased scaffolding for novice students, then this provides a route to addressing gender gaps in attainment.

VI. ACKNOWLEDGMENTS
The authors acknowledge financial support from eS-TEeM, the OU centre for STEM pedagogy, as well as the co-operation of the S207 and S217 module teams, and useful discussions with Richard Jordan.