Exploring gender differences in the Force Concept Inventory using a random effects meta-analysis of international studies

The force concept inventory (FCI) is one of the research-based assessments (RBAs) established by the physics education research (PER) community to measure students' understanding of Newtonian mechanics. Former works have often recorded the notion of gendered mean FCI scores favoring male students notably in the North America (NA) based studies. Nevertheless, these performance gaps remain inconclusive and unexplored outside the NA context. This paper aims to fill this gap by meta-analyzing the mean FCI scores between gender based on the existing PER literature beyond the NA context. We analyzed the magnitude and direction on the mean FCI scores between gender on the basis of primary international studies published over the last two decades. We also explored the moderating impact of international study characteristics on the meta-analytic findings by performing a subgroup analysis to study the different study regions stratified by two subgroups (NA vs non-NA authors). Thirty-eight studies reporting the mean FCI scores by gender were included in the present meta-analysis. We employed Hedges' g statistic to estimate to what degree the mean FCI scores may be different between male and female students on each study. Under a random effects model, we meta-analyzed the findings and conducted a subgroup analysis to answer the research questions. In summary, our meta-analysis indicated a significantly positive and moderate amount of gendered mean FCI scores in favor of male students both in NA- and non-NA based regions, and the performance gaps were wider in the NA-based studies. Suggestions are discussed for promoting gender fairness in the FCI when interpreting its scores for teaching, learning, and forthcoming studies.


I. Introduction
Assessment is an educational domain that warrants continuous research within the scope of PER [1][2][3][4].One main concern investigated by this subset of PER is designing a valid and reliable research-based assessment (RBA) to gauge students' understanding of physics concepts.Thus far, RBA can be administered as a useful measure to examine some physics learning reforms within the PER.Systematically, the PER community has archived their established RBAs that are openly available from PhysPort [5] or on the LASSO platform [6].Some of the widely known RBAs published by the community include the Force Concept Inventory (FCI) [7], Force and Motion Conceptual Evaluation (FMCE) [8], Conceptual Survey on Electricity and Magnetism (CSEM) [9], and Brief Electricity and Magnetism (BEMA) [10].Without omitting the significance of the other RBAs, the focus of this paper is to explore the FCI as one of the most widely implemented RBAs within the community over decades.
The FCI is consistently popular as an RBA that assesses students' conceptual understanding of Newtonian mechanics within PER notably at the undergraduate physics education.Initially, it was designed as 29 multiplechoice conceptual questions [7], which were later revised into 30 items to date [11].The FCI items were written based on findings from qualitative and quantitative studies about the taxonomy of students' knowledge state in understanding Newtonian mechanics [12].Since then, many PER scholars have undertaken studies to research its measurement using multiple approaches [13][14][15][16].They also summarized the discovered evidence based on systematic quantitative studies [17,18].They found that FCI can be acceptable as a useful conceptual metric to measure students' understanding of Newtonian mechanics and to examine the effectiveness of some innovative physics learning strategies developed by PER scholars [19,20].Recently, exploration has been enriched by utilizing more sophisticated analytical approaches, such as item response theory [21][22][23], cluster analysis [24], network analysis [25,26], and machine learning [27,28].Moreover, some studies aimed at validating FCI constructs have been attempted using the framework of factor analysis studies [11,29,30].Scott and colleagues [31] found that the concept of "Newtonian force" is perfectly valid and that the division of this concept into subcategories by Hestenes and Halloun [7] is also perfectly valid based on their factor analysis findings.
Admittedly, FCI is one of the most widely implemented RBAs in producing the milestones of PER studies to date.
According to the lens of social demographic lenses, studying the FCI in terms of gender has attracted much attention in PER works [21,[32][33][34].Many studies addressing the gender differences 1 in the FCI have been conducted to date [35][36][37][38][39][40][41].Surprisingly, after the FCI was employed as a probe of conceptual understanding, PER scholars discovered that gender discrepancies in mean FCI scores favoring male students may be visible [18,[42][43][44].One can argue that it is always interesting to gain more knowledge pertaining to this phenomenon as we discover that there are inconclusive findings regarding the magnitude and direction of the gender performance gaps in the FCI from the literature.Some studies suggest that female students were underperformed in the FCI test compared with male students [42,45], while female superiority in FCI was also found elsewhere [46,47].It is however unclear whether the FCI can significantly affect gendered students' performance in favor of male students [18].A meta-analysis from Ref [18] has been attempted, but it faces the limitation of only summarizing limited PER literature from the environment in which the original FCI authors are affiliated with the North America (NA) based region 2 .As such, it is critical to update more studies to review the published FCI scores by gender based on the global PER literature beyond the origin of FCI development.This study contributes to filling the void of the former meta-analysis, which is still isolated under the limited environment.
On the other hand, FCI has been widely adopted by PER researchers from other parts of the world internationally across Europe [34,66,67], Asia [46,47,[68][69][70][71][72][73][74][75][76], Australia [77], and Africa [78,79].These research activities, beyond the NA based regions, can simply denote the recognized reputation of the FCI globally.However, outside the NA context, this has never been explored by the former meta-analysis.Henceforth, a meta-analysis that has been attempted by Ref [18] needs to be updated by reviewing more literature to accommodate the international diversity of the PER literature about the FCI gender performance gaps.The goal intended by this paper is to expand the former meta-analysis study programmed beyond the area of NA-affiliated PER authors.
One can argue that the cultural mechanism that emerged from international publications could give a moderating effect on the gender differences in the mean FCI scores.The performance gaps between gender obtained in the NA-based environment, were often reported in favor of male students [21,32,33,42,43,45,[48][49][50][51][52][53][54][55][56][57][58][59][60][61][62][63][64].Based on the NA literature, female students were also identified as an underrepresented group within the department of physics.By contrast, these performance gaps might be conflicting to the non-NA based studies as reported by Refs [80][81][82] since the culture of physics students applies differently.In this study, we start to hypothesize that this underexplored phenomena outside the NA context could have a moderating impact on the gendered FCI performance gaps studied in this paper.This hypothesis has been drawn from relevant works of sociological scholars investigating the gender differences seen from between country comparisons [83,84].In four national contexts, Chan et al. [83] explained that gender differences in physics might be shaped by social identities, social locations, and country contexts from three theoretical ideologies perceived by scientists' attitude toward gender.
Their explanation for gender differences in physics was varied by social identities and social location in country specific ways.In mathematics education research, gender differences were also making benefit in favor of male students as reported by Ayalon and Livneh [84].They revealed that between-country findings in gender performance gaps might be moderated by the different educational system standardization implemented in each country.The use of educational system standardization such as national examinations and less between-teacher instructional variation was evidently a major factor in reducing the advantage of boys over girls in mathematics scores and in the odds of excelling.Drawing on these sociologist findings, instead of merely reporting the common magnitude and direction of the gendered mean FCI scores as combined by a meta-analysis study, this paper will also follow up a subgroup analysis to examine the potential effect of different cultures as represented by distinct study locations to moderate the gendered mean FCI scores summarized by the meta-analytic methodology.
A subgroup meta-analysis is motivated by the prevalence of high heterogeneity between studies included in a meta-analysis.Hence, it should be acknowledged by a choice of the meta-analysis models [85].Broadly speaking, FCI has been widely recognized within PER community across international study locations.Unique regional characteristics such as demographics, socioeconomic status, cultures, educational systems, and system of beliefs of the samples may inevitably be associated with gendered mean FCI scores summarized by our study.The earlier meta-analysis published by Madsen et al. [18] faces limitation to address the high heterogeneity of the international landscape.They summarized the gendered mean FCI scores based on the weighted average without analyzing the potential heterogeneity of the pooled studies.Ref [18] utilized a fixed effect model 3 , that is a sort of meta-analysis model when the same true effect size of interest (the FCI gender gap) is supposed to be exactly identical within the pooled studies [86].In fact, each study must be similar and perform identical methodology then this assumption could be satisfied.Therefore, the fixed-effect model encounters limitations in tackling nonideal conditions when we consider that meta-analysis is summarizing heterogeneous study results [87].
Ignoring heterogeneity leads to an overly precise summary result (that is, the confidence interval is too narrow) and may wrongly imply that a common effect size of interest exists when actually there are real differences in characteristics across studies [88].To address this limitation, a random effects meta-analysis model is utilized to estimate the summary of gender differences in mean FCI scores based on the international studies.
This study sought to summarize the FCI performance gaps between gender that have been discovered inconclusive and underexplored outside NA based studies.According to the international selection of the existing PER studies beyond the NA context, we start the meta-analysis to summarize the magnitude and direction of the mean FCI scores between gender and subsequently investigate to what degree they can be moderated by a factor of different geographical regions (stratified by two subgroups, NA and non-NA countries).Two research questions are addressed to guide the goal intended by this paper as follows.
RQ1.To what degree are there differences in the magnitude and direction of mean FCI scores between gender?RQ2.How does the factor of the different study locations (NA vs non-NA studies) moderate the magnitude and direction of mean FCI scores summarized by the present meta-analysis?
The implication of this paper, within its limitations, is exploring to what degree gender differences can be visible by the mean FCI scores reported by a body of published PER literature to date.It should be noted that this paper is unable to establish the conclusion at item level.That FCI gender gaps will indicate the potential of measurement variance must be validated by the lens of psychometric studies such as in differential item functioning (DIF) analysis.Forthcoming study using this novel method must be suggested to enrich the current paper.Replication study is also welcomed to expand the present meta-analytic findings and to summarize the gendered FCI performance gaps based on the more recent and wider published PER literature internationally.

II. Method
This paper is a meta-analysis study aimed at summarizing the gendered mean FCI scores reported by published PER studies.As a typical systematic review method, the literature search should be initiated in preparing the dataset (a set of literature).This tedious phase was iterated regularly during the study process using the "backward snowballing" technique [85].To capture the non-NA context, the PER literature should be identified from the international landscape of the FCI administration beyond the NA environment.After the eligible literature has been achieved, we summarized the mean FCI scores between gender rely on the methodology of a meta-analytic review under the random effects model 3 .Then, we should mitigate the potential of publication bias that might threaten the validity of the meta-analytic conclusion [89].Subsequently, a subgroup analysis was conducted to examine the moderating effect of different study locations on the gendered mean FCI scores.

A. Literature search
Broadly speaking, Google Scholar is one of the available indexing databases for systematic review studies [90].This database platform was chosen since we believed that some studies might be excluded by some subscriptionbased indexing databases (e.g., Scopus and Web of Science).Then, we gathered our PER literature by scanning through this database using the key queries of "gender" AND "force" AND "concept" AND "inventory".We scrutinized each web page published between 2002 up to 2024 consecutively.The international contexts beyond the circle of NA-affiliated PER scholars were prioritized.In this study, we encompassed the reported mean FCI scores by gender across five authors continents (America, Europe, Asia, Australia, and Africa).Nevertheless, only English-language literature was analyzed in the present meta-analysis study.Meanwhile, we could assure that the readers could replicate the systematic process of our study identification and find an accurate list of the analyzed literature described in the reference section.Furthermore, interested readers could retrace the included literature and replicate what had been analyzed and reviewed throughout our meta-analysis study in this paper.

Figure 1. Flow diagram of the inclusion and exclusion criteria for study identification
The inclusion and exclusion criteria for identifying the eligible PER literature in this meta-analysis study can be illustrated in Figure 1.This workflow was prescribed based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline [85].With regard to RQ1, the first inclusion criteria should employ the aforementioned queries of the intended search, and we discovered 74 studies between 2002 and 2024.
Subsequently, the exclusion criteria were applied that the eligible studies should explicitly mention the FCI as their measurement tool, and its results should be described by gender (male and female).Admittedly, most of the PER studies often addressed specific subjects, instead of gender studies.Fortunately, they often described the mean FCI scores by gender.In this regard, we then extracted the quantitative findings even though we should manually scrutinize the contents to select studies that clearly provided the gendered comparison of the mean FCI scores.In the screening process, we obtained 49 articles as the prefiltered dataset of our study identification.
The subsequent exclusion setting to filter the eligible studies was the mathematical requirement for the effect size calculation.The prefiltered list of the pooled literature formerly should satisfy the exclusion criteria of the screening process considering the adequacy of statistical parameters needed for the effect size calculation of the FCI gender differences.As described earlier, this study conducted a direct comparison of the mean FCI scores between students' gender binary (male vs female).Therefore, the sort of group comparison effect size method was appropriate based on the purpose of the study intended by our meta-analysis.Three statistical parameters (e.g.sample size (), the mean FCI score (̅ ), and its standard deviation of gendered group ()) should be explicitly reported by the eligible literature [85].See equations ( 1) to ( 5) below for the mathematical formulas of the "effect size" calculation process.Based on this final exclusion criteria, thirty-eight studies were eligible as our dataset (a set of literature) that would be reviewed and summarized to answer research questions in this paper.
From the eligible PER literature, a pretest-posttest experimental design was found as the most prevalent research design.It should be understood that some studies sometimes provided only pre-or post-instruction data in their findings (e.g.Normandeau et al. [60]).Without omitting the significance of the FCI pretest or posttest data, in this study, we decided to employ the pre-instruction data which usually provided a larger data size in most of the published literature.As such, the gendered scores extracted by the pretest FCI data would be more robust than the post-instruction data with smaller sizes.We assumed that there is no significant difference in the students' preparation between the pretest and posttest session.Hence, the FCI was supposed to be administered within those two different testing conditions equally.Meanwhile, the excluded posttest data were still worth noting to generate our understanding pertaining to the gender differences in the FCI.Admittedly, open room for forthcoming study must be suggested to analyze the post-instruction data.After this selection, we extracted the statistical parameters from the FCI pretest data to estimate the gendered mean FCI scores from the eligible studies.

B. Quantifying the magnitude of the FCI gender differences from each study: Hedges' 𝒈 statistic
The meta-analysis performed in this paper followed the standard procedure of meta-analytic methodology for research synthesis [85].After the set of eligible literature had been selected ( = 38), we formatted a spreadsheet file to tabulate the sample size (), the mean (̅ ), and the standard deviation () of the FCI score among male and female groups reported by each literature.To maintain the general scale of the tabulated FCI scores from the pooled literature, the maximum value of the raw FCI score was employed.For studies reporting the FCI scores as a percentage, the linear transformation to the raw score should be calculated before they were used to estimate the gender differences in the mean FCI scores.After that, we quantified the mean differences using Cohen's  [91] to compare the students' performance on the FCI between the male and female groups as follows.
Here, the subscript  indicates the male group and  denotes the female group.Based on equation 1, the direction of the FCI gender differences should be defined by the sign of  value.The positive  value indicates the FCI is obtained higher by the male population.
Then, a measure of discrepancy from the estimated true effect size between studies can be explained by the standard error of Cohen's  (  ) that is estimated using equation 2 as follows.
Considering Ref [92], a correction factor is suggested in the case of literature with a small sample size, as indicated by some studies in our pooled literature (e.g.Carleschi et al. [78]).Statistically, Borenstein and colleagues found that it could affect Cohen's  effect size calculation that might be sample dependent [92].Thus, they proposed to adjust the Cohen's  into Hedges'  value using the correction factor .It can be calculated as follows.
Here,  is the degree of freedom which is equal to   +   − 2.Then, Hedges'  and its standard error of the gender gaps in mean FCI scores are calculated using Equations 4 and 5 as follows.
A tabular spreadsheet file was eventually created to organize the list of authors, years, regions, the estimates of the FCI gender differences () and the standard error (  ) as our main dataset for meta-analysis.A random effects meta-analysis was estimated using the "metafor" package facilitated by the open-source R language [93].The gender differences in mean FCI scores () on each study was then labeled based on Cohen's recommendation [91].A  value that is more than 0.2 should be "weak", greater than 0.5 could be "moderate", and more than 0.8 indicates "strong" differences in the mean FCI scores between male and female students.
As described in section I, this study ought to be novel since we employed the random effects model as our metaanalytic approach.This selection was judged a priori by the researchers based on theoretical consequence and the nature of heterogeneity that was evident from the obtained meta-analysis results below [83,84].Most metaanalyses are based on a collection of studies that may employ unique methods and characteristics of the included literature [93].It can be assumed that the studies involved in the meta-analysis would not only vary on one of these characteristics but several ones at the same time [89].Therefore, a random effects model can relax this assumption.Given the expected study diversity, a random effects model would often be more realistic [94].In this study, a random effects model was estimated using the "rma.uni"function from the "metafor" package in the R language environment.Mathematical explanation of this model can be found in the Appendix.
When the high heterogeneity within our pooled literature is present, it could be moderated by the impact of subgroups in our data that have a different true effect size [89,94].Therefore, we followed up a subgroup analysis in the second research question to account for the unexplained heterogeneity from the meta-analysis findings.In this paper, a subgroup meta-analysis was intended to study the potential effect of different study locations to moderate the identified gender differences in the mean FCI scores.Under the random effects model, a subgroup analysis was conducted to examine the underlying factors moderating the meta-analytic results.Arguably, Ref [89] proposed that a subgroup analysis could be done if the number of studies should be at least 10 in each subgroup.To satisfy this need of covariate distribution [95], we split the dataset as two large subgroups.They were the NA (n = 21) and non-NA (n = 17) subgroups 2 .This grouping schema should be reasonable in two folds.
First, we intended to expand the former meta-analysis in the literature that is heavily focused on analyzing the NA-based studies.Then, there could be interesting results since the FCI gender differences might be present differently in the non-NA based regions which was never explored before.To examine the more diverse cultural contexts beyond the NA based environment, a subgroup analysis should be worth reporting.

C. The R 2 index: an intuitive measure of explained heterogeneity
The comparison of the subgroups does not inform the amount of heterogeneity that can be explained after we are using the subgroup analysis.To estimate this, Spineli and Pandis [96] propose an index that quantifies the proportion of explained heterogeneity by covariates studied by a subgroup analysis.This index is termed as  2   index, and it is defined as a ratio of the explained heterogeneity to the total heterogeneity as follows.
where  2  is the pooled heterogeneity after the subgroup analysis is conducted and  2  is the heterogeneity that can be accounted before the subgroup analysis is conducted (without splitting the included studies into subgroups under the random effects model).The R 2 index will be ranging between 0 and 1 (or in the range 0%-100% if expressed as percentage).The closer the R 2 index to 1 or 100% indicates the studied covariates should be determinant affecting the summarized effect sizes in a meta-analysis.

D. Diagnosing potential of publication bias
The next our important question warranted by a meta-analysis study under the random-effects model was diagnosing the potential of publication bias.This could imply the imbalance of significant and insignificant results that have been reported by the pooled literature in our dataset.In other words, there might be unpublished results excluded from our dataset (set of literature) [92].Rosenthal [97] mentioned these unpublished cases as a "file drawer problem", whereas Card [98] termed it as publication bias.Publication bias may threaten the interpretation of the meta-analysis results.A well-written peer-reviewed study could be unpublished in the academic journal due to multiple factors that, for instance the manuscript does not fit with the editorial policy [99].Although publication bias might be a severe warning to meta-analysis studies, systematic methods for estimating the symmetricity of the eligible published studies could mitigate the risk.In this study, we performed graphical and statistical methods to estimate the publication bias that might be obtained by small size studies.We created a funnel plot, and its symmetricity was statistically tested using Egger's regression and Kendall's rank tests [98].

A. Descriptive results
We demonstrated that most of the included PER literature in the present meta-analysis was obtained from peer-  [47,76].Overall, the meta-analysis findings are reported as a forest plot visualized in Figure 2. The Hedges'  (the gender differences in mean FCI scores), the corresponding confidence intervals for each study, and the aggregated estimate of the gendered mean FCI scores with its prediction intervals under the random effects model are presented in Figure 2.
A forest plot is often presented in a typical meta-analysis finding to describe effect sizes on each literature and the combined effect size estimates from the analyzed set of studies.In this paper, complete lists of authors' names, years, and regions are first placed in the leftmost section of Figure 2. The adjacent part from these columns is the estimate of the gender differences in mean FCI scores (effect sizes), and it is visualized by the horizontal bars (95% confidence intervals) with squared dots on each center representing the magnitude of the differences in mean FCI scores between gender within the continuum of observed outcome between -1 and 2. These squared dots are distributed around the dashed line located in the "zero" effect size, indicating that the differences in mean FCI scores between gender should be nonsignificant.The right location of the squared dots from the dashed line describes the outperformed score of the mean FCI scores favoring the male students.By contrast, the left direction of the squared dots from the dashed line suggests that females outperform males in FCI test [46,47].As we can see, studies with small sample sizes yield greater standard error estimates that could be diagnosed by the wider horizontal bars.Accordingly, the greater size of the squared dots indicate the larger magnitude of the FCI gender gaps on each study.Overall, the common  value is reported on the lowermost of Figure 2 with some heterogeneity statistics (,  2 ,  2 ).Diamond-shaped dots are given to indicate the summarized value of the gender differences on the mean FCI scores based on our meta-analysis under the random effects model.A prediction intervals is provided accordingly centered at the overall  value to explain the uncertainties of the summary estimate [88].

B. The magnitude and direction of mean FCI scores between gender (RQ1)
Thirty-two studies reported significant gender differences on the mean FCI scores in favor of male students.Significant gender differences can be inspected when there is non-zero value within 95% confidence intervals reported by each study (e.g.Chrysostomou [79], Eaton [32], Rahim & Ayop [73]).Overall, the magnitude of the effect sizes () of the individual studies analyzed in this meta-analysis ranged between -0.Cohen [91], eight studies should indicate strong effect sizes, most of the pooled studies (n = 20) demonstrated moderate effect sizes, and ten studies (e.g., Doherty [49], Al Rsa'I et al. [68], Abdal-Razzaq [47], Bahtaji [46]) indicated weak effect sizes.Most of the literature clearly delineated that the mean FCI score is significantly different between males and females within the PER literature internationally.
Figure 2 is produced under the random effects model of the pooled literature to compute the overall result or summarized effect size (Hedges' ).The average gender difference in the mean FCI scores is 0.61, 95% CI [0.51, 0.70] with higher performance in FCI test should be gained by male than female students.This summarized gender difference was significant,  = 13.0013, < .05.Then, the prediction intervals of the FCI gender differences generally included non-zero estimate suggesting that future studies could also yield significant results.
Based on the label of Cohen's recommendation, the overall value of FCI gender differences could be a moderate magnitude.We found the positive direction of the gendered mean FCI scores, indicating that male students had obtained higher score in FCI test than female students based on both NA and non-NA literature.This result recommended that male students still tend to achieve a significantly higher FCI score than female students.Most of the included studies ( = 36) report the positive direction of Hedges' .Male gender was associated with a greater likelihood of obtaining a higher FCI score.Unsurprisingly, there were two studies from non-NA authors [46,47] showing that their female students outperform the male cohorts as probed by the FCI test.
Indeed, at the end of Figure 2, some statistical measures of heterogeneity are reported.From the  statistic, it was evident that we should reject the null hypothesis, suggesting that there is a significant heterogeneity of effect size estimates reported by the pooled studies (p < 0.05).The test for heterogeneity ( = 215.36, = 37,  < 0.05) suggested considerable heterogeneity among the true effects within the pooled studies.In addition, the  2 measure could also quantify whether the heterogeneous factor may be present.The greater  2 value indicated the included literature should be more heterogeneous.In this study, we found  2 value of 91.3%, suggesting a value greater than the cutoff value of 75%, the suggested rule of thumb under which heterogeneous results might be still unexplained across studies.The third measure of residual heterogeneity ( 2 ) also demonstrated the similar finding.
Our positive  2 value was 0.0620, and the rejection of the null hypotheses should be drawn as well.Therefore, our meta-analysis should be followed up further and findings from a subgroup analysis will be reported.

C. The moderating effect of the different study locations on the gendered mean FCI scores (RQ2)
We revealed that the heterogeneity should be unavoidable in our pooled literature based on the examination using multiple statistical approaches (,  2 ,  2 ).Accordingly, this heterogeneity might be accounted by the influence of some possible covariates.Thus, we followed up further analysis to examine the moderating impact driven by different study locations stratified by two subgroups (NA vs non-NA studies).A subgroup analysis was then attempted to further examine the moderating effect.Due to the need of covariate distribution for subgroup analysis as previously described in the method section [89,95], we treated the study locations dichotomously.Both groups were NA ( = 21) and non-NA ( = 17) based studies.The null hypothesis of this subgroup analysis was that there are no significant differences of the FCI gendered score between NA-and non-NA affiliated studies.
The results of the subgroup analysis are presented in Table 1 and Table 2. Table 1 summarizes the overall heterogeneity test of the subgroup analysis between the NA-and non-NA based studies.In Table 1, we demonstrated that the estimated amount of heterogeneity ( 2 ) from the subgroup analysis was equal to 0.0549.
Formerly, 0.0620 was the residual heterogeneity ( 2 ) obtained from the random-effects model in Figure 2, and 0.0549 was the  2 calculated after the subgroup analysis is conducted (Table 1).The  2 value in Table 1 (subgroup analysis) was decreasing than  2 explained by the random-effects model as described in Figure 2.This reduction indicated the degree of heterogeneity within the pooled literature that could be accounted by the studied covariates (different study locations, NA vs non-NA based literature).The amount of heterogeneity ( 2 ) accounted for can be quantified based on formula explained by Equation 6.It yielded [1 -0.0549/0.0620]= 11.38%.Using the bootstrapping approach suggested by Viechtbauer [93], one can gauge the precision of the R 2 index through the confidence intervals.Based on 95% confidence intervals, we obtained our R 2 index was ranging between 0% up to 45.32%.This value was still acceptable in literature [100], and it would quantify to what degree the reduced heterogeneity could be explained by the covariate of the investigated study locations (NA vs non-NA).As many 11.38% of the total amount of heterogeneity could be accounted by including the moderating factor of different study locations into the model.To assess whether the impact of the study location factor was statistically significant between NA vs non-NA based studies, we reported some examinations of the heterogeneity in Table 1.The notation  2 estimates the amount of residual heterogeneity.The notation   describes the test for residual heterogeneity.The notation   , as indicated by the subscript, denotes the test for moderators.It was evident that the null hypothesis should be rejected (  = 189.7808, = 2, p < 0.05), suggesting that investigated different study locations (NA vs non-NA) significantly moderated the gender differences on the mean FCI scores reported by the PER literature globally.Meanwhile, the present subgroup analysis still revealed significant results of the test for residual heterogeneity (  = 186.1874, = 36, p < 0.05).This ought to be reasonable since one could argue that it was possibly indicating that other moderating factors excluded by our study were influencing the gender differences on the mean FCI scores [93].
Table 2 demonstrates the summary of the gender differences in mean FCI scores separated by two subgroups (NA vs non-NA studies).According to Table 2, both NA and non-NA regions obtained significant estimates of the gender differences on the average FCI scores.The summarized estimate of gender differences recorded by the NA-based literature ( = 0.6784, 95% CI [0.56, 0.79]) was larger than that in the non-NA based studies ( = 0.5092, 95% CI [0.37, 0.64]).Indeed, there were non-zero values in the confidence intervals of both subgroups.
It translates that either NA-based or non-NA based studies produce significant gender gaps on the mean FCI scores.Indeed, we also demonstrated both regions include non-zero value in the estimate of the prediction intervals indicating future studies yielding the same significant gender differences on the gendered mean FCI scores.Clearly, both regions demonstrated the positive direction of effect sizes indicating that the mean FCI scores mostly favors the male students over the female students based on our meta-analysis.Moreover, if this value is labeled based on Cohen's recommendation [91], both subgroups should be categorized as moderate sizes of the gender differences on the mean FCI scores.These labels are showing an immediate similarity with the overall result before the subgroup analysis in Figure 2. Note: CI = confidence intervals, PI = prediction intervals, SE = standard error

D. Potential of publication bias
Publication bias could be diagnosed by visualizing the symmetricity of the studies as depicted in a funnel plot (Figure 3).This visually implies how the individual studies scattered around the summary result (the dashed line).
Three types of confidence intervals are provided in Figure 3.The significant studies should be located within the 95% confidence intervals indicated by the white, yellow, and pink regions.Thirty-two studies located within this confidence intervals reported significant gender differences in mean FCI scores.It was evident that significant results were mostly reported than the nonsignificant studies in our findings.The -axis is the standard error representing the sample size of the individual studies.The closer value of the standard error to the peak (nearly zero) demonstrates the larger sample size.As we can see, most studies were approaching the apex of the pseudo triangle of confidence intervals in Figure 3.The symmetricity of studies on the left side from the dashed line (summarized estimate) are balanced by those on the right side.Hence, the funnel plot was visually symmetrical.
There were 15 study points on the left side from the summary estimate, and the rest of studies were located in the opposite direction.This harmony simply represented no potential publication bias that may be exist in our metaanalysis results.
Figure 3. Funnel plot to diagnose the potential of publication bias from the meta-analysis results.
geographical groupings (NA vs non-NA) significantly influences the magnitude of the FCI gender gaps.Hence, one can conclude that the magnitude of the gendered mean FCI scores in the non-NA based environment is weaker than in the NA-based regions.Meanwhile, the direction of the gender differences is exactly identical among both regions significantly.Nevertheless, this conclusion is lacking evidence to establish understanding that the FCI is functioning differently between two regions.This should be the next question warranted for forthcoming study.
Admittedly, the rationales to explain the different magnitude of the gender performance gaps favoring male students in the non-NA context may be multifactorial.Many sociological mechanisms across countries should be involved to understand this phenomenon and there must be no single factor that is sufficient to explain the gap.
Our hypothesis drawn from sociologist works assumes that different research contexts brought by the non-NA literature should translate the different cultures.This international comparison also may be broadened by assuming physics students across NA vs non-NA groups have different learning atmospheres [83], institutional supports [84], students' backgrounds and preparations, and all possible combinations of many small factors.For example in the different cultures, we can realize that females in the NA-based affiliations are mostly reported as underrepresented groups in the department of physics [18,[42][43][44].By contrast, the number of female students can be larger within the non-NA environment as reported by Refs [46,47,81,82].As mentioned above, Abdal-Razzaq [47] reported a competing result with most of the pooled studies.His master's theses using a descriptive quantitative research design discovered that female students could be superior to male engineering students in a classical mechanics course.He concluded that there are no significant gender differences in the FCI.The finding reported by Abdal-Razzaq [47] supports claims suggested by the recent investigation using the qualitative study conducted by Moshfeghyeganeh & Hazari [82] within the context of a Muslim-majority (MM) country.
Essentially, both studies are situated within the lower number of male students in the physics class.
Moshfeghyeganeh & Hazari [82] articulated the participation of female physicists in the MM countries somewhat different from what was found in the NA-based PER studies.Most of the students involved in the physics class were found to be dominated as described by other study from the Iranian authors [78,79].They also captured a typical culture in their country that as many as 60% of women participate in the physics department.Nevertheless, the small amount of heterogeneity accounted by the study locations from our subgroup analysis suggests that the effect of different study locations should be interpreted with caution since the number of nonsignificant studies may be the outliers of our data.In fact, this meta-analysis is only focused on analyzing the average of the FCI score reported by a single study.Therefore, we are limited, or may be out of scope, to examine other mechanisms explained by the social and economic system within a single country to affect the phenomena of gender performance gaps measured by the FCI conceptual assessment.It may be attempted by forthcoming studies to analyze a meta regression analysis to examine some socio-economic variables reported by a single country (for example as reported by Global Gender Gap Report from World Economic Forum 2 that is used to divide the regional subgroups (NA vs non-NA) analyzed in this study).
The strengths of this meta-analysis are the inclusion of the recent published PER studies up to date and the comparison of meta-analysis results between NA-and non-NA studies under the random effects model closing the gap that is still limited from the former work [18].Many factors can contribute explaining these gender performance differences in physics and many studies has addressed this question to date [101][102][103][104][105][106].We admit that there are many important questions that can be sources of uncertainty in our meta-analysis results.
Nevertheless, a meta-analysis study is a kind of quantitative research synthesis isolated within the secondary data provided by the published literature.Several drawbacks must be acknowledged.First, high heterogeneity was still present in subgroup analyses that is difficult to avoid in meta-analyses of behavioral studies such as PER [89,94].
Second, some potential factors associated with gender differences as described in the literature such as physics identity [101,102], motivation and belief [103], efficacy [104], personal interest [105], and perceived stereotype [106] are neglected in our meta-analysis.In addition, as previously described in section II, only preinstruction FCI data are analyzed in the present meta-analysis.Therefore, the findings may be generalized with caution as a global representation of the FCI studies within different testing conditions.We open room for interested readers to broaden the analysis of the current research using the larger FCI data or to synthesize the gender performance differences that can be driven by the other established RBAs within the community.

V. Conclusion
As a final remark, this meta-analysis has reviewed the gender differences on the mean FCI scores based on the international published PER studies for the recent two decades.A random effects meta-analysis of the eligible PER literature has provided a general finding that a moderate size of the gendered mean FCI score is significantly visible in favor of male students globally.Both NA and non-NA based studies mostly indicate moderate and positive gender performance gaps of FCI scores in average.A covariate of different study locations is evident to significantly moderate the variation of the gendered average FCI scores.Our finding may be unsurprising since the former meta-analysis also achieved the similar conclusion based on the NA literature.Meanwhile, this study has revealed the information from non-NA environment that is still underexplored from the former meta-analysis.Additionally, implementing innovative physics learning strategies must be enacted to enhance effective student learning regardless of student gender.We admit this meta-analytic review using the random effects model should not be the ultimate conclusion of the notion of the gender differences in FCI.A meta-analysis model is limited to only summarize the average FCI score systematically [85].One can argue that it may be lacking evidence to summarize the general idea of gender differences from the theory of educational measurement.Admittedly, there are widely known other statistical methods from classical test theory (CTT) and item response theory (IRT) such as differential item functioning (DIF) that can be more robust to establish construct validity of the underlying latent factor of the FCI gender differences [107].Nevertheless, we need item level data to realize this sort of statistical analysis that is out of scope, but we have to admit that this must be important question for further studies.
Oncoming research projects should be proposed to explore other involved sociological or psychological factors that moderate the phenomena and to replicate our gender meta-analysis study to the recent PER literature and the other published RBAs within the PER community.
Notes 1 We employ the term "gender differences" to denote the differences of mean FCI scores (or the FCI performance gaps) between male and female students.

Appendices Fixed-effect and random-effects meta-analysis model
Meta-analysis is used to synthesize quantitative information from related studies and produce results that summarize a whole body of research [88].In this study, we used a meta-analytical method to aggregate gender differences in FCI score and obtain a summary estimate based on published PER literature internationally.One important feature of the meta-analysis is its ability to incorporate information about the quality and reliability of the primary studies by weighing larger, better reported studies more heavily [85].
where  is an index of independent study and  is the number of studies.There are two popular statistical models to determine the weight () for meta-analysis, the fixed effect model and the random effects model [85,87].These models form the basis for most meta-analyses.
Under the fixed effect model we assume that there is one true effect size  that underlies all the studies in the analysis, and thus the observed outcome   for study  is then a function of within-study error   .
=  +   (2) where   is the difference between the observed mean for study  (  ) and the common true mean () in which   is also normally distributed   ~(0, (  ) 2 ).In a fixed effect meta-analysis, there is only one source of variation, the estimation error   or the intrastudy error.
Alternatively, the random effects model supposes that each study samples a different true outcome   , such that the summarized outcome  is the grand mean of a population of true effects.Hence, it is possible that all studies share a common effect size, but it is also possible that the effect size varies from study to study.Differences in the methods and sample characteristics may introduce variability (heterogeneity) among the true effects [93].
Random effects models allow us to estimate variation between studies due to differences in methodology, population, sample characteristics, or other factors, thereby allowing a more flexible assessment of similarities or differences between studies [88].Thus, random effects models are better able to provide a more accurate and representative summary of international FCI scores by gender, despite variations between the studies we used in this analysis.One way to model the heterogeneity is to treat it as purely random.The observed effect   for study  is then influenced by the intrastudy error   and interstudy error   [85].This leads to the random effects model, given by   =  +   +   (3) where   is also normally distributed   ~(0,  2 ), with  2 representing the extent of heterogeneity, or betweenstudy (interstudy) variance.  is the difference between the grand mean () and the true mean for study  (  ) and   is the difference between the true mean of study i (  ) and the observed mean for study  (  ).Therefore, the true effects are assumed to be normally distributed with the mean  and variance  2 .If  2 = 0, then it implies homogeneity among the true effects (i.e. 1 =  2 = ⋯ =   ≡ ), so that  =  then denotes the true effect.
Study-level estimates for a fixed effect or random effects model are weighted using the inverse variance: fixed effect 1 (  ) 2 +  2 , random effects (4) Note that, two formulas described by equation 4 are identical except for the inclusion  2 for the random effects model.If the heterogeneity ( 2 ) is estimated at zero, then the random effects model collapses to the fixed effect model and the two formulas produce the similar result.

Manuscript Review Version:
Please include all authors' information.Author information will be removed by the editorial office before the double-blind review process.

Figure 2 .
Figure 2. Forest plot showing the results of 38 PER studies examining the gender differences in mean FCI scores.The results are estimated using Hedges'  and 95% confidence and prediction intervals.

Furthermore, our study
has implications for researchers and educators to be aware of the prevalence of the performance gaps between students' gender in FCI test.When interpreting the FCI scores, they should consider the magnitude and direction of the discovered performance gaps.Studying the influence of the moderating variables such as diverse contexts as represented by the different students' background (preparation) can deliver useful information to design the more appropriate learning to close the gaps.If the gender differences can be narrowed by revising the FCI constructs among male and female students, PER scholars can attempt further studies to develop the reformed version of the FCI items.A quantitative study intended to equate the reformed version and the original version can be designed to validate the constructs.The equating study is recommended to examine the invariance of the item and person parameters by version of the test that can be estimated using the popular psychometric frameworks within PER such as classical test theory (CTT) and item response theory (IRT).

Table 1 .
Heterogeneity test of the subgroup analysis among the NA and non-NA study locations

Table 2 .
Model results of the subgroup analysis by the inclusion of study locations (NA vs non-NA studies) The two quantities of interest are the overall estimate and the measure of the variability in this estimate.Study-level outcomes   are synthesized as a summarized mean  ̂ according to the study level weights   :