Modernizing use of regression models in physics education research: a review of hierarchical linear modeling

Physics education researchers (PER) often analyze student data with single-level regression models (e.g., linear and logistic regression). However, education datasets can have hierarchical structures, such as students nested within courses, that single-level models fail to account for. The improper use of single-level models to analyze hierarchical datasets can lead to biased findings. Hierarchical models (a.k.a., multi-level models) account for this hierarchical nested structure in the data. In this publication, we outline the theoretical differences between how single-level and multi-level models handle hierarchical datasets. We then present analysis of a dataset from 112 introductory physics courses using both multiple linear regression and hierarchical linear modeling to illustrate the potential impact of using an inappropriate analytical method on PER findings and implications. Research can leverage multi-institutional datasets to improve the field's understanding of how to support student success in physics. There is no post hoc fix, however, if researchers use inappropriate single-level models to analyze multi-level datasets. To continue developing reliable and generalizable knowledge, PER should use hierarchical models when analyzing hierarchical datasets. The supplemental materials include a sample dataset, R code to model the building and analysis presented in the paper, and an HTML output from the R code.


I. INTRODUCTION
Student conceptual change is one of the earliest areas of study in Physics Education Research (PER) and remains a popular area of study [1].To measure conceptual change, PER researchers often analyze students pre and posttest scores on Research-Based Assessments (RBAs) [2], such as the Force Concept Inventory (FCI) [3].To identify the impact of a course transformation, PER researchers commonly compare student data collected across course contexts (e.g., traditional and transformed courses).Including data from multiple contexts not only allows for comparative analyses, it also improves a studies statistical power and the generalizability of findings.Many of the most cited publications in the PER literature rely on data collected from multiple courses and institutions [4][5][6][7].Aggregating data across courses and institutions introduces a hierarchical structure into the data where student data (level 1) are nested within course data (level 2).This nesting can include additional levels, such as departments (level 3) and institutions (level 4).When included in a multi-level regression model, the structure of a hierarchical dataset can provide additional information.When ignored, the structure of a hierarchical dataset will often violate the assumption of independence, which is central to many statistical analyses (e.g., multiple linear regression (MLR) and ANOVA) and can bias statistical models [8,9].Multilevel models are called different things within different disciplines, such as multilevel linear models in sociology, mixed-effects models and random-effects models in biometrics, random-coefficient regression models in econometrics, and covariance components models in the statistical literature [10].In this publication We will use the nomenclature of education researchers for multi-level models, hierarchical linear models (HLM).
The purpose of this article is to assist researchers in identifying and applying the regression analysis techniques best suited to their data and research questions.
We will accomplish this in four sections: (1) Motivationwe review PERs historical use of regression analyses, (2) Theory -we discuss the theoretical advantages and disadvantages of three common techniques for dealing with hierarchical data, (3) Application -we examine the practical implications of using MLR vs. HLM, (4) Tutorialwe provide a tutorial for developing HLM models using the HLM 7.03 application [11].

II. MOTIVATION
Development of regression models is a common method PER researchers use to investigate a wide range of phenomena from drop, fail, and withdrawal rates and likelihoods of passing a class to conceptual learning and attitudes development.These models isolate the impact of a variable of interest while controlling for other variables.For example, a model investigating research based teaching practices might control for student pretest scores and gender.The most recognizable of these models are linear regressions.To determine the different regression models commonly used in physics education research, we reviewed publications in Physical Review Physics Education Research.We performed our literature search on 3/27/18 using the "All Field" search tool on the Physical Review PER website [12].Our search of the PER literature included both the Physical Review PER and Physical Review Special Topics PER archives but did not include PER Conference proceedings since the conference proceedings do not have a full featured search functionality.We searched Physical Review PER for the terms "linear regression", "hierarchical linear model", "multi-level model", and "multilevel model."We performed an identical search of the broader education literature using the Sage Journal's "anywhere" search tool of their journals they classify as being "education" journals.Examples of Sage's education journals include Educational Researcher and American Education Research Journal.The results of the searches are shown in Table I.
There are a variety of regression models, but the most commonly used was a single-level linear regression called multiple linear regression (MLR).As Table I shows, there were 43 publications in Physical Review PER and 432 publications in Sage education journals that mentioned "linear regression".The fact that there were more publications that mentioned "linear regression" in Sage education journals was not surprising given that it included multiple journals.Our search for "hierarchical linear model" and its synonyms in Physical Review PER returned only 2 publications and in Sage education journals returned 196 publications.While Sage journals had ∼10x as many publications that mentioned "linear regression" it had ∼100x as many publications that mentioned "hierarchical linear model" and its synonyms.Of the two publications in Physical Review PER that mentioned "hierarchical linear model", the first mentioned HLM as a possible method of analysis but did not use it [13].The second publication stated that they performed both HLM and linear regression, but because they had similar results they only included the linear regression results in the publication [14].Independent of our literature search, we did identify two Physical Review PER articles [15,16] that used HLM but did not show up in our search because they referred to HLM it using the less common nomenclature of "hierarchical-model analysis" and "hierarchical linear regression."The more common use of HLM in Sage education journals could be explained by the fact that the broader field of education research had begun using HLM over 30 years earlier [17] and its use had increased as access to computing power and larger datasets has improved.
PER's non-use of HLM was not due to a lack of hierarchical datasets in the literature.There were examples of hierarchical datasets that included student data from multiple courses and institutions throughout the PER literature [4,6,[18][19][20].We propose that the reason the reason that HLM was not being used by the PER community to analyze hierarchical datasets was due to a lack of knowledge about the method and why the use of traditional methods had likely biased their findings.While resources existed for applying HLM in different disciplines [8,21,22], none existed in PER.In the next sections we address this limitation by outlining the theoretical differ- A foundation of most sampling designs is the assumption that each measurement is independent of other measurements.That is, each student is independent from the others, and all students have an equal chance of being selected.The assumption of independence is more likely to be true for data collected from a single course since it limits the contexts and there are less opportunities for the data to be hierarchical.Data can still be hierarchical within a single course, however, if there are data nested within individual or groups of students.For example, in a course with three unit exams and a final, the hierarchical dataset would have four exam scores nested within each student.Researchers, however, frequently incorrectly extend the assumption of independence to hierarchal datasets.While incorrectly making this assumption will not always lead to biased results, it calls into question the validity of the claims they can support [8,9,23].
Even if researchers are careful to collected an abundance of pertinent data about the courses, when data is collected across multiple courses there are inevitably features of the courses that they cannot explicitly account for in their model.For example, a researcher may know which introductory physics courses in their dataset were algebra-and calculus-based, if it was the Fall or Spring semester, who the instructors were, how much experience the instructors had teaching the course, and what the student to instructor/TA/LA ratio was, but there will be other factors that are harder to measure, such as whether a class was held at 8AM leading some students to show up late, their was a particularly nasty flu going around campus that semester, or whether the instructor was pre-occupied that semester preparing their tenure packet.These known and unknown course features interact to either improve or depress student performance within a course.The impact of the unique features of courses on the students in them leads to clustering of data across courses and violates violates the assumption of independence [23].Given that there are innumerable student, course, and institution level features that can influence student performance, when analyzing hierarchical datasets it is important to use a method accounts for the nesting of student data.Table II shows example variables that a researcher may want to account for within each hierarchical level.

B. Modeling
There are three primary methods that social-scientists have used to examine hierarchical datasets: HLM, disaggregation, and aggregation.To examine the implications of using these three methods, we will apply each of them to a sample multi-level dataset (shown in Table III) to predict student post-test performance on the FCI.The dataset includes information about the students (level-1) and the courses they are nested in (level-2).

Hierarchical Linear Modeling
Hierarchical Linear Modeling (HLM) does not have the assumption of independence (unlike disaggregation and aggregation) and was designed specifically to ana- lyze hierarchical datasets [8,24].HLM accomplishes this by creating unique regression models for each course to model student-level variables and examines difference between sections to model course-level variables.Analyzing the data from Table III with HLM shows a positive association (0.2) between student SAT Verbal and FCI posttest scores and a negative association (-0.2) between class mean SAT Verbal and FCI posttest scores (Figure 1b).In other words, the model predicts the highest FCI posttest scores for students with high SAT verbal scores in classes with low average SAT verbal scores.

Disaggregation
Disaggregation ignores all nesting in the dataset and treats level-2 data as if it were level-1 data.A singlelevel linear regression of the disaggregated data from our sample finds that student SAT Verbal scores have a small negative association (-0.067) with FCI posttest scores (Figure 1c).In other words, the model predicts that students with higher SAT verbal scores will have slightly smaller FCI posttest score.This slight negative association comes from combining the positive studentlevel and negative course-level associations between SAT verbal scores and FCI posttest scores.
Disaggregation was a common method for analyzing hierarchical datasets in the PER literature.For example, in Kost and colleague's examination of the gender gap in introductory physics courses [25], they examined data from 2,099 students in seven first-semester mechanics physics courses.They analyzed their data using multiple linear regression models to account for student and course features within a single-level model.To account for the fact that their dataset included data from multiple semesters, they included a variable for each semester which allowed the model to account for differences in average scores across semesters.While including a variable for each group in a dataset can improve the quality of a linear model examining hierarchical data, it is still an flawed method.Linear regressions of disaggregated data fail to allow for course-level variance in hierarchical datasets and violate the models' assumptions of independence [21,26,27].This can lead to bias in findings, an underestimation of the standard error, and artificially small p-values [8,23].

Aggregation
Aggregation collapses level-1 data into level-2 data.In the case of our example dataset, the student-level variables of SAT verbal scores and FCI posttest scores are averaged within each course to create class mean SAT verbal score and class mean FCI posttest scores.A singlelevel linear regression of the aggregated course-level data finds that course mean SAT Verbal scores have a negative association (-0.2) with course mean FCI posttest scores (Figure 1d).In this case the aggregation model identified the same course-level association between course mean SAT verbal score and FCI posttest scores as the HLM model, but these values will not always be identical when analyzing the more complex datasets.
The aggregation of student-level data into course-level data ignores differences between individuals and all information about student-level variability is lost.If a researcher is only interested in course-level processes, aggregation can be an appropriate method.If a researcher is interested in the impact of student-level variation, however, aggregation can cause several errors.One potential error is a shift in meaning.For example, our aggregation model that examines the course mean SAT score can say something about the impact of a course context, but does not address the impact of an individual students SAT Verbal score.Education researchers have shown that the majority of the variability in student scores comes from the student-level, not the course-level [8,24].Examining only the impact of course-level variables on student performance ignores most of the variability outcomes and can the distorted relationships between variables.

C. HLM Equations
HLM is the preferred method for handling multi-level data as it provides the advantages of both disaggregation and aggregation without introducing their shortcomings [8,24].To accomplish this, HLM creates a set of equations that nests level-2 equations within level-1 equations.To illustrate how this is done, we describe the steps to creating an HLM model to predict student FCI scores.The first step is creating a level-1 equation that includes student-level predictors that are of interest (e.g., SAT verbal score) or need to be accounted for (e.g., gender, major, and pretest score).The level-1 model for student FCI scores take the following form: where: β 0j = predicted FCI score for a student in course j with an SAT Verbal score of 0 (y-intercept); Unlike in MLR models where the coefficients are values, HLM creates a level-2 equation for each coefficient.The nesting of level-2 equations within the level-1 equation allows the model to account for variance at both levels simultaneously and provides an easy method for examining the interaction of the levels in the data.The level-2 models for student FCI score take the following form: Level-2 Equations where: The two levels of equations can be combined to create a single mixed model.The mixed model for student FCI scores take the following form: The combined model includes variables from level-1 (SATVerbal ij ), level-2 (CourseMeanSATVerbal j ), crosslevel interaction effects (CourseMeanSATVerbal j x SATVerbali j ), and multi-level error terms (u 0j + u 1j SATVerbal ij + r ij ).By accounting for variance at multiple levels, HLM models can test for dependencies among level-1 units (students) within each level-2 unit (course).In our example, HLM generates a unique coefficient for the association of SAT Verbal scores on FCI scores for each class.These coefficients are generated from a blend of information from each individual course's data and overall information across all schools.To create the weighted average of information from the individual course and the grant mean HLM uses Bayes "shrinkage estimators" that takes into account several factors about each course, the sample size within the course being the most important factor [28].
HLM can quantify the degree to which the variability in the data is at the student-level as opposed to the course-level.If almost all of the variability is at the student level then HLM and MLR tend to give similar results.If the variability is distributed across both levels then HLM and MLR may produce very different results because MLR will only account for the variability at one level.By developing a model with no predictor variables, called the unconditional model (shown below), HLM analysis will calculate the variance at both the student-level (variance(u 0j )=τ 2 =between group variance) and the class-level (variance(r ij )=σ 2 =within group variance).The percentage of the total variance (τ 2 + σ 2 ) attributed to the between group variance (τ 2 ) is known as the intraclass correlation coefficient (ICC).The ICC quantifies the share of the differences in student scores caused by differences in students versus differences in courses.

Unconditional Model
Level-1: If there is not significant variance at the class-level (ICC <5%), then it is likely that a single-level and multilevel models will provide similar findings and a MLR may be appropriate.While this rule of thumb is useful, it is also limited.The only way to know that HLM and MLR provide similar results is to run both analyses and compare them.

D. HLM Assumptions
While HLM has fewer assumptions than standard regression analysis it is still important that its assumptions are satisfied.Failure to meet model assumptions can lead to misrepresentations of the relations in the data and p-values [23].HLM's four assumptions: (1) linearity -the dependent variables should vary linearly with the explanatory variable, (2) independence of residualslevel-1 and level-2 residuals are uncorrelated, (3) random slopes -variables must be allowed to have random slopes, (4) homoscedasticity -residuals should be distributed in a homoscedastic (i.e., similarly across the range of outcome values) and normal manner [23].There are a variety of methods by which the assumptions of HLM can be checked, many of which involve visual inspection of residual plots [29].There are methods for fixing HLM models that violate assumptions.For example, if the assumption of linearity is violated performing a transformation of a variable (e.g., a squaring or log transformation) may fix it.For more information about the assumptions of HLM and how to check them, we recommend referencing Snidjers and Bosker's book on multilevel analysis [23].
Perhaps the most important differences in the assumptions of MLR and HLM is HLM does not assume independence of observations [24].The dependent nature of hierarchical datasets is what makes it possible for HLM to accurately isolate the impact of both lower-and higher-level variables.By allowing for variance at all levels, HLM can create a separate regression model for each course that accounts for its unique features and then combine the course models to create a set of coefficients that describe the larger dataset.Because it includes fewer assumptions than standard regression analysis HLM has several additional advantages, such as accommodating non-sphericity (i.e.,"variances of the differences between all combinations of related groups are not equal" [30]), missing data (at level-1), and small and/or discrepant group sample sizes [8] (Woltman et al., 2012).The primary disadvantage of HLM is that it requires more statistical power (i.e., larger sample sizes) than MLR.

IV. APPLICATION
In the prior sections, we used idealized data to highlight the theoretical differences between MLR and HLM.While this shows how HLM can provide more accurate models when analyzing hierarchical datasets, it fails to demonstrate whether using MLR and HLM lead to meaningfully different conclusions when analyzing real-world physics student data.To test how using these two analytical techniques may lead to different conclusions about physics student learning, we reanalyzed a dataset from a prior investigation [31] using both MLR and HLM.Because these analyses are meant to illustrate how using MLR versus HLM can lead to different conclusions, rather than focusing on the conclusions themselves, we will only offer an abbreviate methods section.For more details, please reference the original investigation [31].
The purpose of the original investigation was to disentangle the impact of Learning Assistants (LAs) and the collaborative learning activities that are often implemented with them simultaneously.Learning Assistants are undergraduate students who, through the guidance of weekly preparation sessions and a pedagogy course, facilitate discussions among groups of students in a variety of classroom settings that encourage active engagement [32].To do isolate the potential impact of LAs, data was compared across courses that used traditional instruction, collaborative learning without LAs, and collaborative learning with LAs.
In our reanalysis of the data using MLR and HLM, we interpret our models to answer the following question: (1) How are the implementation of collaborative learning with and without LAs associated with overall student learning, if at all?

Data Collection and Processing
Our study analyzes course and student Force Concept Inventory (FCI) [3] and the Force and Motion Conceptual Evaluation (FMCE) data from the Learning About STEM Student Outcomes (LASSO) platform.The LASSO platform is hosted on the LA Alliance website [33] and facilitates collecting large-scale, multi-institution data by hosting, administering, scoring, and analyzing pretest and posttest research-based assessments online.We removed assessment scores for students if they took less than 5 minutes on the assessment or completed less than 80% of the questions.A cut-off of 5 min was used because there was a clear break in the data between students who took less than or more than 5 minutes and we did not feel that the a student who was honestly trying to answer the questions could completed in less than 5 minutes.Courses with less than 40% student participation on either the pretest or posttest were entirely removed from the data.We removed 44 courses with less than 40% participation rates on either the pretest or posttest.Of the 44 courses removed, 8 had no pretests, 11 had no posttests, and 25 had less than 40% participation on the posttest.The filters for time and completion removed 269 pretest scores and 398 posttest scores and removed 258 students from the dataset who did not have either a pretest or posttest score.The final data set included 5,959 students in 112 courses at 17 institutions with missing data for 15% of the pretests and 30% of the posttests.
After cleaning the data, we used hierarchical multiple imputation (HMI) with the hmi and mice packages in R to address missing data.HMI is a principled method for maximizing statistical power by addressing missing data while taking into account the structure of the data [34][35][36][37].HMI addresses missing data by (1) imputing each missing data point m times to createm complete datasets, (2) independently analyzing each dataset, and (3) combining the m results using standardized methods [38].For more information on Missing Data, please reference our companion piece in this special issue [39].

Data Analysis
To ensure that the models were setup identically, we used the HLM 7.03 software to perform both the HLM and MLR analysis using identical variables and interactions.Figure 2 shows our workflow for the data collection and analysis.
To investigate student learning, we developed a set of models to predict student raw gains (posttest-pretest).The use of raw gains as the outcome variables is functionally equivalent to using the posttest score as the outcome variable in our models, but leads to findings that are more easily interpretable.The use of various change measures (e.g., raw gain, normalized learning gain, and cohen's d effect size) have been shown to have the ability to bias findings across student groups [40,41].As the mean scores for the pretest (36%) and posttest (55%) in our dataset indicate that the floor and ceiling effects were limited, no transformation of the data was required prior to analysis.In our analysis, pretest and raw gains are represented as a percentage correct.We developed our models for student raw gain scores through a series of additions of variables.Model 1 is the unconditional model, which predicts the student gains without level-1 or level-2 predictor variables.The unconditional model allowed us to calculate the intra class correlation (ICC) by comparing the course and student level variance in the HLM version of model 1.The intra class correlation is 13%.This shows that course-level effects account for 13% of the variation in student raw gains and falls above the heuristic threshold of 5%, showing that HLM is likely necessary.Model 2 builds on Model 1 by including the student (level-1) variables (student pretest score) and course (level-2) variables (course mean pretest score, collaborative learning w/o LAs, and collaborative learning w/ LAs).We explored including a level-2 variable to differentiate between the FCI and the FMCE.Since the inclusion of a level-2 variable for the concept inventory used made no appreciable impact on either the MLR or HLM model outputs, we removed it from the model to make them more easily interpretable.The normal distribution of error at level-1 and level-2 were confirmed visually using Q-Q plots.The HLM and MLR version of Model 2 equations are described below.

HLM Level-1 Equation (Model 2)
For ease of interpretation, student prescore is group mean centered so coefficients will predict the score for a student with the mean average score within their course.The coefficient for student prescore can predict scores for students with prescores that are above or below the course average.Doing this can hide differences between courses because it centers all of the courses around their own mean prescores.To account for this, we added a variable for the course mean prescore and grand mean centered it.Grand mean centering it set the coefficients to predict gains for students in courses with course mean prescores equal to the average course mean prescores across all of the courses.As with the student prescore coefficient, the course mean prescore coefficient can predict scores for students in courses with mean prescores that are above or below the average of all of the courses in the sample.All other variables are uncentered.While the pretest scores are not the focus of this study, they are included because they are strong predictors of student posttest scores and improve the models fit.We will not, however, substantively discuss them in our interpretation of the models.

B. Findings
Table IV shows the coefficients for both the MLR and HLM versions of model 2. Some of the coefficients remained consistent across the models while others varied by large amounts.For example, the impact of being in a class with LAs on the predicted gains (+6.85 for MLR and +7.24 for HLM) and their statistical significance (p<0.001 for MLR and p<0.01 for HLM) are similar in both models.The impact of collaborative learning without LAs on student gains, however, has notable differences in both the magnitude of its impact (+12.26 for MLR and +3.59 for HLM) on predicted gains and its statistical significance (p<0.001 for MLR and p>0.05 for HLM).
Figure 3 shows the predicted gains for students for both the MLR (a) and HLM (b) models.The differences in the raw gains across each course context are statistically significantly different from each other within both models.The raw gains for students in traditional and LA-supported courses do not vary much across the MLR and HLM models.The raw gain for students in collaborative learning courses without LAs, however, goes from being 2x the gains of traditional courses in the MLR model to only being 1.3x larger in the HLM model.The course context with the largest gains also switches from collaborative learning courses without LAs in the MLR model to those with LAs in the HLM model.

C. Discussion
Our comparison of MLR and HLM with real course data demonstrates that the theoretical differences between MLR and HLM can lead to important differences in the analysis of real physics student data.In this example, the MLR model predicts that courses that use collaborative learning without LAs have the largest student gains.In contrast, the HLM model predicts that courses that use collaborative learning with LAs have the largest gains.The conclusions of these two analyses and recommendations that one might make to instructors, administrators, and policy makers are very different based on the regression technique used.The analyses demonstrate the importance of using a model that does not have the assumption of independence when analyzing a hierarchical dataset.For those interested in a more full analysis of this data and the impact of learning contexts, we refer you to Herrera et al. [31].
When analyzing multi-level data (e.g., student data across multiple sections, courses, or semesters) it is very unlikely that the students are independent.Most likely there will be course-level differences that lead student outcomes within a course to be clustered, creating dependencies in the dataset.MLR models do not account for these dependencies because one of the assumptions of MLR is that the data are independent.HLM is designed to address these dependencies by leveraging the group correlations to identify level-1 effects (e.g., student features) level-2 effects (e.g., course features).Even if there are no level-2 predictor variables in the model, HLM can account for the impact of all of the unknown course differences (e.g., percent majors, time of day, curriculum, and instructor experience).To accomplish this, HLM allows each level-2 group to vary independently of each other level-2 group.While this model's flexibility allows it to handle more diverse datasets and create more ac-curate models it also creates its biggest drawback.The flexibility of HLM models comes from introducing more variables that use up degrees of freedom and reduces the statistical power of the test, therefor it requires larger datasets to have statistical power equivalent to MLR.HLM also requires that there is sufficient power at each level of the model.Heuristics for the minimum sample at level-2 vary by citation and can be as large as 50, but Maas and Hox [42] found that level-2 sample sizes as small as 10 can produce unbiased estimates.
V. CONCLUSION MLR models can be powerful tools for isolating the impact of variables of interest while accounting for other variables.MLR models, however, assume that each data point in the dataset are independent from all other data points.This assumption is often not true for datasets with hierarchical data, such as students within multiple courses, and can create biased findings.The failure to account for the hierarchical structure of the data in many physics education research studies calls into question the validity and reliability of their claims.
Hierarchical linear modeling (HLM) is a regression modeling method that leverages the hierarchical structure of datasets to create more accurate estimates of the impact of variables across levels.HLM have been in use since the mid 1980s [43] with commercial software designed to perform HLM being launched in the 1990s [44].While the field of physics education research has not adopted the use of HLM in any meaningful manner, the use of HLM in the broader field of education research started in the 1990s [45] and has become more commonly used in the intervening decades.HLM models require more data and calculations than MLR models but with the availability of large-scale databases (e.g., DataExplorer [46], E-CLASS [47], and LASSO [33]) and modern computer processing these barriers-to-use have been significantly reduced.These large databases are powerful tools for researchers and will hopefully lead to broader use of hierarchical datasets within the field of PER.It is important, however, that PER researchers adopt the use of appropriate quantitative methods (i.e., HLM) when analyzing hierarchical datasets.If the hierarchical datasets are analyzed with MLR, it may bias or skew the results and it could take PER researchers extensive effort to undo the harm caused by the findings and implications derived from the use of improper analysis techniques.While there is no simple post-hoc fix for existing studies that used inappropriate modeling techniques, we hope that this article will support future researchers use of HLM when analyzing hierarchical datasets.HLM models can be developed using several different pieces of software.Two of the most popular are R and HLM.Because building models in R requires knowledge of the R programming language which is not ubiquitous in PER, we will provide a guide to the more user friendly HLM 7.03.There are several features of the HLM software used in the analyses presented in the Application section of this paper (e.g., multiple imputation and hypothesis testing) that we do not discuss in this tutorial for brevities sake.Other HLM guides [48] can be referenced for additional analysis functionality.
The dataset used in this guide is one of the 10 multiply imputed datasets used in the analysis in the previous section.See Supplemental Material at [URL will be inserted by publisher] for the two data files used in the tutorial.The first file (lvl1.xls)contains the level-1 data (student pretest and gain) and a column with a course ID number that connects each student's data to their appropriate course.The second file (lvl2.xls)contains the level-2 data (course mean pretest and if the course used collaborative learning with or without LAs) and the same course identifier codes as the level-1 file.The files should match that pops up will ask you to select an ID variable (this is the variable that links students to courses) and in MDM variables (these are the variables you want to include in your model).Select crse id as the ID variable and the rest of the variables as in MDM (shown in figure 6).
Step then see a set of descriptive statistics about the dataset (Figure 7).To confirm that it imported all of the data, check that the level-1 and level-2 Ns are correct (in this case, 5,959 students and 112 courses).Step 9, click Done.
Once the MDM file is created, you can develop your models.The first model to run has the outcome variable with no predictors and is known as the unconditional model.To make the unconditional model, select GAIN → Outcome variable.This gives a basic model, shown in figure 8, that can assess the variation at both level-1 and level-2.
To run the model, select Run Analysis → Run the model shown.This will produce an html file titled hlm2.html with the results, shown in Figure 9.To prevent the file from being overwritten with our next model, rename the file unconditional.html.The output gives a range of information about the model, but we will focus on the Final estimation of fixed effects (with robust standard errors) and Final estimation of variance components.The average gain across all of the courses (γ 00 ) was 18.3 and is statistically significantly different from 0 (p<0.001).The variance components of u 0 (τ 2 = 61.0)and r (σ 2 = 406.2) can determine the share of the total variance (τ 2 + σ 2 ) that comes from each level.As discussed in the Theory section, the share of the variance from level-2 is known as the intra-class correlation (ICC) and can indicate if a multi-level model is appropriate for the data.Here, the ICC = 13%, showing that 13% of the variation in students scores is associated with course-level features.Given that course features account for a meaningful amount of the student variation, we will proceed with developing the multi-level model by adding the level-1 predictors.Select PRE SCOR → add variable group centered.Centering PRE SCOR tells the model to set the PRE SCOR variable intercept as the mean score for students in their course.This will make the model easier to interpret as the coefficients will be for a student with the mean pretest score in their course.Click on the variance variables (u) for the β 1 and β 2 equations such that they are not grayed out (Figure 10).This allows the model to have variance in the newly added level-1 variables.
Select Run Analysis → Run the model shown.This will produce a new html file titled hlm2.html with the results, shown in Figure 11.Rename the file Lvl1.html.Focusing on the same sections as the last model we see that the average predicted gain across all of the courses has not changed (γ 00 ) but now we see that for every point above the course mean on the pretest (γ 10 ) students will have predicted gains that are 0.44 lower.All of these coefficients are statistically significantly different from 0 (p<0.001).The variance of r (σ 2 ) has decreased from 406.2 in the unconditional model to 331.3 in the level-1 model.The decrease of 74.9 in σ 2 shows that the level-1 variables we introduced can explain 18.4% of the level-1 variation ((σ 2 unconditional -σ 2 level-1 )/σ 2 unconditional ).This means that looking at student prescore nearly 20% of the differences in student gains.We will now add the Level-2 predictors.With the β 0 equation highlighted in yellow select MEAN PRE → add variable grand centered, CL WO LA → add variable uncentered, and CL W LA → add variable uncentered.Grand centering MEAN PRE tells the model to set the MEAN PRE variable intercept as the mean score for all courses.This will make the model easier to interpret as the coefficients will be for courses with the a pretest score equal to the mean pretest score across all of the courses.The model should match Figure 12.
Select Run Analysis → Run the model shown.This will produce a new html file titled hlm2.html with the results, shown in Figure 13.Rename the file Lvl2.html.The coefficients in this model are similar to, but different from those in Table V because they were created using just one of the ten datasets used to create Table IV.Focusing on the same sections as the last model we can combine coefficients to predict the gains for students from different demographic groups in various course types.For example, to predict the gain for an average student in an average course that used collaborative learning without LAs you would add γ 00 (13.0) and γ 02 (1.7) to find a gain of 14.7.Table V illustrates how the coefficients can be combined to predict the scores for various students.This process can be done in the HLM software using the hypothesis testing feature.The advantage of doing it within HLM is it provides a standard error value for each predicted gain.
Examining the variance of u 0 (τ 2 ) has decreased from 63.0 in the level-1 model to 60.5 in the level-2 model.The decrease of 2.5 in model 2 shows that the level-2 variables we introduced can explain 4.0% of the level-2 variation ((τ 2 level-1 -τ 2 level-1 )/τ 2 level-1 ).This value shows that there are still predictor variables that could improve the models fit.It should be noted that models of student performance typically leave larger shares of variance unexplained than models in traditional STEM disciplines.This is due to the tremendous amount of diversity across students and learning contexts.

FIG. 1 .
FIG. 1.The relationships between SAT Verbal and FCI posttest scores in our sample dataset as modeled using 4 different methods: a) raw data (no model), b) hierarchical linear modeling (a multi-level regression that creates models within each course and between the courses) c) disaggregation (a single-level regression ignoring class-features), and d) aggregation (a single-level regression collapsing student-level data into course-level data).Figure is adapted from an example by Snijders and Bosker [23] and Woltman et al. [8].
FCI score for a student in course j with an SAT Verbal score of 0 (y-intercept); β 1j = regression coefficient for SATVerbal in course j (slope); CourseMeanSATVerbal j = mean SAT Verbal score for course j ; γ 00 = predicted FCI score for a student with an SAT Verbal score of 0 in a course with a mean SAT Verbal score of 0; γ 10 = regression coefficient for student SATVerbal ; γ 01 = regression coefficient for CourseMeanSATVerbal ; γ 11 = interaction effect between SATVerbal and CourseMeanSATVerbal ; u 0j = random effects of the j th course adjusted for student SAT Verbal scores on the intercept; u 1j = random effects of the j th course adjusted for student SAT Verbal scores on the slope.

FIG. 2 .
FIG.2.Workflow for collecting, preparing, and analyzing the data using MLR and HLM.In the circles, blue represents data, white represents missing data, and purple represents imputed data.Squares represent results.

FIG. 3 .
FIG. 3. Predicted gains for average students across course contexts as predicted by: a) MLR and b) HLM.Error bars are +/-1 standard error.We calculated the standard error using the hypothesis testing function in HLM 7.03.Error bars vary in size based on the number and spread of student scores in each course context.
Appendix A: HLM Tutorial

Figure 4 .
FIG. 5. MDM creation screen labeled with the 9 steps of creating an MDM file.
FIG. 7. Descriptive statistic output from the creation of an MDM file.

TABLE I .
Publications in Physical Review Physics Education Research and Sage Journals that mention specific regression terms, as of 3/27/18.

TABLE II .
Factors at each hierarchical level that may affect students concept inventory post-score.
ences between MLR and HLM and then applying them both to a hierarchical dataset from 112 physics courses to illustrate the impact that the selection of an inappropriate modeling technique can have on findings.III.THEORYA.Sampling

TABLE III .
Sample multi-level dataset.

TABLE IV .
MLR and HLM coefficients for model 2. Note variable labeling is consistent with HLM model.