The influence of violations of assumptions on multilevel parameter estimates and their standard errors

doi:10.1016/j.csda.2003.08.006

Computational Statistics & Data Analysis

Volume 46, Issue 3, 15 June 2004, Pages 427-440

https://doi.org/10.1016/j.csda.2003.08.006 Get rights and content

Abstract

A crucial problem in the statistical analysis of hierarchically structured data is the dependence of the observations at the lower levels. Multilevel modeling programs account for this dependence and in recent years these programs have been widely accepted. One of the major assumptions of the tests of significance used in the multilevel programs is normality of the error distributions involved. Simulations were used to assess how important this assumption is for the accuracy of multilevel parameter estimates and their standard errors. Simulations varied the number of groups, the group size, and the intraclass correlation, with the second level residual errors following one of three non-normal distributions. In addition asymptotic maximum likelihood standard errors are compared to robust (Huber/White) standard errors.

The results show that non-normal residuals at the second level of the model have little or no effect on the parameter estimates. For the fixed parameters, both the maximum likelihood-based standard errors and the robust standard errors are accurate. For the parameters in the random part of the model, the maximum likelihood-based standard errors at the lowest level are accurate, while the robust standard errors are often overcorrected. The standard errors of the variances of the level-two random effects are highly inaccurate, although the robust errors do perform better than the maximum likelihood errors. For good accuracy, robust standard errors need at least 100 groups. Thus, using robust standard errors as a diagnostic tool seems to be preferable to simply relying on them to solve the problem.

Introduction

Social research often involves problems that investigate the relationship between individual and society. The general concept is that individuals interact with their social contexts, meaning that individual persons are influenced by the social groups or contexts, and that the properties of those groups are in turn influenced by the individuals who make up that group. Generally, the individuals and the social groups are conceptualized as a hierarchical system, with individuals and groups defined at separate levels of this hierarchical system.

Standard multivariate models are not appropriate for the analysis of such hierarchical systems, even if the analysis includes only variables at the lowest (individual) level, because the standard assumption of independent and identically distributed observations is generally not valid. The consequences of using uni-level analysis methods on multilevel data are well known: the parameter estimates are unbiased but inefficient, and the standard errors are negatively biased, which results in spuriously ‘significant’ effects (cf. de Leeuw and Kreft, 1986; Snijders and Bosker, 1999; Hox 1998, Hox 2002). Multilevel analysis techniques have been developed for the linear regression model (Bryk and Raudenbush, 1992; Goldstein, 1995), and specialized software is now widely available (Raudenbush et al., 2000; Rasbash et al., 2000).

The assumptions underlying the multilevel regression model are similar to the assumptions in ordinary multiple regression analysis: linear relationships, homoscedasticity, and normal distribution of the residuals. In ordinary multiple regression, it is known that moderate violations of these assumptions do not lead to highly inaccurate parameter estimates or standard errors. Thus, provided that the sample size is not too small, standard multiple regression analysis can be regarded as a robust analysis method (cf. Tabachnick and Fidell, 1996). In the case of severe violations, a variety of statistical methods for correcting heteroscedasticity are available (Scott Long and Ervin, 2000). Multilevel regression analysis has the advantage that heteroscedasticity can also be modeled directly (cf. Goldstein, 1995, pp. 48–57).

The maximum likelihood estimation methods used commonly in multilevel analysis are asymptotic, which translates to the assumption that the sample size is large. This raises questions about the accuracy of the various estimation methods with relatively small sample sizes. This concerns especially the higher level(s), because the sample size at the highest level (the sample of groups) is always smaller than the sample size at the lowest level. A large simulation by Maas and Hox (2003) finds that the standard errors for the regression coefficients are slightly biased downwards if the number of groups is less than 50. With 30 groups, they report an operative alpha level of 6.4% while the nominal significance level is 5%. Similarly, simulations by Van der Leeden and Busing (1994) and Van der Leeden et al. (1997) suggest that when assumptions of normality and large samples are not met, the standard errors have a small downward bias.

Sometimes it is possible to obtain more nearly normal distributions by transforming the outcome variable. If this is undesirable or even impossible, another method to obtain better tests and confidence intervals is to correct the asymptotic standard errors. One correction method to produce robust standard errors is the so-called Huber/White or sandwich estimator (Huber, 1967; White, 1982), which is available in several of the available multilevel analysis programs (e.g., Raudenbush et al., 2000; Rasbash et al., 2000).

In this paper we look more precisely at the consequences of the violation of the assumption of normally distributed errors at the second level of the multilevel regression model. Specifically, we use simulation to answer the following two questions: (1) what group level sample size can be considered adequate for reliable assessment of sampling variability when the assumption of normally distributed residuals is not met, and (2) how well do the asymptotic and the sandwich estimators perform when the assumption of normally distributed residuals is not met.

Section snippets

The multilevel regression model

Assume that we have data from J groups, with a different number of respondents n_j in each group. On the respondent level, we have the outcome variable Y_ij. We have one explanatory variable X_ij on the respondent level, and one group level explanatory variable Z_j. To model these data, we have a separate regression model in each group as follows: $Y_{ij} =β_{0j} +β_{1j} X_{ij} +e_{ij} .$ The variation of the regression coefficients β_j is modeled by a group level regression model, as follows: $β_{0j} =γ_{00} +γ_{01} Z_{j} +u_{0j}$ and $β_{1j} =γ_{10} +γ_{11}$

Maximum likelihood estimation

The usual estimation method for the multilevel regression model is maximum likelihood (ML) estimation (cf. Eliason, 1993). One important assumption underlying this estimation method is normality of the error distributions. When the residual errors are not normally distributed, the parameter estimates produced by the ML method are still consistent and asymptotically unbiased. However, the asymptotic standard errors are incorrect. Significance tests and confidence intervals can thus not be

The simulation model and procedure

We use a simple two-level model, with one explanatory variable at the individual level and one explanatory variable at the group level, conforming to Eq. (4), which is repeated here $Y_{ij} =γ_{00} +γ_{10} X_{ij} +γ_{01} Z_{j} +γ_{11} X_{ij} Z_{j} +u_{1j} X_{ij} +u_{0j} +e_{ij} .$ Four conditions are varied in the simulation: (1) Number of groups (NG: three conditions, NG=30, 50 and 100), (2) group size (GS: three conditions, GS=5, 30 and 50), (3) intraclass Correlation (ICC: three conditions, ICC=0.1, 0.2 and 0.3; note that the ICC varies with the X

Convergence and inadmissible solutions

The estimation procedure converged in all 3×27,000=81,000 simulated data sets. The estimation procedure in MLwiN can and sometimes does lead to negative variance estimates. Such solutions are inadmissible, and common procedure is to constrain such estimates to the boundary value of zero. However, all simulated data sets produced only admissible solutions.

Percentage relative bias

For across all 27 conditions the mean relative bias is calculated. Tested is whether this relative bias differs from one, with an α of 0.001.

Summary and discussion

Non-normal distributed residual errors on the second (group) level of a multilevel regression model appear to have little or no effect on the estimates of the fixed effects. The estimates of the regression coefficients are unbiased, and both the ML and the robust standard errors are accurate. There is no advantage here in using robust standard errors. This corresponds to the general belief that ML estimation methods are generally robust (cf. Eliason, 1993).

Non-normal distributed residual errors

Uncited reference

Bryk et al., 1996.

References (31)

Afshartous, D., 1995. Determination of sample size for multilevel model design. Unpublished paper. Annual Meeting of...
Browne, W.J., 1998. Applying MCMC methods to multilevel models. Unpublished Ph.D. Thesis, University of Bath, Bath,...
W.J. Browne et al.
Implementation and performance issues in the Bayesian and likelihood fitting of multilevel models
Comput. Statist.
(2000)
A.S. Bryk et al.
Hierarchical Linear Models
(1992)
A.S. Bryk et al.
HLM. Hierarchical Linear and Nonlinear Modeling with the HLM/2L and HLM/3L programs
(1996)
Busing, F., 1993. Distribution characteristics of variance estimates in two-level models. Unpublished manuscript....
Carpenter, J., Goldstein, H., Rasbash, J., 1999. A non-parametric bootstrap for multilevel models. Multilevel Modelling...
J. Cohen
Statistical Power Analysis for the Behavioral Sciences
(1988)
S.R. Eliason
Maximum Likelihood Estimation
(1993)
M. Evans et al.
Statistical Distributions
(1993)

Goldstein, H., 1995. Multilevel Statistical Models. Edward Arnold, London; Halsted, New...

M.C. Gulliford et al.

Components of variance and intraclass correlations for the design of community-based surveys and intervention studies

Amer. J. Epidemiol.

(1999)

J.J. Hox

Multilevel modeling: when and why

J.J. Hox

Multilevel Analysis, Techniques and Applications

(2002)

J.J. Hox et al.

The accuracy of multilevel structural equation modeling with pseudobalanced groups and small samples

Struct. Equation Modeling

(2001)

Cited by (277)

Exploring multilevel data with deep learning and XAI: The effect of personal-care advertising spending on subjective happiness
2024, International Business Review
International business research often links the cultural and institutional characteristics of countries to the features of the individuals inhabiting these countries. A distinct approach to analyzing such multilevel problems with deep learning and explainable artificial intelligence methods is presented, using country characteristics as explicit spatial coordinates. Deep learning is tolerant of noise and faults and can approximate arbitrarily complex mathematical structures by developing multiple abstractions. An applied example demonstrates the applicability of this approach by exploring the effect of personal-care advertising spending in 27 countries on the subjective happiness of 376,442 individuals, indicating a statistically significant positive effect, albeit with a trivial effect size.
The Effects of a Science and Social Studies Content Rich Shared Reading Intervention on the Vocabulary Learning of Preschool Dual Language Learners
2024, Early Childhood Research Quarterly
This study examined the effects of a content-based shared book reading (SBR) intervention on receptive and expressive vocabulary outcomes of dual language learner (DLL) preschool children enrolled in two school districts in south Texas. Using SBR as the target of instruction, 50 preschool teachers and 298 preschoolers were randomly assigned at the class level to either a well-specified and scripted SBR condition or a comparison SBR condition. Children in the study were selected based on their scores on the school district administered and thereby archival Preschool Language Assessment Scales (Pre-LAS©) and determined to be at the Limited English Speaker (LES) level of English. Teachers in the intervention condition implemented the curriculum for 18 weeks in 5-day instructional cycles of about 20 minutes per day. Results revealed significant and robust effects on proximal measures of expressive and receptive vocabulary; whereas no significant effects on standardized measures were indicated. Limitations and implications are discussed.
Does ridesourcing respond to unplanned rail disruptions? A natural experiment analysis of mobility resilience and disparity
2023, Cities
Urban rail transit networks provide critical access to opportunities and livelihood in many urban systems. Ensuring that these services are resilient (that is, exhibiting efficient responses to and recovery from disruptions) is a key economic and social priority. Increasingly, the ability of urban rail systems to cope with disruptions is a function of a complex patchwork of mobility options, wherein alternative modes can complement and fill occurring service gaps. This study analyzes the role of ridesourcing in providing adaptive mobility capacity that could be leveraged to fill no-notice gaps in rail transit services, addressing the question of distributional impacts of resilience. Using a natural experiment, we systematically identify 28 major transit disruptions over the period of one year in Chicago and match them, both temporally and spatially, with ridesourcing trip data. Using multilevel mixed modeling, we quantify variation in the adaptive use of on-demand mobility across the racially and economically diverse city of Chicago. Our findings show that the gap-filling potential of adaptive ridesourcing during rail transit disruptions is significantly influenced by the station-, community-, and district-level factors. Specifically, greater shifts to ridesourcing occur during weekdays, nonholidays, and more severe disruptions, in community areas that have higher percentages of white residents and transit commuters, and in the more affluent North district of the city. These findings suggest that while ridesourcing appears to provide adaptive capacity during rail disruptions, its benefits do not appear to be equitable for lower-income communities of color that already experience limited mobility options. Research implications for mobility operator collaborations to support mobility as a service are discussed. This study builds a more comprehensive understanding of transit service resilience, variation in vulnerability, and the complementarity of ridesourcing to existing transport networks during disruptions.
Evaluating the precision and reproducibility of non-invasive deformation measurements in an arterial phantom
2023, Measurement: Journal of the International Measurement Confederation
Computer modeling combined with non-invasive measurement modalities may provide a means to advanced diagnosis and treatment of cardiovascular diseases; however, development and validation of novel methods is often impeded by the challenge of gathering relevant clinical data. We present an experimental rig and phantom model of the left common carotid artery developed to generate data necessary for the validation of a computational model of the artery. The flow rates and pressures at the inlet and outlet, and deformation of the phantom were measured simultaneously with optical and ultrasound systems. A statistical analysis of the precision and reproducibility of the measurements found experiment-to-experiment variations were less than 3.1 mmHg, 0.023 L/min, and 0.012 mm for pressure, flow and displacement respectively. The mean difference between ultrasound measured displacement and camera measured displacement was 0.0113 mm.
Cross-cultural adaptation and validation of the Spanish version of the Anesthetists' Non-Technical Skills (ANTS) assessment tool
2023, Journal of Clinical Anesthesia
Establish the transcultural validity of Anesthetists Non-Technical Skills (ANTS) in a Spanish-speaking country.
Prospective cohort.
Clinical simulation center.
Forty-two Anesthesia PY2 and PY3 residents participated in the study.
Four clinical scenarios simulating anesthesia crises were assessed with a Spanish version of ANTS. Every simulated scenario was run twice with a time span of 3 to 4 months between them.
Two anesthesiologists independently assessed all simulated sessions using ANTS. The ANTS indicators of construct validity were obtained by confirmatory factor analysis. Various goodness-of-fit indices of the factorial model were calculated: Comparative Fit Index (CFI); Tucker-Lewis Adjustment Index (TLI) and Root Mean Square Error of Approximation (RMSEA). The standardized factor loadings and the determination coefficient (R2) was also estimated.
A total of 212 clinical scenarios were analyzed. The specified factorial model had the same grouping of elements in four domains as the original version of ANTS. The CFI index and the TLI were 0.99 and the RMSEA reached 0.07 (95% CI 0.06–0.08). All the standardized factor loadings were found to be >0.4. Also, the elements obtained an R2 value that fluctuated between 0.54 and 0.92.
The Spanish version of ANTS is a valid, reliable and a useful tool to assess non-technical skills in Spanish-speaking countries. The applicability of the instrument was comparable to the original setting. The high reliability of ANTS in our setting allows us to propose its use not just in an educational and research setting; it can be used as an assessment tool of non-technical skills.
Graduated sanctioning, endogenous institutions and sustainable cooperation in common-pool resources: An experimental test
2024, Rationality and Society

View all citing articles on Scopus

View full text