Cross-classified Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA) to evaluate hospital performance: the case of hospital differences in patient survival after acute myocardial infarction

Objective To describe a novel strategy, Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA) to evaluate hospital performance, by analysing differences in 30-day mortality after a first-ever acute myocardial infarction (AMI) in Sweden. Design Cross-classified study. Setting 68 Swedish hospitals. Participants 43 247 patients admitted between 2007 and 2009, with a first-ever AMI. Primary and secondary outcome measures We evaluate hospital performance by analysing differences in 30-day mortality after a first-ever AMI using a cross-classified multilevel analysis. We classified the patients into 10 categories according to a risk score (RS) for 30-day mortality and created 680 strata defined by combining hospital and RS categories. Results In the cross-classified multilevel analysis the overall RS adjusted hospital 30-day mortality in Sweden was 4.78% and the between-hospital variation was very small (variance partition coefficient (VPC)=0.70%, area under the curve (AUC)=0.54). The benchmark value was therefore achieved by all hospitals. However, as expected, there were large differences between the RS categories (VPC=34.13%, AUC=0.77) Conclusions MAIHDA is a useful tool to evaluate hospital performance. The benefit of this novel approach to adjusting for patient RS is that it allowed one to estimate separate VPCs and AUC statistics to simultaneously evaluate the influence of RS categories and hospital differences on mortality. At the time of our analysis, all hospitals in Sweden were performing homogeneously well. That is, the benchmark target for 30-day mortality was fully achieved and there were not relevant hospital differences. Therefore, possible quality interventions should be universal and oriented to maintain the high hospital quality of care.

I, the Submitting Author has the right to grant and does grant on behalf of all authors of the Work (as defined in the below author licence), an exclusive licence and/or a non-exclusive licence for contributions from authors who are: i) UK Crown employees; ii) where BMJ has agreed a CC-BY licence shall apply, and/or iii) in accordance with the terms applicable for US Federal Government officers or employees acting as part of their official duties; on a worldwide, perpetual, irrevocable, royalty-free basis to BMJ Publishing Group Ltd ("BMJ") its licensees and where the relevant Journal is co-owned by BMJ to the co-owners of the Journal, to publish the Work in this journal and any other BMJ products and to exploit all rights, as set out in our licence.
The Submitting Author accepts and understands that any supply made under these terms is made by BMJ to the Submitting Author unless you are acting as an employee on behalf of your employer or a postgraduate student of an affiliated institution which is paying any applicable article publishing charge ("APC") for Open Access articles. Where the Submitting Author wishes to make the Work available on an Open Access basis (and intends to pay the relevant APC), the terms of reuse of such Open Access shall be governed by a Creative Commons licence -details of these licences and which Creative Commons licence will apply to this Work are set out in our licence referred to above.
Other than as permitted in any relevant BMJ Author's Self Archiving Policies, I confirm this Work has not been accepted for publication elsewhere, is not being considered for publication elsewhere and does not duplicate material already published. I confirm all authors consent to publication of this Work and authorise the granting of this licence. When evaluating institutional (e.g., hospital) performance in health care, traditional studies make two implicit assumptions. First, it is assumed that over and above patient characteristics, the hospital context exerts a general, shared effect on all patients at the hospital. This general hospital-context effect is argued to reflect the influence of many factors, for instance, hospital administration, access to resources, specialized knowledge, implementation of methods for disease management, and adoption of guidelines and pathways for patient treatments. Second, it is often assumed that the general hospital-context effect can be measured by quantifying differences between hospital averages in certain quality indicators. Therefore, the focus of the analysis is based on the interpretation of tables, funnel plots, control charts, 'league tables' or similar, where hospitals are ranked according to different quality indicators such as their average 30-day mortality after acute myocardial infarction (AMI) (1). Occasionally such analyses are accompanied by an estimation of the reliability and suitability of the ranking ("rankability") (2), but more often than not the focus of analysis remains on hospital averages. there is not a large population of RS categories from which they are drawn. This is, however, a philosophical, rather than a practical question. In fact, when studying hospitals in a country the hospitals are never a sample of an infinite super population of hospitals but a concrete set of facilities in a specific setting. Furthermore, many multilevel studies observe and analyze all the hospitals in a country in their data, and the total number of hospitals may not prove that large, yet here too the hospital effects will be treated as random effects. As discussed by Snijders and Bosker, when defining the random intercept model (14), p. 45, the random effects model can be applied even when the idea of an infinite superpopulation is less evident. This approach is currently being applied when performing intersectional MAIHDA in social epidemiology (15).

Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA)
The cross-classified approach provides several advantages over the traditional hierarchical multilevel approach. First, the cross-classified MAIHDA is parsimonious as it includes only one random parameter for the RS categories rather than the dummy variables as in the -1 fixed effects approach. Second, the cross-classified approach provides separate VPCs and AUCs for RS categories and for the hospital, allowing their magnitude to be contrasted. Thus, in contrast to the fixed-effects approach, it allows the importance of patient-mix vs. the hospital effects to be communicated on a common metric. In addition, hospital MAIHDA provides all the usual advantages of multilevel models. For instance, by providing reliability weighted hospital averages (shrunken residuals), it reduces the concern of monitoring outcome measures based on small hospital caseloads which otherwise may lead to extreme and unstable hospital rankings and, therefore, unreliable performance evaluation (16,17). Both hierarchical and cross-classified MAIHDA are nowadays easy to implement in available software such as MLwiN that can be run from both within Stata (runmlwin) (18) and within R (R2MLwiN) (19). Finally, for binary patient outcomes, such as 30-day mortality after AMI, multilevel analyses can be performed using a simple contingency table or matrix with strata defined by combinations of the hospitals and the RS categories capturing patient-mix. The only information required for the analysis is the overall number of patients and the number of AMI cases in each hospital-RS stratum. This aggregated approach maintains the joint distribution of the hospitals-RS information and provides the same model results (parameter estimates, standard errors, fit statistics and predictions) as when analysing the underlying individual level data. The aggregated approach also allows computationally efficient (fast) estimation as it allows analysing thousands of patients' outcomes using a dataset consisting of just a few hundred strata. A further benefit of the aggregated approach is that the data can be shared since its aggregated presentation reduces ethical problems of confidentiality (statistical disclosure is not at risk). This in turn, improves the transparency of the research and facilitates the replication of the analysis and encourages the sharing of data to compare hospital performance between different settings.

The aim of this study
The aim of this study was to demonstrate a novel statistical approach (MAIHDA) to evaluate hospital performance and to orientate stakeholders to make assertive data informed decisions using a three-step framework. We do so by analyzing differences in 30-day mortality among patients admitted to the Swedish hospital with a first-ever AMI between 2007-2009.

Study Population
This is a cross-sectional study. We used information from the Swedish Patient Register (20) and from the Cause of Death Register (21) (National Board of Health and Welfare) as well as from Population Register (22) (Statistics Sweden). To ensure the anonymity of the subjects, the Swedish authorities transformed the personal identification numbers of the individuals (23) into arbitrary personal numbers before delivering the research databases to us, and we linked the databases using the anonymized identification number.

Ethical statement
This research was done without patient involvement. The Regional Ethics Review Board in southern Sweden (# 2012/637) as well as the data safety committees from the National Board of Health and Welfare and from Statistics Sweden approved the construction of the database used in this study.

Data accessibility
The original databases used in our study are available from the Swedish National Board of Health and Welfare and Statistics. In Sweden, register data are protected by strict rules of confidentiality (24) but can be made available for research after a special review that includes

Patient outcome
The study outcome was all-cause mortality within 30-days after admission to the hospital (coded yes vs. no) due to AMI, in any location.
Risk score for mortality An inherent difficulty when investigating quality outcome indicators such as mortality is the threat of confounding originated by patient-mix. The geographical areas covered by the hospital may vary in the demographical and disease characteristics of the patients. Furthermore, patients with a worse prognosis may be channeled to certain hospitals providing specialized care and this selection of patients will further confound the evaluations of hospitals differences. To reduce this form of confounding we computed a RS for 30-day mortality in the sample of AMI patients. Initially we selected a priori 40 variables including sex (man vs. woman), age in years and defined it as our patient risk score (RS). We discretized the RS into 10 categories using the decile values of the RS distribution. We chose deciles to provide enough granulation of the continuous RS variable and enough categories to be included as a random effect in the multilevel model.

Statistical analyses
We analysed 30-day mortality among 43,247 patients admitted to 68 Swedish hospitals between 2007-2009, with a first-ever AMI. We classified the patients into 10 RS categories for 30-day mortality and created 680 strata defined by combining hospital and RS categories. In the first step (model 1), we applied a traditional hierarchical multilevel logistic regression model with patients clustered within hospitals. In a second step (model 2), and in order to adjust for patientmix, we performed a cross-classified multilevel model of patient outcomes with both RS categories and hospitals random effects (25). We estimated the variance partition coefficient (VPC) and the area under the ROC curve (AUC) to evaluate differences between RS categories and between hospitals in a common way.

Estimation methods
We performed the estimations using Markov Chain Monte Carlo (MCMC) methods, with diffuse (vague, flat, or minimally informative) prior distributions for all parameters. We used quasilikelihood methods to provide starting values for all parameters. For each model, the burnin length was 5,000 iterations. Visual assessments of the parameter chains and standard MCMC convergence diagnostics suggested that the monitored chains had converged. We ran the model for a further 10,000 monitoring iterations and used the resulting parameter chains from the MCMC to construct 95% credible intervals (CI) for all model predictions to communicate statistical uncertainty (Supplementary material S2).

Model 1: Unadjusted multilevel analysis
Model 1 was a multilevel logistic regression for patient mortality where we only include a hospital random effect to account for the variation in mortality rates across hospitals. Let, denote the number of deaths in the hospital ( ). The model can then be written as where denotes the total number of patients in hospital , denotes the probability of death for patients in that hospital, denotes the intercept or precision weighted grand mean, that in 0 logistic regression represents the average of the hospital-specific shrunken (i.e., reliabilityweighted) residuals values, and denotes the hospital-specific random effects for the 68 hospitals (26). The random effects are assumed to be normally distributed with zero mean and between-hospital variance .

2
The purpose of this model was to evaluate unadjusted hospital differences in average mortality complemented this information by quantifying the size of the GCE.

A) Ranking of the hospitals
To rank hospitals according to their unadjusted mortality rates, we predict the absolute risk (AR ) of 30-day mortality and its 95% credible interval (CI) in each hospital. To do so, we first transformed the predicted logit of 30-day mortality into proportions ( as follows ) AR ≡ ≡ logit -1 ( 0 + ) = exp ( 0 + ) 1 + exp ( 0 + ) Formula 2

B) Measuring the hospital GCE.
We estimate the hospital GCE by means of two measures:

(i) The Variance Partition Coefficient for the hospital level ( )
The can be calculated based on the latent response formulation of the model which is an VPC approach widely adopted today in multilevel applied work (27-31) . Where denotes the variance of a standard logistic distribution. We then multiply the 2 3 ≅3.29 by 100 and interpret it as a percentage. VPC

The
, quantifies the share of the total individual differences in the latent propensity of 30-VPC day mortality that is at the hospital level. The embraces the influence of the hospital VPC context on the patient outcome without identifying any specific hospital information. However, the , may also reflect differences in patient-mix between hospitals. In any case, the , VPC VPC In the absence of confounding by patient-mix, the higher the , the higher the hospital GCE VPC is. In other words, the more relevant the hospital context is for understanding individual variation in the latent risk for 30-day mortality.

(ii) The area under the receiver operating characteristics curve for the hospital ( )
A well-known measure of discriminatory accuracy is the AUC (10,11,32). In our case the hospital measures how well the model predicted probabilities based on the attended AUC H hospitals distinguish between two outcome categories (death within 30 days or survival). The is constructed by plotting the true positive fraction (TPF) against the false positive AUC H fraction (FPF) for different thresholds of the predicted probabilities. The AUC takes a value between 1 and 0.5 where 1 is perfect discrimination and 0.5 would be equally as informative as flipping a coin (i.e., the hospital information has no discriminatory accuracy).
We calculated the and to account for the different number of patients in each hospital we AUC also calculated the weighted AUC ( ) where every patient was weighted by the inverse AUC of the number of patients at his/her hospital. In our study both the unweighted and the weighted AUCs were almost identical so we only present the unweighted . AUC

Model 2: Patient-mix adjusted multilevel analysis
Model 2 was a cross-classified multilevel logistic regression (25) for patient mortality where we include both hospital and RS category random effects to simultaneously account for the variation in mortality rates across both hospitals and RS categories and to therefore adjust the hospital effects for patient-mix. The model can be written as: where denotes the random effect associated with RS categories ( ), which are = 1,…,10 assumed to be normally distributed with zero mean and between-RS variance . 2 The predicted logit of 30-day mortality was transformed into proportions. The mortality rates from this model are standardized and represents the rate that each hospital would have experienced if all hospitals had treated the same patients, in our case a patient with an average RS value in the population or . = 0 The purpose of model 2 was to evaluate patient-mix adjusted hospital differences in average mortality risk. Therefore, and analogously to model 1, we ranked the hospitals according to their RS adjusted mortality risk and complemented this information by quantifying the size of the hospital GCE net of the observed patient-mix influence.
As a measure of the patient-mix adjusted hospital GCE, we obtained the hospital as VPC VPC = 2 2 + 2 + 2 3

Formula 6
The adjusted and inform on the share of the total individual variance in the latent VPC propensity of 30-day mortality that is at the hospital and at the RS category level respectively, net of the influence of the other factor. Both measures are estimated on the same scale and can therefore be directly compared to evaluate the relative relevance of hospital versus patient-mix information when it comes to understanding patient differences in the latent propensity of death.  14 We also calculated the adjusted AUC for the hospital level and for RS category levels (AUC ) including their specific random effects when calculating the predicted probabilities. (AUC )

Software
All models were run in MLwiN 3.02 (33) called from Stata using the runmlwin command (18).
We note that MLwiN can equally be called from within R using the sister R2MLwiN package (19) and so our analysis can also be replicated by readers in that statistical package.

Evaluating hospital performance with MAIHDA
The present cross-classified MAIHDA framework extends that which was described in a previous publication aimed to the evaluation of geographical differences in health outcomes (34). The framework proposes three steps that need be considered to achieve a complete analysis of hospital performance. However, more elaborated strategies are of course also possible, and the presented framework is open for modifications and extensions. The application of the framework in our study was as follows

Step 1. Identifying a benchmark value and evaluating the adjusted hospital mortality rates against it.
When evaluating hospital performance, we need to identify a benchmark value expressing a tolerable average level of 30-day mortality in the population of AMI patients. However, the selection of a specific benchmark is often difficult and arbitrary. We can use an internal benchmark defined as the obtained in model 2 (Formula 4). That is, the mortality rate in a 0 hospital with an average mortality ( treating patients with an average RS ( This choice is meaningful since comparing with a national average seems "fair" and being RS adjusted, tertiary care hospitals with more severe cases do not unfairly push the hospital effect towards a higher value as in the crude average rate. However, being an adjusted rate the value does not necessarily resemble the crude rates and can only be used for relative comparisons. Other adjustments are possible such as computing adjusted hospital mortality rates by holding value equal to the 90th percentile of the RS random effect distribution. Another possibility k would be to calculate the adjusted hospital rates that would arise if each hospital treated a nationally representative sample of patients. However, statistically, this would be more complex to do as this would require integrating out the RS random effect (via simulation) and so we do not pursue this here.
We could also use and external benchmark by applying a RS equation for 30-day mortality obtained in another country or in an international collaboration. This approach seems worthy for international comparisons, but it may also be inappropriate if the RS equation does not properly account for population demographical and co-morbidities differences in risk, and the diagnostic criteria are not fully standardized.
Finally, we could use an arbitrary rate based on previous evidence and proven expertise.
Considering the data published by the Organisation for Economic Co-operation and Development (OECD) (35) and a recent review article (36), we decide a RS adjusted 30-day mortality of less than 6% as a desirable target value for the purposes of this illustrative application.
We used caterpillar graphs to compare the RS adjusted hospital rates in relation to the benchmark value of 6%.

Step 2. Quantifying the size of the hospital differences using the VPC and the AUC
Currently there is no official guidance for assessing the magnitude of the VPC or the AUC in the context of studying hospital differences in RS adjusted hospital 30-day mortality but a practical proposal is described in  (34).
The proposed values are based on the authors' own experience, but further discussion is encouraged to arrive at a standard classification. Furthermore, different standards may ultimately be required and developed for different outcomes in different contexts and at different points in time. For instance, Hosmer and Lemeshow (37) suggested that AUC of 0.70 to 0.80 are 'acceptable', 0.80 to 0.90 'excellent' and 0.9 or above 'outstanding', while AUC of 0.50 suggests discrimination by chance e.g. similar to tossing a coin to decide death or alive.

Step 3. Interpreting results to evaluate performance
The two primary questions for the hospital performance evaluation were, (i) has the benchmark value been insufficiently, closely, or fully reached? (ii) are there substantial differences between the hospitals, or do they perform homogenously? To answer both questions, we created a framework (Table 1) with 15 scenarios combining information on the benchmark value achievement and the size of hospital differences based on model 2.
In the best scenario (scenario A) the desired target level has been fully achieved overall (averaging across all hospitals the adjusted mortality rate is less than 6%), and hospital differences are effectively absent (the Hospital GCE is effectively absent). The conclusion would be that all hospitals have performed similarly well. In the worst scenario (scenario C) the desired target level has not been achieved overall, and between-hospital differences are again absent. The conclusion would be that all hospitals have performed similarly but poorly.
Observe that in both scenarios A and C, interventions only targeted to specific hospitals are not justifiable. Rather any intervention should be universal (i.e., directed to all hospitals) as in both scenarios all hospitals are preforming similarly. In scenario A, the intervention would be oriented to maintaining the overall high quality while in scenario C, the objective would be to improve the quality in all hospitals.    In contrast, information on the RS value of the patients was much more relevant than information on the hospital where they were treated. The VPC RS was very large (i.e., 38.4%) and the AUC RS = 0.77. Figure 3 clearly depicts the differential discriminatory accuracies of the hospital and RS random effects.     (41,42) or process quality indicators of diabetes care like albuminuria analysis (43).

Results
The evaluation of institutional performance using VPC is not new (12,39,44,45). Normand (46) states that when evaluating hospital performance if the VPC is zero there are no hospital quality differences, that is, the chance that a patient experiences an event after being treated is the same regardless of the hospital (p. 33). This idea is also explicitly expressed by the committee assigned to set statistical guidelines for assessing hospital performance in USA (47).
The share of the total variance that is at the hospital level is crucial for evaluating performance.
However, this fundamental concept needs be applied more extensively. Today, it is recognized that multilevel models (hierarchical or mixed-effect models) are the preferred methodology for provider profiling. However, the substantive analysis of components of variance still receives little attention and most studies only consider multilevel modeling for its capacity to account for the clustering of patients within hospitals in order to obtain "correct standard errors" on regression coefficients and odds ratios. Some authors even conclude that hospital averages (odds ratios, observed/expected values) obtained from multilevel analyses gives similar results compared to traditional logistic regressions analyses. This situation is interpreted as an argument for keeping the traditional logistic regression as the standard method for performing risk adjustment of hospital quality comparisons (48)(49)(50). However, we do not agree with this opinion. The first reason is that the fixed effects approach does not explicitly inform on components of variance. The second reason is that the equivalence between traditional and multilevel regression results only occurs when the hospital GCE (i.e., the clustering) is low and the number of patients at the hospitals is very high (i.e., reliable estimation of hospital averages) (26). In other words, traditional non-multilevel analyses give similar results to the multilevel analysis only when the hospital differences are not relevant (i.e., low VPC) and the patient load is very large in every hospital (which is rarely the case). In addition, hospital level variables appear paradoxically more statistically "significant" when the hospital level is less relevant (i.e., low VPC) (7). Information on the size of the hospital GCE is, therefore, fundamental for a sound analysis of hospital performance.
In this study we have applied the AUC to evaluate the hospital GCE. The AUC is a measure of discriminatory accuracy (DA) frequently used for gauging the performance of prognostic and screening markers in medicine (8,9) but it can also be used to quantify hospital GCE (10,34).
So far, many epidemiologists may not be familiar with the use of measures of components of variance like the VPC for binary outcomes (30). However, the AUC measure is well established in clinical and health care epidemiology and the information they give is relatively easy to interpret and communicate. From this perspective, the evaluation of hospital performance resembles a screening test and so we must therefore know the discriminatory accuracy of, for instance, a "league table" to make informed decisions.
We perform this adjustment by an innovative strategy that use hospitals and decile groups of RS in a cross-classified multilevel analysis. A key advantage of this approach is it allowed a directed comparison of the importance of patient case mix and hospital effects in explaining variation in patient outcomes across hospital. We did not use existing patient-mix adjustment scores such as the Charlson Risk Score (51), the Elixhauser Score (52)

Contributors:
JM had the initiative of the study and acquired the data.
JM and M R-L wrote the original manuscript.
M R-L and RP-V performed the analyses in coordination with JM.
PA and GL provided advanced statistical support.

Results:
In the cross-classified multilevel analysis the overall RS adjusted hospital 30-day mortality in Sweden was 4.78% and the between-hospital variation was very small (Variance Partition coefficient (VPC)= 0.70%, Area under the curve (AUC)= 0.54). The benchmark value was therefore achieved by all hospitals. However, as expected, there were large differences between the RS categories (VPC= 34.13%, AUC= 0.77) Conclusions: MAIHDA is a useful tool to evaluate hospital performance. The benefit of this novel approach to adjusting for patient RS is that it allowed one to estimate separate VPCs and AUC statistics to simultaneously evaluate the influence of RS categories and hospital differences on mortality. At the time of our analysis, all hospitals in Sweden were performing homogenously well. That is, the benchmark target for 30-day mortality was fully achieved and there were not relevant hospital differences. Therefore, possible quality interventions should be universal and oriented to maintain the high hospital quality of care.
Keywords: Multilevel analysis, Health Evaluation, Decision Making, ROC curve, Variance analysis

Strengths and limitations of this study
We provide a new analytical tool for analysing hospital performance based on multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA).
Cross-classified MAIHDA disentangles the specific role of the patient-mix vs. the hospital when analyzing quality outcomes.
We used a risk score to adjust for differences in patient-mix across hospitals. However, it is not a perfect instrument to quantify the true severity and mortality risk of a patient.
MAIHDA allows analysts to identify whether target or universal interventions are most appropriate to improve the quality of care.
We provide a three-step strategy to achieve a complete analysis of hospital performance. However, more elaborated strategies are also possible. When evaluating institutional (e.g., hospital) performance in health care, traditional studies make two implicit assumptions. First, it is assumed that over and above patient characteristics, the hospital context exerts a general, shared effect on all patients at the hospital. This general hospital-context effect is argued to reflect the influence of many factors, for instance, hospital administration, access to resources, specialized knowledge, implementation of methods for disease management, and adoption of guidelines and pathways for patient treatments. Second, it is often assumed that the general hospital-context effect can be measured by quantifying differences between hospital averages in certain quality indicators. Therefore, the focus of the analysis is based on the interpretation of tables, funnel plots, control charts, 'league tables' or similar, where hospitals are ranked according to different quality indicators such as their average 30-day mortality after acute myocardial infarction (AMI) (1). Occasionally such analyses are accompanied by an estimation of the reliability of the ranking ("rankability") (2), but more often than not the focus of analysis remains on hospital averages.

Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA)
Recently, MAIHDA has been proposed as a novel strategy for evaluating hospital performance (3). In contrast with most traditional studies, hospital MAIHDA simultaneously focuses on both hospital averages and patient heterogeneity around such averages. In MAIHDA, the fundamental statement is that patient and hospital variation should not be analysed separately.
Rather, we need to consider that the total individual outcome variance can be partitioned into variance components operating at different levels of analysis (4). From this perspective, hospital differences are not measured as the difference between hospital averages, but as the hospital general contextual effect (GCE). That is, the share of the total individual variance in patient  (5)(6)(7). This idea is also closely related to the notion of discriminatory accuracy developed for the evaluation of the performance of prognostic and screening markers in medicine (8,9). It is therefore possible to also use measures of discriminatory accuracy such as the area under the Receiving Operator Characteristics curve (AUC), to quantify the hospital GCE (10). See elsewhere for an extended explanation of the GCE concept (3,7,11). In this article, we argue that the systematic application of measures of variance and discriminatory accuracy is of fundamental relevance for meaningful performance evaluations (3,5,(11)(12)(13).

Cross-classified MAIHDA
Hospital comparisons are usually adjusted for "patient-mix" using a risk score (RS). In traditional multilevel analysis of hospital performance, patient RS effects are modelled as fixed effects (e.g., by entering the set of RS categories as a series of dummy variables) while the hospital effects are modelled as random effects. In contrast, in the cross-classified MAIHDA approach both the RS and the hospitals are modelled as random effects. Readers familiar with the traditional application of multilevel modelling may query the treatment of RS categories as random effects. For example, while we can think of the hospitals as a sample drawn from the set of all possible hospitals, it proves harder to conceptualize the RS categories in this way as there is not a large population of RS categories from which they are drawn. This is, however, a philosophical, rather than a practical question. In fact, when studying hospitals in a country the hospitals are never a sample of an infinite super population of hospitals but a concrete set of facilities in a specific setting. Furthermore, many multilevel studies observe and analyze all the hospitals in a country in their data, and the total number of hospitals may not prove that large, yet here too the hospital effects will be treated as random effects. As discussed by Snijders and  (14), p. 45, the random effects model can be applied even when the idea of an infinite superpopulation is less evident. This approach is currently being applied when performing intersectional MAIHDA in social epidemiology (15).
The cross-classified approach provides several advantages over the traditional hierarchical multilevel approach. First, the cross-classified MAIHDA is parsimonious as it includes only one random parameter for the RS categories rather than the dummy variables as in the -1 fixed effects approach. Second, the cross-classified approach provides separate VPCs and AUCs for RS categories and for the hospital, allowing their magnitude to be contrasted. Thus, in contrast to the fixed-effects approach, it allows the importance of patient-mix vs. the hospital effects to be communicated on a common metric. In addition, hospital MAIHDA provides all the usual advantages of multilevel models. For instance, by providing reliability weighted hospital averages (shrunken residuals), it reduces the concern of monitoring outcome measures based on small hospital caseloads which otherwise may lead to extreme and unstable hospital rankings and, therefore, unreliable performance evaluation (16,17). Both hierarchical and cross-classified MAIHDA are nowadays easy to implement in available software such as MLwiN that can be run from both within Stata (runmlwin) (18) and within R (R2MLwiN) (19).
Finally, for binary patient outcomes, such as 30-day mortality after AMI, multilevel analyses can be performed using a simple contingency table or matrix with strata defined by combinations of the hospitals and the RS categories capturing patient-mix. The only information required for the analysis is the overall number of patients and the number of AMI cases in each hospital-RS stratum. This aggregated approach maintains the joint distribution of the hospitals-RS information and provides the same model results (parameter estimates, standard errors, fit statistics and predictions) as when analysing the underlying individual level data. The aggregated approach also allows computationally efficient (fast) estimation as it allows analysing thousands of patients' outcomes using a dataset consisting of just a few hundred strata. A further benefit of the aggregated approach is that the data can be shared since its aggregated presentation reduces ethical problems of confidentiality (statistical disclosure is not at risk). This in turn, improves the transparency of the research and facilitates the replication of the analysis and encourages the sharing of data to compare hospital performance between different settings.

The aim of this study
The aim of this study was to demonstrate a novel statistical approach (MAIHDA) to evaluate hospital performance and to orientate stakeholders to make assertive data informed decisions using a three-step framework. We do so by analyzing differences in 30-day mortality among patients admitted to the Swedish hospital with a first-ever AMI between 2007-2009.

Study Population
This is a cross-sectional study. We used information from the Swedish Patient Register (20) and from the Cause of Death Register  This research was done without patient involvement. The Regional Ethics Review Board in southern Sweden (# 2012/637) as well as the data safety committees from the National Board of Health and Welfare and from Statistics Sweden approved the construction of the database used in this study.

Data accessibility
The original databases used in our study are available from the Swedish National Board of Health and Welfare and Statistics. In Sweden, register data are protected by strict rules of confidentiality (24) but can be made available for research after a special review that includes approval of the research project by both an Ethics Committee and the authorities' own data safety committees. The Swedish authorities under the Ministry of Health and Social Affairs do not provide individual level data to researchers abroad. Instead, they normally advise researchers in other countries to cooperate with Swedish colleagues and analyze data in collaboration according to standard legal provisions and procedures.
However, in the approach we propose, it is technically possible to perform the analysis using a simple table defined by hospital and categories of risk score. The aggregated information as well as the additional encryption of the hospital name fully anonymized the table, which prevents the backwards identification of individuals even when very few patients are in a single cell of the table. Therefore, to increase transparency and facilitate the replication of our analysis we provide the table as a Stata dataset (Supplementary material S1) and a fully annotated Stata do-file to allow the replication of the analyses (Supplementary material S2). We also provide the table as a CSV file along with an R Script (Supplemental material S3, S4). Risk score for mortality An inherent difficulty when investigating quality outcome indicators such as mortality is the threat of confounding due to differences in patient-mix across hospitals.

Statistical analyses
We analysed 30-day mortality among 43,247 patients admitted to 68 Swedish hospitals between 2007-2009, with a first-ever AMI. We classified the patients into 10 RS categories for 30-day mortality and created 680 strata defined by combining hospital and RS categories. In the first step (model 1), we applied a traditional hierarchical multilevel logistic regression model with patients clustered within hospitals. In a second step (model 2), and in order to adjust for patientmix, we performed a cross-classified multilevel model of patient outcomes with both RS categories and hospitals random effects (25). We estimated the VPC and the AUC to evaluate differences between RS categories and between hospitals in a common way.

Estimation methods
We performed the estimations using Markov Chain Monte Carlo (MCMC) methods, with diffuse (vague, flat, or minimally informative) prior distributions for all parameters. We used quasilikelihood methods to provide starting values for all parameters. For each model, the burnin length was 5,000 iterations. We ran the model for a further 10,000 monitoring iterations and used the resulting parameter chains from the MCMC to construct 95% credible intervals (CI) for all model predictions to communicate statistical uncertainty (Supplementary material S2).
Visual assessments of the parameter chains and standard MCMC convergence diagnostics suggested that the monitored chains had converged.

Model 1: Unadjusted multilevel analysis
Model 1 was a multilevel logistic regression for patient mortality where we only include a hospital random effect to account for the variation in mortality rates across hospitals. Let, denote the number of deaths in hospital ( ). The model can then be written as = 1,…,68 Binomial( , )  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  where denotes the total number of patients in hospital , denotes the probability of death for patients in that hospital, denotes the intercept or precision weighted grand mean, that in 0 logistic regression represents the average of the hospital-specific shrunken (i.e., reliabilityweighted) residuals values, and denotes the hospital-specific random effects for the 68 hospitals (26). The random effects are assumed to be normally distributed with zero mean and between-hospital variance .

2
The purpose of this model was to evaluate unadjusted hospital differences in average mortality risk. For this aim, we (A) ranked the hospitals according to their mortality risk; and (B) complemented this information by quantifying the size of the GCE.

A) Ranking of the hospitals
To rank hospitals according to their unadjusted mortality rates, we predict the absolute risk (AR ) of 30-day mortality and its 95% CI in each hospital. To do so, we first transformed the predicted logit of 30-day mortality into proportions ( as follows )

B) Measuring the hospital GCE.
We estimate the hospital GCE by means of two measures:

(i) The Variance Partition Coefficient for the hospital level ( )
The can be calculated based on the latent response formulation of the model which is an VPC approach widely adopted today in multilevel applied work (27-31) . Where denotes the variance of a standard logistic distribution. We then multiply the 2 3 ≅3.29 by 100 and interpret it as a percentage. VPC

The
, quantifies the share of the total individual differences in the latent propensity of 30-VPC day mortality that is at the hospital level. The embraces the influence of the hospital VPC context on the patient outcome without identifying any specific hospital information. However, the , may also reflect differences in patient-mix between hospitals. In any case, the , VPC VPC represents the hospital ceiling effect or potential maximum influence of the hospital attended.
In the absence of confounding by patient-mix, the higher the , the higher the hospital GCE VPC is. In other words, the more relevant the hospital context is for understanding individual variation in the latent risk for 30-day mortality.

(ii) The area under the receiver operating characteristics curve for the hospital ( )
A well-known measure of discriminatory accuracy is the AUC (10,11,32). In our case the hospital measures how well the model predicted probabilities based on the attended AUC H hospitals distinguish between two outcome categories (death within 30 days or survival). The is constructed by plotting the true positive fraction (TPF) against the false positive AUC H fraction (FPF) for different thresholds of the predicted probabilities. The AUC takes a value between 1 and 0.5 where 1 is perfect discrimination and 0.5 would be equally as informative as flipping a coin (i.e., the hospital information has no discriminatory accuracy).
We calculated the and to account for the different number of patients in each hospital we AUC also calculated the weighted AUC ( ) where every patient was weighted by the inverse AUC

Model 2: Patient-mix adjusted multilevel analysis
Model 2 was a cross-classified multilevel logistic regression (25) for patient mortality in which we included both hospital and RS category random effects to simultaneously account for the variation in mortality across both hospitals and RS categories and to therefore adjust the hospital effects for patient-mix. The model can be written as: where denotes the random effect associated with RS categories ( ), which are = 1,…,10 assumed to be normally distributed with zero mean and between-RS variance . The purpose of model 2 was to evaluate patient-mix adjusted hospital differences in average mortality risk. Therefore, and analogously to model 1, we ranked the hospitals according to their RS adjusted mortality risk and complemented this information by quantifying the size of the hospital GCE net of the observed patient-mix influence. Visual inspection of the hospital and RS category predicted random effects showed the random effect normality assumptions were satisfied. (Supplementary material S2). As a measure of the patient-mix adjusted hospital The adjusted and inform on the share of the total individual variance in the latent VPC propensity of 30-day mortality that is at the hospital and at the RS category level respectively, net of the influence of the other factor. Both measures are estimated on the same scale and can therefore be directly compared to evaluate the relative relevance of hospital versus patient-mix information when it comes to understanding patient differences in the latent propensity of death.
We also calculated the adjusted AUC for the hospital level and for RS category levels (AUC ) including their specific random effects when calculating the predicted probabilities.

(AUC )
While patients with relatively mild conditions may have similarly good outcomes regardless of where they are treated, outcomes of the most complex patients may be affected by hospital performance. We therefore fitted a cross-classified model including a random interaction effect between the Hospital and the RS (Supplementary material S2). That is, we allowed the effect that a Hospital has on their patients to vary according to the RS classification of their patients and vice versa. However, the resulting interaction classification variance was very low, suggesting that hospital attended and patient RS have additive effects on the log-odds of AMI.
Consequently, we based our analysis on model 1 and 2. All models were run in MLwiN 3.05 (33) called from Stata using the runmlwin command (18).
We note that MLwiN can equally be called from within R using the sister R2MLwiN package (19) and so our analysis can also be replicated by readers in that statistical package.

Evaluating hospital performance with MAIHDA
The present cross-classified MAIHDA framework extends that which was described in a previous publication aimed to the evaluation of geographical differences in health outcomes (34). The framework proposes three steps that need be considered to achieve a complete analysis of hospital performance. However, more elaborated strategies are of course also possible, and the presented framework is open for modifications and extensions. The application of the framework in our study was as follows

Step 1. Identifying a benchmark value and evaluating the adjusted hospital mortality rates against it.
When evaluating hospital performance, we need to identify a benchmark value expressing a tolerable average level of 30-day mortality in the population of AMI patients. However, the selection of a specific benchmark is often difficult and arbitrary. We can use an internal benchmark defined as the obtained in model 2 (Formula 4). That is, the mortality rate in a 0 hospital with an average mortality ( treating patients with an average RS ( This choice is meaningful since comparing with a national average seems "fair" and being RS adjusted, tertiary care hospitals with more severe cases do not unfairly push the hospital effect towards a higher value as in the crude average rate. However, being an adjusted rate the value does not necessarily resemble the crude rates and can only be used for relative comparisons. Other adjustments are possible such as computing adjusted hospital mortality rates by holding value equal to the 90th percentile of the RS random effect distribution. Another possibility k Step 2. Quantifying the size of the hospital differences using the VPC and the AUC Currently there is no official guidance for assessing the magnitude of the VPC or the AUC in the context of studying hospital differences in RS adjusted hospital 30-day mortality but a practical proposal is described in Table 1. This table also shows the corresponding AUC values according to the simulated relationship between the AUC and VPC published elsewhere (34).
The proposed values are based on the authors' own experience, but further discussion is encouraged to arrive at a standard classification. Furthermore, different standards may ultimately be required and developed for different outcomes in different contexts and at Step 3. Interpreting results to evaluate performance The two primary questions for the hospital performance evaluation were, (i) has the benchmark value been insufficiently, closely, or fully reached? (ii) are there substantial differences between the hospitals, or do they perform homogenously? To answer both questions, we created a framework (Table 1) with 15 scenarios combining information on the benchmark value achievement and the size of hospital differences based on model 2.
In the best scenario (scenario A) the desired target level has been fully achieved overall (averaging across all hospitals the adjusted mortality rate is less than 6%), and hospital differences are effectively absent (the Hospital GCE is effectively absent). The conclusion would be that all hospitals have performed similarly well. In the worst scenario (scenario C) the desired target level has not been achieved overall, and between-hospital differences are again absent. The conclusion would be that all hospitals have performed similarly but poorly.
Observe that in both scenarios A and C, interventions only targeted to specific hospitals are not justifiable. Rather any intervention should be universal (i.e., directed to all hospitals) as in both scenarios all hospitals are performing similarly. In scenario A, the intervention would be oriented to maintaining the overall high quality while in scenario C, the objective would be to improve the quality in all hospitals.
The interpretation of the scenarios in the lowest corners of the   suggesting that hospital attended is not a driving factor in determining patient mortality. In contrast, information on the RS value of the patients was much more relevant than information on the hospital in which they were treated. The VPC RS was very large (i.e., 34.13%) and the AUC RS = 0.77. Figure 3 clearly depicts the differential discriminatory accuracies of the hospital and RS random effects.

Discussion
Analyzing 30-day mortality after AMI in Sweden, we illustrate the MAIHDA approach to auditing hospital performance. By considering both the size of the hospital GCE and the RS adjusted hospital 30-day mortality rates in relation to a pre-set benchmark value, we were able to perform a more nuanced evaluation of hospital performance compared with traditional methods focused exclusively on differences between hospital averages.
Following the framework presented in of hospitals in our study (Figure 2) also shows that no hospitals could be statistically distinguished with any degree of certainty from the overall average mortality.
We have found similar low VPC values when investigating hospital differences in mortality after AMI admission in Ontario, Canada (38) and in Sweden (39) as well as mortality after heart failure in Sweden (40) and in Denmark (3). The low hospital GCE suggests universal instead of targeted interventions, as all hospitals perform homogenously. There may be, however, other patient outcomes where the hospital GCE would be much larger. For instance, when auditing adherence with guidelines for statin prescription (41,42) or process quality indicators of diabetes care like albuminuria analysis (43).
The evaluation of institutional performance using VPC is not new (12,39,44,45). Normand (46) states that when evaluating hospital performance if the VPC is zero there are no hospital quality differences, that is, the chance that a patient experiences an event after being treated is the same regardless of the hospital (p. 33). This idea is also explicitly expressed by the committee assigned to set statistical guidelines for assessing hospital performance in USA (47).
The share of the total variance that is at the hospital level is crucial for evaluating performance.
However, this fundamental concept needs be applied more extensively. Today, it is recognized that multilevel models (hierarchical or mixed-effect models) are the preferred methodology for provider profiling. However, the substantive analysis of components of variance still receives little attention and most studies only consider multilevel modeling for its capacity to account for the clustering of patients within hospitals in order to obtain "correct standard errors" on regression coefficients and odds ratios. Some authors even conclude that hospital averages (odds ratios, observed/expected values) obtained from multilevel analyses gives similar results compared to traditional logistic regressions analyses. This situation is interpreted as an argument for keeping the traditional logistic regression as the standard method for performing risk adjustment of hospital quality comparisons (48)(49)(50). However, we do not agree with this  (26). In other words, traditional non-multilevel analyses give similar results to the multilevel analysis only when the hospital differences are not relevant (i.e., low VPC) and the patient load is very large in every hospital (which is rarely the case). In addition, hospital level variables appear paradoxically more statistically "significant" when the hospital level is less relevant (i.e., low VPC) (7). Information on the size of the hospital GCE is, therefore, fundamental for a sound analysis of hospital performance.
In this study we have applied the AUC to evaluate the hospital GCE. The AUC is a measure of discriminatory accuracy frequently used for gauging the performance of prognostic and screening markers in medicine (8,9) but it can also be used to quantify hospital GCE (10,34).
So far, many epidemiologists may not be familiar with the use of measures of components of variance like the VPC for binary outcomes (30). However, the AUC measure is well established in clinical and health care epidemiology and the information it gives is relatively easy to interpret and communicate. From this perspective, the evaluation of hospital performance resembles a screening test and so we must therefore know the discriminatory accuracy of, for instance, a "league table" to make informed decisions.
We performed this adjustment using an innovative strategy that use hospitals and decile groups of RS in a cross-classified multilevel analysis. This approach provides a new option that could be very useful in some cases. In other cases, the classical inclusion of the patient-mix information as fixed effects may be more suitable. Both approaches provide similar results in terms of the hospital VPC, AUC, and model fit (Supplemental material S2). Then, the This study has some limitations that need to be discussed. Unfortunately, the database for which we have ethical allowance for this study does not provide information on the severity of AMI or on revascularization procedures (e.g., PCI, CABG). The inclusion of these variables could possibly improve the RS. However, we believe this improvement would be small and unlikely to affect our conclusions. Additionally, the RS is not a perfect instrument to quantify the true severity and mortality risk of a patient. Nevertheless, the RS categories we use are strongly associated with mortality and the RS alone shows a high discriminatory accuracy. RS may reflect practice or coding patterns of hospitals. However, Sweden has a very homogenous health care system with centralized diagnostic rules which may reduce the risk of differential diagnosis setting. Finally, to explore the potential loss of information due to the categorization of RS into deciles, we performed a sensitivity analysis with 15 and 20 categories. The results were similar to those obtained in model 2 (data not shown) In summary, we illustrate the MAIHDA approach to auditing hospital performance using a three-step strategy. We argue that it is necessary to consider both the size of the hospital GCE and the RS adjusted 30-day mortality in relation to a pre-set benchmark value. Our results indicate that, at the time of our analysis, all hospitals in Sweden were performing homogenously

Contributors:
JM had the initiative of the study and acquired the data.

Introduction
Background/rationale 2 Explain the scientific background and rationale for the investigation being reported 3-6 Objectives 3 State specific objectives, including any prespecified hypotheses 6

Study design 4
Present key elements of study design early in the paper 7  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46 Participants 13* (a) Report numbers of individuals at each stage of study-eg numbers potentially eligible, examined for eligibility, confirmed eligible, included in the study, completing follow-up, and analysed Figure 1 (b) Give reasons for non-participation at each stage NA (c) Consider use of a flow diagram Figure 1 Descriptive data 14* (a) Give characteristics of study participants (eg demographic, clinical, social) and information on exposures and potential confounders 17, Table 2 (b) Indicate number of participants with missing data for each variable of interest NA Outcome data 15* Report numbers of outcome events or summary measures 17 Main results 16 (a) Give unadjusted estimates and, if applicable, confounder-adjusted estimates and their precision (eg, 95% confidence interval). Make clear which confounders were adjusted for and why they were included 18, Table 3 (b) Report category boundaries when continuous variables were categorized Table 2 (c) If relevant, consider translating estimates of relative risk into absolute risk for a meaningful time period NA Other analyses 17 Report other analyses done-eg analyses of subgroups and interactions, and sensitivity analyses 18

Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (MAIHDA)
Recently, MAIHDA has been proposed as a novel strategy for evaluating hospital performance (3). In contrast with most traditional studies, hospital MAIHDA simultaneously focuses on both hospital averages and patient heterogeneity around such averages. In MAIHDA, the fundamental statement is that patient and hospital variation should not be analysed separately.
Rather, we need to consider that the total individual outcome variance can be partitioned into variance components operating at different levels of analysis (4). From this perspective, hospital differences are not measured as the difference between hospital averages, but as the hospital general contextual effect (GCE). That is, the share of the total individual variance in patient  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   4 outcomes that is at the hospital level. This definition aligns with that for the variance partition coefficient (VPC) in multilevel modelling. The greater the GCE, the more important the hospital context is for explaining variation in individual outcomes (5)(6)(7). This idea is also closely related to the notion of discriminatory accuracy developed for the evaluation of the performance of prognostic and screening markers in medicine (8,9). It is therefore possible to also use measures of discriminatory accuracy such as the area under the Receiving Operator Characteristics curve (AUC), to quantify the hospital GCE (10). See elsewhere for an extended explanation of the GCE concept (3,7,11). In this article, we argue that the systematic application of measures of variance and discriminatory accuracy is of fundamental relevance for meaningful performance evaluations (3,5,(11)(12)(13).

Cross-classified MAIHDA
Hospital comparisons are usually adjusted for "patient-mix" using a risk score (RS). In traditional multilevel analysis of hospital performance, patient RS effects are modelled as fixed effects (e.g., by entering the set of RS categories as a series of dummy variables) while the hospital effects are modelled as random effects. In contrast, in the cross-classified MAIHDA approach both the RS and the hospitals are modelled as random effects. Readers familiar with the traditional application of multilevel modelling may query the treatment of RS categories as random effects. For example, while we can think of the hospitals as a sample drawn from the set of all possible hospitals, it proves harder to conceptualize the RS categories in this way as there is not a large population of RS categories from which they are drawn. This is, however, a philosophical, rather than a practical question. In fact, when studying hospitals in a country the hospitals are never a sample of an infinite super population of hospitals but a concrete set of facilities in a specific setting. Furthermore, many multilevel studies observe and analyze all the hospitals in a country in their data, and the total number of hospitals may not prove that large, yet here too the hospital effects will be treated as random effects. As discussed by Snijders and Bosker, when defining the random intercept model (14), p. 45, the random effects model can be applied even when the idea of an infinite superpopulation is less evident. This approach is currently being applied when performing intersectional MAIHDA in social epidemiology (15).
The cross-classified approach provides several advantages over the traditional hierarchical multilevel approach. First, the cross-classified MAIHDA is parsimonious as it includes only one random parameter for the RS categories rather than the dummy variables as in the -1 fixed effects approach. Second, the cross-classified approach provides separate VPCs and AUCs for RS categories and for the hospital, allowing their magnitude to be contrasted. Thus, in contrast to the fixed-effects approach, it allows the importance of patient-mix vs. the hospital effects to be communicated on a common metric. In addition, hospital MAIHDA provides all the usual advantages of multilevel models. For instance, by providing reliability weighted hospital averages (shrunken residuals), it reduces the concern of monitoring outcome measures based on small hospital caseloads which otherwise may lead to extreme and unstable hospital rankings and, therefore, unreliable performance evaluation (16,17). Both hierarchical and cross-classified MAIHDA are nowadays easy to implement in available software such as MLwiN that can be run from both within Stata (runmlwin) (18) and within R (R2MLwiN) (19).
Finally, for binary patient outcomes, such as 30-day mortality after AMI, multilevel analyses can be performed using a simple contingency table or matrix with strata defined by combinations of the hospitals and the RS categories capturing patient-mix. The only information required for the analysis is the overall number of patients and the number of AMI cases in each hospital-RS stratum. This aggregated approach maintains the joint distribution of the hospitals-RS information and provides the same model results (parameter estimates, standard errors, fit statistics and predictions) as when analysing the underlying individual level data. The aggregated approach also allows computationally efficient (fast) estimation as it allows analysing thousands of patients' outcomes using a dataset consisting of just a few  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   6 hundred strata. A further benefit of the aggregated approach is that the data can be shared since its aggregated presentation reduces ethical problems of confidentiality (statistical disclosure is not at risk). This in turn, improves the transparency of the research and facilitates the replication of the analysis and encourages the sharing of data to compare hospital performance between different settings.

The aim of this study
The aim of this study was to demonstrate a novel statistical approach (MAIHDA) to evaluate hospital performance and to orientate stakeholders to make assertive data informed decisions using a three-step framework. We do so by analyzing differences in 30-day mortality among patients admitted to the Swedish hospital with a first-ever AMI between 2007-2009.

Study Population
This is a cross-sectional study. We used information from the Swedish Patient Register (20) and from the Cause of Death Register

Data accessibility
The original databases used in our study are available from the Swedish National Board of Health and Welfare and Statistics. In Sweden, register data are protected by strict rules of confidentiality (24) but can be made available for research after a special review that includes approval of the research project by both an Ethics Committee and the authorities' own data safety committees. The Swedish authorities under the Ministry of Health and Social Affairs do not provide individual level data to researchers abroad. Instead, they normally advise researchers in other countries to cooperate with Swedish colleagues and analyze data in collaboration according to standard legal provisions and procedures.
However, in the approach we propose, it is technically possible to perform the analysis using a simple table defined by hospital and categories of risk score. The aggregated information as well as the additional encryption of the hospital name fully anonymized the table, which prevents the backwards identification of individuals even when very few patients are in a single cell of the table. Therefore, to increase transparency and facilitate the replication of our analysis we provide the table as a Stata dataset (Supplementary material S1) and a fully annotated Stata do-file to allow the replication of the analyses (Supplementary material S2). We also provide the table as a CSV file along with an R Script (Supplemental material S3, S4).  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60   F  o  r  p  e  e  r  r  e  v  i  e  w  o  n  l  y   8 The study outcome was all-cause mortality within 30-days after admission to the hospital (coded yes vs. no) due to AMI.

Statistical analyses
We analysed 30-day mortality among 43,247 patients admitted to 68 Swedish hospitals between 2007-2009, with a first-ever AMI. We classified the patients into 10 RS categories for 30-day mortality and created 680 strata defined by combining hospital and RS categories. In the first step (model 1), we applied a traditional hierarchical multilevel logistic regression model with patients clustered within hospitals. In a second step (model 2), and in order to adjust for patientmix, we performed a cross-classified multilevel model of patient outcomes with both RS categories and hospitals random effects (25). We estimated the VPC and the AUC to evaluate differences between RS categories and between hospitals in a common way.

Estimation methods
We performed the estimations using Markov Chain Monte Carlo (MCMC) methods, with diffuse (vague, flat, or minimally informative) prior distributions for all parameters. We used quasilikelihood methods to provide starting values for all parameters. For each model, the burnin length was 5,000 iterations. We ran the model for a further 10,000 monitoring iterations and used the resulting parameter chains from the MCMC to construct 95% credible intervals (CI) for all model predictions to communicate statistical uncertainty (Supplementary material S2).

2
The purpose of this model was to evaluate unadjusted hospital differences in average mortality risk. For this aim, we (A) ranked the hospitals according to their mortality risk; and (B) complemented this information by quantifying the size of the GCE.

A) Ranking of the hospitals
To rank hospitals according to their unadjusted mortality rates, we predict the absolute risk (AR ) of 30-day mortality and its 95% CI in each hospital. To do so, we first transformed the predicted logit of 30-day mortality into proportions ( as follows ) AR ≡ ≡ logit -1 ( 0 + ) = exp ( 0 + ) 1 + exp ( 0 + ) Formula 2

B) Measuring the hospital GCE.
We estimate the hospital GCE by means of two measures:

(i) The Variance Partition Coefficient for the hospital level ( )
The can be calculated based on the latent response formulation of the model which is an VPC approach widely adopted today in multilevel applied work (27-31) . Where denotes the variance of a standard logistic distribution. We then multiply the 2 3 ≅3.29 by 100 and interpret it as a percentage. VPC

The
, quantifies the share of the total individual differences in the latent propensity of 30-VPC day mortality that is at the hospital level. The embraces the influence of the hospital VPC context on the patient outcome without identifying any specific hospital information. However, the , may also reflect differences in patient-mix between hospitals. In any case, the , VPC VPC represents the hospital ceiling effect or potential maximum influence of the hospital attended.
In the absence of confounding by patient-mix, the higher the , the higher the hospital GCE VPC is. In other words, the more relevant the hospital context is for understanding individual variation in the latent risk for 30-day mortality.

(ii) The area under the receiver operating characteristics curve for the hospital ( )
A well-known measure of discriminatory accuracy is the AUC (10,11,32). In our case the hospital measures how well the model predicted probabilities based on the attended AUC H hospitals distinguish between two outcome categories (death within 30 days or survival). The is constructed by plotting the true positive fraction (TPF) against the false positive AUC H fraction (FPF) for different thresholds of the predicted probabilities. The AUC takes a value between 1 and 0.5 where 1 is perfect discrimination and 0.5 would be equally as informative as flipping a coin (i.e., the hospital information has no discriminatory accuracy).
We calculated the and to account for the different number of patients in each hospital we AUC also calculated the weighted AUC ( ) where every patient was weighted by the inverse AUC

Model 2: Patient-mix adjusted multilevel analysis
Model 2 was a cross-classified multilevel logistic regression (25) for patient mortality in which we included both hospital and RS category random effects to simultaneously account for the variation in mortality across both hospitals and RS categories and to therefore adjust the hospital effects for patient-mix. The model can be written as: where denotes the random effect associated with RS categories ( ), which are = 1,…,10 assumed to be normally distributed with zero mean and between-RS variance . The purpose of model 2 was to evaluate patient-mix adjusted hospital differences in average mortality risk. Therefore, and analogously to model 1, we ranked the hospitals according to their RS adjusted mortality risk and complemented this information by quantifying the size of the hospital GCE net of the observed patient-mix influence. Visual inspection of the hospital and RS category predicted random effects showed the random effect normality assumptions were satisfied. (Supplementary material S2). As a measure of the patient-mix adjusted hospital The adjusted and inform on the share of the total individual variance in the latent VPC propensity of 30-day mortality that is at the hospital and at the RS category level respectively, net of the influence of the other factor. Both measures are estimated on the same scale and can therefore be directly compared to evaluate the relative relevance of hospital versus patient-mix information when it comes to understanding patient differences in the latent propensity of death.
We also calculated the adjusted AUC for the hospital level and for RS category levels (AUC ) including their specific random effects when calculating the predicted probabilities. While patients with relatively mild conditions may have similarly good outcomes regardless of where they are treated, outcomes of the most complex patients may be affected by hospital performance. We therefore fitted a cross-classified model including a random interaction effect between the Hospital and the RS (Supplementary material S2). That is, we allowed the effect that a Hospital has on their patients to vary according to the RS classification of their patients and vice versa. However, the resulting interaction classification variance was very low, Consequently, we based our analysis on model 1 and 2.

Software
All models were run in MLwiN 3.05 (33) called from Stata using the runmlwin command (18).
We note that MLwiN can equally be called from within R using the sister R2MLwiN package (19) and so our analysis can also be replicated by readers in that statistical package.

Evaluating hospital performance with MAIHDA
The present cross-classified MAIHDA framework extends that which was described in a previous publication aimed to the evaluation of geographical differences in health outcomes (34). The framework proposes three steps that need be considered to achieve a complete analysis of hospital performance. However, more elaborated strategies are of course also possible, and the presented framework is open for modifications and extensions. The application of the framework in our study was as follows

Step 1. Identifying a benchmark value and evaluating the adjusted hospital mortality rates against it.
When evaluating hospital performance, we need to identify a benchmark value expressing a tolerable average level of 30-day mortality in the population of AMI patients. However, the selection of a specific benchmark is often difficult and arbitrary. We can use an internal benchmark defined as the obtained in model 2 (Formula 4). That is, the mortality rate in a 0 hospital with an average mortality ( treating patients with an average RS ( This choice is meaningful since comparing with a national average seems "fair" and being RS adjusted, tertiary care hospitals with more severe cases do not unfairly push the hospital effect towards a higher value as in the crude average rate. However, being an adjusted rate the value does not necessarily resemble the crude rates and can only be used for relative comparisons.

Step 2. Quantifying the size of the hospital differences using the VPC and the AUC
Currently there is no official guidance for assessing the magnitude of the VPC or the AUC in the context of studying hospital differences in RS adjusted hospital 30-day mortality but a practical proposal is described in Table 1. This table also shows the corresponding AUC values according to the simulated relationship between the AUC and VPC published elsewhere (34).
The proposed values are based on the authors' own experience, but further discussion is

Step 3. Interpreting results to evaluate performance
The two primary questions for the hospital performance evaluation were, (i) has the benchmark value been insufficiently, closely, or fully reached? (ii) are there substantial differences between the hospitals, or do they perform homogenously? To answer both questions, we created a framework (Table 1) with 15 scenarios combining information on the benchmark value achievement and the size of hospital differences based on model 2.
In the best scenario (scenario A) the desired target level has been fully achieved overall (averaging across all hospitals the adjusted mortality rate is less than 6%), and hospital differences are effectively absent (the Hospital GCE is effectively absent). The conclusion would be that all hospitals have performed similarly well. In the worst scenario (scenario C) the desired target level has not been achieved overall, and between-hospital differences are again absent. The conclusion would be that all hospitals have performed similarly but poorly.
Observe that in both scenarios A and C, interventions only targeted to specific hospitals are not justifiable. Rather any intervention should be universal (i.e., directed to all hospitals) as in both scenarios all hospitals are performing similarly. In scenario A, the intervention would be oriented to maintaining the overall high quality while in scenario C, the objective would be to improve the quality in all hospitals.
The interpretation of the scenarios in the lowest corners of the   suggesting that hospital attended is not a driving factor in determining patient mortality. In contrast, information on the RS value of the patients was much more relevant than information on the hospital in which they were treated. The VPC RS was very large (i.e., 34.13%) and the AUC RS = 0.77. Figure 3 clearly depicts the differential discriminatory accuracies of the hospital and RS random effects. The hospital variance (0.03 (0.02-0.06)), total AUC (0.78 (0.77-0.79)) and Bayesian DIC (2334.2) results from model 3 were similar to model 2.

Discussion
Analyzing 30-day mortality after AMI in Sweden, we illustrate the MAIHDA approach to auditing hospital performance. By considering both the size of the hospital GCE and the RS adjusted hospital 30-day mortality rates in relation to a pre-set benchmark value, we were able to perform a more nuanced evaluation of hospital performance compared with traditional methods focused exclusively on differences between hospital averages.
Following the framework presented in  (Figure 2) also shows that no hospitals could be statistically distinguished with any degree of certainty from the overall average mortality.
We have found similar low VPC values when investigating hospital differences in mortality after AMI admission in Ontario, Canada (38) and in Sweden (39) (41,42) or process quality indicators of diabetes care like albuminuria analysis (43).
The evaluation of institutional performance using VPC is not new (12,39,44,45). Normand (46) states that when evaluating hospital performance if the VPC is zero there are no hospital quality differences, that is, the chance that a patient experiences an event after being treated is the same regardless of the hospital (p. 33). This idea is also explicitly expressed by the committee assigned to set statistical guidelines for assessing hospital performance in USA (47).
The share of the total variance that is at the hospital level is crucial for evaluating performance.
However, this fundamental concept needs be applied more extensively. Today, it is recognized that multilevel models (hierarchical or mixed-effect models) are the preferred methodology for provider profiling. However, the substantive analysis of components of variance still receives little attention and most studies only consider multilevel modeling for its capacity to account for the clustering of patients within hospitals in order to obtain "correct standard errors" on regression coefficients and odds ratios. Some authors even conclude that hospital averages (odds ratios, observed/expected values) obtained from multilevel analyses gives similar results compared to traditional logistic regressions analyses. This situation is interpreted as an  (48)(49)(50). However, we do not agree with this opinion. The first reason is that the fixed effects approach does not explicitly inform on components of variance. The second reason is that the equivalence between traditional and multilevel regression results only occurs when the hospital GCE (i.e., the clustering) is low and the number of patients at the hospitals is very high (i.e., reliable estimation of hospital averages) (26). In other words, traditional non-multilevel analyses give similar results to the multilevel analysis only when the hospital differences are not relevant (i.e., low VPC) and the patient load is very large in every hospital (which is rarely the case). In addition, hospital level variables appear paradoxically more statistically "significant" when the hospital level is less relevant (i.e., low VPC) (7). Information on the size of the hospital GCE is, therefore, fundamental for a sound analysis of hospital performance.
In this study we have applied the AUC to evaluate the hospital GCE. The AUC is a measure of discriminatory accuracy frequently used for gauging the performance of prognostic and screening markers in medicine (8,9) but it can also be used to quantify hospital GCE (10,34).
So far, many epidemiologists may not be familiar with the use of measures of components of variance like the VPC for binary outcomes (30). However, the AUC measure is well established in clinical and health care epidemiology and the information it gives is relatively easy to interpret and communicate. From this perspective, the evaluation of hospital performance resembles a screening test and so we must therefore know the discriminatory accuracy of, for instance, a "league table" to make informed decisions.
We performed this adjustment using an innovative strategy that use hospitals and decile groups of RS in a cross-classified multilevel analysis. This approach provides a new option that could be very useful in some cases. In other cases, the classical inclusion of the patient-mix This study has some limitations that need to be discussed. Unfortunately, the database for which we have ethical allowance for this study does not provide information on the severity of AMI or on revascularization procedures (e.g., PCI, CABG). The inclusion of these variables could possibly improve the RS. However, we believe this improvement would be small and unlikely to affect our conclusions. Additionally, the RS is not a perfect instrument to quantify the true severity and mortality risk of a patient. Nevertheless, the RS categories we use are strongly associated with mortality and the RS alone shows a high discriminatory accuracy. RS may reflect practice or coding patterns of hospitals. However, Sweden has a very homogenous health care system with centralized diagnostic rules which may reduce the risk of differential diagnosis setting. Finally, to explore the potential loss of information due to the categorization of RS into deciles, we performed a sensitivity analysis with 15 and 20 categories. The results were similar to those obtained in model 2 (data not shown) In summary, we illustrate the MAIHDA approach to auditing hospital performance using a three-step strategy. We argue that it is necessary to consider both the size of the hospital GCE  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59

Contributors:
JM had the initiative of the study and acquired the data.
JM and M R-L wrote the original manuscript.
M R-L and RP-V performed the analyses in coordination with JM.
PA and GL provided advanced statistical support.

Introduction
Background/rationale 2 Explain the scientific background and rationale for the investigation being reported 3-6 Objectives 3 State specific objectives, including any prespecified hypotheses 6