The performance of estimators based on the propensity score
Introduction
Semiparametric estimators using the propensity score to adjust in one way or another for covariate differences are now well-established. They are used for estimating causal effects in a selection-on-observables framework with discrete treatments, or for simply purging the means of an outcome variable in two or more subsamples from differences due to observed variables.1 Compared to (non-saturated) parametric regressions, they have the advantage of including the covariates in a more flexible way without incurring a curse-of-dimensionality problem, and of allowing for effect heterogeneity. The former problem is highly relevant due to the large number of covariates that should usually be adjusted for. It is tackled by collapsing the covariate information into a single parametric function. This function, the so-called propensity score, is defined as the probability of being observed in one of two subsamples conditional on the covariates. The difference to parametric regression is that this parametric function is not directly related to the outcome (as it would be in regression) and thus, additional robustness to misspecification can be expected.2 These methods originate from the pioneering work of Rosenbaum and Rubin (1983) who show that balancing two samples on the propensity score is sufficient to equalize their covariate distributions.
Although many of these propensity-score-based methods are not asymptotically efficient (see for example Heckman et al., 1998a, Heckman et al., 1998b, Hahn, 1998),3 they are the work-horses in the literature on programme evaluation and are now rapidly spreading to other fields. They are usually implemented as semiparametric estimators: the propensity score is based on a parametric model, but the relationship between the outcome variables and the propensity score is non-parametric. However, despite the popularity of propensity-score-based methods, the issue of which version of the many different estimators suggested in the literature should be used in a particular application is still unresolved, despite recent advances in important Monte Carlo studies by Frölich (2004) and Busso et al., forthcoming, Busso et al., 2009. In this paper we address this question and add further insights to it. Broadly speaking, the popular estimators can be subdivided into four classes: parametric estimators (like OLS or probit or their so-called double-robust relatives, see Robins et al., 1992), inverse (selection) probability weighting estimators (similar to Horvitz and Thompson, 1952) or to the recently introduced titling version by Graham et al., 2011, Graham et al., 2012, direct matching estimators (Rubin, 1974, Rosenbaum and Rubin, 1983), and kernel matching estimators (Heckman et al., 1998a, Heckman et al., 1998b).4 However, many variants of the estimators exist within each class and several methods combine the principles underlying these main classes.
There are two strands of the literature that are relevant for our research question: First, the literature on the asymptotic properties of a subset of estimators provides some guidance on their small sample properties. In Section 3 we review this literature and discuss the various estimators. Unfortunately, asymptotic properties have not (yet?) been derived for all estimators used in practice, nor is it obvious how well they approximate small sample behaviour. Furthermore, these results are usually not informative for the important choice of tuning parameters on which many estimators critically depend (e.g., number of matched neighbours, bandwidth selection in kernel matching).
The second strand of the literature provides Monte Carlo evidence on the properties of the estimators of the effects.5 As one of the first papers investigating estimators from several classes simultaneously, Frölich (2004) found that a particular version of kernel-matching based on local regressions with finite sample adjustments (local ridge regression) performs best. In contrast, Busso et al., forthcoming, Busso et al., 2009 conclude that inverse probability weighting (IPW) has the best properties (when using normalized weights for estimation). They explain the differences to Frölich (2004) by claiming that he (i) considers unrealistic data generating processes and (ii) does not use an IPW estimator with normalized weights. In other words, they point to the design dependence of the Monte Carlo results as well as to the requirement of using optimized variants of the estimators. Below, we argue that their work may be subject to the same criticism. This provides a major motivation for our study.
We contribute to the literature on the properties of estimators based on adjusting for covariate differences in the following way: firstly, we suggest a different approach to conduct simulations. This approach is based on ‘real’ data. Therefore, we call our particular implementation of this idea an ‘Empirical Monte Carlo Study’.6 The basic idea is to use the empirical data to simulate realistic ‘placebo treatments’ among the non-treated. The various estimators then use the remaining non-treated in different ways to estimate the (known) non-treatment outcome of the ‘placebo-treated’. Selection into treatment, which is potentially of key importance for the performance of the various estimators, is based on a selection process directly obtained from the data. Moreover, we exploit the actual dependence of the outcome of interest on the covariates on which selection is based in the data rather than making assumptions on this relation when specifying the data generating process. Thus, this approach is less prone to the standard critique of simulation studies that the chosen data generating processes are irrelevant for real applications. Since our model for the propensity score mirrors specifications used in past applied work, it depends on many more covariates compared to the studies mentioned above. Although this makes the simulation results particularly plausible in our context of labour market programme evaluation in Europe, this may also be seen as a limitation concerning its applications to other fields. Therefore, to help generalize the results outside our specific data situation, we modify many features of the data generating process, like the type of the outcome variable and as well as various aspects of the selection process.
Secondly, we consider standard estimators as well as their modified (optimized?) versions based on different tuning parameters such as bandwidth or radius choice. This leads to a large number of estimators to evaluate, but it also provides us with more information on important choices regarding the parameters on which the various estimators depend. Such estimators may also consist of combinations of estimators, like combining matching with weighted regression, which have not been considered in any simulation so far. Finally, we reemphasize the relevance of trimming to improve the finite sample properties of all estimators. The rule we propose is (i) a data driven trimming rule, (ii) easy to implement, (iii) identical for all estimators, and (iv) avoids asymptotic bias. We show that for almost all estimators considered, including the parametric ones, trimming based on this rule effectively improves their performance.
Overall, we find that (i) trimming observations that have ‘too large’ a weight is important for many estimators; (ii) the choices of the various tuning parameters play an important role; (iii) simple matching estimators are inefficient and have considerable small sample bias; (iv) no estimator is superior in all designs and for all outcomes; (v) particular bias-adjusted radius (or calliper) matching estimators perform best on average, but may have fat tails if the number of controls is not large enough; and finally, (vi) flexible, but simple parametric approaches do almost as well in the smaller samples, because their gain in precision frequently compensates (in part) for their larger bias which, however, dominates when samples become larger. Strictly speaking these properties relate to our particular data generating process (DGP) only. However, at least such a DGP is typical for an important application of matching methods, namely labour market evaluations.
The paper proceeds as follows: in the next section we describe our Monte Carlo design, relegating many details as well as descriptive statistics to online Appendices B and C, where the latter contains a description of the support features of our data. In Section 3 we discuss the basic setup of each of the relevant estimators and their properties, as well as the issue of trimming, while relegating the technical details of the estimators to Appendix. The main results are presented in Section 4, while the full set of results is given in online Appendix D. Section 5 concludes and online Appendix E contains further sensitivity checks. The website of this paper (www.sew.unisg.ch/lechner/matching) will contain additional material that has been removed from the paper for the sake of brevity, in particular Appendices B, C, D, and E as well as the Gauss, Stata, and R codes for the preferred estimators. The following is the Supplementary material related to this article.
Section snippets
Basic idea
A typical Monte Carlo study specifies the data generation process of all relevant random variables and then conducts estimation and inference from samples that are generated by independent draws from those random variables based on pseudo random number generators. The advantage of such a design is that all dimensions of the true data generating process (DGP) are known and can be used for a thorough comparison with the estimates obtained from the simulations. However, the disadvantage is that
Notation and targets for the estimation
The outcome variable, , denotes earnings or employment. The group of treated units (treatment indicator ) are the participants in training in our empirical example. We are interested in comparing the mean value of in the group of treated with the mean value of in the group of non-treated , the non-participants, free of any mean differences in outcomes that are due to differences in the observed covariates across the groups.20
Trimming
From Eq. (1) we see that all estimators can be written as the mean outcome of the treated minus the weighted outcome of the non-treated observations. By the nature of this estimation principle, the weights of the non-treated are not uniform (except in the case of random assignment in which they should be very similar even in the smallest sample). They depend on the covariates via the propensity score. If particular values of are rare among the controls and common among the treated, such
Results
In this section, we first discuss several issues concerning the implementation of the various estimators (5.1). After that, the results are discussed, beginning with issues that concern all estimators simultaneously, like the impact of different features of the data generating process, the specification of the propensity score and the trimming (5.2). Then, we analyze implementational issues that are specific to the particular classes of estimators considered (5.3). Finally, we compare the best
Conclusion
This paper investigates the finite sample properties of all major classes of propensity-score-based estimators of the average treatment effect on the treated (ATET) that are used in applications. Moreover, within each class of estimators we investigate the performance of the estimators for a variety of possible versions and various values of the tuning parameters. Both features make this study the most comprehensive one in the field so far.
We propose a way to overcome one of the main criticisms
Acknowledgments
Michael Lechner is a Research Fellow of CEPR and PSI, London, CES-Ifo, Munich, IAB, Nuremberg, IZA, Bonn, and ZEW, Mannheim. Conny Wunsch is a Research Fellow of CES-Ifo, Munich, and IZA, Bonn. This project received financial support from the Institut für Arbeitsmarkt und Berufsforschung, IAB, Nuremberg (contract 8104). We would like to thank Patrycja Scioch (IAB), Benjamin Schünemann and Darjusch Tafreschi (both SEW, St. Gallen) for their help in the early stages of data preparation. An
References (101)
Large sample sieve estimation of semi-nonparametric models
Practical propensity score estimation: a reply to Smith and Todd
Journal of Econometrics
(2005)Nonparametric IV estimation of local average treatment effects with covariates
Journal of Econometrics
(2007)- et al.
How do employment effects of job creation schemes differ with respect to the foregoing unemployment duration?
Labour Economics
(2010) - et al.
New evidence on the effects of job creation schemes in Germany—a matching approach with threefold heterogeneity
Research in Economics
(2004) Long-run labour market and health effects of individual sports activities
The Journal of Health Economics
(2009)A method of moments interpretation of sequential estimators
Economics Letters
(1984)Inverse probability weighted estimation for general missing data problems
Journal of Econometrics
(2007)Semiparametric difference-in-difference estimators
Review of Economic Studies
(2005)- Abadie, A., Imbens, G.W., 2002. Simple and bias-corrected matching estimators for average treatment effects, NBER...
Large sample properties of matching estimators for average treatment effects
Econometrica
On the failure of the bootstrap for matching estimators
Econometrica
Estimating the labor market effects of voluntary military service using social security data on military applicants
Econometrica
When to control for covariates? panel-asymptotic results for estimates of treatment effects
Review of Economics and Statistics
Mostly Harmless Econometrics: An Empiricists’ Companion
Assessing the performance of matching algorithms when selection into treatment is strong
Journal of Applied Econometrics
Doubly robust estimation in missing data and causal inference models
Biometrics
Unemployed and their case workers: should they be friends or foes?
The Journal of the Royal Statistical Society - Series A
A caseworker like me - does the similarity between unemployed and caseworker increase job placements?
The Economic Journal
How much should we trust differences-in-differences estimates
Quarterly Journal of Economics
Alternative approaches to evaluation in empirical microeconomics
Journal of Human Resources
Evaluating the employment impact of a mandatory job search program
Journal of the European Economic Association
Sectoral heterogeneity in the employment effects of job creation schemes in Germany
Journal of Economics and Statistics
The employment effects of job creation schemes in Germany–a microeconometric evaluation
Identifying effect heterogeneity to improve the efficiency of job creation schemes in Germany
Applied Economics
Active labour market policy evaluations: a meta-analysis
Economic Journal
Dealing with limited overlap in estimation of average treatment effects
Biometrika
Causal effects in non-experimental studies: reevaluating the evaluation of training programmes
SAT Journal of the American Statistical Association
Propensity score-matching methods for nonexperimental causal studies
Review of Economics and Statistics
Genetic matching for estimating causal effects: a general multivariate matching method for achieving balance in observational studies
Mimeo
Labor market institutions and the distribution of wages, 1973–1992: a semiparametric approach
Econometrica
Effects of misspecification of the propensity score on estimators of treatment effect
Biometrics
Design-adaptive nonparametric regression
SAT Journal of the American Statistical Association
Evaluating nonexperimental estimators for multiple treatments: evidence from experimental data
Mimeo
Finite-sample properties of propensity-score matching and weighting estimators
Review of Economics and Statistics
Matching estimators and optimal bandwidth choice
Statistics and Computing
Nonparametric regression for binary dependent variables
Econometrics Journal
Bandwidth selection and the estimation of treatment effects with unbalanced data
Annales d’Économie et de Statistique
Microeconometric evaluation of the active labour market policy in Switzerland
The Economic Journal
An introduction to the augmented inverse propensity weighted estimator
Political Analysis
Inverse probability tilting for moment condition models with missing data
Review of Economic Studies
On the role of the propensity score in efficient semiparametric estimation of average treatment effects
Econometrica
Large sample properties of generalized methods of moments estimators
Econometrica
Full matching in an observational study of coaching for the sat
SAT Journal of the American Statistical Association
Cross-validation and the estimation of conditional probability densities
SAT Journal of the American Statistical Association
Cited by (250)
Cash transfers and micro-enterprise performance: Theory and quasi-experimental evidence from Kenya
2024, Journal of Development EconomicsThe impact of low-immersion virtual reality on product sales: Insights from the real estate industry
2024, Decision Support SystemsUsing Wasserstein Generative Adversarial Networks for the design of Monte Carlo simulations
2024, Journal of EconometricsThe effect of philosophy on critical reading: Evidence from initial teacher education in Colombia
2024, International Journal of Educational DevelopmentDevelopment of railway station plazas: Impact on land prices of surrounding areas
2023, Transport Policy