The performance of estimators based on the propensity score

https://doi.org/10.1016/j.jeconom.2012.11.006Get rights and content

Abstract

We investigate the finite sample properties of a large number of estimators for the average treatment effect on the treated that are suitable when adjustment for observed covariates is required, like inverse probability weighting, kernel and other variants of matching, as well as different parametric models. The simulation design used is based on real data usually employed for the evaluation of labour market programmes in Germany. We vary several dimensions of the design that are of practical importance, like sample size, the type of the outcome variable, and aspects of the selection process. We find that trimming individual observations with too much weight as well as the choice of tuning parameters are important for all estimators. A conclusion from our simulations is that a particular radius matching estimator combined with regression performs best overall, in particular when robustness to misspecifications of the propensity score and different types of outcome variables is considered an important property.

Introduction

Semiparametric estimators using the propensity score to adjust in one way or another for covariate differences are now well-established. They are used for estimating causal effects in a selection-on-observables framework with discrete treatments, or for simply purging the means of an outcome variable in two or more subsamples from differences due to observed variables.1 Compared to (non-saturated) parametric regressions, they have the advantage of including the covariates in a more flexible way without incurring a curse-of-dimensionality problem, and of allowing for effect heterogeneity. The former problem is highly relevant due to the large number of covariates that should usually be adjusted for. It is tackled by collapsing the covariate information into a single parametric function. This function, the so-called propensity score, is defined as the probability of being observed in one of two subsamples conditional on the covariates. The difference to parametric regression is that this parametric function is not directly related to the outcome (as it would be in regression) and thus, additional robustness to misspecification can be expected.2 These methods originate from the pioneering work of Rosenbaum and Rubin (1983) who show that balancing two samples on the propensity score is sufficient to equalize their covariate distributions.

Although many of these propensity-score-based methods are not asymptotically efficient (see for example Heckman et al., 1998a, Heckman et al., 1998b, Hahn, 1998),3 they are the work-horses in the literature on programme evaluation and are now rapidly spreading to other fields. They are usually implemented as semiparametric estimators: the propensity score is based on a parametric model, but the relationship between the outcome variables and the propensity score is non-parametric. However, despite the popularity of propensity-score-based methods, the issue of which version of the many different estimators suggested in the literature should be used in a particular application is still unresolved, despite recent advances in important Monte Carlo studies by Frölich (2004) and Busso et al., forthcoming, Busso et al., 2009. In this paper we address this question and add further insights to it. Broadly speaking, the popular estimators can be subdivided into four classes: parametric estimators (like OLS or probit or their so-called double-robust relatives, see Robins et al., 1992), inverse (selection) probability weighting estimators (similar to Horvitz and Thompson, 1952) or to the recently introduced titling version by Graham et al., 2011, Graham et al., 2012, direct matching estimators (Rubin, 1974, Rosenbaum and Rubin, 1983), and kernel matching estimators (Heckman et al., 1998a, Heckman et al., 1998b).4 However, many variants of the estimators exist within each class and several methods combine the principles underlying these main classes.

There are two strands of the literature that are relevant for our research question: First, the literature on the asymptotic properties of a subset of estimators provides some guidance on their small sample properties. In Section 3 we review this literature and discuss the various estimators. Unfortunately, asymptotic properties have not (yet?) been derived for all estimators used in practice, nor is it obvious how well they approximate small sample behaviour. Furthermore, these results are usually not informative for the important choice of tuning parameters on which many estimators critically depend (e.g., number of matched neighbours, bandwidth selection in kernel matching).

The second strand of the literature provides Monte Carlo evidence on the properties of the estimators of the effects.5 As one of the first papers investigating estimators from several classes simultaneously, Frölich (2004) found that a particular version of kernel-matching based on local regressions with finite sample adjustments (local ridge regression) performs best. In contrast, Busso et al., forthcoming, Busso et al., 2009 conclude that inverse probability weighting (IPW) has the best properties (when using normalized weights for estimation). They explain the differences to Frölich (2004) by claiming that he (i) considers unrealistic data generating processes and (ii) does not use an IPW estimator with normalized weights. In other words, they point to the design dependence of the Monte Carlo results as well as to the requirement of using optimized variants of the estimators. Below, we argue that their work may be subject to the same criticism. This provides a major motivation for our study.

We contribute to the literature on the properties of estimators based on adjusting for covariate differences in the following way: firstly, we suggest a different approach to conduct simulations. This approach is based on ‘real’ data. Therefore, we call our particular implementation of this idea an ‘Empirical Monte Carlo Study’.6 The basic idea is to use the empirical data to simulate realistic ‘placebo treatments’ among the non-treated. The various estimators then use the remaining non-treated in different ways to estimate the (known) non-treatment outcome of the ‘placebo-treated’. Selection into treatment, which is potentially of key importance for the performance of the various estimators, is based on a selection process directly obtained from the data. Moreover, we exploit the actual dependence of the outcome of interest on the covariates on which selection is based in the data rather than making assumptions on this relation when specifying the data generating process. Thus, this approach is less prone to the standard critique of simulation studies that the chosen data generating processes are irrelevant for real applications. Since our model for the propensity score mirrors specifications used in past applied work, it depends on many more covariates compared to the studies mentioned above. Although this makes the simulation results particularly plausible in our context of labour market programme evaluation in Europe, this may also be seen as a limitation concerning its applications to other fields. Therefore, to help generalize the results outside our specific data situation, we modify many features of the data generating process, like the type of the outcome variable and as well as various aspects of the selection process.

Secondly, we consider standard estimators as well as their modified (optimized?) versions based on different tuning parameters such as bandwidth or radius choice. This leads to a large number of estimators to evaluate, but it also provides us with more information on important choices regarding the parameters on which the various estimators depend. Such estimators may also consist of combinations of estimators, like combining matching with weighted regression, which have not been considered in any simulation so far. Finally, we reemphasize the relevance of trimming to improve the finite sample properties of all estimators. The rule we propose is (i) a data driven trimming rule, (ii) easy to implement, (iii) identical for all estimators, and (iv) avoids asymptotic bias. We show that for almost all estimators considered, including the parametric ones, trimming based on this rule effectively improves their performance.

Overall, we find that (i) trimming observations that have ‘too large’ a weight is important for many estimators; (ii) the choices of the various tuning parameters play an important role; (iii) simple matching estimators are inefficient and have considerable small sample bias; (iv) no estimator is superior in all designs and for all outcomes; (v) particular bias-adjusted radius (or calliper) matching estimators perform best on average, but may have fat tails if the number of controls is not large enough; and finally, (vi) flexible, but simple parametric approaches do almost as well in the smaller samples, because their gain in precision frequently compensates (in part) for their larger bias which, however, dominates when samples become larger. Strictly speaking these properties relate to our particular data generating process (DGP) only. However, at least such a DGP is typical for an important application of matching methods, namely labour market evaluations.

The paper proceeds as follows: in the next section we describe our Monte Carlo design, relegating many details as well as descriptive statistics to online Appendices B and C, where the latter contains a description of the support features of our data. In Section 3 we discuss the basic setup of each of the relevant estimators and their properties, as well as the issue of trimming, while relegating the technical details of the estimators to Appendix. The main results are presented in Section 4, while the full set of results is given in online Appendix D. Section 5 concludes and online Appendix E contains further sensitivity checks. The website of this paper (www.sew.unisg.ch/lechner/matching) will contain additional material that has been removed from the paper for the sake of brevity, in particular Appendices B, C, D, and E as well as the Gauss, Stata, and R codes for the preferred estimators. The following is the Supplementary material related to this article.

. HLW_MatchEst_201313 R2 Internet appendix.docx.

Section snippets

Basic idea

A typical Monte Carlo study specifies the data generation process of all relevant random variables and then conducts estimation and inference from samples that are generated by independent draws from those random variables based on pseudo random number generators. The advantage of such a design is that all dimensions of the true data generating process (DGP) are known and can be used for a thorough comparison with the estimates obtained from the simulations. However, the disadvantage is that

Notation and targets for the estimation

The outcome variable, Y, denotes earnings or employment. The group of treated units (treatment indicator D=1) are the participants in training in our empirical example. We are interested in comparing the mean value of Y in the group of treated (D=1) with the mean value of Y in the group of non-treated (D=0), the non-participants, free of any mean differences in outcomes that are due to differences in the observed covariates X across the groups.20

Trimming

From Eq. (1) we see that all estimators can be written as the mean outcome of the treated minus the weighted outcome of the non-treated observations. By the nature of this estimation principle, the weights of the non-treated are not uniform (except in the case of random assignment in which they should be very similar even in the smallest sample). They depend on the covariates via the propensity score. If particular values of p(x) are rare among the controls and common among the treated, such

Results

In this section, we first discuss several issues concerning the implementation of the various estimators (5.1). After that, the results are discussed, beginning with issues that concern all estimators simultaneously, like the impact of different features of the data generating process, the specification of the propensity score and the trimming (5.2). Then, we analyze implementational issues that are specific to the particular classes of estimators considered (5.3). Finally, we compare the best

Conclusion

This paper investigates the finite sample properties of all major classes of propensity-score-based estimators of the average treatment effect on the treated (ATET) that are used in applications. Moreover, within each class of estimators we investigate the performance of the estimators for a variety of possible versions and various values of the tuning parameters. Both features make this study the most comprehensive one in the field so far.

We propose a way to overcome one of the main criticisms

Acknowledgments

Michael Lechner is a Research Fellow of CEPR and PSI, London, CES-Ifo, Munich, IAB, Nuremberg, IZA, Bonn, and ZEW, Mannheim. Conny Wunsch is a Research Fellow of CES-Ifo, Munich, and IZA, Bonn. This project received financial support from the Institut für Arbeitsmarkt und Berufsforschung, IAB, Nuremberg (contract 8104). We would like to thank Patrycja Scioch (IAB), Benjamin Schünemann and Darjusch Tafreschi (both SEW, St. Gallen) for their help in the early stages of data preparation. An

References (101)

  • A. Abadie et al.

    Large sample properties of matching estimators for average treatment effects

    Econometrica

    (2006)
  • A. Abadie et al.

    On the failure of the bootstrap for matching estimators

    Econometrica

    (2008)
  • Abadie, A., Imbens, G.W., 2009. Matching on the estimated propensity score, NBER Working Paper...
  • J.D. Angrist

    Estimating the labor market effects of voluntary military service using social security data on military applicants

    Econometrica

    (1998)
  • J.D. Angrist et al.

    When to control for covariates? panel-asymptotic results for estimates of treatment effects

    Review of Economics and Statistics

    (2004)
  • J.D. Angrist et al.

    Mostly Harmless Econometrics: An Empiricists’ Companion

    (2009)
  • B. Augurzky et al.

    Assessing the performance of matching algorithms when selection into treatment is strong

    Journal of Applied Econometrics

    (2007)
  • H. Bang et al.

    Doubly robust estimation in missing data and causal inference models

    Biometrics

    (2005)
  • S. Behncke et al.

    Unemployed and their case workers: should they be friends or foes?

    The Journal of the Royal Statistical Society - Series A

    (2010)
  • S. Behncke et al.

    A caseworker like me - does the similarity between unemployed and caseworker increase job placements?

    The Economic Journal

    (2010)
  • M. Bertrand et al.

    How much should we trust differences-in-differences estimates

    Quarterly Journal of Economics

    (2004)
  • R. Blundell et al.

    Alternative approaches to evaluation in empirical microeconomics

    Journal of Human Resources

    (2009)
  • R. Blundell et al.

    Evaluating the employment impact of a mandatory job search program

    Journal of the European Economic Association

    (2004)
  • Busso, M., DiNardo, J., McCrary, J., 2009. Finite sample properties of semiparametric estimators of average treatment...
  • Busso, M., DiNardo, J., McCrary, J., 2009. New evidence on the finite sample properties of propensity score matching...
  • M. Caliendo et al.

    Sectoral heterogeneity in the employment effects of job creation schemes in Germany

    Journal of Economics and Statistics

    (2006)
  • M. Caliendo et al.

    The employment effects of job creation schemes in Germany–a microeconometric evaluation

  • M. Caliendo et al.

    Identifying effect heterogeneity to improve the efficiency of job creation schemes in Germany

    Applied Economics

    (2008)
  • D. Card et al.

    Active labour market policy evaluations: a meta-analysis

    Economic Journal

    (2010)
  • Chen, X., Hong, H., Tarozzi, A., 2008. Semiparametric efficiency in gmm models of nonclassical measurement errors,...
  • R.K. Crump et al.

    Dealing with limited overlap in estimation of average treatment effects

    Biometrika

    (2009)
  • R.H. Dehejia et al.

    Causal effects in non-experimental studies: reevaluating the evaluation of training programmes

    SAT Journal of the American Statistical Association

    (1999)
  • R.H. Dehejia et al.

    Propensity score-matching methods for nonexperimental causal studies

    Review of Economics and Statistics

    (2002)
  • A. Diamond et al.

    Genetic matching for estimating causal effects: a general multivariate matching method for achieving balance in observational studies

    Mimeo

    (2008)
  • J. DiNardo et al.

    Labor market institutions and the distribution of wages, 1973–1992: a semiparametric approach

    Econometrica

    (1996)
  • C. Drake

    Effects of misspecification of the propensity score on estimators of treatment effect

    Biometrics

    (1993)
  • J. Fan

    Design-adaptive nonparametric regression

    SAT Journal of the American Statistical Association

    (1992)
  • C.A. Flores et al.

    Evaluating nonexperimental estimators for multiple treatments: evidence from experimental data

    Mimeo

    (2009)
  • M. Frölich

    Finite-sample properties of propensity-score matching and weighting estimators

    Review of Economics and Statistics

    (2004)
  • M. Frölich

    Matching estimators and optimal bandwidth choice

    Statistics and Computing

    (2005)
  • M. Frölich

    Nonparametric regression for binary dependent variables

    Econometrics Journal

    (2007)
  • J. Galdo et al.

    Bandwidth selection and the estimation of treatment effects with unbalanced data

    Annales d’Économie et de Statistique

    (2008)
  • M. Gerfin et al.

    Microeconometric evaluation of the active labour market policy in Switzerland

    The Economic Journal

    (2002)
  • A.N. Glynn et al.

    An introduction to the augmented inverse propensity weighted estimator

    Political Analysis

    (2010)
  • Graham, B.S., Pinto, C., Egel, D., 2011. Efficient Estimation of Data Combination Models by the Method of...
  • B.S. Graham et al.

    Inverse probability tilting for moment condition models with missing data

    Review of Economic Studies

    (2012)
  • J. Hahn

    On the role of the propensity score in efficient semiparametric estimation of average treatment effects

    Econometrica

    (1998)
  • L.P. Hansen

    Large sample properties of generalized methods of moments estimators

    Econometrica

    (1982)
  • B. Hansen

    Full matching in an observational study of coaching for the sat

    SAT Journal of the American Statistical Association

    (2004)
  • P. Hall et al.

    Cross-validation and the estimation of conditional probability densities

    SAT Journal of the American Statistical Association

    (2004)
  • Cited by (250)

    View all citing articles on Scopus
    View full text