The logged dependent variable, heteroscedasticity, and the retransformation problem

https://doi.org/10.1016/S0167-6296(98)00025-3Get rights and content

Introduction

The use of a log transformed dependent variable has become commonplace in applied microeconomic work. In some cases, such as in the analysis of wages, the log transform has become the standard, while in other applied areas it is just considered good practice. Once the estimates from such a model have been obtained, the usual practice is to interpret the response to a particular variable (e.g., price or income) as being the exponential of the coefficient of that variable in the model. In some cases, these estimates of the impact of the variable are corrected for the fact that one is using an estimate, rather than the true value of the coefficient (Kennedy, 1981, Kennedy, 1983, Kennedy, 1992).1 Very few of the applications make a full correction for the impact of heteroscedasticity on the estimated response (for examples, see Duan et al., 1983; Manning et al., 1987; McCuen et al., 1990; Newhouse et al., 1981; Puma and Hoaglin, 1990; Showalter, 1994). Although many analysts will use either a generalized least squares estimator or the Huber/White consistent estimate of the variance–covariance matrix of the estimated coefficients, few make a direct adjustment to the predicted response. Unlike regression models on the raw, or untransformed scale, log model results are about geometric means, not arithmetic means. If an unlogged dependent variable is used, the estimated response is that for the arithmetic mean. In fact, there is a danger that log scale results may provide a very misleading, incomplete, and a biased estimate of the impact of a covariate on the arithmetic mean. If one wants to comment on the arithmetic mean response to some variable from a model with a logged dependent variable, then one must include a term that captures any heteroscedasticity in the error term on the log scale that is attributable to that variable.

The following sections will explore the role of heteroscedasticity in log models. After considering rationales for the log transformation in Section 2, I examine the effect of heteroscedasticity in log models with a normally distributed error term in Section 3. This is first done with a simple comparison of population means. Then, the model is extended to allow for other covariates and for a non-normal error term. In Section 4, the case of a non-normal error term is considered. In Section 5, the model is relaxed to allow for other power (or Box–Cox) transformations of the dependent variable. Section 6deals with other estimators where retransformation is problematic, with or without heteroscedasticity. Section 8illustrates the impact of this phenomenon using an example from the Health Insurance Experiment.

Section snippets

Rationales

The rationale for using a log transformed dependent measure can come from a variety of concerns: (1) a desire for multiplicative or proportional responses to a covariate of interest; (2) a desire to generate an estimate that easily yields an elasticity (as in the case of the log–log model); (3) as a consequence of working from certain classes of utility, demand, production, or cost functions (as in the cases of the Cobb–Douglas and translog formulations); (4) as a consequence of estimating the

Expectation of y—the normal case

To see this, we need to write out the expectation of the untransformed dependent variable from a model using a logged dependent variable, that is:ln(y)=xβ+ϵwhere E(ϵ)=0 and E(ϵ|x)=0. Although the error term on the log scale is independent of x, it may not exhibit constant error variance; that is, E(ϵ2)≠c, a constant. If the error term has an expected value of zero, then E(ln(y))=. However, in most cases the expectation of y is a bit more complex:E(y|x)=e(xβ)E(eϵ)≠eIn the general case, the

Expectation of y—the non-normal case

If the error term is not normally distributed, there are two alternatives. If the error term is known to follow a specific distribution, then the expectation of the exponentiated error (E(eϵ)) can be derived directly. If the distribution is not known a priori, then one nonparametric alternative is the smearing estimator developed by Duan (1983), and which was applied in a number of the Health Insurance Experiment papers (Duan et al., 1983, Duan et al., 1984; Manning et al., 1987; Newhouse et

Alternative transformations

This issue of retransformation and heteroscedasticity is not unique to the case of a logged dependent variable. Any power transformation of y will raise this issue. If the square root transformation is used, theny=xδ+νwhere ν has zero mean and is independent of x. Then the expected value of y is:E(y|x)=(xδ)22νIf the error term ν is heteroscedastic in x (or in treatment groups), then the effect of the heteroscedasticity on the arithmetic mean is additive, while for the log model, the effect is

Related transformed model issues

Heteroscedasticity is not the only cause of re-transformation issues for logged (or other power transformed) models. There are a number of commonly used techniques for dealing with econometric problems that raise re-transformation issues, some of which exacerbate the issues raised earlier. These include transformations to do GLS, two stage least squares, and the LIML (Mills ratio) version of the Selection Model.

One commonly used methods for dealing with heteroscedasticity in estimating β is to

Effect on an individual prediction

In some cases, the analytical goal is to determine the effect of a variable on a particular individual, not the mean over a population of interest. If the error term on the log scale is represented by ϵi=σei (where ei is N(0,1)), then Eq. (9)comparing the effect of treatments A and B at the individual level is:yA,iyB,i=e[(μA−μB)+(σA−σB)ei]Because the individual prediction depends on the individual's actual ei, the formula in Eq. 20 differs from Eq. (9). If we were to average Eq. (18)over all

Example

Many of the Health Insurance Experiment (HIE) papers on expenditures and utilization relied on regression models with logged dependent variables. With standard deviations two to four times the mean, the log transformation was essential to finding estimates of the response of health care expenditures and utilization that were robust to the skewness in the data (Duan et al., 1983). In many of the analyses, the residual errors indicated the presence of appreciable heteroscedasticity by insurance

Conclusion—the effect of heteroscedasticity

In the case of least squares on an untransformed dependent variable, the possibility of heteroscedasticity should raise concerns about the efficiency of the OLS estimate of β, and about the consistency of the OLS estimate of the variance of the OLS estimate of β. Most of us have learned to use GLS estimators to obtain efficient estimates of β and the correct inference statistics for the variance of the estimate of β. Failing to do that, we know that the Huber/White estimate of the

Acknowledgements

This paper has benefitted from the comments made by Bryan Dowd, Mike Finch, Richard Frank, John Mullahy, Tamara Stoner, and an anonymous reviewer. The original interest in this area began in collaboration with Naihua Duan, Carl Morris, and Joseph Newhouse on the Health Insurance Experiment. The current analysis and conclusions presented here are the author's alone.

First page preview

First page preview
Click to open first page preview

References (17)

  • Box, G., Cox, D., 1964, An analysis of transformations. J. R. Stat. Soc. Ser. B, pp....
  • N. Duan

    Smearing estimate: a nonparametric retransformation method

    J. Am. Stat. Assoc.

    (1983)
  • N. Duan et al.

    A comparison of alternative models for the demand for medical care

    J. Bus. Econ. Stat.

    (1983)
  • N. Duan et al.

    Choosing between the sample selection model and the multi-part model

    J. Bus. Econ. Stat.

    (1984)
  • Ettner, E.L., Frank, R.G., McGuire, T.G., Newhouse, J.P., Notman, E.H., 1996. Risk adjustment of mental health and...
  • Greene, W.H., 1993. Econometric Analysis, McMillan, New...
  • Johnson, N.L., Kotz, S., Balakrishnan, N., 1994. Continuous Univariate Distributions, 2nd edn., Vol. 1, Wiley, New...
  • Kennedy, P., 1992. A Guide to Econometrics, 3rd edn., MIT Press,...
There are more references available in the full text version of this article.

Cited by (694)

View all citing articles on Scopus
View full text