The reliability of subjective well-being measures

https://doi.org/10.1016/j.jpubeco.2007.12.015Get rights and content

Abstract

This paper studies the test–retest reliability of a standard self-reported life satisfaction measure and of affect measures collected from a diary method. The sample consists of 229 women who were interviewed on Thursdays, two weeks apart, in Spring 2005. The correlation of net affect (i.e., duration-weighted positive feelings less negative feelings) measured two weeks apart is .64, which is slightly higher than the correlation of life satisfaction (r = .59). Correlations between income, net affect and life satisfaction are presented, and adjusted for attenuation bias due to measurement error. Life satisfaction is found to correlate much more strongly with income than does net affect. Components of affect that are more person-specific are found to have a higher test–retest reliability than components of affect that are more specific to the particular situation. While reliability figures for subjective well-being measures are lower than those typically found for education, income and many other microeconomic variables, they are probably sufficiently high to support much of the research that is currently being undertaken on subjective well-being, particularly in studies where group means are compared (e.g., across activities or demographic groups).

Introduction

Economists are increasingly analyzing data on subjective well-being (SWB). From 2000 to 2006, 157 papers and numerous books have been published in the economics literature using data on life satisfaction or subjective well-being, according to a search of Econ Lit.1 Data on life satisfaction or happiness have been used as outcome measures in studies of the tradeoff between inflation and unemployment, the effect of cigarette taxes on welfare, the effect of German reunification on well-being, and the effect of lottery winnings on well-being.2 In addition, life and work satisfaction measures have appeared as explanatory variables in studies of labor turnover, productivity and health.3 If it could be measured accurately, or even approximately, subjective well-being is a natural variable for economists to model and understand because utility maximization is a central idea in economics, from either a normative or positive perspective.

Here we analyze the test–retest reliability of two types of measures of subjective well-being: a standard life satisfaction question and affective experience measures derived from the Day Reconstruction Method (Kahneman et al., 2004). Although economists have longstanding reservations about the feasibility of interpersonal comparisons of utility that we can only partially address here, another question concerns the persistence of subjective well-being measurements for the same set of individuals over time. Absent dramatic events, overall life satisfaction should not change much from week to week. Likewise, individuals who have similar routines from week to week should experience similar feelings over time. How persistent are individuals' responses to subjective well-being questions? To anticipate our main findings, both measures of subjective well-being (life satisfaction and affective experience) display a serial correlation of about .60 when assessed two weeks apart, which is lower than the reliability ratios typically found for education, income and many other common microeconomic variables (Bound et al., 2001, Angrist and Krueger, 1999). If measurement errors are white noise, a reliability ratio of .60 implies substantial attenuation if the variable is used as an explanatory variable in a regression. Measurement error when subjective well-being is used as a dependent variable would imply a loss of precision in resulting estimates. Nonetheless, the estimated degree of reliability of subjective well-being data is probably high enough to detect effects when they are present in most applications, especially if samples are large and the data are aggregated across people or activities.

The life satisfaction question that we examine is nearly identical to that used in the World Values Survey, and similar to that used in many other well-being surveys. There is a reason to expect, however, that life satisfaction measures such as this may not be as stable from week to week as might be assumed. Rather, these judgments are the result of a complex thought experiment, which is often partially dependent on transient influences (e.g. one's mood at the time; see Schwarz and Strack, 1999).

For measurements of the affective experience of daily life the gold standard is perhaps the Experience Sampling Method (ESM) (also called Ecological Momentary Assessment (EMA)), in which participants are prompted at random intervals to record their current circumstances and feelings (Csikszentmihalyi and Larson, 1987, Stone et al., 1999). This method of measuring affect minimizes the role of memory and interpretation, but it is expensive and difficult to implement in large samples. The Day Reconstruction Method (DRM) is a recent development in the measurement of affective experience, which reduces the cost of obtaining this information. Consequently, we use the DRM, in which participants are requested to think about the preceding day, break it up into episodes, and describe each episode by selecting from several menus (Kahneman et al., 2004). The DRM involves memory, but is designed to increase the accuracy of emotional recall by inducing retrieval of the specifics of successive episodes (Robinson and Clore, 2002, Belli, 1998). Evidence that the two methods can be expected to yield similar results was presented earlier for subpopulation averages (Kahneman et al., 2004). A critical advantage of the DRM is that it provides data on time use — a valuable source of information in its own right, which has rarely been combined with the study of subjective well-being.

In this paper we report reliability measures for a sample of 229 employed women who each filled out a DRM questionnaire for two Wednesdays, two weeks apart in 2005. We compare these reliability estimates to those of global well-being measures more typical in the literature, and we decompose the reliability of duration-weighted net affect into a component due to the similarity of activities across days and other factors. We also provide an application using the reliability estimates to correct observed correlations between self-reported well-being and other variables (e.g., income) for attenuation. We conclude with a discussion of the implications of measurement error for DRM studies and for well-being research more generally.

Consider an observed variable, y, which is a noisy measure of the variable of interest, y. We can write yi = yi + ei where yi is the observed value for individual i, yi is the “correct” value, and ei is the error term. Under the “classical measurement error” assumptions, ei is a white noise disturbance that is uncorrelated with yi and homoskedastic. Classical measurement error will lead correlations between y and other variables to be attenuated toward 0 in large samples.4 If we can measure yi at two points in time, and if the measurement errors are independent and have a constant variance over time, then the correlation between the two measures provides an estimate of the ratio of the variance in the signal to the total variance in y. We thus define the reliability ratio, r, as r = corr(yi1,yi2), where the superscripts indicate the measurement taken in periods 1 and 2. Under the assumptions stated, plimr=var(y)var(y)+var(e).

In addition to summarizing the extent of random noise in subjective well-being reports, the signal-to-total variance ratio is of interest because, in the limit, it equals the proportional bias that arises when SWB is an explanatory variable in a bivariate regression. Furthermore, as we explain below, correlations between SWB and other variables are attenuated by random measurement error in SWB. An important application of SWB data involves estimating the correlations among life satisfaction, affect and other variables such as income (e.g., Argyle, 1999). We can use the reliability ratio to correct those correlations for attenuation, which would mean that many reported relationships are stronger than previously thought.

Of course, if the measurement error is not classical, the test–retest correlation can under- or over-state the signal-to-total variance ratio, depending on the nature of the deviation from classical measurement error. With only two reports of y, and without knowledge of y, it is not possible to assess the plausibility of the classical measurement error assumptions. If the errors in measurement are positively correlated over time, then the test–retest correlation will over-state the reliability of the data. Nevertheless, the test–retest correlation is a convenient starting point for summarizing the reliability of subjective well-being data.

There is a vast empirical literature on subjective well-being (see Kahneman et al., 1999 for a survey). Subjective well-being is most commonly measured by asking people a single question, such as, “All things considered, how satisfied are you with your life as a whole these days?” or “Taken all together, would you say that you are very happy, pretty happy, or not too happy?” Such questions elicit a global evaluation of one's life. Surveys in many countries conducted over decades indicate that, on average, large increases in per capita national income have been found to have little effect on reported global judgments of life satisfaction or happiness over the last four decades. Although reported life satisfaction and household income are positively correlated in a cross section of people at a given time, increases in income have been found to have mainly a transitory effect on individuals' reported life satisfaction (Easterlin, 1995).5 Moreover, the correlation between income and subjective well-being is notably weaker when a measure of experienced happiness is used instead of life satisfaction (Kahneman et al., 2006). Of course, such low correlations could be partially due to attenuation, if measurement error is high.

Table 1 summarizes past estimates of the reliability of SWB measures. Single-item measures of SWB have been found to have moderate reliabilities, usually between .40 and .66, even when asked twice in the same session 1 hour apart (Andrews and Whithey, 1976). Kammann and Flett (1983) found that single-item well-being questions under the instructions to consider “the past few weeks” or “these days” had reliabilities of .50 to .55 when asked within the same day. Interestingly, the only study we are aware of that looked at the reliability of an ESM measure of duration-weighted happiness found a correlation on the upper end of the range found for single-item global well-being measures (Steptoe et al., 2005). Overall, there has been surprisingly little attention paid to reliability, despite the wide use of these measures.

The Satisfaction with Life Scale (SWLS, Diener et al., 1985) is another commonly used global satisfaction measure. In contrast to the single question measures it consists of the average of five related items, each of which is rated on a 7-point scale from Strongly Disagree (1) to Strongly Agree (7). The items are: “In most ways my life is close to my ideal”; “The conditions of my life are excellent”; “I am satisfied with my life”; “So far I have gotten the important things I want in life”; and “If I could live my life over, I would change almost nothing”. A key reason that SWLS has proven more reliable than single-item questions (see Table 1), is that since it is the sum of multiple items, it benefits from error reduction through aggregation. Eid and Diener (2004) used a structural model to estimate reliability for a sample of 249 students, measured three times with four weeks between successive measurements. After controlling for the influence of situation-specific factors, they estimated that the imputed stability for life satisfaction was very high, around .90.

One reason for the modest reliability of subjective well-being measures compared with education and income, which typically have reliability ratios of around .90, could be the susceptibility of SWB questions to transient mood effects. For example, researchers have documented mood changes due to such subtle events as finding a dime before filling out a questionnaire, the current weather, or question order, which in turn influence reported life satisfaction (e.g., Schwarz, 1987). Eid and Diener (2004) used a structural model, which attempted to separate situational variability from random error and basic stability, and found that anywhere from 4% to 25% of the variance in various affect and satisfaction measures were accounted for by situation-specific factors. In an earlier study, Ferring et al. (1996) estimated the size of transient factors as between 12% and 34% of total variance. Since the experienced affect measure produced by the DRM is focused on reconstructing a specific event and the affect actually experienced during it, there is at least the possibility that such measures will be less vulnerable to current mood at the time of the interview.

We might expect DRM measures to be less reliable over time than life satisfaction, however, because a person's activities change from day to day, and affect is associated with activities. At the same time, DRM measures are averages of multiple responses, while global life satisfaction of happiness is often assessed with just one question. If ESM is any guide, the DRM may be at least as reliable as reported overall life satisfaction.

Section snippets

Method

We evaluate the test–retest reliability of the DRM by having the same respondents complete a DRM questionnaire two weeks apart regarding the same day of the week (Wednesday). The questionnaire, which is available from the authors on request, also contained standard global life satisfaction measures. The resulting data provide information for the same sample about the relative stability of the DRM compared to the types of global life satisfaction questions used in most well-being research.

For

Results

Table 2 presents the correlations between various measures for the same person in the first and second sessions, as well as 95% confidence intervals. We focus first on overall measures of affective experience. Perhaps the most surprising finding is that the reliabilities of Net Affect (r = .64) and Difmax (r = .60) are at least as high as that for life satisfaction (r = .59). Satisfaction with domains of life (work and home) is more reliable than satisfaction with life overall.8

Discussion

We analyzed the persistence of various subjective well-being questions over a two-week period for a sample of 229 working women. We found that both overall life satisfaction measures and affective experience measures derived from the DRM exhibited test–retest correlations in the range of .50–.70. While these figures are lower than the reliability ratios typically found for education, income and many other common microeconomic variables, they are probably sufficiently high to yield informative

References (43)

  • BlanchflowerDavid et al.

    Well-being over time in Britain and the United States

    Journal of Public Economics

    (2004)
  • BoundJohn et al.

    Measurement error in survey data

  • Alfonso, V.C., Allison, D.B., 1992. Further Development of the Extended Satisfaction with Life Scale. Unpublished...
  • AndrewsF.M. et al.

    Social Indicators of Well Being: Americans' Perception of Life Quality

    (1976)
  • AngristJ. et al.

    Empirical strategies in labor economics

  • ArgyleM.

    Causes and correlates of happiness

  • BelliR.

    The structure of autobiographical memory and the event history calendar: potential improvements in the quality of retrospective reports in surveys

    Memory

    (1998)
  • BlaisM.R et al.

    LEchelle de satisfaction de vie: Validation Canadienne-Francaise du “Satisfaction With Life Scale" [French-Canadian Validation of the Satisfaction With Life Scale]

    Canadian Journal of Behavioral Science

    (1989)
  • BlanchflowerDavid et al.

    Money, sex, and happiness: an empirical study

    Scandinavian Journal of Economics

    (2004)
  • Clark, Andrew and Georgellis, Yannis. 2004. “Kahneman meets the quitters: peak-end behaviour in the labour market,”...
  • CsikszentmihalyiM. et al.

    Validity and reliability of the experience-sampling method

    Journal of Nervous and Mental Disease

    (1987)
  • DeatonA.

    “Income, Aging, Health and Wellbeing around the World: Evidence from the Gallup World Poll”

    (2007)
  • DienerE. et al.

    The satisfaction with life scale

    Journal of Personality Assessment

    (1985)
  • EasterlinRichard A.

    Will raising the incomes of all increase the happiness of all?

    Journal of Economic Behavior and Organization

    (1995)
  • EidM. et al.

    Global judgments of subjective well-being: situational variability and long-term stability

    Social Indicators Research

    (2004)
  • FerringD. et al.

    The ‘Skala zur Lebensbewertung’: scale construction and findings on reliability, stability, and validity

  • FreemanRichard

    Job satisfaction as an economic variable

    American Economic Review

    (1978)
  • FreyB. et al.

    What can economists learn from happiness research?

    Journal of Economic Literature

    (2002)
  • FrijtersPaul et al.

    Money does matter! Evidence from increasing real income and life satisfaction in East Germany following reunification

    The American Economic Review

    (2004)
  • Gardner, Jonathan, and Oswald, Andrew. 2001. “Does money buy happiness? A Longitudinal Study using data on windfalls,”...
  • Gruber, Jonathan, Mullainathan, Sendhil. 2002. “Do Cigarette taxes make smokers happier?” NBER Working Paper No....
  • Cited by (0)

    The authors thank Daniel Kahneman, Norbert Schwarz, Arthur Stone and two anonymous referees for helpful comments and the Hewlett Foundation, the National Institute on Aging, and Princeton University's Woodrow Wilson School for financial support.

    View full text