Article Text

Download PDFPDF

Measurement of the severity of disability in community-dwelling adults and older adults: interval-level measures for accurate comparisons in large survey data sets
  1. José Buz1,
  2. María Cortés-Rodríguez2
  1. 1Department of Developmental Psychology, University of Salamanca, Salamanca, Spain
  2. 2Faculty of Sciences, Department of Statistics, University of Salamanca, Salamanca, Spain
  1. Correspondence to Dr José Buz; buz{at}


Objectives To (1) create a single metric of disability using Rasch modelling to be used for comparing disability severity levels across groups and countries, (2) test whether the interval-level measures were invariant across countries, sociodemographic and health variables and (3) examine the gains in precision using interval-level measures relative to ordinal scores when discriminating between groups known to differ in disability.

Design Cross-sectional, population-based study.

Setting/participants Data were drawn from the Survey of Health, Ageing and Retirement in Europe (SHARE), including comparable data across 16 countries and involving 58 489 community-dwelling adults aged 50+.

Main outcome measures A single metric of disability composed of self-care and instrumental activities of daily living (IADLs) and functional limitations. We examined the construct validity through the fit to the Rasch model and the know-groups method. Reliability was examined using person separation reliability.

Results The single metric fulfilled the requirements of a strong hierarchical scale; was able to separate persons with different levels of disability; demonstrated invariance of the item hierarchy across countries; and was unbiased by age, gender and different health conditions. However, we found a blurred hierarchy of ADL and IADL tasks. Rasch-based measures yielded gains in relative precision (11–116%) in discriminating between groups with different medical conditions.

Conclusions Equal-interval measures, with person-invariance and item-invariance properties, provide epidemiologists and researchers with the opportunity to gain better insight into the hierarchical structure of functional disability, and yield more reliable and accurate estimates of disability across groups and countries. Interval-level measures of disability allow parametric statistical analysis to confidently examine the relationship between disability and continuous measures so frequent in health sciences (eg, cholesterol, blood pressure, C reactive protein).

  • Rasch modelling
  • Geriatric assessment
  • Disability

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

Statistics from

Strengths and limitations of this study

  • This is the first study that provides a Rasch-based single metric of disability to be used for accurate comparisons of disability severity levels across groups/countries and their relationships with external variables.

  • We empirically assess the reliability of scores using Rasch modelling to address the misuse of estimating reliability by means of Cronbach’s α in highly skewed distributions with marked ceiling/floor effects.

  • The measurement of disability with reliable interval-level measures is a cost-effective and efficient approach to gain comprehensive data on persons with disabilities, thus providing important keys regarding how and when to promote prevention programmes, modify interventions or develop enabling environments.

  • The examination of differential item functioning (DIF) by medical conditions and physical symptoms is limited to three broad groups. The presence of DIF with more specific health conditions, as well as contextual and environmental variables, should be investigated in future studies.

  • Despite the advantages of a Rasch-based single metric of disability over separate scales with summative scores, our metric should be improved by adding more items of difficult tasks to adequately measure the lowest disability levels in the general population.


The measurement of the severity of disability is a critical element for studying the causes and consequences of ageing and for planning health programmes and services.1 Until now, having valid and reliable measures of disability based on survey data remains a major challenge. Activities of daily living (ADLs) and instrumental activities of daily living (IADLs) scales have shown construct under-representation, lack of sensitivity to change, low discriminative power, presence of bias, and striking floor and ceiling effects in community-dwelling populations.2–8 To overcome some of these problems, aggregated measures of ADLs and IADLs have been constructed.2 ,9–16 In general, these studies have supported a single underlying dimension,2 ,10 ,12 ,14–16 but they have also underlined serious concerns regarding the purported hierarchy of functional disability, and evidences of differential item functioning (DIF) regarding age and gender.9 ,11 ,12 ,15 When ADL and IADL scales have been combined, the age-related and gender-related measurement bias was significantly attenuated.2 ,15 Moreover, conducting parametric statistics with summative scores from these scales violates the fundamental assumption of equal-interval scaling and increases the probability of type I and II errors.2 ,13 ,17 ,18 It has also been observed that summative scores of ADLs/IADLs underestimate mean disability in cross-cultural studies.2 Summative scores obtained from hierarchical scales wrongly assume that (1) all items are measuring the same disability continuum, (2) each item contributes equally to the final score and (3) scores are not dependent on samples and items.

Frequently reported large floor and ceiling effects in ADL and IADL scales in relatively healthy populations also represent evident threats to validity and reliability but, surprisingly, their effects have been largely ignored. For example, the examination of the reliability of scores in ADLs, IADLs and mobility scales with more precise and more appropriate statistics than Cronbach’s α has not been addressed.

A recognised advance in ensuring the quality of health-related instruments is the Rasch model, a parametric item response theory (IRT) model that transforms raw scores into interval-scaled measures, and allows the unequivocal confirmation of the formal item hierarchy.10 According to the model, the probability of endorsing an item is a logistic function of the difference between the person's ability (latent trait, θ) and the item difficulty (δ). Thus, persons with low disability have a lower probability of being limited in easy activities (eg, eating), whereas more disabled persons have a higher probability of being limited in more difficult activities (eg, shopping). This is usually presented as follows:Embedded ImageXis refers to a correct response (X=1) made by participant s to item i; θs refers to the trait level of participant s; δi refers to the difficulty of item i; e is the base of the natural logarithm (e=2.71828).

Persons and items are calibrated on a common interval-level scale (expressed in logits), so it is possible to assess how reliably persons and items can be hierarchically ordered from low to high levels of disability. A unique property of this model is specific objectivity, meaning that the estimation of item parameters is independent of the persons used (ie, person invariance), and that the estimation of the person parameters is independent of the particular items employed (ie, item invariance).18 Finally, for the Rasch model, missing data do not cause bias or lower the precision of disability measurements.

The aim of this study is to provide a single metric of disability using Rasch modelling with data drawn from the Survey of Health, Ageing and Retirement in Europe (SHARE) to be used for disability severity comparisons across groups or countries. In addition to ADL and IADL items, we incorporate mobility tasks in order to expand the validity construct, based on the accumulative evidence suggesting that mobility limitations are a precursor of disability in ADLs and IADLs and that they are less affected by floor effects.7 ,9 ,14 ,19 To the best of our knowledge, neither the precise severity level of aggregated ADL, IADL and mobility items has been estimated, nor has the ability of a single metric to separate persons with different levels of disability been established. We performed DIF to examine whether the measures were invariant across age, gender, medical conditions, symptomatology and self-rated health. Finally, we adopted the method of known-groups validity to examine the gains in precision using interval-level measures relative to ordinal scores for discriminating between groups known to differ in disability.


Study design

Cross-sectional, population-based study.


Data were drawn from wave 4 (2010–2011) of SHARE including comparable data across 16 countries and involving 58 489 community-dwelling adults aged 50+. Representative samples from Austria, Belgium, the Czech Republic, Denmark, Estonia, France, Germany, Hungary, Italy, the Netherlands, Poland, Portugal, Slovenia, Spain, Sweden and Switzerland were obtained using probability samples. Methodological details of the survey are available elsewhere.20 ,21 We excluded participants aged under 50 years (n=1254), with missing information across all ADL/IADL/mobility items (n=339), or institutionalised (n=368), which resulted in a final sample of 56 528 participants. Calibrated sampling weights were used to adjust for the complex sampling design.


Disability is measured in SHARE by asking respondents whether they had ‘any difficulty’ (yes=1, no=0), because of a physical, mental, emotional or memory problem, in carrying out daily activities (ADLs, six items; IADLs, seven items) and functional limitations (10 Nagi-based questions). ADLs included bathing, dressing, eating, getting into/out of bed, using the toilet and walking across a room. IADLs included making meals, shopping, doing work around the house/garden, making telephone calls, using a map, medications and managing money. Mobility questions asked about kneeling, climbing one flight/several flights of stairs, walking 100 m, sitting for 2 hours, getting up from a chair, pulling large objects, lifting heavy weights, lifting hands above shoulders and picking up a small coin. The SHARE asked about any difficulty in physical functioning even with the help of assistive devices. No information about specific devices was gathered. Data were collected by the interviewer by means of Computer Assisted Personal Interviewing (CAPI). Showcards were used alongside CAPI.

Demographic and health variables: We included the following variables: (1) age, gender and years of education, using the UNESCO International Classification of Educational Degrees (ISCED-97); (2) self-reported illness diagnosed by a general practitioner (heart disease, hypertension, hypercholesterolaemia, stroke, diabetes, lung disease, asthma, arthritis, osteoporosis, cancer, ulcer, Parkinson disease, cataracts, hip fracture, other fractures, Alzheimer disease and benign tumour); (3) presence of long-term health problems that affect daily routines (yes/no); (4) self-reported physical symptoms (pain, angina or chest pain, breathlessness, persistent cough, swollen legs, sleeping problems, falling over and fear of falling, dizziness, stomach or intestine problems, incontinence and fatigue); and (5) self-rated health using a single question with answer categories ranging from 1=poor to 5=excellent.

Data analyses

Descriptive data

Demographic and health variables were examined using descriptive statistics. For subsequent analyses, we randomly split the sample into two subsamples: one for multigroup confirmatory factor analyses (MGCFA; n=28 788), and the other for Rasch-based analyses (n=27 740).

Multigroup confirmatory factor analysis

Before Rasch analysis was conducted, as recommended,22 tests of measurement invariance were performed to establish whether the general factor structure (configural invariance) and the factor loadings (metric invariance) were the same across countries. Once we tested that the goodness of fit of the unidimensional model in each country was adequate, we conducted two hierarchically nested invariance models with increasingly restrictive constraints. To estimate the parameters, we used the diagonally weighted least squares and the asymptotic covariance matrix. Model fit can be considered good with root mean square error of approximation (RMSEA) ≤0.05 and comparative fit index (CFI) >0.90. The comparison for nested models was based on ΔCFI≤0.01.23 High floor/ceiling effects in categorical data can produce attenuated estimates of the correlation among indicators, lead to ‘pseudofactors’ that are artefacts of extremeness, and produce incorrect test statistics and SEs. Therefore, we carried out the analysis excluding extreme scores. The final sample included 15 325 participants.

Rasch analysis

We adopted a parametric model (Rasch modelling for dichotomous responses) for this work because it was appropriate for our purposes and had several advantages: (1) person-free and item-free invariant parameters can be estimated, (2) interval-level measures that show how much (more or less) ability or difficulty exists between persons or items are provided and (3) the estimates of person and item parameters can be represented graphically on a common metric to easily examine the scale targeting, construct validity and predictive validity.

Fit to the Rasch model was evaluated by the mean square fit statistics (infit MnSq and outfit MnSq) and Rasch residual-based principal components analysis (PCA). Mean square fit statistics indicate how much misfit is revealed in the actual data. Infit is a weighted fit statistic in which relatively more impact is given to unexpected responses close to a person’s or item’s measure. Outfit is an unweighted statistic that gives more impact to unexpected responses far from a person’s or item’s measure. The expected value for MnSq is close to 1.0 with an accepted range of 0.6–1.4 for surveys. Values ≥2.0 indicate a severe misfit.24 In PCA, a strong measurement dimension for unidimensionality is achieved when the variance explained is >40%, and the eigenvalue of the first component of residuals is <2.0.25

Reliability was estimated with the Rasch-based person reliability (PR) and the person separation (Gp). PR is more precise and less misleading than Cronbach’s α (KR-20) because (1) it provides a more detailed picture of the precision of measures, (2) statistics are estimated from linear measures and (3) it is not affected by extreme scores where error variance is the largest. Gp represents the scale's ability to separate the sample into different strata of disability (strata=(4Gp+1)/3). We also examined how precise the scale was at various ranges of the disability continuum to determine appropriate cut-off points by plotting the test information function (TIF) according to persons’ ability. TIF is defined as the reciprocal of the precision with which a parameter is estimated. Score accuracy is high where SEs are low. PR≥0.70 (for group comparisons), Gp≥1.5, TIF≥4 and SE around 0.5 are desirable values.22 ,26

The invariance of the item hierarchy across countries was evaluated by (1) intraclass correlation coefficients (ICCs) that indicated the overall agreement across the 16 countries and (2) a matrix of Spearman correlation coefficients that revealed the consistency between countries in the rank order of the item calibrations. Coefficients can be interpreted as follows: 0.6 or higher indicates moderate agreement; 0.7–0.8 indicates strong agreement and >0.8 indicates almost perfect agreement.27

The invariance of the item hierarchy across subgroups was examined with DIF analyses in five different groups: age (<75 vs 75+), gender (male vs female), medical conditions (none vs 1+; ≤1 vs 2+), physical symptoms (none vs 1+; ≤1 vs 2+) and self-rated health (excellent/very good/good vs fair/poor). We used the Mantel-Haenszel model (MH) and the DIF CONTRAST estimate that calculates the difference between the estimators of the item parameter of difficulty for each group. In large samples, differences higher than 0.64 and 0.50 logits for MH and DIF CONTRAST, respectively, and statistically significant (with Bonferroni correction), are considered substantial.24 ,28 To detect whether DIF may cause bias, we assessed its impact on the scale measures by examining differential test functioning.22 ,29 We estimated a Rasch model for each group separately and the expected score was plotted against the measured disability dimension using test characteristic curves (TCCs). The area between the curves reveals the magnitude of bias.15 ,30

Relative precision

The relative precision (RP) method was used to compare the best performance between interval-level measures and summative scores for distinguishing disability severity levels among persons with different medical conditions. RP indicates how much more or less precise Rasch-based scores are relative to the ordinal scores. RP is calculated as the ratio of pairwise F statistics (the interval-level measure F statistics divided by the ordinal score F statistic).

Descriptive analyses and general linear models were conducted with SPSS V.21, MGCFA with LISREL V.8.80 and Rasch analyses with WINSTEPS V.3.70.


Demographic data

Table 1 shows the basic characteristics of participants in each country. The average age ranged from 64.5 to 69.2 years, with women representing ∼55% of the sample within each country. Although in the majority of the countries more than half of the respondents reported having long-term illness and approximately two chronic conditions and physical symptoms, their self-rated health was good.

Table 1

Demographic and health variables of participants aged 50+ in SHARE wave 4 (2010/11) by country

MGCFA analyses

As shown in table 2, the unidimensional solution showed a good model fit (RMSEA from 0.039 to 0.057) in all countries. All factor loadings were statistically significant (p<0.01) and salient. The subsequent configural and metric models showed good fit to the data and the restrictions imposed did not result in a significant drop in model fit.

Table 2

Goodness of fit indices for measurement invariance model comparisons across 16 countries

Rasch analyses

Fit of persons and items to the Rasch model: As recommended,31 the most misfitting persons (outfit MnSq>2.0) were removed because their inclusion distorted the person parameter estimates. We followed an iterative process by first removing the individuals with the highest outfit (MnSq=9.90, mainly as a result of unexpected responses by low and high disabled persons), and then by examining person estimates in each step. Separation and person reliability reached their highest values after excluding 1258 respondents. We did not find a pattern in the sociodemographic variables, health variables or across countries for those persons with idiosyncratic responses. The final sample included 26 482 respondents, including a low percentage of misfitting persons (2.8% with outfit MnSq ranging from 2.0 to 3.77). Statistics indicated a good model data fit for persons (mean infit MnSq=1.00, SD=0.31; mean outfit MnSq=0.71, SD=0.42) and for items (mean infit MnSq=0.98, SD=0.14; mean outfit MnSq=0.74, SD=0.42). The infit and outfit statistics for all the items were in an appropriate range. The low outfit MnSq (<0.60) statistics in ADL/IADL items indicated that they were too predictable. This overfit had no practical implications, except in situations of shortening scales, because these items did not degrade the measure. The PCA showed that the scale met the criterion for essential unidimensionality (44% of explained variance and eigenvalue of 1.7). Logits were transformed into more meaningful values from 0 (no disability) to 100 (highest disability; table 3).

Table 3

Normative measures for the disability scale across countries

Personitem targeting and item hierarchy: The item locations ranged from 3.06 logits for the easiest task (taking medicines) to 3.56 logits for the most challenging tasks (stooping, kneeling, crouching), indicating an adequate spread of disability levels (see table 4). The mean level of disability among participants (θ=−2.77 logits) was lower than the average level of item difficulty (δ=0), indicating that the scale was ‘slightly off target’ 2<|θ−δ|<3 from the sample.18 Thus, items that spread outside the range of persons did not contribute much to the measurement. The person–item map (figure 1) showed that the easiest tasks (eg, eating, taking medicines) were off-target even for persons located at or close to the average level of persons. This indicated that better targeted items at the lower end of the scale were appropriate for adequately measuring persons with the lowest disability levels. The addition of mobility tasks to ADLs and IADLs in a single metric yielded a lower percentage of persons with zero scores (floor effect=48.5%) than that resulting from separate scales (see table 1).

Table 4

Fit statistics and hierarchy of the disability items

Figure 1

Hierarchical structure of the disability scale. The person–item map displays the joint locations of person disability measures (left side) and item difficulty calibrations (right side). In the left column, the more disabled participants are located near the top of the figure (positive values), and the less disabled at the bottom (negative values). In the right column, the items difficult to endorse (easiest tasks) are located near the top of the map. Continuous lines with labels represent limits for levels of disability according to the reliability indices and the test information function as are described in the next section about reliability of scores. The M and S on the vertical line between the two columns refer to mean and SD (S=1 SD, T=2 SD) statistics for persons and items measured in logit. According to the general formula, the probability of endorsing any item can be calculated by using the item difficulty (δ) and person ability estimates (θ). Thus, a respondent with the average ability of the sample (θ=−2.77, raw score=4) has a 69% probability of endorsing the item ‘stooping, kneeling or crouching’, whereas for the same persons the probability of endorsing the item ‘preparing a hot meal’ is 1%. When the ability-difficulty difference |θ−δ| reaches 3 logits, the items are said to be ‘rather off-target’.24 ADL, activities of daily living; IADL, instrumental activities of daily living; MOB, mobility.

Regarding the hierarchy of functional decline, mobility tasks were, as expected, more challenging than IADLs and ADLs. However, IADLs were not clearly more challenging than ADLs. Specifically, some ADLs were more challenging (eg, ‘dressing’ or ‘bathing’) than some IADLs (eg, ‘managing money’ or ‘preparing a hot meal’). Similarly, item location estimates for apparently similar activities (eg, ‘walking 100 m’ and ‘walking across a room’) were markedly different (−1.05 and 2.21 logits, respectively).

The rank ordering of the item difficulties was similar for all countries (Spearman correlation coefficients ranged from 0.88 to 0.99; table 5). The ICC for agreement in item hierarchy across all countries was high (ICC=0.94, 95% CI 0.90 to 0.97, p<0.001). Therefore, the scale demonstrated strong invariance of item hierarchy despite the environmental and cultural differences across countries.

Table 5

Spearman’s correlation coefficients of the single metric item calibrations across countries

Additionally, specific objectivity (generalisability) was empirically tested by randomly splitting the sample (n=13 870), calculating the difficulty estimates of the items, and conducting a linear regression analysis between the measures. The expected values for a perfect fit are 1, 0 and 1 for the correlation value, the intercept and the slope estimate, respectively. We found values of 0.997, 0.024 and 0.991, respectively, thus confirming objective specificity.

Reliability: As is shown in table 6, the reliability of the person ability estimates is 0.74 (person separation=1.70). Therefore, the scale was able to separate persons in two (nearly three) levels of disability.24 This corroborates, in part, the aforementioned targeting problem regarding the person–item map. Visual analysis of TIF (figure 2) revealed that the score precision drops substantively as the scores approach the higher and lower ends. Thus, a cut-off of 11 (raw score) was the most appropriate to distinguish among disabled persons with low or high disability. Tentatively, cut-offs of 8 and 15 (raw score) could be used for low (1–8), moderate (9–14) and high (15+) levels of disability (see also figure 1).

Table 6

Reliability statistics for the single metric, and separate scales of self-care activities, instrumental activities and mobility limitations

Figure 2

Test information function representing how well each (dis)ability level is being estimated with the scale. The amount of information is maximum at the person ability location of 0 logits (raw score 11), and about 3.15 for the locations of ±1.5 logits (raw score=8 and 15, respectively). (Dis)ability cannot be estimated with precision when outside of this range.

In contrast, ADL and IADL scores from separate scales showed an insufficient reliability; person reliability, person separation and TIF indicated that these scores were not able to separate two distinct strata of persons with disability. SE revealed that the precision of scores was twice the desired value of 0.5. Gp≤1, and person reliability <0.50, imply that more than 50% of the differences between measures are due to measurement error.24 Mobility scores showed slightly better results. From an epidemiological point of view, this finding suggests that, statistically, cut-off scores such as ADL 1+and IADL 1+ represent adequately the boundary between ‘non-disabled’ and ‘disabled’ persons, but additional cut-off scores are not appropriate.

Differential item functioning: DIF was found in four items as a function of age. Difficulty estimates were significantly greater for the younger respondents compared with the older respondents (75+) on ‘sitting 2 hours’ and ‘getting in/out of bed’, while ‘shopping’ and ‘managing money’ were more difficult for the older respondents compared with the younger respondents. Across gender, ‘lifting over 5 kilos’ showed a higher difficulty estimate for males, while ‘dressing’ and ‘preparing hot meals’ showed a higher difficulty estimate for females. No further DIF was found. TCCs for age and gender groups revealed that their expected and observed scores matched almost perfectly, indicating that items displaying DIF were not causing bias.

Relative precision: As can be seen in table 7, interval measures produced gains in RP in all of the medical conditions (above 50% in 9 out of 16 comparisons). Specifically, Rasch-based measures were two times more effective than summative scores for detecting differences in disability in persons ‘diagnosed vs non-diagnosed’ as having osteoporosis or benign tumour. Interval measures were also ∼70% better at discriminating between diagnosed and non-diagnosed hypertension, cholesterol, asthma or arthritis. Low gains were observed for medical conditions such as Alzheimer disease, Parkinson and hip fracture.

Table 7

Comparisons of the RP values of the two scoring methods for discriminating between groups differing in disability severity levels across medical conditions


Principal findings

Our study presents a hierarchical scale with equal-interval measures and person-invariant and item-invariant properties to measure disability severity in community-dwelling adults and older adults. We provide strong evidence regarding the hierarchical structure of functional disability, independent of country, age, gender, medical conditions, symptomatology and self-rated health.

Fit statistics, PCA and invariance analyses showed that the single metric of disability achieved the requirements of a strong hierarchical scale. Our findings support previous studies suggesting that ADL, IADL and mobility items contributed a unidimensional construct of disability.14 ,15 ,32 In addition to this, the property of specific objectivity facilitates the generalisability of results. As regards, we aim to address the most recent claims resulting from public health studies33 for the need to create composite measures of disability that permit accurate comparisons of functional status across and within countries.

Differential item functioning

Our findings coincide with research showing DIF by age and gender.9 ,12 However, we did not find evidence of bias.15 It is important to note that our results are not completely comparable to previous studies that examined the ‘need for help’ instead of the ‘difficulty with’ daily activities. Plausibly, the ‘need for help’ is more dependent on social network availability, gender roles and culture, among other variables; hence, the existence of DIF can be expected. Furthermore, we also demonstrated that the scale was not biased by medical conditions, symptomatology and self-reported health. Therefore, researchers can use it confidently for comparisons of disability in adults and older adults with a wide variety of health conditions. This is an important contribution because previous studies have only focused on age and gender, and the impact of health-related variables has not been addressed.

We examined DIF in heterogeneous groups according to the number, but not the type, of self-reported diseases and symptomatology, and therefore did not explore the risk of bias associated with specific diseases or symptoms when performing different activities. Previously, a cross-cultural adaptation of the Functional Independence Measure (FIM) for patients with stroke showed that different calibrations for several items were necessary.34 Thus, future cross-cultural studies could assess DIF across subpopulations with specific medical conditions and settings in order to ensure the comparability of disability measures.

Contextual and environmental factors can also affect the calibration of items and distort outcome measures. As has been previously stated,35 differences in the estimates of disability are caused by theoretical perspectives, methodological issues (eg, wording or response categories) and environmental factors. The ‘difficulty with’ or the ‘need for help’ with specific activities may be largely mediated/moderated by environmental variables. In practical terms, it is possible that calibrations (δ) for some daily activities can change in different geographical or cultural contexts (eg, ‘dressing’ is probably more challenging in Finland than in Bora Bora). Other factors affecting the estimates of disability are related to the availability of personal and social resources (income, spouse, education, etc), or even the use of assistive devices,35 which is an issue that should be investigated in the future. Additionally, the analysis of DIF within IRT is a useful mechanism to evaluate the impact of these factors on disability estimates and make the appropriate adjustments (ie, different calibrations).

Relative precision

We demonstrated gains in RP for comparisons of disability severity using interval measures (averaged gained 58%) in all of the medical conditions. These gains have occurred mainly through greater differences between groups in scores at the lower extreme of the distribution, where the relationship between raw scores and Rasch measures is non-linear (as at the upper end). This is an important issue because large survey population studies have to face the challenge of comparing groups/countries with low and/or similar disability levels. Rasch measures and summative scores showed similar precision when comparing diagnosed and non-diagnosed participants with Alzheimer, Parkinson disease and hip fracture. Neuropsychological diseases and fractures produce severe disability levels involving instrumental and self-care activities of daily living. Mean scores of disability of participants diagnosed with these medical conditions indicate that they are located near to the middle of the distribution (eg, mean=10.89 for Alzheimer disease), where the relationship between raw scores and Rasch measures is linear. In these conditions, parametric analyses conducted with raw scores may yield an accurate comparison of groups. Although we have not measured change in scores over time, the advantages in the precision of Rasch measures are also applicable in longitudinal design studies.36

Scale targeting and hierarchical structure

Despite the aforementioned positive findings, there are some issues that cause concern. The first one is related to construct under-representation. The item–person map revealed that the scale is better targeted at more disabled people than those less disabled. Paradoxically, epidemiological studies attempt to target relatively healthy respondents (at the low end of the distribution) in order to better plan health, social and long-term care services. Off-target scales negatively affect the precision of the item estimates, do not make for an efficient measurement and do not provide enough information along the desired population range.24 The expected positive effect of adding mobility limitations to our metric in order to expand the construct may have been cancelled out by the inclusion of relatively healthy adults aged 50+. Previous authors have demonstrated that the dimensionality of ADL/IADL items could vary depending on if disabled or non-disabled people were included in the analysis.9 Therefore, we carried out an additional analysis, selecting persons aged 65 years and over (n=14 339; results not shown but available on request), to observe the impact on reliability and targeting. We found a lowering in the floor effect (from 48.5% to 35%), but a similar reliability (PR=0.77, separation=1.86) and targeting (mean person score=−2.26), which represented a non-significant improvement. Attempts to expand the construct of disability in a single metric for a community-dwelling population should include mental health functions, more infrequent and demanding tasks, physical performance measures, sensory and communicating limitations, as well as pain, fatigue and tiredness35 to better target the general population.

Another aspect to which some thought must be given is related to the hierarchical structure of disability. It has been widely accepted that ADL/IADL items can be ordered by the complexity of neuropsychological organisation involved with the decline in IADLs and the ambulation preceding ADLs.15 ,16 ,37 ,38 In contrast, we provide additional evidence supporting a blurred hierarchical structure of functional decline when ADLs and IADLs are combined.9 ,14 ,16 ,39–41 We found a disordered hierarchy among activities of moderate difficulty,9 as well as among easy activities such as ‘toileting (δ=2.06) and ‘taking medications’ (δ=3.06). As has been suggested,4 the relative overlap of ADL and IADL items in aggregated scales may be reflecting different disability profiles resulting from the interaction of multiple factors, and therefore the purported strict hierarchy is only achieved in terms of general dimensions instead of specific activities or tasks. Studies with more homogeneous samples, for example, with specific chronic diseases or physical impairments, may reveal the existence of different formal hierarchies.


Although the reliability of scores of the single metric was adequate, we found that a very low reliability of ADL and IADL scores (as separate scales) yielded important effects on the measurement of disability. For example, low reliability attenuates effect sizes and increases the chance of type II errors. As a consequence, researchers may not find the expected differences across groups or some results could be misleading. The discrepancies observed in table 6 between Cronbach’s α (0.78 for ADLs and IADLs) and the Rasch reliability (PR=0.26 and 0.36, respectively) reflect the negative impact of the different factors on the classical approach to reliability. All factors are present in the ADL and IADL scales: low number of items, skewed distributions, marked floor effect, low TIF and high SEs. If the requirement measurements are violated, coefficient α yields spuriously high estimates of reliability that do not reveal the poor metric quality of the scores.42

The alternative non-parametric approach

While we addressed the issue of cross-cultural validity within the framework of a highly restrictive parametric model, non-parametric IRT models (eg, Mokken scaling) have been successfully applied to evaluate the measurement invariance of disability scales.3 ,14 ,43 Non-parametric models relax some of the strong assumptions of measurement that are required for Rasch analysis. This can lead to more general conclusions and is more conservative; for example, when researchers are interested in retaining more items from a pool yielding higher reliability and better coverage of the latent trait.44 For this reason, Mokken has been widely used for scale development and psychometric studies of scales with a small number of items. Moreover, Mokken yields ordinal-level measures that can be enough to order items, persons or both in most cases, especially when persons are performing at or near the midpoint of the range of the scale, or can also be used to determine whether change in an individual’s health status has occurred. In contrast, interval-level measures allow estimates of how much more (or less) change has occurred, produce gains in precision over ordinal scores in discriminating between groups, and are ideally suited for studying longitudinal change.18 ,30 ,36 ,45 The conjoint representation of persons and items on a common metric in Rasch modelling provides an easy evaluation of the reliability of scores (by means of item targeting), the construct validity (by means of the item-difficulty hierarchy) and the predictive validity (by means of the person-ability hierarchy).18 ,24 ,30

Final recommendations

Our findings raise an important question regarding the choice of scales. That is to say, is it better to use a single metric of disability instead of separate ADL, IADL and mobility scales? If a researcher aims to estimate the prevalence of disability using the traditional cut-off score ADL 1+, IADL 1+ or mobility 1+, and wants to report findings based on descriptive and non-parametric statistics, then separate scales can be adequate. In this case, each scale could even be replaced by a single question (with a binary response format), including all the activities that the respondent might have difficulty with. Alternatively, difficulties in daily activities and functional limitations can be summed, and the aforementioned statistics can also be performed. However, researchers have to face several related issues: (1) the well-known problem of construct under-representation even with aggregated ADL/IADL scales, (2) the presence of large floor effects (around 80–90%) that seriously threaten construct validity and (3) the inability of the scales to separate statistically persons with different levels of disability, which implies that additional cut-offs are not supported empirically. Moreover, some ADLs (eg, dressing) are more challenging than some IADLs (eg, preparing a hot meal), so the inferences regarding the hierarchy of functional disability of respondents can be misleading.

Finally, we recommend that in situations where researchers are interested in (1) comparing disability severity using summative scores for parametric statistics, especially with markedly skewed distributions or expected minimal differences between groups, (2) estimating change scores in longitudinal studies, interval-level measures from the single metric should be used. In this way, researchers can be reasonably confident that any of the differences in disability detected between countries, age groups, gender, medical conditions, symptomatology and self-rated health are likely to be true differences. Furthermore, the availability of interval measures to conduct parametric statistical analysis without violating fundamental measurement requirement represents a promising field to explore the relationship between disability and a wide range of linear measures in health sciences (eg, blood pressure, cholesterol, C reactive protein, grip strength, etc).



  • Contributors JB conceived the study and created the data set from SHARE wave 4. JB and MC-R performed analyses and wrote the paper.

  • Funding The SHARE data collection has been primarily funded by the European Commission, through FP5 (QLK6-CT-2001-00360), FP6 (SHARE-I3: RII-CT-2006-062193, COMPARE: CIT5-CT-2005-028857, SHARELIFE: CIT4-CT-2006-028812) and FP7 (SHARE-PREP: N°211909, SHARE-LEAP: N°227822, SHARE M4: N°261982). Additional funding from the German Ministry of Education and Research, the U.S. National Institute on Aging (U01_AG09740-13S2, P01_AG005842, P01_AG08291, P30_AG12815, R21_AG025169, Y1-AG-4553-01, IAG_BSR06-11, OGHA_04-064) and from various national funding sources is gratefully acknowledged. JB and MC-R are independent from the SHARE funding organisations.

  • Competing interests None declared.

  • Ethics approval SHARE has been approved by the Ethics Committee of the University of Mannheim and the Ethics Council of the Max-Planck-Society for the Advancement of Science.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement No additional data are available.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.