Introduction

Until recently, there has been a lack of standardised patient-reported outcome measures for young, active patients with hip and groin disability [16]. The Copenhagen Hip and Groin Outcome Score (HAGOS) was published in 2011 [15], and the international Hip Outcome Tool (iHOT12) was published in 2012 [4]. The iHOT12 has been cross-culturally adapted and validated into a Swedish version, the iHOT12-S [7]. The HAGOS is based on and designed in a manner similar to the KOOS scale and comprises 37 items in six subscales; symptoms (7 items), pain (10 items), function in daily living (5 items), function in sport and recreation (8 items), participation in physical activities (2 items) and hip- and/or groin-related quality of life (5 items).

Health-related patient-reported outcomes (HR-PROs) are widely used to evaluate the effectiveness of treatment or to compare different interventions in clinical trials. They are questionnaires completed by patients to measure perceptions of their general health or their health in relation to a specific illness or condition. Before an HR-PRO can be used for research or in a clinical setting, it must be standardised, validated and tested for reliability [3]. In 2010, the Consensus-Based Standards for the Selection of Health Status Measurement Instruments (COSMIN) published a checklist which could be used to develop and evaluate HR-PROs [13, 15]. The checklist is designed to be used as a guide in the development of HR-PROs and to evaluate the quality of studies measuring the properties of HR-PROs.

The purpose of this study was cross-culturally to adapt and validate the Swedish HAGOS version, in accordance with the COSMIN checklist.

Materials and methods

The adaptation of the HAGOS to Swedish was performed in several steps, as proposed by Beaton et al. [1].

The original version was translated into Swedish by three of the authors (two orthopaedic surgeons and one physiotherapist) who are fluent in Swedish, well acquainted with the Danish language and experienced in working with patients with hip and groin disability. The three translations were then synthesised into a Swedish version by an expert panel of three orthopaedic surgeons and one physiotherapist. The synthesised version, the result of consensus among the panel, was back-translated into Danish by a native Danish-speaking person, and the translation was subsequently compared with the original version by the same panel. Minor differences between the original and back-translated versions were resolved by consensus among the panel.

A pilot test to check the acceptability of the synthesised version was performed on 10 healthy individuals without any history of hip or groin problems. They were encouraged to make comments with their answers. This was done to ensure that the questions would not be experienced as obtrusive and that non-health care professionals could understand the questions. After the pilot test, minor modifications were made to the synthesised translation, according to consensus among the panel, which mainly involved replacing professional words with more lay terms. Face validity, the degree to which the instrument looks as though it adequately reflects the measured construct [9], was deemed acceptable according to consensus among the expert panel.

The reliability, validity and responsiveness of the final version, the HAGOS-S, were assessed according to the COSMIN checklist [10] in a clinical study. Five hundred and two patients requiring hip arthroscopy for femoro-acetabular impingement (FAI) based on radiological and clinical criteria completed the questionnaire on their first visit to an experienced hip surgeon. Only those patients requiring hip arthroscopy for FAI were included. At the time of the study, 360 patients (92 % response rate) had completed the questionnaires at 4 months post-operatively. A group of 26 patients completed the HAGOS-S pre-operatively on two separate occasions within 3 weeks for test–retest reliability.

All the patients evaluated their overall hip function on a global perceived effect (GPE) visual analogue scale (VAS) from 0 (extremely poor hip function) to 100 (perfect hip function). A change of 20 points or more on the GPE scale was regarded as representing a clinically relevant change in patient symptoms [3, 5, 10]. Twenty-six patients were included in the test–retest reliability evaluation. To be included in the test–retest evaluation, the patients’ condition had to be regarded as clinically stable during this period. It was therefore decided a priori that only patients with a change of fewer than 20 points between test and retest on the VAS could be included in this analysis.

The patients completed the Swedish versions of the EQ-5D [5] and the iHOT12-S [7] to be correlated with the HAGOS-S for construct validity. Their physical activity level was assessed with the Hip Sports Activity Scale (HSAS) [11]. The patients were also asked to use the HSAS to estimate their physical activity level when they were teenagers and before their symptom debut.

The study was approved by the Regional Ethical Review Board, Gothenburg, Sweden, ID: 472-10. All patients gave their informed consent.

Statistical analysis

Statistical analysis was performed using the Statistical Package for the Social Sciences (SPSS) version 21. Most data were ordinal, so nonparametric statistics were used. The level of significance was set at p < 0.05. The questionnaires were web based, leaving the patients no option but to answer all the questions. As a result, no individual items were missing.

Reliability

The reliability of an HR-PRO is the degree to which it is free from measurement error [9]. To evaluate the reliability of an HR-PRO, its internal consistency, test–retest reliability and measurement error must be assessed.

Internal consistency is the degree of interrelatedness between the items [9]. Internal consistency was measured for the six subscales of the HAGOS-S from the baseline values and was deemed good if Cronbach’s alpha was between 0.70 and 0.95 [14].

Test–retest reliability is defined as the proportion of the total variance in the measurements which is due to true differences between patients [9]. The intraclass correlation coefficient (ICC), (3.1 two-way mixed effects model absolute agreement) was calculated for each of the six HAGOS-S subscales. An ICC of >0.70 was deemed acceptable [14]. A Wilcoxon’s paired test was performed to assess whether there were significant differences in scores between the test occasions.

Measurement error is the systematic and random error of the score, not attributed to the construct that is being measured [9]. Measurement error was expressed as the standard error of the mean (SEM) using the formula SD × √1—ICC, with SD as the standard deviation of scores from all patients at baseline [17]. The smallest detectable change (SDC), a change in a score that exceeds the measurement error, was calculated at individual level as SEM × 1.96 × √2 and at group level as SEM × 1.96 × √2/√n [2].

Validity

Construct validity is the degree to which the scores of a PRO instrument are consistent with a priori hypotheses, based on the assumption that the instrument validly measures the construct that is going to be measured [9]. A principal component factor analysis with varimax rotation and the eigenvalue set at >1 was performed to assess the structural validity of each of the six HAGOS-S subscales. The factor analysis presents the eigenvalue and the variance explained in per cent to indicate the relative strength of the factor. Hypothesis testing was performed using Spearman’s correlation coefficient for nonparametric data, comparing the scores from the HAGOS-S with the EQ-5D-S and iHOT12-S scores.

A priori hypotheses were formulated. With the HAGOS and iHOT12 developed for similar patient groups and measuring essentially the same constructs, we expected high correlations (Spearman r > 0.50) between the six HAGOS-S subscales and the iHOT12-S average score. We expected a moderate correlation (Spearman r > 0.30) between the subscales of the HAGOS-S and the subscales of the EQ-5D-S, but a higher correlation was expected between the HAGOS-S and the mobility, usual activities and pain/discomfort subscales of the EQ-5D-S than with the self-care and anxiety/depression subscales.

Responsiveness

The responsiveness of a PRO instrument is its ability to detect change over time [9]—in the present study, between pre-operatively and a 4-month follow-up. Responsiveness was assessed using Spearman’s correlation coefficient, standardised response mean (SRM) and effect size (ES). Correlations between the GPE and the six subscales of the HAGOS-S were measured. The SRM was calculated as the mean change in score divided by the SD of the change. The ES was calculated as the mean change in score divided by the SD of the baseline score [13]. The patients were divided into three groups: those reporting worsening of hip function between pre-operatively and the 4-month follow-up (at least 20 points lower GPE score), those that reported no change in function (0–19 points higher or lower GPE score) and those that reported improved function (at least 20 points higher GPE score). A priori hypotheses were formulated for responsiveness. We hypothesised that the change in the score on the HAGOS-S subscales would correlate with the GPE score with a Spearman correlation coefficient of >0.3. We furthermore hypothesised that the SRM and ES would be higher for those reporting improved hip function between pre-operatively and the 4-month follow-up (at least 20 points lower GPE score) and lower for those reporting worsening of hip function between pre-operatively and the 4-month follow-up (at least 20 points lower GPE score).

Interpretability

Interpretability is defined as the degree to which it is possible to assign qualitative meaning to an instrument’s quantitative scores or change in scores [9]. It includes the distribution of total scores and change in scores, floor and ceiling effects and an estimation of the minimal important change (MIC) and/or minimal important difference (MID). Floor and ceiling effects were defined as being present if more than 15 % of patients reported lowest (0) or highest (100) possible scores [15]. The MIC was calculated as 0.5 × SD both at baseline and at 4 months [12].

Results

Baseline characteristics are presented in Table 1. A total of 502 patients completed the HAGOS-S questionnaire at baseline. At the time of the study, 391 patients had reached 4 months post-surgery and 360 (92 %) were available for follow-up. Twenty-six patients completed the questionnaire pre-operatively on two separate occasions with a mean interval of 14 (range 9–20) days (SD 3.3).

Table 1 Baseline characteristics

Reliability

Descriptive statistics and test–retest reliability measurements are presented in Table 2. The ICC ranged from 0.81 to 0.87. No statistically significant difference between the test and retest scores was found. The SDC for the six subscales ranged from 7.8 to 16.1 at individual level and from 1.5 to 3.2 at group level.

Table 2 Descriptive statistics and test–retest reliability of the six HAGOS-S subscales (n = 26)

The internal consistency for the six subscales ranged from a Cronbach’s alpha of 0.77–0.89 (Table 3).

Table 3 Chronbach’s alpha (Cα) for internal consistency (n = 502) and factor analysis (n = 502) with eigenvalue (EV) and degree of variance explained in per cent (%) for the six HAGOS-S subscales

Validity

An exploratory factor analysis of each of the six subscales separately revealed that all the subscales loaded with one strong factor with an eigenvalue over 1.0 explained a large degree of the variance (Table 3).

For the evaluation of the HAGOS-S construct validity, Spearman’s correlation coefficients were calculated between the HAGOS-S and EQ-5D-S and the HAGOS-S and iHOT12-S, respectively (Tables 4, 5). All six subscales of the HAGOS-S showed significant correlations with all questions and the total of the iHOT12-S, the EQ-5D-S total score and the EQ-5D-S VAS score.

Table 4 Spearman’s correlation coefficients at baseline for HAGOS-S subscales and EQ-5D-S subscales, total score and VAS (n = 495)
Table 5 Spearman’s correlation coefficients at baseline for the 12 items and average score for the iHOT12-S and the six subscales of the HAGOS-S (n = 495)

Responsiveness

Spearman’s correlation coefficient between the score on the six HAGOS-S subscales and the GPE scale ranged from 0.40 to 0.62, indicating moderate correlations. The results of the SRM and ES calculations and GPE correlations are presented in Table 6. As hypothesised, the ES and SRM were lower for those reporting a worsening of hip function and higher for those reporting improved hip function at 4 months.

Table 6 Responsiveness of the HAGOS-S measured against different change scores on the GPE scale

Interpretability

Floor and ceiling effects, present if more than 15 % of the patients reported highest or lowest scores on an individual item, were not found. The distribution of the scores at baseline, at 4 months and the MIC, is presented in Table 7.

Table 7 The HAGOS-S score at baseline and at 4 months with frequencies of lowest (floor effect) and highest (ceiling effect) scores

Discussion

The principal findings in the present study were that the HAGOS-S is a valid, reliable and responsive HR-PRO, for patients with femoro-acetabular impingement, undergoing hip arthroscopy.

During translation and adaptation, the authors carefully followed a standardised process described in the literature. This should make the adapted version highly comparable with the original version. During the evaluation of the adapted version, the authors carefully followed the COSMIN checklist to ensure the assessment of every psychometric property.

With the development of the COSMIN checklist, health care specialists have a standardised instrument to evaluate the quality of studies measuring PRO instrument properties. The authors have used the COSMIN checklist during the design and reporting of the present study. We found the checklist easy to follow, but, as it does not as yet conclude what constitutes adequate measurement qualities, criteria proposed in the literature were used during calculations in the present study.

Study population

The HAGOS was developed for young, active patients with hip disorders, but it has been validated on a population between 18 and 60 years of age. In the present study, we included some 500 patients, some younger and some older (15–75 years), and only patients with FAI. Some floor and ceiling effects were experienced. We believe, however, that the HAGOS-S can also be utilised for older patients and for patients with other hip disorders, but future studies are needed to clarify this.

Reliability

All subscales showed very good homogeneity, with an internal consistency between 0.77 and 0.89, as measured with Cronbach’s alpha.

With an ICC between 0.81 and 0.89 for the six subscales, the test–retest reliability of the HAGOS-S was found to be very good and in agreement with the ICC reported in the original publication [15].

In order to express the patients’ clinical change in hip status, it was decided in the present study to use a VAS to determine whether significant changes in patient symptoms had occurred. A change of 20 mm or more was considered clinically important. Minimal important changes on a pain VAS have been found to range from 13 to 30 mm [3, 5, 10, 12].

The SDC for the six subscales of the HAGOS-S at individual level was at a clinically acceptable level (between 7.8 and 16.1), and the HAGOS-S could therefore be recommended for use in individual patients. A change of 20 points as used in this study for a clinically relevant change in GPE can thus also be recommended as a clinically relevant change at individual level in the HAGOS-S. The low SDC values at group level (between 1.5 and 2.7) strongly indicate that the HAGOS-S is very useful for group comparisons.

Validity

Significant correlations were found between the HAGOS-S subscales and the EQ-5D-S total score, ranging from r s = 0.40 to r s = 0.60. Significant correlations were also found between the HAGOS-S and the iHOT12-S average score, ranging from r s = 0.37 to r s  = 0.68, which was as hypothesised, apart from the HAGOS-S subscale of physical activity. Significant correlations were found between the HAGOS-S subscales and EQ-5D-S subscales, ranging from r s = −0.10 to r s = −0.57. As hypothesised, somewhat lower correlations were found for the EQ-5D-S subscales of self-care (average r s = −0.19) and anxiety/depression (average r s = −0.29) compared with the subscales of mobility, usual activities and pain/discomfort (average r s = −0.47, −0.35, −0.40, respectively). The latter three subscales thus correlated more highly with the HAGOS-S than hypothesised.

The factor analysis revealed that the six HAGOS-S subscales had one strong factor per subscale, which is in accordance with the original HAGOS [15].

Responsiveness

The GPE score correlated strongly with the HAGOS-S subscales, ranging from r s  = 0.40 to r s  = 0.68. As hypothesised, the SRM and ES were lower for patients reporting little clinical change in hip status and higher for patients reporting a larger clinical change in hip status, indicating good responsiveness of the HAGOS-S. Clinically, most of the patients had recovered well (although not completely) after 4 months. Larger ES and SRM can thus be expected at 4 months compared with 12 months, for example.

Interpretability

Floor and ceiling effects were detected in the HAGOS-S. At baseline, 31.5 % of the patients obtained the lowest score on the subscale of participation in physical activities. The two questions in the subscale ask: Are you able to participate in your preferred physical activities for as long as you would like? and Are you able to participate in your preferred physical activities at your normal performance level? with the alternatives: Always—Often—Sometimes—Rarely—Never. It is not surprising that many patients with hip and/or groin disability choose the alternative never. At 4 months, however, fewer patients (21.9 %) chose the alternative never. Future studies will show whether this apparent floor effect is present in the long term. At 4 months, there is a ceiling effect (16.9 %) in the function in daily living subscale, indicating that the sensitivity of this subscale can be limited in this patient population.

Taken together, the HAGOS-S with its six subscales can be recommended for measuring both improvement and deterioration over time in the study population.

When developing the COSMIN checklist, no consensus was reached about the method that should be used to measure the MIC [10]. The MIC is supposed to measure the minimal change in score that the patient regards as important. The rule of thumb that the MIC can be estimated as half an SD was proposed by Norman et al. [12], and, as long as no consensus is reached on the methods by which the MIC should be measured, the authors find this simple rule as good as any other. Applying this rule to the data gave an MIC of 9–17 for the HAGOS-S subscales at baseline and the 4-month follow-up. In the present study, the SDC at individual level is slightly higher than the MIC for some of the HAGOS subscales and slightly lower for other subscales at individual level both at baseline and at the 4-month follow-up. Results at individual level should therefore be interpreted with caution.

Each of the six HAGOS subscales can be used independently to identify changes in certain aspects of patients’ symptoms and for certain subpopulations at both group and individual level.

The present data are in agreement to a very large extent, in terms of reliability, validity and responsiveness, with the original study [15] of patients with hip and groin disability. Kemp et al. [8] recently evaluated, on a small subpopulation of 50 patients undergoing hip arthroscopic surgery, the reliability, validity, responsiveness and interpretability of five HR-PROs [Copenhagen Hip and Groin Outcome Score (HAGOS), Hip Disability and Osteoarthritis Outcome Score (HOOS), Hip Outcome Score (HOS), International Hip Outcome Tool (iHOT-33) and modified Harris hip score (MHHS)]. They concluded that some of the psychometric properties of the HAGOS were reduced, based on the fact that the HAGOS subscale related to activities of daily living showed a ceiling effect, which is in agreement with the present study. It is, however, not surprising that, as patients get better, they become symptom free in activities of daily living before they are symptom free in more sport-related activities. In a recent study, Hinman et al. [6] searched for the best HR-PRO for 30 patients with femoro-acetabular impingement in terms of test–retest reliability. They were able to demonstrate that the majority of the questionnaires, including the HAGOS, were reliable and precise enough for use at group level, which is in agreement with the present study, which also showed that the HAGOS can be used at individual level. Taken as a whole, this study shows that the HAGOS is a highly relevant measurement for patients with unspecific hip and groin pain, as well as for patients with femoro-acetabular impingement, undergoing hip arthroscopy.

Conclusion

The HAGOS-S showed good reliability, validity and responsiveness and can be used both for research and clinically at individual and group level in active patients with hip and/or groin pain.