Objective The purpose of this systematic review was to critically appraise and synthesise the psychometric properties of Global Rating of Change (GROC) scales for assessment of patients with neck pain.
Design Systematic review.
Data sources A search was performed in four databases (MEDLINE, EMBASE, CINAHL, SCOPUS) until February 2019.
Data extraction and synthesis Eligible articles were appraised using Consensus-based Standards for the selection of health Measurement Instruments checklist and the Quality Appraisal for Clinical Measurement Research Reports Evaluation Form.
Results The search obtained 16 eligible studies and included in total 1533 patients with neck pain. Test–retest reliability of global perceived effect (GPE) was very high (intraclass correlation coefficient=0.80 to 0.92) for patients with whiplash. Pooled data of Pearson’s r indicated that GROC scores were moderately correlated with neck disability change scores (0.53, 95% CI: 0.47 to 0.59). Pooled data of Spearman’s correlations indicated that GROC scores were moderately correlated with neck disability change scores (0.56, 95% CI: 0.41 to 0.68).
Conclusions This study found excellent quality evidence of very good-to-excellent test–retest reliability of GPE for patients with whiplash-associated disorders. Evidence from very good-to-excellent quality studies found that GROC scores are moderately correlated to an external criterion patient-reported outcome measure evaluated pre-post treatment in patients with neck pain. No studies were found that addressed the optimal form of GROC scales for patients with neck disorders or compared the GROC to other options for single-item global assessment.
PROSPERO registration number CRD42018117874.
- neck pain
- global assessment
- psychometric properties
- systematic review
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
We rated the quality of individual studies and the overall risk of bias using two standardised approaches.
Our focus on neck pain increased the specificity of results but are not necessarily applicable to other musculoskeletal conditions.
Conceptual concerns about global ratings of change being affected by recall bias are not adequately addressed by psychometric evidence.
No studies addressing the optimal form of global rating were found.
Neck pain is the fourth leading cause of disability and approximately half of the adult population with neck pain will experience a clinically important episode once in their lifetime.1–3 The annual prevalence of neck pain is estimated between 15% and 50%, with women having a higher prevalence rate than men.2 3 Neck pain has been associated with many other comorbidities such as headaches, dizziness, anxiety, depression, back pain and arthralgias.3–6 Several different methods for classifying neck pain have been described, using indicators such as duration (acute, subacute or chronic), degree of interference (low, moderate, severe) or most likely structure at fault (eg, neuropathy vs mechanical).7
As part of a patient-centric approach to care, clinicians will commonly evaluate response to intervention by asking the patient directly whether they feel better, worse or the same since the prior encounter. While direct questioning can provide a qualitative indicator of change in status, many best practice guidelines endorse use of some form of quantified patient-reported outcome (PRO) as an adjunct to oral self-report. PROs are available to quantify several different constructs in people with neck pain, including pain severity, disability and neck function.8 Any PRO intended to provide an estimate of change over time should be responsive to subtle shifts in the patient’s condition. To facilitate interpretation of change scores, a common property of many such tools is the minimum clinically important difference (MCID), which is a change threshold that corresponds to the minimum shift in scale values that most patients would indicate corresponds to an important change in their overall condition. A well-recognised approach to establishing an MCID for a PRO is to compare the magnitude of change against an anchor, most commonly a Global Rating of Change (GROC) scale. These scales allow patients or study participants to indicate whether their condition has gotten worse, better or stayed the same and to quantify the magnitude of that change. As they have been adopted as a sort of ‘standard’ against which change in other tools is compared, the GROC can also be used on its own as an omnibus generic indicator of change.8
Despite being accepted as a standard measure, there is considerable variation in how the GROC has been constructed and implemented in research in neck pain. GROC scales consist of ordered categories which may have different ranked levels (some have 15 levels, some 11 levels and others have 7 levels). The common structure across these is the use of a middle ‘0’ score corresponding to ‘no change’, with negative values indicating magnitudes of worsening while positive values indicate improvement.9 Variations of the GROC (in name or structure) include the Global Perceived Effect (GPE), Patient Global Impression of Change (PGIC), Transition Ratings and Global Scale.9
A well-established component of health outcomes is having a tool with strong psychometric properties of validity, reliability and responsiveness to be able to monitor change. While recent research8 has examined the psychometric properties of the most commonly reported PROs for neck disorders, to date there has been no systematic review to summarise the measurement properties of GROC scales themselves in patients with neck disorders. Therefore, this systematic review aims to critically appraise and synthesise the psychometric properties of the GROC scales in patients with neck disorders.
Patient and public involvement
There was no patient or public involvement in the design or planning of this study.
Study design and protocol registration
We conducted a systematic review to evaluate the psychometric properties of GROC scales in patients with neck disorders.
We included studies in this systematic review if the following criteria were met10–12:
Design: psychometric testing, randomised/ cohort studies.
Participants:>50% of the study’s patient population with neck conditions/disorders.
Intervention/comparison: studies that reported on the psychometric properties (reliability, validity, responsiveness) of GROC, GPE and PGIC.
Outcomes: GROC, GPE and PGIC.
Articles were written in English language only.
Studies with no data on the GROC scale’s psychometric properties, and conference abstract/posters were excluded from this systematic review.
To identify studies on the psychometric properties (reliability, validity, responsiveness) of the GROC, GPE and PGIC, we searched the MEDLINE, EMBASE, SCOPUS and CINAHL databases from inception till February 2019, using a combination of keywords. Furthermore, we identified additional studies by examining the reference list of each of the selected studies. The full list with keyword strategy is presented in online supplementary appendix 1.
Two investigators (PB and GN) performed the systematic electronic searches independently in each database. The same investigators then proceeded to identify and remove the duplicate studies. In the next stage, we performed the independent screening of the titles and abstracts and any full-text article marked as include or uncertain were obtained. In the final stage, the same two independent authors performed the full-text reviews independently to assess final article eligibility. In case of disagreement, a third reviewer, the most experienced member (JM), facilitated a consensus through discussion.
The fourth author (RF) performed the data extractions. The extracted data were then crosschecked by another author (PB). Data extraction included the author, year, study population/condition, setting, sample size, age, properties evaluated, retest-interval and the intervention protocol (if used to assess responsiveness parameters).13 14 For reliability estimates, standard error of measurement (SEM), intraclass correlation coefficient (ICC), minimal detectable change and 95% CIs were extracted.13 14 The ICC interpretation of ICC<0.40 indicating poor, 0.40≤ICC<0.75 indicating fair-to-good and ICC≥0.75 indicating excellent reliability were used as a common benchmark.15 For validity estimates, correlation coefficient (Pearson’s/Spearman) and the 95% CIs were extracted.13 14 Evan’s guidelines to interpret the strength of the correlation was used which included: 0.00–0.19 ‘very weak’, 0.20–0.39 ‘weak’, 0.40–0.59 ‘moderate’, 0.60–0.79 ‘strong’ and 0.80–1.00 ‘very strong’.16 For responsiveness estimates, the effect size, standardised response mean, clinically important difference and/or MCID including the method of MCID estimation-based, anchor-based or distribution-based methods and 95% CIs were extracted.13 14 To assist clinical decision-making, standard benchmark scores of trivial (<0.20), small (≥0.20 to<0.50), moderate (≥0.50 to<0.80) or large (≥0.80), as proposed by Cohen, were used.17 When insufficient data were presented, PB contacted the authors by email and requested further data.
Consensus-based standards for the selection of health measurement instruments (COSMIN)
COSMIN assesses the risk of bias for the psychometric properties reported on a property-by-property basis. A score for the risk of bias in estimates of psychometric properties was assessed by two authors (PB) and (RF) using the new (COSMIN) checklist.18 If disagreement was present, a third person (JM) assist in resolving the discrepancy. Each study was assessed by COSMIN on the 4-point scale as ‘very good’, ‘adequate’, ‘doubtful’ or ‘inadequate’ for each of the checklist criteria for relevant measurement properties (eg, reliability, responsiveness, and so on). According to COSMIN, when determining the overall score for each measurement property, the worst score counts method was used wherein the lowest score for the checklist criteria of the relevant property was taken as the overall score.19 We then assessed the result of individual studies on a measurement property against the updated criteria for good measurement properties. This involved the evaluation of results of the included studies as either sufficient (+), insufficient (−), or indeterminate (?).18
Quality appraisal for clinical measurement research reports evaluation form
A summary score for the overall quality of individual studies was appraised independently by the authors (PB) and (RF) using a structured clinical measurement-specific appraisal tool.13 14 In case of disagreement, a third person was consulted (JM) to resolve the conflict. The evaluation criteria of this tool included 12 items: (1) thorough literature review to define the research question; (2) specific inclusion/exclusion criteria; (3) specific hypotheses; (4) appropriate scope of psychometric properties; (5) sample size; (6) follow-up; (7) the authors referenced specific procedures for administration, scoring and interpretation of procedures; (8) measurement techniques were standardised; (9) data were presented for each hypothesis; (10) appropriate statistics-point estimates; (11) appropriate statistical error estimates; and (12) valid conclusions and recommendations.13 14 An article’s total score—quality—was calculated by the sum of scores for each item, divided by the numbers of items and multiplied by 100%.13 14 Overall, the quality summary of appraised articles range from poor (0%–30%), fair (31%–50%), good (51%–70%), very good (71%–90%) and excellent (>90%).13 14
Synthesis of results
A qualitative synthesis was conducted to report findings on test–retest reliability statistics. A meta-analysis of Pearson’s and Spearman’s correlation was performed in R V.3.6.1 with metaphor package.20 The meta-analyses were conducted using a random effect model and the correlation coefficients were converted to z values. Heterogeneity was deemed substantial if I2 values were more than 50%.21 A meta-regression was planned to explore the sources of unexplained heterogeneity by considering the following factors: (1) neck pain with or without radicular symptoms, (2) acute or chronic, (3) age and (4) sex. Forest plots were created using means and 95% CIs for correlation coefficients. We summarise the main results of the included articles based on the neck disorders, reported psychometric estimate and the study quality ratings.
Our search yielded 8837 articles. After removal of duplicates, 6027 studies remained and were screened using their title and abstract; leaving 29 articles selected for full-text review. Of these, 16 studies were considered eligible.22–37 The flow of the study selection process is presented in figure 1.
The 16 eligible studies were conducted between 2006 and 2017 and included 1533 participants with neck pain/disorders (mean of 96 participants per study).22–32 34–37 Study size ranged from 29 to 200 participants. A summary description of all the studies included is displayed in table 1. Concurrent validity was evaluated in 14 studies by comparing the difference of pain intensity, disability and function scores with the score of GROC scales. Two studies26 31 examined the test–retest reliability of a 7-point and an 11-point GPE scale for patients with whiplash-associated disorders (WADs). One study24 examined whether occurrences of within-session and between-session changes were significantly associated with functional outcomes, pain and self-report of recovery in patients at discharge who were treated with manual therapy for mechanical neck pain.
COSMIN risk of bias rating and quality appraisal of the included studies
Regarding the risk of bias, all studies were rated as very good (table 2). The quality of the studies ranged from 88% to 96% (table 3). The most common flaws were (1) lack of/inadequate sample size calculations; (2) missing data (ie, inadequate follow-up) and (3) inconsistencies between the data presented and hypothesis stated.
Reported GROC scales
The most commonly reported GROC scale (n=6 studies) was a 15-point scale with the most frequent anchors being ‘−7 (a very great deal worse) to 0 (about the same) to +7 (a very great deal better)’. A 7-point scale was reported in five studies, 11-point and 5-point scales were reported in two studies and a 9-point scale in one study. The anchors in those scales varied greatly and are presented in table 1. Only six studies26 31–33 35 36 reported full details regarding the specific questions asked of the patients with neck disorder when a GROC scale was administered. Those questions that were reported are presented in table 4.
Two studies were included that examined test–retest reliability of GPE for patients with WAD. Kamper et al 26 examined the (time interval) test–retest reliability of an 11-point GPE scale in 134 patients with chronic WAD and reported an ICC of 0.99 (95% CI: 0.99 to 0.99) at baseline, 0.96 (0.95 to 0.97) at 6 weeks and 0.92 (0.89 to 0.94) at 12 months (table 5). Ngo et al 31 assessed the test–retest reliability of a 7-point scale of GPE in patients with acute WAD at 3 to 5 days.31 The ICC and 95% CIs were used to determine the test–retest reliability of the two versions of the perceived recovery questions using their original 7-item responses. Ngo et al also computed weighted kappa coefficients and 95% CI using quadratic weights to determine whether the distribution of responses influenced the reliability as measured by the ICC. An ICC for general recovery of 0.70 (0.60 to 0.80) and an ICC for neck pain questions of 0.80 (0.72 to 0.87) were found. A weighted kappa was also calculated (kappa=0.70 (0.42 to 0.98)) at 6 weeks for general recovery and at 6 weeks kappa=0.80 (0.51 to 1.0) for neck pain questions (table 5).
We found 14 studies that examined concurrent validity measures between GROC and another PRO.22 23 25 27–30 32 34–38 Correlations of Pearson’s and Spearman’s coefficients between GROC and another PRO were ranging from very weak to very strong correlations. The validity measures are presented and summarised in table 6.
Meta-analysis and meta-regression of correlations between disability change scores and GROC scores
Five studies23 25 34 37 38 of very good-to-excellent quality reported the Pearson correlation coefficients between neck disability change scores and the GROC scores were pooled together. We found that GROC was positively correlated with disability change scores (r=0.53, 95% CI: 0.47 to 0.59, I2=0%). Six studies27–30 32 36 of very good-to-excellent quality reported the Spearman correlation coefficients between neck disability changes scores and the GROC scores and were pooled together. We found that GROC was moderately correlated with disability change scores (rho=0.56, 95% CI: 0.41 to 0.68, I2=85%). The forest plots with correlation coefficients with 95% CIs are presented in figures 2–3. Our meta-regression showed that age was found as a significant factor in influencing Fisher’s Z scores (β=−0.034, 95% CI: −0.05 to −0.01, p=0.001). The model explained 68% of the variance (R2=0.68) (figure 4).
Area under the curve (AUC)—sensitivity and specificity
Cook et al 24 found that between-session NPRS pain changes were associated with greater than 3-point change on the GROC at 96 hours (AUC=0.76). The pain change associated with GROC was more specific (specificity=79.2%, range: 62.2–91.1) than sensitive (sensitivity=65.6%, range: 57.9 to 74.6). Those with a 36.7% between-sessions change in pain were also 7.3 times more likely to report an improvement of greater than 3-point change on the GROC than those who did not achieve a 36.7% change in pain (table 5).
This review has synthesised the current research from 16 studies that aimed to evaluate the psychometric properties of GROC scales for patients with neck disorders, with the goal to provide evidence for clinicians and researchers concerning its use within clinical practice and research. From the 16 included studies, only two studies26 31 reported test–retest reliability statistics of the 7-ranked and 11-ranked categories of GPE scales for patients with WAD only. We were able to pool data from 12 studies regarding concurrent validity of GROC scales and neck disability change scores at one time point after the interventions. Themes influencing interpretation of the GROC were explored in a study33 that evaluated the factors that contribute to how patients respond to a question on GPE. This study found that treatment process, biomechanical performance, self-efficacy and the nature of the condition may influence the responses on GPE, which is consistent with what we would expect for patients with neck pain. This suggests that change is a complex multifactorial global concept. A strength of GROC is that it is intended as a global assessment, and it can be assumed that it reflects the aspects of change important to the individual patient.
Reliability can be defined as the degree to which a measure produces consecutive results with the least amount of random error when the status of the population remains unchanged. The reliability of GPE displayed an excellent test–retest reliability of ICC>0.90 over an interval of 6 weeks and 12 months for patients with WAD. Conducting an assessment with a long test–retest interval (eg, 12 months) can provide challenges as there is higher risk of individuals with WAD being symptomatically unstable.9 Determining if patients are symptomatically stable can be achieved by administering another PRO such as the Single Assessment Numeric Evaluation (SANE)39; however, the 7-ranked and 11- ranked categories of GPE scales still demonstrated good stability properties at long test intervals (ie, of 6 weeks and 12 months).26 Therefore, the measurements of the reliability parameters of the GPE may be very useful during longer test intervals in clinical trials.
The psychometric property of validity is defined as the degree to which a PRO measures what it is intended to measure. Pooled data from 11 studies overall suggest that post-treatment changes of validated disability outcome measures were moderately (Pearson’s r=0.51, 95% CI: 0.43 to 0.58; Spearman’s rho=0.56, 95% CI: 0.41 to 0.68) correlated to change in perceived effect (figures 2–3). This finding suggests that GROC scores taken at one point in time were related to scores in pain and disability in patients with neck disorders, as measured by standardised measures taken at two points in time. We identified one study24 that found a 36.7% change in pain for within-session and between-session changes was associated with a 50% reduction in the NDI and an improvement of >3 levels on a 15-ordinal level GROC scale for patients with neck pain. This quantified predictive change value may have clinical utility for use in clinical practice.
Previous studies9 40 have indicated serious concerns about the conceptual validity of the global rating of change. The review by Kamper et al 9 clearly showed that GROC was related to final status more than change and was least related to baseline health status. This result undermines the premise of what the global rating of change actually measures. For this reason, we conclude that the 0.50 pooled correlation across 12 studies between the GROC and other patient-reported outcome measure (PROM) change scores (eg, NDI scores) may reflect a relationship between follow-up status and change rather than supporting the contention that GROC actually measures change. This would also explain why only 25% of the variation in GROC change scores was explained by change scores from a PROM change score measured at two points in time. In all studies, participants completed the GROC scale at one time point after the intervention, and hence recall bias is a cause for concern. However, another potential factor for moderate correlations is that the PROMs that have been used as the comparator with GROC scores may not reflect priorities that are important to patients. That is, the field has largely been driven by assumptions that the GROC is a ‘gold standard’ for evaluating true change in a respondent’s condition or status, and that all items on the comparator PROM are of equal importance to all people with that condition. The work presented herein challenges the valorisation of the GROC as a gold standard for change, and prior work has challenged the notions that all PROM items are equally important.9 41 42 It is therefore possible that the very constructs being evaluated require greater critical discourse before authors can say, with confidence, that one scale functions well or poorly based on its associations with another scale. Since no studies compared a retrospective global assessment of the GROC to pre-post single item global PROM for example, the SANE, we do not know the extent to which these two factors contributed to moderate correlation.
A unique aspect of this study was that it focused on global rating of change scales in a neck pain patient population. Our study appraisal suggests that future studies concerning GROC should include adequate sample sizes, maintain a rigorous follow-up and report appropriate statistical error estimates, since these were often inadequate. Various critical appraisal tools exist, and the perspectives and ratings may differ across instruments. COSMIN is just one methodology that can be used to synthesise or evaluate outcome measures and other methods might be equally valid or provide different perspectives. We used two different critical appraisal tools to evaluate quality from two perspectives. The COSMIN risk of bias assessments reflects the level of confidence in the conclusions and pooled estimates. The quality appraisal tool focuses on design issues in the studies and reflects gaps in research designs that should be considered in interpretation of current research and improved in future studies. Substantial heterogeneity was detected (I2 >50%) in pooled Spearman’s correlation coefficients which is a concern when pooling data. Sources of the observed heterogeneity were identified in our meta-regression results. Our univariate meta-regression analysis indicated that age across the studies explained 68% of the variance (figure 4). Other factors such as type of neck pain (with or without radicular symptoms), acute or chronic and sex did not explain the remaining heterogeneity (not statically significant). In our meta-regression, we used a patient level characteristic to identify the observed heterogeneity and therefore, our model may be vulnerable to aggregation bias. Furthermore, the scope of our literature search was focused on identifying full-text papers written only in English.
While this study included 16 studies, only 2 of these reported reliability statistics for GROC scales for patients with chronic WAD. Therefore, the applicability of our study is mostly limited to patients with chronic WAD. For validity measurements, GROC scales were mostly investigated by correlation analyses to evaluate the external responsiveness of another PRO measure over a specific time point. From our meta-analysis, we can be confident that the GROC scores were moderately correlated with neck disability change scores. However, more robust psychometric design studies to test the measurement properties of GROC scales as the primary outcome of investigation are highly needed. Future studies should aim to test to what extent the different range of items (eg, 7-level scale vs 11-level scale), the anchors (eg, much worse vs much better) may affect the measurement properties of GROC scales for patients with neck disorders. Also, it is important to indicate that most outcome measures are ordinal and assume that additive scores of ordinal items can be treated as interval level. This potentially could lead to scaling problems even in the face of strong psychometric properties. The main protection we have is to create new scales or retrofit existing scales based on Rasch analysis. Also, we acknowledge that the majority of work done on the GROC scales has been performed using statistical approaches that are most appropriate to linear rather than ordinal data.
This study found excellent quality evidence of very good-to-excellent test–retest reliability of GPE for patients with WAD. Evidence of very good-to-excellent quality studies found that GROC scores are moderately correlated to an external criterion PROM, measured pre–post treatment in patients with neck disorders. Studies addressing the optimal form of GROC scales for patients with neck disorders or comparing the GROC to other options for single-item global assessment of change were not found.
Collaborators CATWAD co-authors: Michele Sterling, Anne Söderlund, Michele Curatolo, James M Elliott, David M Walton, Helge Kasch, Linda Carroll, Hans Westergren, Gwendolen Jull, Eva-Maj Malmström, Luke B Connelly, Joy C MacDermid, Mandy Nielsen, Pierre Côté, Tonny Elmose Andersen, Trudy Rebbeck, Annick Maujean, Sarah Robins, Kenneth Chen, Julia Treleaven.
Contributors PB contributed significantly to the conception and design of the study, data extraction, critical appraisal, interpretation of data and drafting of the manuscript. GN and RF were involved in literature search, critical appraisal and interpretation of data and drafting. GN was involved in critical appraisal and drafting. JM was also involved in the conception and design of the study, drafting and revised the manuscript for important intellectual content. JM and CATWAD were involved in the drafting and review of the manuscript. All authors have given their final approval on the manuscript to be published.
Funding This work was supported by the Canadian Institutes of Health Research (CIHR) with funding reference number (FRN: SCA-145102).
Competing interests None declared.
Patient consent for publication Not required.
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement No data are available.