Original Article
Testing the Newcastle Ottawa Scale showed low reliability between individual reviewers

https://doi.org/10.1016/j.jclinepi.2013.03.003Get rights and content

Abstract

Objectives

To assess inter-rater reliability and validity of the Newcastle Ottawa Scale (NOS) used for methodological quality assessment of cohort studies included in systematic reviews.

Study Design and Setting

Two reviewers independently applied the NOS to 131 cohort studies included in eight meta-analyses. Inter-rater reliability was calculated using kappa (κ) statistics. To assess validity, within each meta-analysis, we generated a ratio of pooled estimates for each quality domain. Using a random-effects model, the ratios of odds ratios for each meta-analysis were combined to give an overall estimate of differences in effect estimates.

Results

Inter-rater reliability varied from substantial for length of follow-up (κ = 0.68, 95% confidence interval [CI] = 0.47, 0.89) to poor for selection of the nonexposed cohort and demonstration that the outcome was not present at the outset of the study (κ = −0.03, 95% CI = −0.06, 0.00; κ = −0.06, 95% CI = −0.20, 0.07). Reliability for overall score was fair (κ = 0.29, 95% CI = 0.10, 0.47). In general, reviewers found the tool difficult to use and the decision rules vague even with additional information provided as part of this study. We found no association between individual items or overall score and effect estimates.

Conclusion

Variable agreement and lack of evidence that the NOS can identify studies with biased results underscore the need for revisions and more detailed guidance for systematic reviewers using the NOS.

Introduction

The internal validity of a study reflects the extent to which the design and conduct of the study have minimized the impact of bias [1]. One of the key steps in a systematic review is the assessment of internal validity (or risk of bias, RoB) of all studies included for evidence synthesis. This assessment serves to identify the strengths and limitations of the included studies; investigate and explain heterogeneity of findings across a priori defined subgroups of studies based on RoB; and grade the quality or strength of evidence for a given outcome.

With the increase in the number of published systematic reviews [2] and development of systematic review methodology over the past 15 years [1], close attention has been paid to methods of assessing internal validity of individual primary studies. Until recently, this has been referred to as “quality assessment” or “assessment of methodological quality” [1]. In this context, “quality” refers to “the confidence that the trial design, conduct, and analysis has minimized biases in its treatment comparisons” [3]. To facilitate the assessment of methodological quality, a plethora of tools has emerged [3], [4], [5], [6]. Some of these tools are applicable to specific study designs, whereas other more generic tools may be applied to more than one design. The tools usually incorporate items associated with bias (e.g., blinding, baseline comparability of study groups) and items related to reporting (e.g., was the study population described, was a sample size calculation performed) [1].

There is a need for inter-rater reliability testing of quality assessment tools to enhance consistency in their application and interpretation across different systematic reviews. Furthermore, validity testing is essential to ensure that the tools being used can identify studies with biased results. Finally, there is a need to determine inter-rater reliability and validity to support the use of individual tools that are recommended by those developing methods for systematic reviews.

We undertook this project to assess the reliability and validity of the Newcastle Ottawa Scale (NOS). The NOS is a quality assessment tool for use on nonrandomized studies included in systematic reviews, specifically cohort and case–control studies. The tool was produced by the combined efforts of the Universities of Newcastle, Australia, and Ottawa, Canada [7], and was first reported at the Third Symposium for Systematic Reviews in Oxford, United Kingdom, in 2000 [8]. It has been endorsed for use in systematic reviews of nonrandomized studies by The Cochrane Collaboration [1].

The NOS includes separate assessment criteria for case–control and cohort studies covering the following domains: the selection of participants, comparability of study groups, and the ascertainment of exposure (for case–control studies) or outcome of interest (for cohort studies). A star rating system is used to indicate the quality of a study, with a maximum of nine stars [8]. Each criterion receives a single star if appropriate methods have been reported. The selection domain is subdivided to evaluate the selection of the exposed and nonexposed cohorts, the ascertainment of exposure, and whether the study demonstrated that the outcome of interest was not present at the start of the study. Comparability is the only category that may receive two stars: one if the most important confounders have been adjusted for in the analysis and a second star if any other adjustments were made. Outcome of interest is made up of three questions: the appropriateness of the methods used to evaluate the outcome, the length of follow-up, and the degree of the loss to follow-up [7].

The developers of the NOS have examined face and criterion validity, inter-rater reliability, and evaluator burden for the NOS. Face validity has been evaluated as strong by comparing each individual assessment item to their stem question. Criterion validity has shown a strong agreement with the Downs and Black assessment tool [9] on a series of 10 cohort studies evaluating hormone replacement therapy in breast cancer, with an intraclass correlation coefficient (ICC) of 0.88. Inter-rater reliability for the NOS on cohort studies was high with an ICC of 0.94. Evaluator burden, as assessed by the time required to complete the NOS evaluation, was shown to be significantly less than the Downs and Black tool (P < 0.001) [10]. The authors state that further assessment of the construct validity and the relationship between the external criterion of the NOS and its internal structures are under consideration [7]. These studies have been presented as abstracts.

The objectives of this study were to further assess the reliability of the NOS for cohort studies between individual raters and assess the validity of the NOS by examining whether effect estimates vary according to quality.

Section snippets

Methods

This article is part of a larger technical report conducted for the Agency for Healthcare Research and Quality (AHRQ). We followed a protocol that was developed a priori with input from experts in the field. Further details on methodology and results are available in the technical report (http://effectivehealthcare.ahrq.gov/index.cfm/search-for-guides-reviews-and-reports/).

Description of reviewers

Sixteen reviewers from the two centers assessed the studies using the NOS. Individuals had varying levels of relevant training and experience with systematic reviews in general. The length of time they had worked with their respective EPC ranged from 4 months to 10 years. Thirteen reviewers had formal training in systematic reviews. Four reviewers had a doctoral degree; 10 reviewers had a master's degree; 1 reviewer had a medical degree and master's degree; and 1 reviewer had an undergraduate

Discussion

This is the first study to our knowledge that has examined inter-rater reliability and construct validity of the NOS by researchers who were not involved in the development of the tool. We found wide variation in the degree of inter-rater agreement across the domains of the NOS, ranging from poor to substantial. The domain about the length of follow-up had substantial agreement; this finding was not surprising. The domain asked “Was the follow-up long enough for the outcome to occur?” Given the

Conclusions

More specific guidance is needed to apply and interpret quality assessment tools. We identified specific items within the NOS where agreement is low. This information provides direction for more detailed guidance. Low agreement between reviewers has implications for incorporation of quality assessments into results and grading the strength of evidence. The low agreement, combined with no evidence that the NOS is able to identify studies with biased results, underscores the need for revisions

Acknowledgments

The authors gratefully acknowledge the following individuals from the University of Alberta (U of A) EPC and University of Ottawa (U of O) EPC for assisting with quality assessments: Susan Armijo Olivo (U of A), Christine Ha (U of A), Chantelle Garritty (U of O), Kristin Konnyu (U of O), Dunsi Oladel-Rabiu (U of A), Larissa Shamseer (U of O), Kavita Singh (U of O), Elizabeth Sumamo (U of A), Jennifer Tetzlaff (U of O), Lucy Turner (U of O), Fatemeh Yazdi (U of O). We thank Annabritt Chisholm

References (31)

  • D. Moher et al.

    Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists

    Control Clin Trials

    (1995)
  • H. Bastian et al.

    Seventy-five trials and eleven systematic reviews a day: how will we ever keep up?

    PLoS Med

    (2010)
  • P. Juni et al.

    The hazards of scoring the quality of clinical trials for meta-analysis

    JAMA

    (1999)
  • S. West et al.

    Systems to rate the strength of scientific evidence

    Evid Rep Technol Assess (Summ)

    (2002)
  • S.A. Olivo et al.

    Scales to assess the quality of randomized controlled trials: a systematic review

    Phys Ther

    (2008)
  • Wells G, Shea B, O'Connell J, Robertson J, Peterson V, Welch V, et al. The Newcastle-Ottawa scale (NOS) for assessing...
  • Wells G, Shea B, O'Connell J, Robertson J, Peterson V, Welch V, et al. The Newcastle-Ottawa scale (NOS) for assessing...
  • S.H. Downs et al.

    The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and non-randomised studies of health care interventions

    J Epidemiol Community Health

    (1998)
  • Wells G, Brodsky L, O'Connell D, Robertson J, Peterson V, Welch V, et al. Evaluation of the Newcastle-Ottawa Scale...
  • S. Ip et al.

    Breastfeeding and maternal and infant health outcomes in developed countries

    Evid Rep Technol Assess (Full Rep)

    (2007)
  • F.A. McAlister et al.

    Cardiac resynchronization therapy and implantable cardiac defibrillators in left ventricular systolic dysfunction

    Evid Rep Technol Assess (Full Rep)

    (2007)
  • P.L. Santaguida et al.

    Diagnosis, prognosis, and treatment of impaired glucose tolerance and impaired fasting glucose

    Evid Rep Technol Assess

    (2005)
  • M. Egger et al.

    How important are comprehensive literature searches and the assessment of trial quality in systematic reviews? Empirical study

    Health Technol Assess

    (2003)
  • T.A. Furukawa et al.

    Association between unreported outcomes and effect size estimates in Cochrane meta-analyses

    JAMA

    (2007)
  • Cited by (299)

    • Relationship between dysphagia and surgical treatment for supraglottic laryngeal carcinoma: A meta-analysis

      2023, American Journal of Otolaryngology - Head and Neck Medicine and Surgery
    View all citing articles on Scopus

    Funding disclosure and disclaimer: This manuscript is based on a project conducted by the University of Alberta Evidence-based Practice Center under contract to the Agency for Healthcare Research and Quality (AHRQ), Rockville, MD (Contract No. 290–2007–10021). The findings and conclusions in this manuscript are those of the authors, who are responsible for its contents; the findings and conclusions do not necessarily represent the views of AHRQ. No statement in this manuscript should be construed as an official position of AHRQ or of the US Department of Health and Human Services.

    View full text