Article Text

Original research
Tools for assessing quality of studies investigating health interventions using real-world data: a literature review and content analysis
  1. Li Jiu1,
  2. Michiel Hartog1,
  3. Junfeng Wang1,
  4. Rick A Vreman1,
  5. Olaf H Klungel1,
  6. Aukje K Mantel-Teeuwisse1,
  7. Wim G Goettsch1,2
  1. 1Division of Pharmacoepidemiology and Clinical Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Utrecht, Netherlands
  2. 2National Health Care Institute, Diemen, Netherlands
  1. Correspondence to Dr Wim G Goettsch; w.g.goettsch{at}; Dr Junfeng Wang; j.wang5{at}; Dr Junfeng Wang; j.wang5{at}


Objectives We aimed to identify existing appraisal tools for non-randomised studies of interventions (NRSIs) and to compare the criteria that the tools provide at the quality-item level.

Design Literature review through three approaches: systematic search of journal articles, snowballing search of reviews on appraisal tools and grey literature search on websites of health technology assessment (HTA) agencies.

Data sources Systematic search: Medline; Snowballing: starting from three articles (D’Andrea et al, Quigley et al and Faria et al); Grey literature: websites of European HTA agencies listed by the International Network of Agencies for Health Technology Assessment. Appraisal tools were searched through April 2022.

Eligibility criteria for selecting studies We included a tool, if it addressed quality concerns of NRSIs and was published in English (unless from grey literature). A tool was excluded, if it was only for diagnostic, prognostic, qualitative or secondary studies.

Data extraction and synthesis Two independent researchers searched, screened and reviewed all included studies and tools, summarised quality items and scored whether and to what extent a quality item was described by a tool, for either methodological quality or reporting.

Results Forty-nine tools met inclusion criteria and were included for the content analysis. Concerns regarding the quality of NRSI were categorised into 4 domains and 26 items. The Research Triangle Institute Item Bank (RTI Item Bank) and STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) were the most comprehensive tools for methodological quality and reporting, respectively, as they addressed (n=20; 17) and sufficiently described (n=18; 13) the highest number of items. However, none of the tools covered all items.

Conclusion Most of the tools have their own strengths, but none of them could address all quality concerns relevant to NRSIs. Even the most comprehensive tools can be complemented by several items. We suggest decision-makers, researchers and tool developers consider the quality-item level heterogeneity, when selecting a tool or identifying a research gap.

OSF registration number OSF registration DOI (

  • Systematic Review

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


  • This literature review identified 49 appraisal tools for non-randomised studies of interventions, through both the systematic approach (ie, database search) and the non-systematic approaches (ie, snowballing and grey literature search).

  • Our study compared sufficient descriptions of appraisal tools at quality-item levels, for either methodological quality or reporting.

  • We only searched health technology assessment agencies for grey literature, so some tools only mentioned by clinical guideline or regulatory organisations might have been overlooked.

  • Usefulness of categorising a quality item as ‘sufficient’ or ‘brief’ for each tool, based on whether an explanation was provided for the criteria, has not been tested by previous studies.


Real-world data (RWD) generally refer to data collected during routine clinical practice, but their definitions could vary in settings.1 According to Makady et al, one of the RWD definitions is data collected without interference with treatment assignment.1 RWD that fit this definition are normally analysed in non-randomised studies of interventions (NRSIs), which estimate effectiveness of a health intervention without randomising intervention groups.2 3

NRSIs provide evidence on clinical and cost-effectiveness of health interventions for decision-making, in clinical and health technology assessment (HTA) settings.4–9 For example, NRSIs could inform clinicians on what diagnosis or treatment strategies to adopt.4 5 Also, with NRSIs, HTA agencies could gain more certainty on validity of evidence from randomised controlled trials (RCTs), when deciding on which health intervention to reimburse and on which pricing strategy to adopt.6 7 Also, HTA stakeholders could exploit NRSIs to evaluate highly innovative or complex interventions, for which RCTs may be considered infeasible or unethical.8 9 Generally speaking, NRSIs have become increasingly useful, as they complement and sometimes replace RCTs, when RCTs are scarce or even infeasible to conduct.2 10

However, the usefulness of NRSIs is often questioned due to quality concerns, in terms of risk of bias (RoB) and reporting. According to the Cochrane Handbook, NRSIs have higher RoB than RCTs and are vulnerable to various types of bias, such as confounding, selection and information bias.11 Also, the Professional Society for Health Economics and Outcomes Research (ISPOR) published a report in 2020, which stated that insufficient reporting on how an NRSI was generated was a major barrier for decision-makers to adopt NRSIs.12

To address NRSI’s quality concerns and to build decision-makers’ confidence, NRSIs need to be rigorously appraised, and this rationalises the development and use of appraisal tools. According to systematic reviews of appraisal tools for NRSIs, tens of tools have been developed in the past five decades.13–15 The growing number of tools has then brought a new challenge to users: how to select the best tool. To address this challenge, previous reviews have summarised quality items (ie, a group of criteria or signalling questions for methodological quality or reporting) and compared whether existing tools addressed these items.13–15 Some example items include ‘measurement of outcomes’, ‘loss to follow-up bias’, ‘inclusion and exclusion criteria of target population’, ‘sampling strategies to correct selection bias’, etc.13 In addition, these reviews provided some general recommendations on tool selection, such as referring to multiple tools for quality appraisal.14 However, information is still lacking on to what extent the tools address each quality item and the heterogeneity of tools at the quality-item level. To take outcome measurement as an example, the Academy of Nutrition and Dietetics Quality Criteria (ANDQ) checklist mentions that outcomes should be measured with ‘standard, valid and reliable data collection instruments, tests and procedures’ and ‘at an appropriate level of precision’.16 In contrast, the Good ReseArch for Comparative Effectiveness (GRACE) checklist considers the ‘valid and reliable’ measurement as ‘objective rather than subject to clinical judgement’17; while the Risk Of Bias In Non-randomised Studies—of Interventions (ROBINS-I) checklist interprets the ‘standard’ way as ‘comparable across study groups’ and ‘valid and reliable’ as low detection bias without ‘systematic errors’ in outcome measurement.18 In summary, the heterogeneity in level of detail with which a tool addresses a quality item and the heterogeneity in content and format of signalling questions can pose a challenge when tools are selected, or even merged.

Hence, our study aimed to summarise and compare signalling questions or criteria in the tools provided at the quality-item level, through a content analysis. This research was performed as part of the HTx project.19 The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825162.



To ensure credibility of the review and the content analysis, we registered a study protocol in the OSF registry (registration DOI: on 30 June 2022. The OSF registry is an online repository that accepts registration of all types of research projects, including reviews and content analyses.20

Patient and public involvement

Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.


In our study, appraisal tools refer to tools, guidelines, instruments or standards that provide guidance on how to report or assess any quality concern of NRSIs. NRSIs, according to the Cochrane Handbook, refer to any quantitative study estimating the effectiveness of an intervention without randomisation to allocate patients to intervention groups.2 According to Makady et al, data collected in such NRSIs belong to the second category of RWD, that is, those collected without interference with treatment assignment, patient monitoring or follow-up, or selection of study population.1

Search strategy

To identify appraisal tools for NRSIs from various potential sources, we adopted three approaches. A diagram illustrating how the three approaches complemented each other is shown in online supplemental appendix 1.

Database search

In the first approach, we conducted a systematic review to identify articles on appraisal tools, through a database search using Medline. Since D’Andrea et al have already conducted a systematic review to identify appraisal tools for all types of non-randomised studies published before November 2019,13 we updated their review by searching for articles published between November 2019 and April 2022, with their strings.


In the second approach, we searched for published reviews on appraisal tools for NRSIs. To identify all published reviews, we adopted a snowballing approach described by Wohlin.21 Snowballing refers to using the citations of articles to identify additional articles, and it is considered a good extension of a database search.21 To implement the snowballing approach, three researchers (LJ, MH and JW) first conducted a pilot search of articles using Google Scholar, reviewed full-text, judged eligibility through a group discussion, then identified three reviews (ie, those by D’Andrea et al,13 Quigley et al14 and Faria et al15). Next, the three reviews were used as a starting set and were uploaded to the website Connected Papers, which provides an online tool for snowballing.22 With each uploaded review, Connected Papers analysed approximately 50 000 articles and finally returned 40 articles with the highest level of similarity, based on factors such as overlapping citations. After judging eligibility of the returned articles, eligible articles were uploaded to the website Connected Papers for a second round of snowballing.

Grey literature

In the third approach, we searched for grey literature on the websites of European HTA agencies. Our rationale was that some appraisal tools may exist in the format of grey literature, such as agency reports and technical support documents. The list of European HTA agencies was derived from the International Network of Agencies for Health Technology Assessment.23 On each agency website, two researchers (MH and LJ) independently searched for grey literature with four concepts, respectively, ‘quality’, ‘RoB’, ‘appraisal’ and ‘methodology’. For each concept, only the first 10 hits sorted by relevance, if optional, were included (ie, a maximum of 40 hits for each website).

Eligibility criteria for articles and grey literature to identify relevant tools

An article or grey literature document was included if it described one or more appraisal tools. It was excluded if it only described tools for RCTs or only described tools for diagnostic, prognostic, qualitative or secondary studies (eg, systematic reviews and cost-effectiveness analyses). We only included articles identified through the database search and snowballing if published in English, while included grey literature could be published in all languages, as many HTA agencies tend to only use languages of their nations. Relevant documents obtained through this approach were translated using Google Translate.

The process of identifying studies and appraisal tools

Two researchers (MH and LJ) independently scanned all titles and abstract of the identified hits, then reviewed the full-text with Rayyan24 and Excel. After identifying the eligible studies, one researcher (MH) extracted the name of the tools and downloaded them by tracking study citations. A pilot search with Google was conducted to ensure we downloaded the most up-to-date version. Next, two researchers (MH and LJ) independently reviewed full-text and judged eligibility of the tools. An appraisal tool was included if it (1) was designed for non-randomised studies, (2) was used for assessing either methodological quality or reporting and (3) was developed or updated after 2002. A tool was excluded if it was designed for non-randomised studies of exposures which were not controlled by investigators (eg, diets). All discrepancies were solved through discussion among the three researchers (MH, LJ and JW).

One researcher (MH) extracted tool characteristics using a prespecified Excel form. The data items included publication year, tool format (eg, checklist or rating scale), targeted study design (eg, all NRSIs, cohort studies, etc), target interventions (eg, all or surgical interventions), originality (ie, whether a tool was developed based on an existing tool) and scope. The scope referred to whether the tools were designed for assessing methodological quality (eg, RoB and external validity) and/or for ensuring adequate reporting of research details that could be used for assessing methodological quality.25

For the content analysis, we adopted both deductive and inductive coding techniques.26 First, we derived a list of candidate quality items from the three reviews, the starting set for the snowballing.13–15 Then, in a pilot coding process, we reviewed all identified appraisal tools and judged whether a candidate quality item was described. After the pilot coding, we summarised signalling questions or criteria that were not covered by the candidate items and coded them as new items. After updating the list of candidate items, three researchers (JW, LJ and MH) finalised the items in four group meetings. During the meetings, we merged items with overlapping content, split items containing too much content and renamed items so they could be self-explanatory.

To score whether and to what extent a quality item was described by a tool, we again reviewed all identified tools. If an item was described by a tool in one or several signalling questions, we judged whether the question(s) was related to methodological quality, reporting or both, independently of what original studies claimed to be. Additionally, we judged whether an item was described sufficiently or briefly. A description was scored as ‘brief’, if the corresponding signalling question(s) did not explain how to improve or assess methodological quality or specify elements needed for reporting. For example, ‘outcomes should be measured appropriately’ or ‘outcome measurement should be adequately described’ are ‘brief’ descriptions, if no additional explanations were provided. The scoring process was independently conducted by two researchers (LJ and MH) using NVivo V.12, and all discrepancies were solved through discussion between the two.


Tool selection

As shown in figure 1, we identified 1738 articles after removing duplicates and excluded 1645 articles after subsequently reviewing titles, abstracts and full-text. From the remaining 27 eligible studies, we identified 417 appraisal tools. After removing duplicates and reviewing full-texts, we included 49 tools which met our criteria. References of the included studies and appraisal tools are shown in online supplemental appendix 2 and 3, respectively.

Figure 1

Flow chart for the inclusion and exclusion of appraisal tools for non-randomised studies of interventions

Characteristics of appraisal tools

As shown in table 1, 18 (37%) tools were published between 2002 and 2010, while 31 (63%) tools were published thereafter. Among these, 30 (61%), 6 (12%) and 5 (10%) tools were designed for addressing methodological quality, reporting and both, respectively, while 7 (14%) tools did not report intended use of the tools. About three quarters of the tools were designed for all types of NRSIs, while others were designed for one or several NRSI types, such as cohort (16%) and case–control studies (16%). Regarding sources, 44 (90%) tools were described in articles that developed a tool, in grey literature (eg, online checklist or report), or in both, while the other five tools were extended from existing tools, when researchers conducted systematic reviews on non-randomised studies. Finally, 9 (18%) tools were designed for specific interventions or diseases while all other tools were generic in nature.

Table 1

Characteristics of the 49 included appraisal tools for non-randomised studies of interventions

Quality domains and items

We identified 44 criteria to describe study quality from three previous reviews.13–15 After merging criteria with similar content (eg, ‘Follow-up’ and ‘Loss to follow-up’) and incorporating items into those with wider meanings (eg, ‘Loss to follow-up bias’ into ‘Loss to follow-up’), we obtained a list of 18 items. After the pilot coding, we summarised criteria of appraisal tools not covered by the 18 items into another 8 items. According to the general order of conducting an NRSI (eg, study design and data analysis, etc.), these 26 items were categorised into four domains: Study design, Data quality, Data analysis and Results presentation. As shown in figure 2 and table 2, all domains and most items were addressed by existing tools, but for each item, the number of tools with sufficient descriptions was relatively small. For three items in methodology and nine items in reporting, less than five tools addressed them, and none of the tools sufficiently described them.

Figure 2

The extent to which the appraisal tools addressed quality items on methodological quality or reporting.

Table 2

Overview of the 4 domains and 26 quality items, with numbers and proportions of appraisal tools that addressed or sufficiently described them

Figure 2 illustrates whether and to what extent the identified tools addressed quality items in terms of methodological quality or reporting. The 26 columns represent the 26 quality items as shown in table 2. The ranking of appraisal tools based on the number of items addressed or sufficiently described, either general or segmented by quality domains, is shown in online supplemental appendix 4–6. Regarding methodological quality, Research Triangle Institute Item Bank (RTI Item Bank)27 addressed (n=20) and sufficiently described (n=18) the highest number of items. In addition, the tools that ranked both top 10, based on number of items addressed or sufficiently described, included Methodology Index for Non-randomized Studies (MINORS),28, Faillie et al,29 ROBINS-I,18 ANDQ,16 Comparative Effectiveness Research Collaborative Initiative Questionnaire (CER-CI)30 and Joanna Briggs Institute’s Critical Appraisal Tool (JBI).31 These tools addressed at least 10 items and sufficiently described at least 5 items. In the study-design domain, RTI Item Bank27 sufficiently described the most items (n=7), while in the Data quality domain, RTI Item Bank27 and MINOR28 ranked the top two, which sufficiently described at least 5 of the 10 items. In the Data analysis domain, only Faillie et al29 and Handu et al32 sufficiently described all the three included items. In the Results presentation domain, the relevant two items sufficiently described by Faillie et al29 and Handu et al,32 and ANDQ.16 Regarding reporting, STrengthening the Reporting of OBservational studies in Epidemiology (STROBE)33 addressed (n=17) and sufficiently described (n=14) the highest number of items. Also, the tools that ranked both top 10, based on the two criteria, included Transparent Reporting of Evaluations with Non-randomized Designs (TREND),34 the tool by Genaidy et al,35 REporting of studies Conducted using Observational Routinely-collected Data (RECORD),36 European Network of Centres for Pharmacoepidemiology and Pharmacovigilance (ENCePP),36 International Society of Pharmacoepidemiology (ISPE),37 the tool by Tseng et al38 and Joint Task Force between the International Society for Pharmacoepidemiology and the International Society for Pharmacoeconomics and Outcomes Research (ISPE-ISPOR).39 These tools at least addressed and sufficiently described seven and three quality items, respectively. In all the four quality domains, STROBE32 sufficiently described the (equally) most items, compared with other tools. Besides, in the Study design domain, ENCePP36 and RECORD40 sufficiently described at least 4 of the 11 items, while in the Data quality domain, TREND34 and Genaidy et al35 sufficiently described at least 4 of the 10 items. In the Data analysis and Results presentation domain, STROBE was the only tool that sufficiently described two of the thee items, while 7 and 12 other tools sufficiently described only one item, respectively.

Methodological quality

Among the four domains, the Study design domain was the most ignored domain by appraisal tools, as only 4 of the 11 relevant items were described with sufficient details by more than four tools. More specifically, no tool described methodological quality on Ethical approval or Study objective with sufficient detail. For example, the guidelines manual of the National Institute for Health and Care Excellence (NICE) stated that: “The study addresses an appropriate and clearly focused question”.41 The tool did not explain the standard of appropriateness and clearness.

In addition, although one-third of tools discussed what a good study design was, only three tools defined the goodness.42–44 For example, the NHS Wales Questions to Assist with the Critical Appraisal of a Cross-Sectional Study (NHS Wales) stated that the choice of study design should be appropriate to the research question and ensure the reliability of study results.44 Outcome selection was also ignored by most tools, as only three tools (ie, RTI Item Bank,27 MINORS28 and the tool by Faillie et al29) sufficiently described them. Similarly, only RTI Item Bank,27 the tool by Genaidy et al35 and NICE41 sufficiently described the item Outcome definition. For example, Genaidy et al35 stated that a definition was clear only if ‘definitions of all outcome variables were clearly described’, and was partially clear if not all variables were clearly described, but ‘sufficient information was provided for the reader to understand the intent’.35 Other items that were rarely addressed or insufficiently described included Intervention definition and Data source. The respective tools with sufficient descriptions included SURE,45 ROBINS-I,18 MINORS,28 CER-CI,30 GRACE17 and the tools described by Faillie et al.29


The Data quality domain was ignored by most tools, as 4 of the 10 relevant items were sufficiently addressed by less than three tools. In particular, the item Intervention measurement and Length of follow-up were sufficiently addressed by none of the tools, JBI was the only tool stating that method of measuring interventions should be clearly reported,31 while 19 tools addressing Intervention measurement only focused on methodological quality. Some other items that were rarely addressed or insufficiently addressed included Outcome blinding and Loss to follow-up. Regarding Outcome blinding, only three tools provided sufficient descriptions, that is, MINORS, TREND and ISPE.28 34 37. Similarly, only the tool by Genaidy et al,35 TREND and STROBE sufficiently described Loss to follow-up.32 35 36


We conducted a review of appraisal tools for NRSIs and assessed whether and how sufficiently these tools addressed quality concerns, in terms of methodological quality or reporting, in 4 quality domains and across 26 items. Our study identified 49 tools and showed that the RTI Item Bank and STROBE were most comprehensive, with the highest number of items addressed and sufficiently described, respectively, on methodological quality and reporting. However, none of the tools addressed concerns in all items, not even briefly. The items least addressed for methodological quality included Outcome selection, Outcome definition and Ethical approval, and for reporting included Intervention selection, Intervention measurement and Length of follow-up.

To our knowledge, this is the first study that compared level of sufficient descriptions of appraisal tools at quality-item levels. Previous reviews also compared appraisal tools but from different perspectives. D’Andrea et al identified 44 tools evaluating the comparative safety and effectiveness of medications, and only assessed whether or not these tools addressed methodological quality in eight domains.13 In another review, Ma et al elaborated for what types of study design a tool was suited.46 For example, for cohort studies, they encouraged using five tools, while discouraged the use of another two. However, they did not clarify why some tools were more suitable than the others. Quigley et al identified 48 tools for appraising quality of systematic reviews of non-randomised studies, listed the five most commonly used tools and assessed whether they addressed the 12 quality domains, such as ‘appropriate design’ and ‘appropriate statistical analysis’.14 Although the tools were compared using different criteria, some results were consistent among all studies. For example, both D’Andrea et al13 and our study found that intervention measurement, outcome measurement and confounding were frequently addressed by existing tools. Also, Ma et al46 and Quigley et al14 both recommended ROBINS-I, MINORS and JBI, and all these tools ranked top 10 for addressing and sufficiently describing methodological quality in our study. With detailed information on level of sufficient descriptions of appraisal tools at the quality-item level, we add value to previous reviews by listing quality concerns that such commonly recommended tools could not adequately address.

We also found some discrepancies in the tools identified or recommended. For example, of the 44 tools identified by D’Andrea et al,13 27 were published between 2003 and 2019; while in our study, 47 were identified as published between 2003 and 2019. This discrepancy could be explained by additional tools identified through other reviews, tools from grey literature and differences in eligibility criteria (eg, exclusion of non-pharmacological interventions or assessing only one or a few specific types of bias). Another discrepancy was that some tools that ranked top in our study were less recommended by previous reviews, such as RTI Item Bank27 and the tool by Faillie et al29 for methodological quality and by Genaidy et al35 for reporting. This might be explained by the novel criteria (ie, how sufficiently quality items were addressed) we used to evaluate these tools.

We discovered that, with information on how sufficiently a tool described a quality item, tool users might broaden their horizons on quality concerns of non-randomised studies to be considered. For example, if ROBINS-I18 is used for assessing methodological quality, the quality concerns known to users will be RoB in eight domains (eg, confounding and selection bias). However, as shown in figure 2, quality concerns in 16 items (eg, Intervention selection and Outcome definition) may not be sufficiently described in ROBINS-I but in other tools, such as RTI Item Bank,27 the NICE checklist41 and the tool by NHS Wales.44 Similarly, if users check the ENCePP36 and ISPE tools,37 in addition to STROBE, for reporting quality concerns, they may more comprehensively understand concerns on Ethical approval, Outcome definition, Study objective and Data source. Tool users who may benefit from such information are not only researchers who conduct non-randomised studies and decision-makers who assess study quality, but also tool developers who may identify a research gap.

While the needs of tool users may vary, they could all be somewhat satisfied by our research. For example, it is important for researchers to ensure sufficient reporting of the strengths and weaknesses of an NRSI, as such information will be ultimately used for determining the eligibility of their studies for a decision-making.32 47 For HTA agencies, NRSIs can be used to extrapolate long-term drug effectiveness and to identify drug-related costs, and a deep and consistent understanding of how to assess NRSI quality among the agencies is important for promoting the use of RWD.48 For regulators, a comprehensive understanding of how to evaluate NRSI quality may promote a structured pattern of using RWD to support drug regulation.49 While researchers focus more on reporting, and decision-makers (eg, HTA agencies) have emphasis on methodological quality, we suggest all users pay attention to the linkage between methodology and reporting for each quality item, as illustrated in our research, as it could help understand the necessity of investigating each item.

Another finding of our research was that whether and to what extent a quality concern was addressed by a tool partly depended on the tool purpose. For example, the GRACE checklist was designed as a ‘screening tool’ to exclude studies that did not meet basic quality requirements,17 and ROBINS-I focused on RoB, rather than all methodological quality issues, such as appropriateness of study objectives or statistical analyses for patient matching.18 Some tools, such as JBI Cohort,31 were specific to a type of study design. While they addressed less than half of quality items defined in our research, they were proven robust in many studies.14 Additionally, for several quality items we found some heterogeneity in content of signalling questions or criteria among the tools with sufficient description. For example, to assess methodological quality of sensitivity analysis, CER-CI30 stated that key assumptions or definitions of outcomes should be tested, while the tool by Viswanathan et al50 emphasised the importance of reducing uncertainty in individual judgements. Given the heterogeneity of tools, we suggest users following a two-step approach when selecting a tool. First, users may narrow down the scope of tools based on their own needs, for example, excluding tools for a different study design. This step could be achieved by referring to synthesised results and recommendations from existing reviews.13 14 Second, users could use the overview we provide (figure 2) to see which tool(s) could provide complementary insights the tool of their first choice is lacking.

Furthermore, we found that appraisal tools designed for specific interventions had potential to be transferred for general interventions. In our research, the tools described by Tseng et al38 and Blagojevic et al51 and ANDQ16 were originally designed for a surgical intervention, knee osteoarthritis and for the field of diabetes, respectively. All these tools ranked top 15 in our study for addressing either methodological quality or reporting (online supplemental file appendix 4–6), and many of their criteria could be generalisable. For example, Tseng et al38 stated that interventions could be adequately described with specifically referenced articles (online supplemental file appendix 7).38 Though such tools could be transferred, they often used disease-or-intervention-specific concepts in their criteria, which might be adjusted before being applied more widely.

Moreover, we noticed that, some quality items were less frequently addressed, such as Study objective, Ethical approval or Sensitivity analysis, compared with other items. This might be explained by the fact that, some items were more related to a certain need of users than the others. For example, a tool addressing concerns on RoB may focus less on Study objective, which is relatively more difficult to be directly linked to a well-defined type of bias. Still, since these quality items are related to NRSI quality, and they are rarely sufficiently described, particular efforts investigating these quality items may be needed in future tool development. In contrast, while some quality items have been frequently addressed, such as Length of follow-up and Intervention measurement, they are not necessarily relevant to all types of user needs. For example, as shown in table 2 and online supplemental file appendix 7, 14 tools highlighted that the follow-up should be sufficiently substantial for detecting an association between intervention and outcome, but none of these tools linked Length of follow-up to RoB. Therefore, we recommend tool developers to clarify not only the purpose of their tools but also the relevance of their signalling questions to any user needs (eg, RoB assessment). We also advise that in future research the relationships between quality items and user needs will be investigated in more detail.

Our study has a number of limitations. One limitation is that, some tools identified by our study were originally developed for purposes beyond assessing methodological quality of reporting of NRSIs, so our study could not cover all potentials of these tools. For example, the GRADE framework was mainly designed for addressing certainty of evidence, such as indirectness (ie, whether interventions were compared directly), and for making relevant clinical practice recommendations. While it mentions RoB (eg, publication bias), its main purpose is to illustrate how to grade quality of evidence, rather than to function as an exact quality appraisal tool. In other words, the GRADE allows users to use any additional tools to assess NRSI quality.52 Also, the GRADE checklist was designed for both RCTs and NRSIs, so some criteria might be relatively brief, compared with specifically designed tools, such as RTI Item Bank.27 Finally, GRADE can be used to estimate and score the quality of evidence for the full body of evidence and not only for individual primary studies. Therefore, tool users who assess NRSIs beyond methodological quality or reporting should consider criteria in addition to those mentioned in our study, for selecting a tool. Another limitation is that, some tools were predecessors of others, but we did not exclude them if they met the inclusion criteria. For example, the ROBINS-I tool was developed from the Cochrane Risk Of Bias Assessment Tool: for Non-Randomized Studies of Interventions (ACROBAT-NRSI),53 and some of their signalling questions differed. Such information on tool linkage may also be considered for tool selection, if available from the tools. Another limitation is that we only searched HTA agencies for grey literature, and the returned hits on the snowballing approach depended on the starting-set articles, so some tools only mentioned by clinical guideline or regulatory organisations, or tools missed by the previous reviews might have been overlooked. Also, only one researcher (MH) traced versions of tools, by following reference lists of the identified studies and by visiting websites of the online tools. Consequently, the most up-to-date version of a tool might be missing, and the extent to which a quality item was described by a tool might be underestimated. As existing appraisal tools are improved continuously and new tools are being developed (eg, the HARmonized Protocol Template to Enhance Reproducibility (HARPER) and Authentic Transparent Relevant Accurate Track-Record (ATRAcTR)),54 55 an online platform that automatically identifies appraisal tools and summarises tool information is promising. Such platforms have already been established for tools for assessing observational studies for exposures that were not controlled by investigators (eg, dietary patterns).56 Another limitation is that we categorised criteria of a quality item as ‘sufficient’ or ‘brief’ for each tool, based on whether an explanation was provided for the criteria. Though consensus was reached among authors, and all tool criteria were independently reviewed by two researchers, tool users might question the feasibility of such categorisation when selecting a tool. Additionally, as we categorised quality items based on the order of conducting an NRSI (ie, from study design to results presentation), we did not provide specific suggestions on how to select tools based on bias categories. For example, motivational bias, which would occur when judgements are influenced by the desirability or undesirability of events or outcomes, may affect reporting and measurement of patient outcomes and adherence to healthcare interventions.57 58 Although the items Conflict of interest and Outcome measurement are relevant to motivational bias, we did not investigate their relationships. Hence, we recommend for future research to bridge our quality items to all potential categories of bias, then test whether a tool selected based on such categorisation, together with recommendations from previous reviews, can really satisfy tool users. It is also worth noting that, the target audience of this review and content analysis could be decision-makers who assess the general quality of an NRSI, NRSI performers who may report quality of their studies, or developers of relevant appraisal tools. However, when users focus on a specific type of concern (eg, causal effect or data quality), some methodological guidance investigating the specific issue or tools beyond the healthcare field (eg, social science) really exist59 60 and may be referred to by users. In addition, the tools for diagnosis studies, prognosis studies and secondary studies were beyond the scope of our study, and relevant users may refer to other studies, such as Quigley et al14, for further information. Moreover, some frameworks specifically designed for assessing data quality, for example, in terms of data structures and completeness, have been published, and some of their instructions may also be considered as criteria for assessing NRSI quality.61–65 While evaluating these frameworks is beyond the scope of this study, we recommend tool developers to refer to these frameworks when they define relevant criteria or signalling questions in the future.


Most of the appraisal tools for NRSIs have their own strengths, but none of them could address all quality concerns relevant to these studies. Even the most comprehensive tools could be complemented with items from other tools. With information on how sufficiently a tool describes a quality item, tool users might broaden their horizons on quality concerns of non-randomised studies to be considered and might select a tool that more completely satisfies their needs. We suggest decision-makers, researchers and tool developers consider the quality-item level heterogeneity when selecting a tool or identifying a research gap.

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.


This research was once published as an abstract in 2022-11, ISPOR Europe 2022, Vienna, Austria. The citation was: Jiu L, Hartog MK, Wang J, et al. OP18 applicability of appraisal tools of real-world evidence in health technology assessment: a literature review and content analysis. Value Health. 2022 Dec 1;25(12):S389.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors LJ designed the study protocol, identified appraisal tools, conducted the content analysis and wrote the manuscript; MH identified appraisal tools, collected data on appraisal tools and conducted the content analysis; JW designed the study protocol, solved the discrepancies on identification of appraisal tools and edited the manuscript; RAV designed the study protocol and edited the manuscript; OK provided assistance on coding of quality items and edited the manuscript; AM-T edited the manuscript; WG edited the manuscript, and was responsible for the overall content as the guarantor.

  • Funding This research was performed as part of the HTx project. The project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825162.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.