Objectives To develop an algorithm that aims to provide guidance and awareness for choosing multiple study designs in systematic reviews of healthcare interventions.
Design Method study: (1) To summarise the literature base on the topic. (2) To apply the integration of various study types in systematic reviews. (3) To devise decision points and outline a pragmatic decision tree. (4) To check the plausibility of the algorithm by backtracking its pathways in four systematic reviews.
Results (1) The results of our systematic review of the published literature have already been published. (2) We recaptured the experience from our four previously conducted systematic reviews that required the integration of various study types. (3) We chose length of follow-up (long, short), frequency of events (rare, frequent) and types of outcome as decision points (death, disease, discomfort, disability, dissatisfaction) and aligned the study design labels according to the Cochrane Handbook. We also considered practical or ethical concerns, and the problem of unavailable high-quality evidence. While applying the algorithm, disease-specific circumstances and aims of interventions should be considered. (4) We confirmed the plausibility of the pathways of the algorithm.
Conclusions We propose that the algorithm can assist to bring seminal features of a systematic review with multiple study designs to the attention of anyone who is planning to conduct a systematic review. It aims to increase awareness and we think that it may reduce the time burden on review authors and may contribute to the production of a higher quality review.
- MEDICAL EDUCATION & TRAINING
- STATISTICS & RESEARCH METHODS
Statistics from Altmetric.com
Strengths and limitations of this study
We developed an algorithm to provide guidance for allocating various study designs to specific research questions. This can be viewed as a response to the lack of comprehensive guidance in major published methods documents.
The terms used for defining the critical decision points of the algorithm, such as length of follow-up, frequency of events and types of outcome need to be interpreted in the context of the disease.
Disease-specific circumstances and aims of interventions always have to be taken into account during application of the algorithm.
We could follow and confirm the appropriateness of the pathways during the application of the algorithm on four selected systematic reviews.
The checking of the plausibility of the algorithm was based on systematic reviews that were already completed. This approach is far from the everyday working condition and the approach may be biased by subjective expectations of the authors. We encourage independent evaluation of the algorithm.
When evaluating healthcare interventions, different categories of intervention such as medicinal versus non-medicinal therapy and different categories of outcomes such as intended effects, adverse events or health-related quality of life may sometimes be best answered by multiple study designs. Some designs have features, which preferably match the requirements of specific parts of a research question. Exclusively using data from randomised trials (RCTs) to evaluate whether an intervention might work has a number of limitations.1 For example, RCTs may not be appropriate to estimate the incidence of rare (adverse) events. Other study designs, for example registry analyses, may incorporate the data of much more participants. These analyses may therefore complement the information on rare but important events. We have gathered some examples of research questions that cannot or only with great difficulty be investigated in RCTs (table 1). A practical concern may arise with low numbers of patients with a rare disease. It might be difficult to conduct an RCT to evaluate patients with acquired severe aplastic anaemia. An ethical concern may arise with the treatment of severe diseases or life-threatening treatments. It might be obsolete to conduct an RCT to evaluate a new experimental treatment of patients with pancreatic cancer.
The methods of conducting systematic reviews of healthcare interventions are major components of ‘evidence-based’ medicine (EBM). In 2000, Sackett et al2 defined EBM as the integration of ‘best’ research evidence with individual clinical expertise, patient values and expectations, and ‘best’ external research evidence. This definition may be visualised by three overlapping circles in a Venn diagram.3 The area of intersection, where all three different resources meet, should represent the EBM. To classify more valid and less valid information, the ‘levels of evidence’ specify a hierarchical order for various research designs based on their internal validity. The highest level of valid data, that is, the ‘best’ evidence, however, is not always available. In table 2, we present the classification of some of the major study designs for intended effects of therapy. There appears to be some variation in the hierarchy among some authors and institutions issuing ‘evidence-based’ guidelines or systematic reviews. All authors agree that RCTs have the highest ‘level of evidence’ with respect to minimising the risk of bias. The prospective non-randomised controlled clinical trial (CCT) has an experimental design and its internal validity should be regarded lower than a randomised trial but higher than an observational study. Prospective cohort studies have a potential for a lower risk of bias than retrospective cohort studies because they have lower risk of recall bias and confounding.11 In cohort studies groups are defined by exposure whereas in case–control studies groups are defined by outcome status. Both points are acknowledged in the ‘hierarchy of evidence’ by some but not all authors. Case series and case reports are descriptions of one or more individual cases. Some authors combine both designs in one category while others place case series a higher level.
In figure 1 we show a study design classification tree including the main features of study designs that makes them distinct from others conforming with the reports of the Centre for Evidence-Based Medicine (CEBM), the National Institute for Health and Care Excellence (NICE) and the Centre for Reviews and Dissemination (CRD).7 ,12 ,13 Examples of distinguishing study characteristics are the concurrent versus the historical control group or the participants being or not being allocated to the treatment groups by the investigator. Within group comparison has also been referred to as a before and after study in a previous CRD report.
We observed a lack of a clear and comprehensive guidance to optimise the choice of study designs in systematic reviews. A simple and clear algorithm could remind authors instantly about important issues that should be considered. Consequently, we aimed to develop such an algorithm. The algorithm could raise awareness of the prospects of multiple study designs, especially for less experienced authors. It could bring attention to issues that could be overlooked but are pertinent to healthcare. We expect that it may reduce the time burden for review authors and may facilitate the production of higher quality reviews.
The objectives of the study may be subdivided into four items, which are used to structure the text.
First, we systematically reviewed the literature about the advantages and disadvantages of integrating multiple study designs in systematic reviews. Criteria for considering and search methods for identification of publications and data collection and analysis are described in a precursor article that is firmly connected to the present paper.14 These results form the information base for the topics of the current paper.
Second, we have conducted systematic reviews that are associated with the integration of multiple study designs. We reflected what could be learnt from this experience and what could also be a helpful piece of information for upcoming authors. Experience gathered in these papers was weaved into checking the plausibility of the algorithm.
Third, we wanted to know if major not-for-profit publishers of systematic reviews have included guidelines on the integration of multiple study designs in their manuals. We non-systematically searched the internet sites of 12 selected high-profile institutions and we transferred the relevant statements. The access dates are provided with the reference. We conceived an idea, how a decision tree could look like that should contain major characteristics of clinical studies, provide easy to follow pathways and close on recommended study designs. The resulting algorithm should depict the necessary information in a clear and straightforward way and combine all parts on a single page. We observed that the length of follow-up and the frequency of events are essential components of every outcome assessment and we introduced those items as binary decision points. Furthermore, we were convinced on theoretical grounds that both components are critical in the process of choosing the appropriate study design. Fletcher15 classified outcomes into the following five simple categories: death, disease, discomfort, disability and dissatisfaction (table 3) and we judged that this classification fits well into the algorithm. We did not consider economic outcomes. We selected study design labels from the Cochrane Handbook to maintain a common language and a reference for its the descriptions.16 We combined similar design concepts of the Cochrane Handbook that appeared redundant for the purpose of the algorithm. For example, the experimental design comprises mainly two types of design, the randomised controlled trial (RCT) and the non-randomised controlled trial (CCT).17 We used the term ‘CCT’ to combine the quasi-randomised controlled trial, the non-randomised controlled trial and the controlled before-and-after study. We used the term ‘cohort studies’ to combine the prospective and the retrospective cohort study designs. We used the term ‘case series’ to combine the case series study design and the uncontrolled before-and-after comparison design. The term ‘registry analyses’ is not listed in the Cochrane Handbook, though it may be classified as a retrospective subtype of cohort studies. Registries generally collect data that are confined to a specific disease, a specific intervention or a specific outcome. One example are the registry-based studies of the European Society for Blood and Marrow Transplantation, which registers data from transplanted patients after having received bone marrow or haematopoietic stem cells.18 Therefore, we wanted to accentuate this type of data selection and analysis and introduced the term ‘registry analyses’ into the algorithm. We did not consider the historically controlled design because changes over time are expected to because serious systematic differences between treatment groups. We also did not consider the cross-sectional study design due to the lack of observation over time.
Fourth, we checked the plausibility of the algorithm by backtracking its pathways with four previously conducted systematic reviews. The first author of the present paper was also the leading author of these systematic reviews and could apply his knowledge about all the details of the history of these systematic reviews. The second author checked whether the results of the plausibility check appeared sensible. The structured research question of each systematic review and the simulated pathways for choosing multiple study designs were described in detail. For this purpose, we structured the information such as the inclusion criteria by using the PICOTS-SD typology: participants (P), interventions (I), comparators (C) and outcomes (O), timing (T), setting (S), and study design (SD).7 ,19–21
We identified ‘49 studies that compared the effect sizes between randomised and non-randomised controlled trials, which were statistically different in 35%’.14 We concluded: ‘The risk of presenting uncertain results without knowing for sure the direction and magnitude of the effect holds true for both non-randomised and randomised controlled trials. The integration of multiple study designs in systematic reviews is required if patients should be informed on the many facets of patient relevant issues of healthcare interventions’.14
The following four systematic reviews were used. The PICOTS-SD frames of these papers are shown in table 4.
Example 1: Non-rhabdomyosarcoma soft tissue sarcomas.22 This systematic review evaluated autologous haematopoietic stem cell transplantation (autoHSCT) following high-dose chemotherapy (HDCT) versus standard-dose chemotherapy (SDCT) in patients with non-rhabdomyosarcoma soft tissue sarcomas (NRSTS).
Example 2: Acquired severe aplastic anaemia.24 This systematic review evaluated allogeneic haematopoietic stem cell transplantation (alloHSCT) from matched sibling donors (MSD) versus immunosuppressive therapy (IST) in patients with acquired severe aplastic anaemia (SAA).
Example 3: Localised prostate cancer.25 This systematic review evaluated permanent interstitial low-dose-rate brachytherapy (LDR-BT) versus radical prostatectomy (RP) versus external beam radiotherapy (EBRT) and no primary therapy (NPT) in patients with localised prostate cancer.
Example 4: Negative pressure wound therapy.27 This systematic review evaluated negative pressure wound therapy (NPWT) versus standard wound dressing in patients with wounds.
We looked at a sample of 12 high profile not-for-profit publishers of systematic reviews detailed in table 5. Of these, 10 have published guidance about their methodological procedure for preparing the systematic reviews.7 ,12 ,13 ,16 ,28–33 A range of other books or guidance documents on systematic reviews exist.8 ,34–38 We extracted the major statements of their methods guidance with respect to choosing the appropriate research design in online supplementary table S1. We did not identify an algorithm or a comprehensive guidance focusing on finding the appropriate research design in any of these methods guidance documents. We propose an algorithm, which is shown in figure 2. The algorithm has four decision points.
First, it should be decided whether the outcomes are typically evaluated at an early or late time point after start of treatment. The cut-off between a short and long follow-up depends on the type of disease, intervention and outcome and we list some examples that range from 30 days to 5 years (table 6).
Second, it should be decided whether the events of interest are regarded as rare or frequent. The cut-off between a rare and frequent event depends on the type of disease, intervention and outcome and we list some examples in table 7. A rare disease may be defined according to the Office of Rare Diseases Research (ORDR): “In the United States, a rare disease is generally considered to be a disease that affects fewer than 200 000 people”.49
Third, the type of outcome of interest needs to be considered. We list some examples of outcomes, which may depend on length of follow-up and frequency of events (table 8). Additional examples for outcomes of respiratory tract disease are shown in table 9.
Fourth, the recommended study design for inclusion in a systematic review is assigned. We used the following study design labels: RCT, CCT, nested case–control study, cohort study, case–control study, case series, case report and registry analysis. These and alternative study design labels are described in table 2.
Practical or ethical concerns may emerge as reasons to over-ride the earlier decisions or to switch to a more appropriate study design. We remind the reader at the bottom of figure 2 to reconsider the chosen path. This part is introduced to facilitate a flexible handling of the algorithm. Examples are shown in table 8. Ethical concerns are primarily associated with objections against experimental allocation.
We conducted a plausibility check of the algorithm’s pathways by backtracking four own previously published systematic reviews. We marked the pathways by boxes that are filled in with a coloured background or that have a coloured frame lines.
Example 1: Non-rhabdomyosarcoma soft tissue sarcomas. In the first version of this systematic review, we assumed a long follow-up and frequent events regarding death by disease or complication: NRSTS in figure 3.23 Thus, an RCT would be the best choice for all outcomes but we did not find any RCT and we did not find any comparative study. Instead, we identified only single-arm studies. We estimated overall survival and described adverse events but were unable to draw conclusions on the benefit of the intervention of interest. In a planned update 2 years later, we were able to identify a single RCT.22 Using different study types provided the advantage to report estimates of overall survival in the first version when RCTs were lacking. The advantage affected also the update version because the reporting of adverse events exceeded the scope of a single RCT considerably.
Example 2: Acquired severe aplastic anaemia. In this systematic review, we assumed a long follow-up and frequent events regarding death by disease or complication: SAA in figure 3.24 Thus, an RCT would be the best choice for all outcomes. As we did not identify any RCT, we included other comparative study designs. We tried to lower the risk of bias imposed by the non-randomised design. Eligible studies needed to be prospective non-randomised controlled trials, to meet the requirements of ‘Mendelian randomisation’, and to be confined to human leucocyte antigen (HLA)-matched sibling donors. Using different study types enabled the evaluation in view of lacking RCTs. It should be noted that these study data were generated more than 10 years ago and may not be applicable to the current medical care status.
Example 3: Localised prostate cancer. In this systematic review, we assumed a long follow-up and rare events concerning death: PCa: OS in figure 3.25 Localised prostate cancer as opposed to advanced prostate cancer is believed to be associated with a very good overall survival regardless of the intervention. While invasive interventions may not improve overall survival, they may impair the health-related quality of life considerably. For example, radical prostatectomy may promise to completely remove the malignant tumour but may also disrupt erectile function in a considerable proportion of patients. According to the algorithm, a cohort study would be appropriate to estimate long-term overall survival. Concerning patient-reported outcomes such as discomfort, disability and dissatisfaction, we assume that we have short follow-up and frequent events: PCa: HRQL in figure 3. According to the algorithm, an RCT would be the best choice to evaluate the patient-reported outcomes. As a single RCT was available, data on discomfort, disability and dissatisfaction were sparse and the inclusion of CCTs expanded the results considerably26 Evaluating overall survival needed a different approach than evaluating patient-reported outcomes. The obvious reluctance of patients and physicians alike to participate in RCTs corroborated the consideration of other study designs, though, restrictive inclusion criteria were necessary to enable a minimal level of quality.
Example 4: Negative pressure wound therapy. In this systematic review, we assumed a short follow-up and frequent events concerning complete wound closure: NPWT: Closure in figure 3 and we assumed a long follow-up and rare events concerning the outcome of severe adverse events NPWT: AE in figure 3.27 According to the algorithm, an RCT or a CCT would have been appropriate to evaluate the successful treatment of the disease. In 2009, the US Food and Drug Administration issued a report on six deaths and 77 other complications that were reported within a 2-year period in connection with NPWT.50 Many of the deaths occurred in outpatient care or care homes and were caused by bleeding complications. The consideration of registry analyses and case reports were very helpful to draw attention to possible dangerous and life-threatening events.
In a separate paper, we concluded that “the integration of multiple study designs in systematic reviews is required and that the risk of presenting uncertain results without knowing for sure the direction and magnitude of the effect holds true for both nonrandomized and randomized controlled trials”.14 Our results appear to be in agreement with other authors. A Cochrane review compared RCTs versus historically or concurrently controlled non-randomised trials in 2007.51 The authors concluded that, on average, the non-randomised controlled trials tend to result in larger estimates of effect than RCTs. The latest update of this Cochrane review in 2011 amended the research question and compared RCTs versus concurrently controlled non-randomised trials and excluded historically controlled ones.52 The authors concluded that “the results of randomized and non-randomized controlled trials sometimes differed”, namely, “in some instances nonrandomized studies yielded larger estimates of effect and in other instances randomized trials yielded larger estimates of effect”. It appears that the early firm statement expressing larger estimates in the non-randomised controlled trials changed to a less decided message.
We reported our experience gained during the conduct of four of our systematic reviews. These systematic reviews required the inclusion of multiple study designs to accomplish the planned evaluation of healthcare interventions. They present a few selected topics. Experiences or conclusions derived from these papers are far from being representative and not predestined to be generalised. They were conducted by the person who is also first author of the present work. Subsequently, the inferences based on the four papers and reported in the present paper may be subjective. Thus, further research by other authors and concerning other topics is recommended.
We did not identify an existing algorithm or a comprehensive guidance focused on finding the appropriate research design. Therefore, we developed an algorithm, which aims to guide systematic reviewers in the reasonable inclusion of various study designs in their planned systematic reviews of healthcare interventions. The proposed algorithm cannot be applied without considering disease-specific circumstances and aims of interventions. The terms used for defining the critical decision points of the algorithm such as short versus long follow-up need to be interpreted in the context of the disease and may be unclear and not useful if used as general terms. We provided examples to show that short versus long follow-up can vary considerably depending on the disease. Similarly, we provided examples to show that the definition of rare versus frequent events has to be interpreted in the context of the type of intervention or exposure as well as the type of event. The outcomes include hard and soft outcomes, physician-reported and patient-reported outcomes and it is likely that the outcomes can match the purpose of the algorithm. Nevertheless, the types of outcomes have been arbitrarily chosen from a handbook of clinical epidemiology. An alternative selection of other outcomes might also be acceptable for the understanding and usefulness of the algorithm. The labelling of study designs and the descriptions of study design features are not consistently used. Hartling 2010, while testing a tool on study design classification, reported that reviewers disagreed considerably on fundamental design characteristics, such as whether the design was experimental or observational and whether there was a control group involved or not.53 Lopez-Alcade 2011 reported that “Cochrane review groups did not use common study design labels and did not explicitly describe all study design features suggested by the Cochrane Handbook”.54 We are confident that the algorithm is a tool helping to bring seminal features of a systematic review to the attention of anyone who is planning to conduct a systematic review. It has the potential to help to reorientate oneself to major features of the studies eligible for an evaluation of a healthcare intervention. The benefit is the provision of awareness, and it is certainly not a new regulation. The intention is to provide a guide and a decision support tool that might be used fully or partially by persons who are going to prepare a systematic review. While preparing a systematic review, it may be important at an early time point to identify the relevant and the most appropriate study designs necessary to find answers for a variety of prespecified outcomes. It might also be of interest for persons who evaluate the quality of systematic reviews and might want to check whether the all study designs have been considered that should have been considered. Therefore, we think that it may reduce the time burden on review authors and contribute to the production of a higher quality review.
The plausibility check is a crude approach to speculate if the theory-based algorithm could be sensibly applied in practice. Thus, further research could facilitate a more objective and statistically measurable testing of the usefulness of the algorithm. It is recommended to let various systematic reviewers backtrack the algorithm independently and to apply the algorithm on more systematic reviews with different topics. We could follow and confirm the appropriateness of the pathways for all described examples of systematic reviews. In one example, the algorithm selected RCTs as the best choice but due to the lack of RCTs it was decided to rely on non-randomised studies. This example showed that it is important to build flexibility into the algorithm, which enables the systematic reviewer to extend or change the inclusion criteria to other study designs in case that certain unexpected conditions may emerge or practical concerns exists. While conducting systematic reviews, we observed the critical importance of case reports and registry analysis for the evaluation of serious adverse events. We mentioned above the Food and Drug Administration (FDA) report on adverse events after NPWT, which covered 2 years from 2007 to 2009. In a recent update, the FDA included two additional years covering a total of 4 years from 2007 to 2011.55 The adverse events increased to 12 deaths and 174 injuries. With respect to the added cases, bleeding was again the major cause of the most serious adverse events and the majority of adverse events occurred at home or in long-term care facilities. All RCTs on NPWT were conducted in hospitals and were unable to provide this information. In France between 1998 and 2004, 21 drugs were reported to be withdrawn from the market for safety reasons. The withdrawal of 19 of 21 drugs was based on case reports and only 1 case was supported by RCT.56 In the European Community between 2002 and 2011, case reports contributed to the withdrawal of 18 of 19 drugs.57
We are confident that the algorithm can assist to bring seminal features of a systematic review to the attention of anyone who is planning to conduct a systematic review. It aims to provide awareness and we think that it may reduce the time burden on review authors and may contribute to the production of a higher quality review.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
- Data supplement 1 - Online supplement
Contributors FP and JK conceived and designed the experiments. FP performed the experiment, analysed the data and contributed analysis tools. FP and JK wrote the manuscript.
Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement No additional data are available.