Observational designs and methods are important for evaluating the safety of medicines postapproval. They may be the only means to study large populations under routine clinical conditions or to evaluate a medicine’s association with rare events or long-latency outcomes. As such, the use of epidemiological designs and methods to assess the safety of a new medicine after it enters the market is a logical complement to pre-approval safety data collected from randomized controlled trials (RCTs). In the past 10 years, requirements for epidemiological studies as a condition of regulatory approval for new medicines have increased.[1,2] It is now generally recognized that an evaluation of a drug’s overall safety profile is strengthened by data on its use among real-world populations.[37]

However, despite these advances, the role of findings from observational studies for regulatory and clinical decision-making is frequently contested. An important reason is that epidemiological studies of medication exposures and their effects have difficulty measuring and controlling for confounding by indication for drug use (and/ or severity of disease).[810] As with any form of confounding, an investigator can, in theory, control for confounding by indication if its causes or their surrogates can be identified and reliably measured. In practice this is not easily done. Concerns about uncontrolled confounding by indication are central to debates about the interpretation of observational study findings and have complicated the assessment of many potential medicine-adverse outcome associations. [1122]

Nonetheless, the options for investigating realworld safety, or the safety of a medicine as it is actually prescribed by physicians and used by patients once on the market, are limited.[23] Postapproval phase IV RCTs are ethically and logistically difficult to conduct, are not appropriate for addressing many safety issues (e.g. rare or longer-term outcomes) and are unlikely to provide information on the safety of medicines in real-world populations. Thus, studies that are observational and follow patients with minimal interference are often the only means by which to answer clinically meaningful questions about a marketed medicine’s safety and to study patients in settings that generalize to real-world medicine use.

The ideal design for a post-approval safety study is one that minimizes the potential for bias yet is still relevant to real-world clinical practice. A design that, in principle, merges the ideal characteristics of the RCT (randomization) with those of an observational epidemiology study (follow-up with minimal intervention) is the large simple trial (LST).[2426] LSTs are characterized by large sample sizes, often in the thousands; broad entry criteria consistent with the approved medication label; randomization based on equipoise, i.e. neither physician or patient believes that one treatment option is superior; minimal data requirements such as a questionnaire or case report form usually collecting data in only a few pages with questions limited to key variables that are typically collected at routine clinical care visits; objectively-measured endpoints (e.g. death, hospitalization); follow-up that minimizes interventions or interference with normal clinical practice; follow-up of all patients regardless of whether they discontinue randomized medication; and intent-to-treat (ITT) analysis examining the entire population of randomized subjects according to the treatment group to which they were initially randomized.

Although the LST design shares a key design component, randomization, with an RCT, it is distinguished by its intent to minimize interference with usual medical care. Notably, since both designs employ randomization to allocate patients to treatment groups they are defined as interventional studies. In the RCT, the intent is restriction and control to create experimental conditions at baseline and over the course of the study. In contrast, in the LST the intent is to create balance of baseline characteristics but to then follow patient outcomes using observational methods. Practically, this means that endpoint definition, physician and patient recruitment, drug delivery, data collection, allowance for treatment discontinuation, concomitant drug use, and patient and site monitoring in an LST are operationally different from that of an RCT. Table I highlights the key aspects that differentiate these designs.

Table I
figure Tab1

Design characteristics of a large simple trial (LST) compared with those of a randomized controlled trial (RCT)

Calls for the use of the LST design have, until recently, primarily focused on the need for studies of clinically important therapeutic or preventative effects of interventions to inform clinical and health policy decisions.[26,27] LSTs have been employed to study a range of real-world benefits, from interventions for the treatment and prevention of cardiovascular outcomes[2832] to a comparison of antiretroviral treatment strategies for HIV-positive patients. [33] In the 1990s, two research groups described its use for safety, each using a single LST they had completed as a case study to infer general lessons for future studies.[34,35] More recently, because of its unique design characteristics, the LST has been put forward in regulatory guidance from the US FDA [36] and the Institute of Medicine’s (IOM) Report on Drug Safety [37] as having the potential to further characterize a drug’s post-approval safety profile. To evaluate whether the design is in fact advantageous in practice, we conducted a review of the published literature and the ClinicalTrials.gov registry. The aim was to identify studies that used the design to study safety outcomes and to determine if those studies had successfully answered their original hypotheses. The intent of our analyses was to draw conclusions about the utility of the LST design, and the conditions under which it is most appropriate, for comparative safety research.

The specific objectives of this review were to (i) identify all ongoing or completed LSTs with a primary safety endpoint described in the public domain; (ii) analyse and summarize the key design aspects and results of these studies, including whether the study successfully addressed the original research question(s); and (iii) describe any design characteristics that were noted by study authors to complicate the conduct of the study and interpretation of results.

1. Data Sources and Search Strategy

We conducted a systematic review of the published medical literature using PubMed and the ClinicalTrials.gov registry. The process used to identify studies is described in the following sections (a more detailed description of the criteria used is available in Appendix I, Supplemental Digital Content 1, http://links.adisonline.com/DSZ/A52). In addition to the systematic search of the literature and ClinicalTrials.gov registry, we reviewed bibliographies of LST review articles and book chapters, and conducted an informal survey of international experts in drug safety. Many of the responders to the informal survey are past or current investigators of the safety LSTs identified in this review (a list of the process and respondents is available in Appendix II, online SDC 1).

1.1 Identifying Large Simple Trials (LSTs)

For the purpose of this review, we defined ‘LST’ as any randomized study with simplified study procedures permitting comparative assessment of medicines under real-world or routine clinical conditions. Consistent with the focus on studies of real-world, post-approval safety, we sought to identify studies that principally used observational follow-up post-randomization, i.e. had few requirements post-randomization that differed from a physician’s prescribing practice or patient’s usual care. Formally and colloquially, these studies are referred to by many names, including LSTs/studies, large simplified trials/studies, large streamlined trials/studies, naturalistic trials/studies, practical clinical trials and pragmatic clinical trials. Thus, our searches used such terms as ‘simple’, ‘simplified’, ‘streamline’, ‘pragmatic’, ‘practical’, ‘naturalistic’, ‘trial’, ‘study’ and ‘randomization’ to capture any randomized study with simplified procedures that allowed routine care conditions. The prospective, randomized, open-label, blinded endpoint evaluation (PROBE) term was not included in the search string because of its inconsistent use in the literature. Furthermore, while many LSTs meet the criteria of a PROBE design most PROBE studies are not LSTs.

1.2 Identifying Studies with a Primary Safety Endpoint

The goal of this review was to identify whether the LST design had been used for the primary purpose of testing differences in safety outcomes. Therefore, the review excluded simple trials in which the primary hypothesis/endpoint was an efficacy or effectiveness endpoint. We used search terms such as ‘safety’ and ‘risk’ to limit the retrieved studies to those that included a safety endpoint or evaluated the comparative safety profile of the health interventions being studied in the trial. Clearly, randomized efficacy and effectiveness studies collect data on safety, as required by regulations, and may include a secondary endpoint of safety. However, in our review of abstracts and published manuscripts we sought to distinguish studies designed and powered to test a pre-specified primary safety hypothesis from those that included one or more secondary safety hypotheses or no specified safety hypothesis.

2. Systematic Literature Search

2.1 PubMed Search Strategy

PubMed is a service of the US National Library of Medicine that includes over 20 million citations from MEDLINE, life science journals and online books. We searched this database, including all citations posted through 31 December 2010. RCTs, drug toxicity and adverse events were searched as MESH terms and subheadings (where relevant). Terms specific to LSTs were included using the ‘TW’ search field tag, which searches the entire record including the article title, abstract and all indexing terms associated with the article. The search string was as follows: (‘Randomized Controlled Trial’ [Publication Type]) AND (drug toxicity [MeSH] OR adverse effects [MeSH]) AND (‘large scale’ [TW] OR simple [TW] OR simplistic [TW] OR pragmatic [TW] OR practical [TW] OR streamline [TW] OR naturalistic [TW]).

2.2 ClinicalTrials.gov Search Strategy

ClinicalTrials.gov (http://www.ClinicalTrials.gov) is a registry of federally and privately supported clinical trials conducted in the US and around the world. Clinical trials in this registry are defined broadly as any research study in human volunteers to answer specific health questions; thus, the registry includes interventional and observational studies. The first version of the system was publicly available in February 2000.[38] Organizations that sponsor and implement clinical studies are responsible for submitting accurate and timely information about their studies to ClinicalTrials. gov, which is then reviewed by the National Library of Medicine (NLM). The database is updated daily. At the time of the registry search, ClinicalTrials. gov contained more than 100000 trials sponsored by the National Institutes of Health, other federal agencies, and private industry. Studies listed in the database were conducted in all 50 states of the US and 174 countries.

We searched the ClinicalTrials.gov registry, including all studies posted February 2000 through 31 December 2010. The registry was searched in a similar fashion to the medical literature, albeit with variations because of its specific search requirements. The search string was as follows: (‘simple’ OR ‘simplified’ OR ‘streamline’ OR ‘pragmatic’ OR ‘practical’ OR ‘naturalistic‘) AND ‘randomized’ AND ‘safety study’. In addition, the search was restricted to ‘Interventional Studies’ and ‘Phase IV’ using the registry selection options.

3. Criteria for Evaluating Abstract Summaries and Published Papers

After conducting the searches, identification of studies for inclusion in the safety LST analysis was conducted in two stages. Each stage had specific inclusion and exclusion criteria. In Stage 1, abstracts and summaries were reviewed and studies were identified for full paper review. In Stage 2, full papers were reviewed and studies were identified for inclusion in the analysis.

3.1 Stage 1 : Selecting Abstracts and Clinicaltrials.gov Summaries for Full Paper Review

Each abstract and study summary was reviewed independently by two co-authors. The study was included in Stage 2 if at least one reviewer identified the abstract or study summary as meeting the specified inclusion/exclusion criteria or if there was insufficient information to evaluate the abstract or study summary. The inclusion and exclusion criteria are described in Appendix I (online SDC 1) and summarized below.

3.1.1 Inclusion Criteria a

  1. 1.

    Investigators describe a specific primary safety endpoint, set of safety endpoints or a composite safety or risk score.

  2. 2.

    Study randomizes patients to a specific health intervention (treatment, vaccine, device, treatment strategy) or an appropriate usual-care comparison group.

  3. 3.

    Phase IV study.

  4. 4.

    Conducted at more than one site and includes at least 100 patients.

3.1.2 Exclusion Criteria

  1. 1.

    Studies comparing the (general) adverse event profiles of two or more interventions without a specific endpoint of interest.

  2. 2.

    Studies whose primary endpoint is a Quality of Life or general well-being measure.

  3. 3.

    Health education, surgical procedure or dosing studies.

  4. 4.

    Phase I–III studies.

  5. 5.

    Pilots, feasibility assessments, or secondary analyses and extension studies of RCTs.

3.2 Stage 2: Selecting Published Papers for Inclusion in Safety LST Analysis

In Stage 2 of the review, the published articles for the abstracts and summaries identified in Stage 1 were reviewed independently by two coauthors. If both reviewers identified the study as meeting the specified inclusion/exclusion criteria, the study was included in the analysis. If only one author identified the study as meeting the specified inclusion/exclusion criteria, the authors discussed the study to reach consensus.

We first verified that the study met the inclusion and exclusion criteria set in Stage 1. If so, we reviewed the publication for evidence that the investigators intended to design and conduct a study reflective of real-world practice. Typically, this would be accomplished by studying the indicated population; using study procedures that do not deviate from normal practice, or do so minimally; and using observational follow-up methods to assess outcomes. The inclusion criteria are summarized below and are described in detail in Appendix I (online SDC 1).

3.2.1 Inclusion Criteria

  1. 1.

    Eligibility per approved product label or Summary of Product Characteristics so that patients reflect those seen in a typical practice setting.

  2. 2.

    Measurements and laboratory tests in the study are generally considered standard clinical practice for the treatment under evaluation and/or for the comparison group.

  3. 3.

    Physician visits do not significantly exceed those expected for the indicated population.

  4. 4.

    Data collected that are not directly related to the outcome assessment are limited.

  5. 5.

    Outcome assessment is primarily dependent on a combination of patient or physician report, simple home or outpatient tests/measurements that are consistent with standard practice and home care, or physician and medical/hospital records.

Each study document (published paper or ClinicalTrials.gov summary) selected for inclusion was then reviewed to extract the following data: study name, year of publication or year posted on ClinicalTrials.gov, study drug (generic name), comparator, sample size and the verbatim description of the primary endpoint(s). We then extracted key design information for each study to demonstrate the extent to which the study mimicked routine care and used observational follow-up. For completeness, we also extracted information about whether the study performed intent-to-treat (ITT) analysis and/or a time on treatment (exposed person-time) analysis, and summarized the study’s results, including risk estimates and 95% confidence intervals (CIs) when these were reported.

4. LSTs with a Primary Safety Endpoint

Thirteen LSTs with primary safety endpoints were identified.[3953] Ten studies were discovered by the systematic search of the published medical literature (n = 7)[4143,4547,51,52] and the Clinical-Trials. gov registry (n=4),[44,49,50,53] with one study identified in both searches,[4244] two through prior knowledge of the authors,[39,40] and one through the survey of drug safety experts.[48]

The stages of the review process for the PubMed and ClinicalTrials.gov searches are depicted in figure 1.

Fig 1
figure 1

Results of the medical literature and ClinicalTrials.gov registry search and review process.

The PubMed search of literature from 1949 through 2010 retrieved 1323 abstracts. Of these, 1258 were excluded and 65 were determined to meet the criteria for full-paper review. The most common reasons for excluding abstracts were safety was not a primary endpoint in the study; the abstract was selected because the search term referred to something other than study design or procedures; the study did not assess drug treatment effects (e.g. numerous short-term, single centre ‘simple’ surgical procedure studies); the study was a phase I healthy subject study, phase II dosing study or phase III extension study and thus did not use approved, marketed medicines; or the study design was not randomized. In addition, the search retrieved many abstracts describing phase IV RCT or LST designs where the primary purpose of the study was to assess an efficacy or effectiveness endpoint but where data on the general safety profile were also collected. Of the 65 full papers reviewed, 58 studies were excluded because the study was a methodological study (n = 2); phase I–III study (n= 5); RCT with a primary efficacy endpoint (n = 34); RCT extension follow-up study (n = 4); LST with a primary effectiveness endpoint (n = 2); RCT with a primary safety endpoint (n = 9); or LST evaluating safety profile without a specific endpoint (n = 2). Three excluded studies met some of the criteria used to define a safety LST. Two studies were LSTs and evaluated safety but did not specify a primary endpoint.[54,55] The third study[56] evaluated a safety endpoint and followed patients annually for 7 years but required extensive laboratory, psychometric and neurological assessments at baseline and subsequent visits and, thus, did not study subjects under routine clinical care conditions. Overall, seven studies were identified from the published medical literature.

The clinicaltrials.gov search returned 60 study summaries. Four studies met the study selection criteria: the VOLUME Study (ClinicalTrials.gov ID NCT00359801) [see table II for definition of study acronyms], the ZODIAC Study (Clinical-Trials. gov ID NCT00418171), the SCOT Large Simple Safety Trial (ClinicalTrials.gov ID NCT00447759), and the GiSAS pragmatic RCT (ClinicalTrials.gov ID NCT01052389).[44,49,50,53] The ZODIAC Study had already been identified in the medical literature search. Of the 60 studies identified, 56 were excluded because the identified search term referred to something other than the study design or procedures (e.g. ‘simple blind’, ‘simple verbal scale’, ‘simplified treatment’) [n = 34] or did not include safety as the primary endpoint (n = 22).]

Table II
figure Tab2

Study acronyms

Two studies[39,40] known to the authors were not identified in the published medical literature search or in the ClinicalTrials.gov registry, and only one study [48] of four potential LSTs [48,5759] identified by the survey of drug safety experts met the study selection criteria. None of these studies were indexed using the broad search terms we used to describe LSTs. Upon further review, we discovered that the terms to which they were indexed were very general. In order to have retrieved these studies we would have had to use a search strategy retrieving more than 60000 abstracts for review. Finally, all four studies were conducted or began enrolment prior to initiation of the ClinicalTrials.gov registry, therefore we did not expect them to be catalogued there.

In summary, a total of 13 LSTs were identified for analysis through the published medical literature, ClinicalTrials.gov registry, survey of drug experts and prior knowledge of the authors.

5. Comparison of Design Characteristics of the LSTs

The design elements of the 13 studies are summarized in table III.

Table III
figure Tab3

Large simple trials (LSTs) with a primary safety endpoint by year of publication or ClinicalTrials.gov posting date

Upon review of these studies, we found they spanned a period of about 20 years, were conducted in multiple disease areas, and evaluated drugs and vaccines or treatment strategies (treatment duration, type, or dose and number of vaccine injections). Two trials, SCOT and GiSAS, are ongoing.[50,53] Three trials were conducted to examine the safety of prescription and over-the-counter NSAIDs,[40,41,50] two among patients being treated for diabetes mellitus[46,49] and three to assess the safety of atypical antipsychotics for patients with schizophrenia.[43,45,53] Only one study used placebo added to usual care as the comparator.[48] Most of the trials (n = 9) enrolled patients from general or community practice sites,[4044,4648,50,53] although four studies were conducted in hospitals.[39,45,51,52] Two studies tested the safety of health interventions among paediatric patients.[40,47]

Nine studies recruited and enrolled patients from one country only,[3941,4548,52,53] whereas four studies,[43,4951] which had large actual or intended sample sizes, included sites from multiple countries. The actual or intended sample sizes of all studies ranged from just over 100 patients to more than 80000 randomized patients. One study, VOLUME,[49] intended to enrol sites from more than 20 countries but did not reach its target sample size because of the market withdrawal of one of the study drugs, inhaled insulin, for commercial reasons. Despite having enrolled more than 25 000 patients, the SMART trial also terminated early. The primary outcome event rate was lower than expected and revised estimates of the sample size were approximately 60000 patients.[48] A direct relationship between the sample size of the trial and the number of countries participating was not evident.

The primary safety endpoint for each study ranged from outcomes that required scale or instrument measurement during the course of the study, such as a persistent decline in pulmonary function (spirometry)[49] and tardive dyskinesia (symptom scales)[45] to ‘hard’ events such as death.[43,48,50] The majority of studies used broad inclusion criteria, had few procedures required and allowed flexible dosing. Inclusion in the studies was primarily determined by clinical opinion, i.e. the physician’s interpretation of the drug’s approved labelling with minimal or no additional exclusion criteria. Except for the COSMIC,[46] Pediatric Vaccine,[47] and Latent TB [51] studies, the primary analysis for the LSTs was ITT. These analyses included patients in the analysable population even if they discontinued or switched their assigned treatment. Although information was lacking from some studies, we found that in most studies patients were followed for periods that exceeded 30 days after last trial visit. This is in contrast to the approach typically used in many clinical trials where outcome follow-up ends 30 days after the last visit.

6. Results and Conclusions of the Completed LSTs

Each study’s rationale, main objective, primary results and conclusions are presented in table IV. The data collection method(s) used to obtain information about study endpoints, the duration of follow-up and the percentage of participants’ loss to follow-up are also reported in table IV.

Table IV
figure Tab4

Rationale, results and conclusions of the completed large simple trials

Table IV indicates there are similarities across the studies, particularly in the reasons cited for conducting the study. The objective of each study was to address clinical uncertainty about the safe use of a health intervention. Six studies were postapproval commitments to the FDA or an EU regulatory agency to study the real-world safety of a medicine or vaccine after its approval and launch onto the market or following the emergence of safety concerns postmarketing.[43,4650] In every study, the authors noted there was insufficient or no data available on the real-world safety of the interventions they were studying. Many investigators also reported that reliable estimates of the background incidence of the endpoint(s) were unavailable. When estimates of incidence were reported, they were typically inferred from phase III trials or spontaneous reporting rates for medicines in the same class[39,40,43,46] but in a couple of cases were based on real-world information because the medicine was already marketed.[41,50] Ten [3944,4649,51,52] of the 11 completed studies noted power calculations for the study, reporting the event rate or sample size needed to detect a statistically significant difference between treatment arms. Eleven studies, based on the available information, were designed as superiority trials.[3947,5053] Two studies, VOLUME and SMART, used a non-inferiority design.[48,49]

All studies addressed an important clinical and public health question. Every completed study provided results that meaningfully addressed the investigators’ original objectives and were relevant to clinical decision making. Even VOLUME and SMART, which were stopped early and were under-powered, provided relevant descriptive and comparative data, respectively.[48,49] In the majority of studies, the absence of an increase in safety events associated with the use of the study intervention provided support for the safe use of a new medication or treatment strategy as an equivalent therapeutic option.[3944,46,47,51] In the TB Study,[51] the findings led to a considered re-evaluation of the safety of the standard of care for treatment of latent TB and a call for further research on the comparative efficacy and safety of the first- and second-line treatments.[60]

The studies varied in two notable ways: the duration of follow-up and the percentage of patients lost to follow-up. In the four shortduration studies,[3941,47] where follow-up ranged from 1 day to 1 month, loss to follow-up was minimal. Longer duration studies had greater attrition.[4246,4851] Patients with schizophrenia and asthma were the most difficult groups to follow. [43,48] In studies where patients were randomized to a medicine and were followed for longer than 1 month, we found a range in adherence to the treatment under study versus the comparator: 90% versus 77% in COSMIC,[46] 64% versus 73% in ZODIAC,[43] 89% versus 95% in the Tardive Dyskinesia Study;[45] 73% versus 72% in SMART;[48] and 78% versus 60% in the TB Study. [50] Concern about the bias introduced by discontinuation of study medications, and thus the validity of ITT analyses, may explain why person-time analyses, where rates for ontreatment periods are calculated and compared, were performed in many studies. For example, despite having higher than expected adherence rates for the patient populations being studied, both ZODIAC and the TB studies reported results from person-time analyses.[43,50] These ontreatment analyses are also potentially biased, however. They are subject to the same biases as a non-randomized study and, thus, require adjustment for potential confounders.

Although most studies successfully addressed their research objectives, challenges complicating the interpretation of the findings from safety LSTs were noted, including low event rates, heterogeneous effects within the study population, lengthy enrolment periods and study durations, and the use of concomitant medications. Despite the large sample size of many of the LSTs, the absolute number of events reported in studies was often not very large, particularly if death was the outcome. For example, in the Pediatric Ibuprofen Study, despite its large size, the investigators note they could not exclude serious events occurring at a rate <1/10000.[40] The COSMIC Study did not observe any cases of lactic acidosis, which was its primary regulatory purpose.[46] In ZODIAC,[43] even a ‘hard’ outcome such as sudden death was still infrequent enough that the study was designed and powered to detect differences in non-suicide mortality, despite being conducted in a patient population known to have a higher rate of cardiovascular events than the general population.[61] A lower than expected event rate in the SMART trial led to a re-calculation of its required sample size to 60 000 patients, double the initially intended size. Furthermore, the SMART trial’s finding that African Americans were at a higher risk of respiratory and asthma-related events, likely due to genetic or behavioural factors, highlights the problems of conducting a LST where treatment effects vary within or across the study population.[48] When an investigator expects subgroup differences, the LST design is likely not the preferred study option.[24,26]

The length of the study, particularly as it relates to the evolving benefit-risk balance of the medicines under study, and allowance of concomitant medication use may complicate interpretation of study results. During the course of the ZODIAC Study, concerns were raised about the potential for antipsychotics to be associated with the development of diabetes and metabolic syndrome. Although this eventually resulted in class labelling for atypical antipsychotics, physician perception that there were differential risks among antipsychotics was anecdotally cited as a reason for decreased enrolment at investigator sites. Finally, in the Pediatric Vaccination Study,[47] the authors noted that parents in France commonly use prophylactic antipyretics to prevent fever after vaccination. Since the study was meant to be reflective of real-world practice, the investigators did not prevent their use. Instead, they tried to distinguish between curative and preventative use of fever reducers in post hoc analyses. This challenge highlights the difficulty of estimating the incidence of outcomes if the definition of the endpoint (in this case fever) is preventable through other interventions used in real-world practice. However, from a pragmatic perspective this is acceptable, particularly if it means that the intervention as it is actually used, with other medicines, is safe.

7. Conclusions

Very few completed or ongoing LSTs evaluating safety outcomes as a primary endpoint were identified in our review. Among those that have been conducted, we found that their design elements were similar over time. Three LSTs evaluated the safety of prescription or non-prescription NSAIDs. This is likely due to regulatory and public health concerns about their safety given their widespread use. Although the LST design may be most appropriate when studying ‘hard’ outcomes such as death or hospitalization, we found that researchers also used the design to compare ‘soft’ outcomes (e.g. incidence of physician reported symptomatic hypotension and patient reported complaints) or outcomes that require regular measurement (e.g. pulmonary function). This finding was unexpected since the seminal literature on the use of LSTs for examining the therapeutic or preventative effects of an intervention suggest they are best suited to studying ‘hard’ outcomes to avoid assessment and reporting bias.[2426] Other unexpected differences we noted were the use of person-time on treatment secondary analyses, which result in potentially biased comparisons between treatment groups (i.e. because participants discontinue their medications for reasons that may be related to potential outcomes), and sample sizes that were smaller than thousands of patients.

The results of the search demonstrate that there is significant variation in the terms used to describe LSTs or denote the intent to minimize interference with routine care. Despite potential differences in nomenclature, we hypothesized that the intent of ‘simple’ or ‘practical’ trials would be the same: to minimize random error by using a large sample size, to distribute known and unknown confounders across baseline treatment groups by randomization, and to preserve routine care and maximize generalizability of findings by following patients using observational methods. This was found to be true, although the earlier studies indicated a tendency toward the use of elements associated with a controlled clinical trial (e.g. double-blinds) whereas more recent LSTs had the most characteristics associated with observational epidemiology. Nonetheless, consistent with our selection criteria, all LSTs included in this review utilized observational methods of follow-up, but did so with varying degrees of intervention to meet the study objectives, primarily in the form of scheduled visits or required laboratory and diagnostic tests. For example, to permit analytic comparison across treatment groups, the VOLUME Study required more intervention (e.g. requirement for spirometry among inhaled insulin and usual diabetes care users to make valid comparisons of pulmonary safety) despite its interference with usual care (i.e. usual diabetes care does not routinely include pulmonary function tests).[49] ZODIAC, on the other hand, evaluated only mortality and hospitalization rates, and the study protocol did not mandate interventions post-randomization. [42,43]

A good LST has been defined as one that asks an important public health question and does so reliably.[24,26] Comparison of the LSTs’ original objectives with their results indicates that the design yields important clinical insights. These studies’ findings have provided data critical to evidence-based comparisons of treatments as used in the real-world and, in some cases, led to the reevaluation of medical practice.[50,60] Despite the success of these studies, it is unreasonable to expect the findings of an LST to always result in changes to guidelines on clinical practice or policy decisions. Rigorous safety evaluation typically requires critical interpretation of results from multiple sources, and the results of a single safety LST may not be sufficient to immediately alter medical practice.

Finally, while medication adherence in these studies was relatively high it was different by treatment arm in some studies. ITT analyses are most valid when there is maximum, and equal, adherence to the assigned study medication across treatment arms. Bias as a result of non-adherence is a justifiable concern for any randomized study, including LSTs. When appreciable non-adherence occurs, and is non-differential, the true effect will be underestimated by ITT. In a safety LST this would result in a bias toward the null, decreasing the likelihood of observing an adverse effect of a medicine. Furthermore, if non-adherence is differential, then despite randomization, both the ITT and person-time analyses will be biased, requiring adjustment for confounders. In these analyses, because patients choose to discontinue or switch medicines for unknown reasons, the benefits of randomization are lost and the groups being compared can no longer be assumed to have an equal distribution of known and unknown confounders. Research on how to address non-compliance in randomized studies with observational follow-up is growing; for example, one recently published study has described the use of inverse probability weighting to adjust for incomplete adherence to assigned treatment.[62]

The finding that LSTs are rarely used to evaluate safety endpoints may be due to a lack of experience with the design among drug safety professionals but it is more likely a result of the operational, financial and scientific hurdles of implementing the design. Substantial resources are required to accrue large sample sizes, collect multiple forms of outcome data, manage hundreds of participating investigators and sites, and ensure appropriate scientific and ethical oversight. The lack of research infrastructure for conducting research at sites or with physicians inexperienced with randomized trials, including the complex regulations governing interventional study implementation and prohibitive financial costs, have been described previously as barriers to conducting LSTs studying efficacy.[63,64] This complexity suggests that the design might be more accurately described as a ‘simplified’ rather than a ‘simple’ trial. Efforts have been made to simplify LSTs by recruiting patients using electronic healthcare databases, as is being done in SCOT with the Scottish Medicines Monitoring Unit (MEMO) database,[50] but others have found that patient recruitment in this way was not practical.[65] Financial considerations are also important. Twelve of the LSTs identified in this review were funded by the pharmaceutical or consumer products industry,[3950,52,53] six of which were identified as post-approval commitments to the FDA or an EU regulatory agency.[43,4650] Whether there is a lack of interest in funding from public research institutes or a societal expectation that costs of researching the safety of health interventions should be borne by private industry is unclear.

Lesko and Mitchell,[40] in their overview of lessons learned from the Pediatric Ibuprofen Study, point out that a simple testable safety hypothesis, a motivated patient and physician population, and the ability to follow-up patients to assess outcomes are critical to the feasibility of LSTs for safety.[66] Many other factors influence whether a LST is feasible. For example, evaluating differences in very rare (e.g. Steven’s Johnson Syndrome) or long latency (e.g. cancer induction) outcomes are not practical with the LST design. Even hard outcomes such as death are difficult to study in a randomized study if the incidence of the outcome is low or the observed rate in the study is lower than expected when initial sample size calculations are performed. Because the design uses randomization, an appropriate comparator acceptable to physicians and patients is also a critical factor in the success of a LST. Equipoise must exist, and if a new medication or vaccine has a real or perceived benefit over available treatments, the LST design will not be feasible. The lack of equipoise may also become a challenge in situations where the study enrolment period is very lengthy. As knowledge of the benefits and risks of treatments evolve, leading to less clinical uncertainty about the type of patients that might benefit from a particular study medication, investigators may find it increasingly difficult to enrol patients.

The scientific question underlying a safety LST must be relevant clinically and of interest to physicians practicing in private and community health settings. It is inclusion of those sites that make the findings generalizable. Interestingly, studies that seek to compare the general safety profile of medicines or vaccines, in lieu of a specific research hypothesis for which the study is powered, do seem to be feasible. Three studies[54,55,57] excluded from our review because they collected and compared all serious adverse events rather than a pre-defined safety endpoint, were able to successfully enrol many patients. Based on our review, there is insufficient evidence to restrict the design to studies with short-term follow-up or to patient populations with lower rates of therapy discontinuation. Finally, data collection directly from patients, rather than physicians, is not necessarily a barrier either, as the PAIN Study demonstrated.[67,68]

While the LST is potentially a design to study the safety of any health intervention, its use with novel medicines, vaccines, diagnostics and biologics is likely to be limited. The feature of randomization alone limits its application, since in some disease areas there may be no logical or ethical comparator (i.e. first-in-class medicine or vaccine, rare diseases, oncology). Biopharmaceutical development is increasingly focused on smaller indicated patient populations. This will result in insufficient exposure to reach the required event rate or, when feasible, it may take too long to accrue exposure to be acceptable to decision makers. Even in situations where the prevalence of a condition is much greater (e.g. schizophrenia or asthma) it may take many years to accrue the necessary patient population, which to some may be too long to address important safety questions.[69] Regardless of how the design is used in the future, well designed observational studies, whether they involve primary data collection or use secondary data sources such as large, electronic healthcare databases, will continue to be an important and, in some cases, the only resource to investigate many post-approval safety questions.

Despite the challenges of using the LST design for comparative safety evaluation, it is particularly suited to research questions in which confounding by indication or severity is likely to be pronounced and difficult to measure or control, but where assessment under routine care is important for decision making. LST designs are similar to observational studies in that they can, in principle, be effectively used to study the safety of health interventions in patient populations not typically exposed in clinical trials, such as the elderly, very young or those with multiple comorbidities; determine if physicians prescribe according to their interpretation of the product label or clinical experience; and understand the safety of a health intervention as it is used with multiple concomitant prescriptions or over-the-counter medications under routine medical care. With the increasing demand for real-world evidence to guide public health and clinical decisions, the design’s advantages and disadvantages for future comparative safety research should be carefully considered by researchers.