Article Text
Abstract
Objectives To investigate differences between target and actual sample sizes, and what study characteristics were associated with sample sizes.
Design Observational study.
Setting The large trial registries of clinicaltrials.gov (starting in 1999) and ANZCTR (starting in 2005) through to 2021.
Participants Over 280 000 interventional studies excluding studies that were withheld, terminated for safety reasons or were expanded access.
Main outcome measures The actual and target sample sizes, and the within-study ratio of the actual to target sample size.
Results Most studies were small: the median actual sample sizes in the two databases were 60 and 52. There was a decrease over time in the target sample size of 9%–10% per 5 years, and a larger decrease of 18%–21% per 5 years for the actual sample size. The actual-to-target sample size ratio was 4.1% lower per 5 years, meaning more studies (on average) failed to hit their target sample size.
Conclusion Registered studies are more often under-recruited than over-recruited and worryingly both target and actual sample sizes appear to have decreased over time, as has the within-study gap between the target and actual sample size. Declining sample sizes and ongoing concerns about underpowered studies mean more research is needed into barriers and facilitators for improving recruitment and accessing data.
- statistics & research methods
- clinical trials
- epidemiology
Data availability statement
Data are available in a public, open access repository. All data and code are openly available from the github database: https://github.com/agbarnett/registries.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
All analyses were repeated using two trial registries.
The registries had very large sample sizes with little missing data.
The registry data are completed by researchers and have some data entry errors and poor reporting.
There were changes over time in the types of studies of registered, so differences in sample size over time should be interpreted in light of these changes.
Introduction
Sample size is a key element of most research study designs. Researchers should aim to collect a large enough sample to answer their research question with a good statistical power, for example, recruiting a sufficient number of patients to demonstrate a hypothesised difference in efficacy between two treatments. However, researchers do not want to collect more data than necessary as this wastes time and resources.
The target sample size should be estimated at the study design stage. Researchers then collect data until that target is achieved or until they run out of time or money. This sounds straightforward, but in practice many studies struggle to recruit their target sample size and difficulties with recruitment are a common reason why trials end early.1–3 Recruiting sufficient participants is crucial to a trial’s validity, and in recognition of the difficulties around trial recruitment there is large and ongoing research effort aimed at increasing recruitment and retention.4 5
Inadequate sample sizes mean studies are underpowered and so true associations may be missed or estimated with large uncertainty. Theoretical work has shown how underpowered studies contribute to the ongoing problem of poor quality research.6 7 Generally, larger sample sizes are needed to tackle the pervasive problem of studies with low power,8 although small samples are often appropriate for pilot or feasibility studies.
Sample size calculations depend on a range of assumptions that should reflect current knowledge. The practical application of these assumptions has been criticised in terms of a general lack of understanding of uncertainty, and the approach of reverse-engineering assumptions to get a desired target sample size.9 10
In this paper, we examine sample sizes using two large trial registries containing information on health and medical studies. We examined the difference between the target and actual sample size, what study characteristics were associated with sample size, and if sample sizes have declined over time. The aim is to contribute to the ongoing work on improving study designs and the quality of research.11
Methods
Trial registries
Trial registries were introduced to counter the serious problem of unreported trials.12 Trials cannot now be published in any high-profile medical journal without a prospective registration, hence there has been good uptake of trial registries, although they have not eliminated the problem of unreported trials or poorly reported trials.13–16 For our purposes, the high uptake of registries provides a large and comprehensive data set to study sample sizes.
Trial registries contain details on the study characteristics, including the study design, disease(s), outcome(s), key dates and funding. Researchers are responsible for posting and updating their studies.
We downloaded data from two large trial registries:
Australian New Zealand Clinical Trials Registry (ANZCTR) started in 2005.
clinicaltrials.gov run by the US National Library of Medicine started in 1999 and publicly available in 2000.
ANZCTR was chosen because of the authors’ familiarity with the region, and clinicaltrials.gov was chosen because it is the largest international registry. Both registries make their data available for research.
Ethics approval
All the data are publicly available and do not involve human participants, hence this study did not require ethics approval.
Inclusion and exclusion criteria
We included studies from interventional studies and did not include observational studies. This is because these two study types are unlikely to be comparable and there were many study characteristics (eg, blinding) that are not applicable for observational studies. Interventional studies are those where participants were prospectively assigned to one or more health-related interventions in order to study the intervention’s effects.
We excluded a small number of retrospectively registered trials from ANZCTR before the registry started in 2005, and a small number that were missing the date the study was submitted to ANZCTR (details below).
We excluded studies from clinicaltrials.gov that had a status of ‘withheld’ because the available data for these studies were limited. We excluded studies that were terminated for safety reasons as they may have achieved their objective using a smaller sample size than planned. We excluded expanded access studies because we were not certain that these were comparable to interventional studies. We excluded studies where the type of sample size was not stated, as we had to know if the sample size was the target or actual. We excluded two studies that used a dummy sample size, for example, ‘9 999 999’. To avoid double-counting, we excluded clinicaltrials.gov studies if they had an ANZCTR number. We preferenced data from ANZCTR as it had more detailed information on sample size. The exclusions are shown in online supplemental figure 1.
Supplemental material
We included all the available studies that met our inclusion/exclusion criteria and did not use a sample size calculation or formal hypothesis testing.
Data for both registries were downloaded on 1 February 2021 in XML format and then read into R (V.4.0.3).17 Updated sample size data for clinicaltrials.gov were downloaded on 5 March 2021. All the code to replicate the data extraction and analyses, and data are openly available on GitHub (https://github.com/agbarnett/registries).18 Results are reported using the Strengthening the Reporting of Observational Studies in Epidemiology guidelines for observational studies.19
Statistical methods
Models of sample size
Both registries had two measures of sample size: the target and actual. We used multiple regression to estimate what study characteristics were associated with the target and actual sample size. See online supplemental table 1 for the list of available study characteristics which differed by registry. For the models of actual sample size, we included the study status (eg, ‘completed’) as an independent variable, but we did not include study status for the models of target sample size as study status occurred after the target sample size and so any association could not be causal.
Supplemental material
The clinicaltrials.gov database does not include a variable for whether studies are longitudinal. Hence, we searched each study’s description for ‘longitudinal’ in order to extract this study design variable. We also searched for ‘adaptive’ or ‘platform’ trial to examine whether these study designs impacted sample sizes.20
Sample size had a strong positive skew with a small number of very large studies. To improve model fit and reduce the influence of a few very large studies, we log-transformed sample size (base e). We therefore present the effects of the study characteristics as the percent change in the geometric mean instead of the absolute difference in sample size.
Some study characteristics had a strong positive skew with a small proportion of very large numbers, for example, the number of primary outcomes (median 1, maximum 214 for clinicaltrials.gov). To reduce the potential for a few large studies to overly-influence the results, we log-transformed these variables using base 2, hence the parameters are the percent change in the sample size when the variable is doubled.
Most of the variables we used were mandatory, meaning researchers had to complete them and hence there was little item-missing data. The most amount of missing data was 2% for study purpose. For non-mandatory categorical variables with missing data, we included ‘Missing’ as its own category. Our reasoning was that investigators likely did not complete a question if they felt it was not relevant to their study and hence ‘Missing’ should be akin to ‘Not applicable.’ This avoided excluding studies with small amounts of missing data. Details on the item-missing data are in online supplemental appendix 1.
Supplemental material
We used the elastic net method to select the key variables from the larger subset of all variables.21 We used 10-fold cross-validation to select the ideal penalty and hence which variables were included in the final model. We used a parsimonious model by choosing the penalty within one SE of the minimum cross-validated mean square error.
We checked the variance inflation factor of the final models to detect collinearity using a threshold of five. We checked the residuals of the final model to verify they were unimodal, approximately symmetric, and with no large outliers.
Target versus actual sample size
We calculated the sample size ratio of the actual divided by target and created a histogram of the ratio. We described the range of this ratio using the central 50% and 90% of studies.
To estimate what study characteristics were associated with the sample size ratio, we used the same elastic net method as for the models of sample size. The ratio had a strong positive skew, so it was log-transformed (base e) for the modelling.
We used a Bland–Altman plot of the actual-to-target sample size against the average sample size ((actual +target)/ 2). The aim was to see whether the ratio narrowed for small and/or large sample sizes. We log-transformed (base e) the ratio because of the strong positive skew in sample sizes. Because of the very large sample size, a standard Bland–Altman scatter-plot using individual studies was too cluttered, hence we used a tile plot to summarise studies in bins.
We used the Bland–Altman limits of agreement to show the range in observed ratios that covers 95% of the data. However, the standard limits assume that the ratio is constant for all sample sizes which did not appear valid for these data. Hence, we used a Bayesian model and allowed the mean and variance of the limits of agreement to vary by the average sample size using a fractional polynomial approach with the eight powers: 22 (see online supplemental appendix 2). We fitted 64 (8 × 8) separate models to cover all combinations for the mean and variance, and selected the best model using the deviance information criterion (DIC)23 (see online supplemental figure 2). Because the ratio distribution had long tails, we used a t-distribution with 4 degrees of freedom instead of a Normal distribution, and this gave a far better fit to the data (DIC improvement of over 4000). For the clinicaltrials.gov data, we fitted these Bayesian models using a random sample of 10 000 studies (8% of the total) because of the time needed for the Markov chain Monte Carlo estimates.
Supplemental material
Supplemental material
The Bayesian models were fitted using the JAGS software (V.4.3.0).24 We used vague Normal priors for all parameters. We used two chains thinned by three with a burn-in and sample of 2000. We visually checked the convergence and mixing of the chains (see online supplemental figure 3).
Supplemental material
Patient and public involvement
No patients or members of the public were involved in the design, conduct or reporting of this study.
Results
The number of included studies and reasons for exclusion are shown in online supplemental figure 1. The final analyses had 17 510 studies from ANZCTR and 272 160 from clinicaltrials.gov.
Some basic characteristics of the included studies are in table 1. The median target sample size was 66 for ANZCTR and 78 for clinicaltrials.gov. The median actual sample size was 60 for ANZCTR and 52 for clinicaltrials.gov for both databases. Additional summary statistics on the two databases are in online supplemental appendix 1.
Target versus actual sample size
The number of studies with a target and actual sample size was 5712 in ANZCTR and 121 603 in clinicaltrials.gov.
The histograms of the ratios of the actual-to-target sample sizes are in figure 1. Many studies hit their target and a large proportion were also just below their target. The histograms are asymmetric around 1, with a larger ‘shoulder’ of studies missing their target compared with studies exceeding the target.
For ANZCTR, the central 50% of studies had a ratio of between 22% below their target to equalling their target. The central 90% of studies had a ratio of between 53% below their target to 13% above their target. For clinicaltrials.gov, the central 50% of studies had a ratio of 43% below their target to 2% over target. The central 90% of studies had a ratio of 86% below their target to 23% over target.
The Bland–Altman plot of the sample size ratio against the average sample size is in figure 2. Many studies with an average sample size between 10 and 200 hit their target sample size. The estimated limits of agreement narrowed for larger sample sizes in both databases.
For the ANZCTR data, the 95% limits of agreement for the sample size ratio were 0.58–1.38 for an average sample size of 50, narrowing slightly to 0.64–1.39 for an average sample size of 500. There are a small number of studies that are far above or below the limits of agreement, particularly studies in the 5–500 sample size range that were well below their target.
The 95% limits of agreement were generally wider for the clinicaltrials.gov data. The 95% limits of agreement were 0.37–1.84 for an average sample size of 50, narrowing to 0.63–1.54 for an average sample size of 500. The diagonal strip of studies in the bottom-left of the figure are studies with a small target sample size that recruited no participants.
Models of the actual-to-target sample size ratio
We used multiple variable regression to estimate what study characteristics were associated with the actual-to-target sample size ratio. The estimates are shown in figure 3, expressed as a percent change, and in online supplemental table 2.
Supplemental material
Larger target sample sizes were associated with a lower actual-to-target ratio, meaning smaller actual sample sizes (5.7% lower per doubling of the target sample size). The actual-to-target ratio lowered over time (4.3% lower per 5 years).
Studies with more arms and more secondary outcomes were associated with a higher ratio, as were studies that included healthy volunteers.
Studies sponsored by the National Institutes of Health (NIH) or US Federal agencies (including the Food and Drug Administration) had a lower average actual-to-target ratio of 7.0%–10.3%, whereas industry funded studies had a 18.1% higher average ratio.
In terms of study design, studies with some type of masking had a slightly higher actual-to-target ratio, whereas single group studies had a slightly lower ratio.
Compared with completed studies, studies that stopped early had a 73.2% smaller ratio (95% CI –73.5 to –72.9) and withdrawn studies had a 99.9% smaller ratio (95% CI –99.9 to –99.9).
One reason the actual sample size can be smaller than the target sample size is an adaptive trial that may require fewer patients than originally planned. However, there were only 168 (<0.1%) adaptive trials in the clinicaltrials.gov data and hence this variable is unlikely to impact the overall results and was not selected in the elastic net.
Models of sample size for ANZCTR
Here we examine the non-paired data on the target and actual sample size. The estimated percent differences in sample sizes for the ANZCTR database are shown in table 2 and plotted in figure 4.
Some associations were as expected. More funders—and hence more resources—meant larger sample sizes. Studies with no age limits were larger than those with any limits. There was a generally increasing sample size for later phases. Bioequivalence studies had an over 30% larger sample size as demonstrating equivalence generally needs more participants than demonstrating efficacy. Studies that allowed healthy volunteers were larger, likely because it increases the available pool of participants. Factorial designs were over 20% larger than parallel studies, to account for the additional comparisons, while cross-over studies were over 60% smaller because of the key comparison is within-participants. Prevention studies were over 25% larger than treatment studies, and screening studies were over 130% larger. Public health studies were over 60% larger.
Surprisingly, more primary outcomes were associated with smaller sample sizes, although more secondary outcomes were associated with larger sample sizes.
Compared with studies in both genders, actual sample size for studies in men only were around 16% smaller, whereas studies in women only were 19% larger.
Many of the associations for the actual sample size mirrored those from the target sample size. A notable difference was that the decreasing trend in sample size was much larger for the actual sample size, at –21% per 5 years for the actual sample size compared with –10% for the target sample size.
The models of actual sample size included the study status which was a strong determinant of sample size when studies were stopped early or withdrawn.
Models of sample size for clinicaltrials.gov
The estimated percent differences in sample sizes for the clinicaltrials.gov database are shown in table 3 and plotted in figure 5.
As expected studies were larger if they had funding. Studies were also larger if they had more arms or more conditions. Surprisingly, studies with more primary outcomes were associated with a smaller sample size, although the reduction was small at under –4% per doubling in outcomes.
There was a decrease over time in the target sample size of –7% per 5 years, and this decrease was –18% for the actual sample size.
As per the results for the ANZCTR database, women only studies were larger, and men only studies were smaller than studies with both men and women.
Health services research studies were over 150% larger than treatment studies and screening studies were over 250% larger.
Studies using masking were smaller than studies using none, possibly because they are less prone to confounding. Somewhat surprisingly non-randomised studies were around 20% smaller than randomised studies, when these would be more prone to confounding and hence likely need a larger sample size. Adaptive or platform trials were over 50% larger. Longitudinal studies were over 24% larger.
Not surprisingly, studies that were suspended, terminated or withdrawn had a greatly reduced sample size. Those with an unknown study status had larger sample sizes compared with completed studies
Model checks
The cross-validations for the elastic net selections are plotted in online supplemental figure 4. Only one variable category (an allocation category of ‘Missing’) was removed due to colinearity, it was colinear with an assignment category of single group. The residuals for the final models are plotted in online supplemental figure 5 and are unimodal and approximately symmetric.
Supplemental material
Supplemental material
Discussion
For the ratio of actual-to-target sample size, although the modal value was on target, the 90th percentiles were asymmetric with more studies below than above target (figure 1). This reflects the many challenges of achieving the target sample size, including difficulties with ethics and governance, difficulties finding and recruiting participants, and running out of time or funding. Larger studies were generally closer to their target sample size (figure 2) but not by much.
Results from both databases showed a strong decrease in sample size over time. Interestingly, the target sample size decreased by 7%–10% per 5 years, whereas the actual decrease was 18%–21%, confirming the generally growing difficulty of recruiting research participants. The finding was confirmed in the lower actual-to-target sample size ratios over time. Smaller actual sample sizes mean studies may be underpowered with flow-on effects for the statistical power and uncertainty of meta-analyses.25
A recent observational analysis of the health literature shows a clear decrease in average effect sizes over time from 1990 to 2015.26 We would expect larger average sample sizes over time to study these smaller effects with adequate statistical power. Our finding of smaller sample sizes (both actual and target) has implications for statistical power and strongly suggests that the problem of underpowered studies is ongoing. A study of the Cochrane database of systematic reviews from 1975 to 2014 estimated that the percentage of sufficiently powered studies increased from 5% in 1975–1979 to 9% in 2010–2014.8 Another study of clinical trials from the Cochrane database estimated an increase in adequately powered studies over time with an OR of 1.02 per year.27 Our results suggest this small previous increase in power may now be at risk given the average decrease in sample sizes.
In both databases, there were more studies that were women only than men only (10% women only vs 5% men only), and in both databases, the women only studies were larger. This difference may partly be due to initiatives to fund women’s health research to make up for the historical shortage of women in trials.28 To examine other differences, we examined the top 10 words in the brief titles of the clinicaltrials.gov database in studies in women and men only (see online supplemental table 3). ‘Cancer’ and ‘breast’ were the two most common words in studies in women only, and ‘study,’ ‘prostate’ and ‘cancer’ were the top three words in studies in men only. Hence, the difference in sample sizes could be due to differences in the primary outcomes and effect sizes for these two cancers.
Supplemental material
A previous study of the clinicaltrials.gov data examined clinical trials between 2007 and 2010 found 62% had 100 or fewer participants.29 Another study of clinicaltrials.gov found that actual sample sizes for completed studies declined between 2000 and 2019.30 A study of 114 trials found that only 31% achieved their target sample size.31 A study of NIH funded clinical trials found that the proportion enrolling more than 500 or 1000 was relatively stable between 2005 and 2015.32 Studies examining why trials are terminated early found that problems recruiting patients are the most common reason.1–3 33–35 Trial characteristics that predicted smaller actual than target sample sizes were phase 2 studies compared with phase 3, more eligibility criteria, active control compared with placebo, fewer sites and public funding compared with industry funding.33 These results match ours for study phase and industry funding, although we found active control did slightly better than placebo (figures 4 and 5).
Strengths and limitations
We analysed two databases and found generally consistent results in terms of what study characteristics were associated with sample size, which increases the robustness of our results.
A key strength is the large sample size available from the trial registry data. There are strong incentives for researchers to register trials before any participants are recruited, which means the registry data should be representative of the target population of all trials. However, there have been documented problems with trials not being updated to include the results and recruitment status.15 36 The implication for our study is that actual sample sizes will be missing and there could well be an under-reporting bias for studies where the actual sample size was well below the target. Hence, our results may present a somewhat optimistic picture of the actual-to-target sample size ratio.
The databases record many trial features with little missing data. The completeness of studies on clinicaltrials.gov has increased over time, with over 90% completion since 2007 for key fields such as allocation, masking, gender, enrolment and study arms.37 A study of the completeness of clinicaltrials.gov of phase 2–3 studies posted by pharmaceutical companies found incomplete data wwere generally below 3%.38
The registry data relies on researchers to correctly enter and update their study’s details and there are likely to be data entry errors and poor reporting. For example, we found a study where an age limitation was mentioned in the descriptive text but not in the age limit field. We also found some cluster-randomised studies where the anticipated sample size was the number of clusters and the actual sample size was the number of participants (we excluded these six studies).
Data on the actual amount of funding for each study would have been useful so that an actual dollar value could have been modelled instead of the simpler variables of number of funders and funding class.
Conclusion
Registered studies are more often under-recruited than over-recruited and disappointingly both target and actual sample sizes appear to have decreased over time. If true, this is concerning and deserves attention by both researchers and funders, to examine causes and solutions of the problem. This could include understanding barriers to recruitment, the use of evidence-based recruitment processes39 and incentives to increase the use of multicentre studies. We recommend ongoing implementation of evidence-based interventions to increase sample size and further monitoring of sample sizes.
Data availability statement
Data are available in a public, open access repository. All data and code are openly available from the github database: https://github.com/agbarnett/registries.
Ethics statements
Patient consent for publication
Ethics approval
This study does not involve human participants.
Acknowledgments
Thanks to the National Library of Medicine and ANZCTR for making the registry data available for research. Thanks to Nicholas De Vito for help with the clinicaltrials.gov data. Thanks to Andrew Althouse and Noah Haber for helpful comments on the first draft.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
Twitter @aidybarnett
Contributors Study design and data interpretation: AGB and PG. Data analysis and initial manuscript drafting: AGB. Critical review of early and final versions of the manuscript: PG. AB is the guarantor.
Funding This work was supported by National Health and Medical Research Council (https://www.nhmrc.gov.au/) grant number APP1117784. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Competing interests Paul Glasziou is a member of the ANZCTR advisory committee.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.