Article Text

Download PDFPDF

Overtesting and undertesting in primary care: a systematic review and meta-analysis
  1. Jack W O’Sullivan1,
  2. Ali Albasri1,
  3. Brian D Nicholson1,
  4. Rafael Perera1,
  5. Jeffrey K Aronson1,
  6. Nia Roberts2,
  7. Carl Heneghan1
  1. 1 Centre for Evidence-Based Medicine, Nuffield Department of Primary Care Health Science, University of Oxford, Oxford, UK
  2. 2 Bodleian Health Care Libraries, University of Oxford, Oxford, UK
  1. Correspondence to Dr Jack W O’Sullivan; jack.osullivan{at}


Background Health systems are currently subject to unprecedented financial strains. Inappropriate test use wastes finite health resources (overuse) and delays diagnoses and treatment (underuse). As most patient care is provided in primary care, it represents an ideal setting to mitigate waste.

Objective To identify overuse and underuse of diagnostic tests in primary care.

Design Systematic review and meta-analysis.

Data sources and eligibility criteria We searched MEDLINE and Embase from January 1999 to October 2017 for studies that measured the inappropriateness of any diagnostic test (measured against a national or international guideline) ordered for adult patients in primary care.

Results We included 357 171 patients from 63 studies in 15 countries. We extracted 103 measures of inappropriateness (41 underuse and 62 overuse) from included studies for 47 different diagnostic tests.

The overall rate of inappropriate diagnostic test ordering varied substantially (0.2%–100%)%).

17 tests were underused >50% of the time. Of these, echocardiography (n=4 measures) was consistently underused (between 54% and 89%, n=4). There was large variation in the rate of inappropriate underuse of pulmonary function tests (38%–78%, n=8).

Eleven tests were inappropriately overused >50% of the time. Echocardiography was consistently overused (77%–92%), whereas inappropriate overuse of urinary cultures, upper endoscopy and colonoscopy varied widely, from 36% to 77% (n=3), 10%–54% (n=10) and 8%–52% (n=2), respectively.

Conclusions There is marked variation in the appropriate use of diagnostic tests in primary care. Specifically, the use of echocardiography (both underuse and overuse) is consistently poor. There is substantial variation in the rate of inappropriate underuse of pulmonary function tests and the overuse of upper endoscopy, urinary cultures and colonoscopy.

PROSPERO registration number CRD42016048832.

  • epidemiology
  • quality in health care

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • Generates rate of undertesting and overtesting for specific diagnostic tests against national or international guidelines.

  • Only includes data from real clinical encounters rather than surveys or hypothetical clinical vignettes.

  • Quantified inappropriate ordering of all types of diagnostic tests rather than just laboratory.

  • Systematic reviews are restricted to published literature; thus, rates of inappropriate ordering are not available for all tests available to primary care physicians.

  • Included studies measure appropriateness of testing in a particular healthcare setting against a particular guideline, thus reflect test ordering in a specific healthcare setting.


Reaching a diagnosis in primary care is exceedingly complex. The combination of undifferentiated symptoms, a low prevalence of serious disease, a high degree of symptom overlap between serious and benign conditions, patients with multiple complaints and psychological or social distress manifesting somatically all complicate reaching a diagnosis.1 In around 40% of primary care consultations, a diagnosis cannot be established from the history and physical examination alone,2 and tests are therefore often needed.1 3

Primary care consultations make up most of the care provided in healthcare systems (90% of consultations in the UK,4 55% of consultations in the USA5) and inappropriate diagnostic testing in primary care therefore has enormous resource implications. Given the calls for £22 billion in efficiency savings from the UK’s National Health Service6 and the $660 billion US Medicare deficit predicted by 2023,7 ensuring the appropriateness of primary care diagnostic testing is crucial to the sustainability of healthcare systems.8

Inappropriate diagnostic tests in primary care can be both inappropriately underused and overused. Underuse of tests, failure to order a test when indicated, can lead to diagnostic errors and delays in diagnosis and the delivery of effective treatment, leading to adverse patient outcomes and further healthcare costs.9 10 Overuse of tests, the delivery of tests with no clear benefit or when potential harms outweigh potential benefits, subjects patients to direct harms, such as radiation exposure, as well as potential adverse outcomes (eg, contrast nephropathy),11 incidental findings12 and overdiagnosis.13 Overuse is also a waste of finite healthcare expenditure, diverting resources from beneficial tests and treatments.14–16

Many drivers encourage inappropriate underuse and overuse of diagnostic tests in primary care. Greater access to tests,17 the medicolegal consequences of undertesting,18 few if any disincentives to overinvestigate14 and clinical performance measures19 may all contribute to overuse. Increasing primary care workload,4 time constraints19 and difficulty keeping up-to-date with rapidly increasingly evidence20 may contribute to both inappropriate underuse and overuse.

Guidelines set the standard of care across most healthcare settings.21 22 Furthermore, they provide a medicolegal framework,23 inform healthcare policy and improve both care outcomes and processes of care.24 Despite some recognised limitations, including varying quality of guidelines,25–27 guidelines are often used as markers of healthcare appropriateness.28–31 Zhi et al,29 for instance, used guidelines as a measure of appropriateness to estimate underuse and overuse of laboratory testing. They estimated that 45% (95% CI 34% to 56%) of secondary care laboratory testing is underused and 21% (95% CI 16% to 25%) is overused.

Despite the increasing use of healthcare resources,32 rising healthcare expenditure,6–8 increasing demands placed on primary care4 and the apparent drivers of inappropriate testing,1 4 14 17–20 it is not clear how often diagnostic tests are inappropriately overused or underused in primary care. We therefore conducted a systematic review to quantify the frequency of inappropriate ordering of all types of diagnostic tests from primary care in relation to their respective guidelines and identify tests that are frequently overused and underused.


This study was conducted and is reported in line with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses33 and Meta-analysis of Observational Studies in Epidemiology statements.34

Protocol and registration

The protocol has been published and is available online (open access) via the International Prospective Register for Systematic Reviews database (registration ID: CRD42016048832).

Search strategy

We searched Embase (OvidSP) and MEDLINE (OvidSP) databases from January 1999 to October 2017 for studies of any design measuring how often diagnostic test guidelines were followed in primary care (see online supplementary file 1: search Strategy). Our search strategy can be summarised as: ‘Ambulatory Care AND adherence AND guideline AND diagnostic tests AND inappropriate’. Conference abstracts published after 2015 were also searched for in these databases to capture data not yet published. We also searched the WHO International Clinical Trials Registry Platform (,, and the reference lists of included studies.

Supplemental material

Eligibility criteria

We included studies of any design if they measured the rate of inappropriate ordering (overuse) or not ordering (underuse) of diagnostic tests ordered from primary care against national or international guidelines. We considered all diagnostic tests ordered in adults. We also included studies that measured diagnostic tests ordered from primary care but performed in secondary care (eg, upper endoscopy). We included the control arms of randomised controlled trials (RCTs) if they offered exclusively usual care and the preintervention periods of studies that used interrupted time series designs (before and after studies).

We excluded studies if they met the following criteria: >20% of participants were children (>20% under 18 years old); diagnostic tests not ordered by general practitioners; and screening or monitoring tests or publication before 1999 (studies after 1999 were considered to ensure that results would more closely reflect current practice). We defined a screening test as a test on an asymptomatic or symptomatic person without signs or symptoms related to that test.35 36 We defined monitoring tests as ‘a test for a patient with an established diagnosis, for which the test is used to measure progression of the disease’.37 We excluded studies if they did not give a measure of appropriateness or if appropriateness was measured against local guidelines, such as a guideline specific to a hospital or region, rather than international or national guidelines.

Study selection and data extraction

Three reviewers (JWO and AA or BDN) independently screened titles, abstracts and full texts for eligibility. The same reviewers assessed risks of bias and extracted the following data from included studies: patient demographics, eligibility criteria, name and type of diagnostic test, duration of study (days), guideline name and recommendation, total number of tests performed and the number of tests ordered when the specific guideline recommended not ordering (inappropriate overuse) or the number of tests not ordered when the guideline recommended ordering it (inappropriate underuse). The last two data points (overuse and underuse) represent ‘measures of inappropriateness’. When studies measured inappropriateness of multiple tests, we extracted data on each test and presented them as individual measures of inappropriateness. When studies measured tests across different periods, we extracted measures for each time point and considered each one as an individual measure of inappropriateness.

We assessed the quality of included studies using a modified version of the Hoy risk of bias tool.38 This tool has been validated to assess the internal and external validity of prevalence studies.38 Our modified version of this tool kept the same domains but adjusted the wording of the tool to reflect prevalence of inappropriate testing rather than prevalence of disease. Our tool (and results) is available in online supplementary file 2: risk of bias.

Supplemental material

Statistical analysis

The primary outcome was the prevalence of inappropriate diagnostic testing. Inappropriate testing was measured in two ways:

  1. Overuse: a diagnostic test was ordered when the relevant guideline recommends not ordering it, for instance, imaging for non-red flag low back pain (LBP).

  2. Underuse: a diagnostic test was not ordered when the relevant guideline recommended ordering it, for instance, spirometry to confirm or refute the diagnosis of chronic obstructive pulmonary disease (COPD).

We expressed measures of inappropriateness as percentages (%), where the numerator represents the total number of times a guideline recommendation was not followed, and the denominator represents the total number of times a guideline recommendation could have been followed. For instance, the number of times imaging was inappropriately ordered for non-red flag headache as a percentage of the total number of patients who presented with non-red flag headache. As our included data are percentages, we calculated Clopper-Pearson 95% CIs for each individual measure of appropriateness. We conducted sensitivity analyses with high risk of bias studies excluded.

Where the same guideline and recommendation were used by multiple studies (eg, five studies measured inappropriate underuse of spirometry testing in patients with COPD39–43 using the Global Initiative for Chronic Obstructive Lung Disease (GOLD) guideline), we pooled the measures and assessed heterogeneity. We combined measures of inappropriateness using a random-effects meta-analysis with 95% CIs(Clopper-Pearson), for the reason that each measure of appropriateness contributed relatively evenly to pooled estimates. We performed double arcsine transformation on prevalence data to stabilise the variance44 and pooled the data using the inverse variance method.45 We assessed heterogeneity using the I2 statistic.46 We did not combine measures of overuse and underuse, as they have different denominators: overuse involves the total number of tests ordered, whereas underuse involves the total number of times a test should have been ordered. We performed analyses using R V.3.3.2 (R project).


Study selection and characteristics

We included 63 studies from 14 716 references identified from independent searches by two authors (JWO and AA or BDN) (see figure 1). Of the 63 included studies, 55 were observational studies, 6 were before-and-after studies and 2 were RCTs. The two RCTs investigated the effect of implementing an intervention to reduce inappropriate testing. These studies were conducted in 15 countries and included 357 171 patients (see online supplementary file 3: table 1). Online supplementary file 4: table 1 shows the 103 measures of inappropriateness extracted from included studies for 47 different diagnostic tests measured against 77 guideline recommendations (41 measured underuse and 62 measured overuse). Guideline recommendations came from 42 different guideline organisations from 15 countries.

Supplemental material

Supplemental material

Figure 1

PRISMA flow diagram. GP, general practitioner; PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses.

Fourteen studies measured inappropriateness of more than one diagnostic tests for the same condition (eg, chest X-ray, electrocardiography and transthoracic echocardiography to confirm or refute a diagnosis of heart failure). Two studies47 48 measured inappropriateness across multiple time periods. No studies measured both underuse and overuse of the same test.

Included studies measured inappropriateness in one of three ways:

  1. Patients with specific symptoms were assessed (prospectively or retrospectively) to see if they had received an inappropriate diagnostic test (overuse) or had not received the appropriate diagnostic test (underuse) in line with the relevant guideline recommendation (eg, records for patients with non-red flag LBP to see if they received imaging49). Eighteen studies used this method.

  2. Patients who had undergone a diagnostic test were identified (via hospital or national databases), and an assessment of whether the test was inappropriate (as per the defined guideline recommendations) via individual patient data was made (overuse). For instance, patients who had an upper endoscopy.50 Twenty-two studies used this method.

  3. Patients with a diagnosis were identified via hospital or national databases and assessed to see whether they had received the appropriate diagnostic test (as per the defined guideline) to confirm or refute the diagnosis via individual patient data (underuse). For instance, assessing if patients with a diagnosis of COPD had spirometry to confirm or refute the diagnosis.39 Twenty-three studies used this method.

Risk of bias

Two-thirds of the studies (n=44) were graded as being at low risk of bias, 15 (24%) at moderate risk and 4 (6%) at high risk (see online supplementary file 2: risk of bias). Moderate or high risk studies were at an increased risk of non-response bias (>20%), non-objective collection of data and/or unclear intervals between symptom onset and diagnostic test use. Supplementary file 2: risk of bias outlines risk of bias scores in detail.

Percentage of diagnostic tests ordered in line with specific guideline recommendations

There was large variation in the rate of inappropriate diagnostic test ordering. The 103 diagnostic test guideline recommendations were not followed 0.2%–100% of the time (see online supplementary file 4 table 1); wide variation was largely sustained (0.2%–99.94%) when a further analysis was conducted excluding studies judged to be of high risk of bias. The prevalence of underuse varied 8.2%–100%, whereas overuse varied between 0.2% and 94.2%. Similarly, this variation was essentially maintained on exclusion of high risk studies (under use 9.8%–99.9%, overuse 0.2%–94.2%).

Underused tests

Online supplementary file 4 table 1 shows that 17 tests were underused more than 50% of the time. Echocardiography was the most frequently studied (n=4, twice in the UK and once in Poland and Brazil) . In patients with heart failure, echocardiography was underused between 54% and 89% (n=3) of the time and in atrial fibrillation 56% (n=1).

For some tests, there was large variation in the rate of underuse (figure 2). Underuse of pulmonary function tests (PFTs) to confirm or refute COPD, measured against the GOLD, National Institute for Health and Care Excellence (UK) and Danish National Board of Health guidelines, varied from 26% to 78% (n=8). None of the studies that studied echocardiography or PFTs were considered high risk of bias and thus results did not change on further analysis excluding high-risk studies.

Figure 2

Rates of underuse. ACC, American College of Cardiology; AFib, atrial fibrillation; AHA, American Heart Association; CDC, Centers for Disease Control and Prevention; CXR, chest X-ray; ESC, European Society of Cardiology; FBC, full blood count; FNA, fine needle aspiration; GOLD, Global Initiative for Chronic Obstructive Lung Disease; NICE, National Institute for Health and Care Excellence; PFTs, pulmonary function tests; TB, tuberculosis; TSH, thyroid stimulating hormone; UTI, urinary tract infection.

Overused tests

Eleven tests were overused more than 50% of the time (figure 3). Echocardiography was consistently overused, for instance in ‘routine perioperative evaluation of ventricular function with no symptoms or signs of cardiovascular disease’, whereas other tests (urinary cultures, upper endoscopy and colonoscopy) were overused at varying rates. The overuse of echocardiography was studied in the UK51 and the Netherlands.52 The rates of overuse varied between the two settings: between 77% (Netherlands) and 92% (UK). The overuse of urinary cultures for uncomplicated urinary tract infections was studied in the USA,53 54 Spain55 and Sweden.56 The rate of overuse varied from 57% to 77% in the USA, compared with approximately 50% in Sweden and 36% in Spain. Overuse of upper endoscopy was studied widely (n=11) in Australia,57 58 Saudi Arabia,59 60 UK,61 Italy,62–64 USA50 65 and Malaysia.66 The overuse varied markedly, from 7.5% to 54% (n=11), respectively (figure 3, online supplementary file 3 table 1). Similarly, the inappropriate overuse of colonoscopy varied substantially from 8% in Australia58 to 52% in Malaysia.67 None of the above studies were considered high risk of bias and thus results did not change on further analysis excluding high-risk studies.

Figure 3

Rates of overuse. GORD, gastro-oesophageal reflux disease; GP, general practitioner; LBP, low back pain; NHMRC, National Health and Medical Research Council; NICE, National Institute for Health and Care Excellence; U/S, ultrasound; UTI, urinary tract infection.

Our results also suggest that the inappropriate overuse of CT and MRI scans for non-red flag headache (a headache without symptoms suggesting a malignant underlying pathology) has more than doubled in the last 10 years in the USA (2000: 6.7% (95% CI 5.4% to 8.2%), 2010: 14% (95% CI 12.% to 16%)) (see online supplementary file 4 table 1).48 Conversely, the rate of inappropriate overuse of radiology tests for non-red flag LBP was consistently low, with all (n=18 measures) but two measure showing inappropriate overuse less than 25% of the time (see online supplementary file 4 table 1). One of these studies68 estimated overuse to be about 50% but was conducted in 2001 and thus may reflect improvements over time. The other study is current but used a small sample size.69 None of these studies were considered high risk of bias and thus results did not change on further analysis excluding high-risk studies.

Variation of inappropriateness against the same guideline recommendation

Eleven different guideline recommendations were studied more than once. There was significant heterogeneity (I2 >50%) in nine of these pooled measures. Significant heterogeneity may have occurred for several reasons: (1) vastly different populations (for instance, one study measured the inappropriateness of upper endoscopy in Saudi Arabia60 using the American Gastroenterological Association recommendations, whereas another study used the same recommendations in the USA70); (2) contrasting healthcare systems71 72; (3) relevance and applicability of one country’s national guideline to another country73; (4) a low number of measures for meta-analysis46; and/or (5) significant heterogeneity, reflecting significant variation in inappropriate ordering.


There is marked variation in the rate of underuse and overuse of diagnostic tests from many primary care settings across the world. This variation suggests improvement can be made in the rate of appropriate diagnostic test ordering.

Primary care use of echocardiography is consistently poor. Echocardiography is inappropriately underused for some clinical situations, for example, confirming a diagnosis of heart failure, and inappropriately overused in others, for example, perioperative assessment. This was consistent across the countries where appropriateness of echocardiogram has been studied. This is of concern given the expertise and resource requirements to perform the test and the increasing availability of direct access ordering for primary care physicians.

For four tests, we found marked variation in the rate of inappropriate use. Underuse of PFTs varied by >50%, whereas overuse of urinary cultures, upper endoscopy and colonoscopy all varied by around 40%.

Radiology tests for both non-red flag LBP and non-red flag headache were frequently not overused, but the rate of overuse of imaging for non-red flag headache showed concerning trends, more than doubling from 2000 to 2010 (see online supplementary file 4 table 1).

Implications and future research

Two principle conclusions can be drawn from our results: (1) ordering of echocardiograms from primary care appears to require improvement and (2) markedly varying rates of inappropriate use for PFTs (underuse), colonoscopy (overuse), upper endoscopy (overuse) and urinary cultures (overuse) suggest that ordering can be improved.

Future research should focus on: determining the reasons for deviation from guidelines, assessing the quality of guidelines supporting diagnostic test use and systematic reviews quantifying inappropriate screening and monitoring tests. Furthermore, investigators wishing to undertake primary studies measuring inappropriate use should focus on developing objective data extraction methods for assessing patient notes and define clearly the interval they (investigators) will consider a test ordered for a particular symptom or disease.

Strengths in relation to other studies

Compared with other studies of inappropriate use of healthcare resources, we used data from real clinical encounters. This allowed a more robust assessment of diagnostic test inappropriateness, where other studies used surveys and hypothetical clinical vignettes.19 74 75 Furthermore, we quantified the appropriateness of all types of diagnostic tests, rather than focusing on a specific test or specific disease (such as only laboratory tests29). Our paper is the first systematic review of studies that measured inappropriateness of all diagnostic tests ordered from primary care. Zhi et al 29 quantified the mean rates of overuse and underuse of laboratory tests in secondary care and focused on quantifying an overall rate of overuse and underuse. They estimated that overuse and underuse of laboratory tests was around 21% and 45%, respectively.29 We choose not to quantify an overall rate of overuse and underuse because we feel the results would not be representative; we would be combining data from multiple different healthcare settings and data captured only the studied selection of diagnostic tests available in primary care.

Our use of guideline recommendations as the metric of appropriateness allowed a direct measure of diagnostic test appropriateness. Other studies that have assessed temporal and geographical variation in the use of diagnostic tests76 77 have noted substantial differences in diagnostic practices across different regions, irrespective of disease prevalence and patient characteristics.77 These studies, however, could not quantify what percentage of the temporal increase in the use of a diagnostic test is inappropriate and what percentage of variation between regions is inappropriate. We have quantified the percentage of inappropriate testing.

Although beyond the scope of our review, ultimately, interventions should be implemented to improve test use. A 2015 systematic review78 concluded that ‘Interventions such as educational strategies, feedback and changing test order forms may improve the efficient use of laboratory tests in primary care’. Thus, doctors, academics and policy makers can use our results to identify diagnostic tests in their particular healthcare settings that may benefit from intervention.


The use of guidelines to quantify appropriateness of diagnostic tests could be considered a limitation of this study. Guidelines are often criticised for varying quality25–27 79 and panel members’ conflicts of interests.80 However, clinical practice guidelines have been shown to improve both care outcomes and processes of care,24 allow assessment of care on a population level, inform health policy,81 82 set the standard of care across many healthcare settings21 22 and provide a medicolegal framework.23 One major medical insurance company advises that ‘doctors must be prepared to explain and justify their decisions and actions, especially if they depart from guidelines produced by a nationally recognised body’.23 Furthermore, guidelines have been used to measure appropriateness of the use of tests in other published peer-reviewed studies.29 There will always be times when it is appropriate to depart from guidelines, but dramatic, consistent variation from guidelines requires investigation and is unlikely to be caused entirely by the quality of guidelines.

Furthermore, our study includes only a selection of diagnostic tests and is thus not an all-encompassing reflection of clinical practice. The data reflect the use of a specific test, sometimes for a particular clinical situation, in a particular country’s healthcare system. Thus, policy makers and those interested in improving the quality of primary care diagnostic test use can use our results as a resource to identify tests in their healthcare setting that require improvement and/or investigation to decipher why such deviation from guidelines exists. Our conclusions from this paper, however, are not generalisable to all primary care settings nor all primary care diagnostic tests.

Lastly, caution must be taken when comparing results that measured inappropriateness using different denominators. The results from studies that measured inappropriateness using patients who had undergone a diagnostic test as a denominator should be interpreted differently to studies that used patients with a diagnosis or symptoms as a denominator (and vice versa).


There is marked variation in underuse and overuse of appropriate diagnostic test use in primary care across the world. From the available data, echocardiograms are ordered particularly poorly, while the substantial variation in appropriate ordering of PFTs, colonoscopy, upper endoscopy and urinary cultures suggests a need for improvement.


We would like to thank Kate Roche and Jason Hendry for comments on the draft and figures. We also thank the peer reviewers for their constructive feedback.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.
  50. 50.
  51. 51.
  52. 52.
  53. 53.
  54. 54.
  55. 55.
  56. 56.
  57. 57.
  58. 58.
  59. 59.
  60. 60.
  61. 61.
  62. 62.
  63. 63.
  64. 64.
  65. 65.
  66. 66.
  67. 67.
  68. 68.
  69. 69.
  70. 70.
  71. 71.
  72. 72.
  73. 73.
  74. 74.
  75. 75.
  76. 76.
  77. 77.
  78. 78.
  79. 79.
  80. 80.
  81. 81.
  82. 82.


  • Contributors Conception and design: JWO, RP and CH. Search strategy: NR and JWO. Screening, extraction and risk of bias: JWO, AA and BDN. Analysis and interpretation of the data: JWO, RP, JA and CH. Drafting of the article: JWO (all authors critically reviewed and approved manuscript). Statistical expertise: RP. Clinical expertise: JWO, BDN, JA and CH. JWO is the guarantor.

  • Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient consent Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement Data extracted from the included studies in this review are available on request from the corresponding author.