Objectives To quantify and analyse the quality of evidence that is presented in national guidelines.
Setting Levels of evidence used in all the current valid recommendations in the Scottish Intercollegiate Guideline Network (SIGN) guidelines were reviewed and statistically analysed.
Outcome measures The proportion of level D evidence used in each guideline and a statistical analysis.
Method Data were collected from published guidelines available online to the public. SIGN methodology entails a professional group selected by a national organisation to develop each of these guidelines. Statistical analysis of the relationship between the number of guideline recommendations and the quality of evidence used in its recommendations was performed.
Result The proportion of level D evidence increases with the number of recommendations made. This correlation is significant with Kendall's τ=0.22 (approximate 95% CI 0.008 to 0.45), p = 0.04; and Spearman ρ=0.22 (approximate 95% CI 0.02 to 0.57), p=0.04.
Conclusions Practice guidelines should be brief and based on scientific evidence. Paradoxically the longest guidelines have the highest proportion of recommendations based on the lowest level of evidence. Guideline developers should be more aware of the need for brevity and a stricter application of evidence-based principles could achieve this. The findings support calls for a review of how evidence is used and presented in guidelines.
- Health Services Administration & Management
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Strengths and limitations of this study
This is the first objective evidence of inconsistencies in approach by a national guideline developer.
This supports commentator suggestion that even without good evidence a group will prefer consensus.
Adds to the current debate about how guidelines might be developed in the future.
The study is limited to only one set of national guidelines, that is, the Scottish Intercollegiate Guideline Network (SIGN).
Reasons for the differences in quality of evidence preferred by the guideline development groups are unclear.
The Scottish Intercollegiate Guidelines Network (SIGN) was founded in 1993. It is a national body, professionally led and publicly funded. SIGN's founding principles proposed direct links between evidence and recommendations, offering a brief and succinct quick-reference guide for clinicians.1 Guidelines anticipated presenting brief, evidence-based clinical advice. They have developed into long and authoritative texts used by managers and politicians to inform policy. A formal arrangement between SIGN and the National Institute of Care Excellence (NICE) has existed from 2003. Both have responsibility to consider cost-effectiveness and input to the Quality and Outcomes Framework (QOF).
The WHO recognises that current grades of recommendation (box 1) may be ambiguous2 and encourages guideline developers to use a system which includes a category ‘Use only in the context of research’ where doubt exists.
Grades of recommendation
At least one meta-analysis, systematic review, or randomized controlled trial (RCT) rated as 1++, and directly applicable to the target population; or A body of evidence consisting principally of studies rated as 1+, directly applicable to the target population, and demonstrating overall consistency of results
A body of evidence including studies rated as 2++, directly applicable to the target population, and demonstrating overall consistency of results; or
Extrapolated evidence from studies rated as 1++ or 1+
A body of evidence including studies rated as 2+, directly applicable to the target population and demonstrating overall consistency of results; or
Extrapolated evidence from studies rated as 2++
Evidence level 3 or 4; or
Extrapolated evidence from studies rated as 2+
Guideline developers have conflict of interest policies reported as challenging to apply. Where doubt exists, groups of specialists may feel consensus more defensible than acknowledging uncertainty.3
Even with the best evidence, concerns are expressed about the relevance of guidelines in treating patients with multiple morbidities,4 and the emergence of the phenomenon of reversal,5 ,6 where established practice, sometimes evidence based, is shown to be suboptimal or harmful. This study looks at the quality of evidence used for SIGN guidelines, and describes a significant trend for some groups to emphasise poorly evidenced recommendations.
SIGN guidelines were accessed online in September 2013. SIGN guidelines were chosen because they are internationally respected, the authors were familiar with their format and they contribute to national government policy. Guidelines that were ‘Current’ and ‘Current 3–7 years, some recommendations, may be out of date.’, were included. Those that had been ‘Withdrawn’, ‘Recommendations being updated’, ‘Need for update being considered’ and those with no recommendations were excluded.
SIGN guideline 50 clearly describes an established process for developing guidelines.7 It explains how the process is planned, how it is implemented and by whom. This process is independent of this study, but is stated to be an objective process. SIGN guidelines have four grades of recommendation outlined in box 1. Table 1 describes the level of evidence SIGN uses to support the recommendation grading. SIGN guideline development groups vary in size depending on the scope of the topic under consideration, but generally comprise between 15 and 25 members. SIGN states they are aware of the many psychosocial factors, including the problems of overcoming professional hierarchies that can affect small group processes.
Three investigators (JRL, AGB and ABB) independently enumerated the level of evidence used by each guideline. They discounted any duplication implicit in text-embedded key recommendations and also implementation recommendations. There were no discrepancies. A statistical analysis of the correlation between the proportion of level D evidence and the total number of recommendations was performed for the 42 guidelines.
The 42 guidelines consisted of 2559 pages (including references), ranging from 26 to 161 (median 59.5) pages. The longest guideline, number 116 was 61 pages longer than the next largest. The number of recommendations per page ranged from 0.2 to 1.8 (median 0.7). The number of recommendations per guideline is presented in table 2.
Of the 1999 recommendations, 480 (24%) were level A, 491(24.6%) were level B, 318 (15.9%) level C and 710 (35.5%) level D. Thus 51.4% were poorly evidenced (C and D) and over a third (D) depend almost entirely on ‘expert opinion’. The number of level A recommendations per guideline ranged 0–57 (median 9), level B 2–62 (median 8.5), level C ranged 0–26 (median 6) and level D ranged 0–60 (median 14.5). Four guidelines had no level A evidence.
The proportion of level D evidence increases with the number of recommendations made. This correlation is significant with Kendall's τ=0.22 (approximate 95% CI 0.008 to 0.45), p=0.04; and Spearman ρ=0.22 (approximate 95% CI 0.02 to 0.57), p=0.04.
This study reveals that expert groups who produce long guidelines rely on poor evidence more heavily than others. While this study only looks at SIGN, this study highlights a problem that has escaped national guideline developers, a wide range of professionals and the public to whom these guidelines are applied. National guidelines are useful and important and there is a debate about how evidence is best presented. Guidelines define standards of care, help busy clinicians and allow managers and politicians to develop governance. An American study (using 3 not 4 levels of evidence) similarly found that 48% were ‘based on expert opinion, case studies or standards of care’;8 we show comparable results for current SIGN guidelines. Where patients are involved in clinical decisions, honestly declaring uncertainty has merit. In the absence of good scientific evidence, recommending a course of action without understanding the circumstances of the individual to whom it is applied seems both risky and, assuming a right to patient choice, unwarranted. Other guidelines that use high levels of poor evidence should evaluate the proportion of poorly evidenced recommendations and seek explanations for such trends.
This study did not examine why longer guidelines use poorer evidence. Groups of experts, indulging in ‘group think’ may view their own opinion as more authoritative than science can support.9 It has been postulated that there is security in “just doing what everyone else is doing—even if what everyone else is doing isn't very good.”3 Reliance on expert opinion has a poor track record. Blinded by certainty, expert groups defining established practice have, in the past, perpetuated radical mastectomy instead of conservative surgery, class 1C antiarrhythmics,10 pulmonary artery catheters in heart failure,11 electronic fetal monitoring in low-risk pregnancies: even then practice can take a decade to reverse.12
Even good evidence is subject to the phenomenon of reversal where new evidence contradicts current practice. Reversal can affect around 13–16% of publications.5 ,6 This may partly explain why the implementation of even the most soundly evidence based national guidelines fails to improve outcome.13–15 There is potential harm16 ,17 from guidelines in real clinical settings, for example, increasing radiation dose without benefit18 or increased risks of anticoagulation.19
SIGN 116 (diabetes), is a notable outlier. It is more than 50% larger than the next largest, 2.5 times longer than the average and yet uses the fourth lowest level D recommendations. There are a number of hypotheses why this group reports differently. SIGN guidelines inform QOF policy. Diabetes is the largest clinical UK QOF indicator and is associated with substantial payment incentives. The need for objective evaluation of performance drives a use of surrogate outcomes without appropriate clinical endpoints.20 Diabetes guidelines have suffered several noteworthy reversals. Examples include the recommendation of glycosylated haemoglobin reduction resulting in increased use of rosiglitazone (still mentioned in the current document) both associated with harm including mortality.21 ,22 Aspirin recommendations have also been changed from previous guidelines. Is it possible that the repeated use of surrogate outcomes arises from group dynamics driven by a powerful external agenda?
Many doctors whose expertise cross several guidelines23 ,24 express concerns about guideline development groups. The inappropriate exclusion of disease groups from general population data is common. Smoking cessation advice is applicable to the general population almost without exception, yet the evidence to stop smoking was graded as B on 3 occasions and level C and D once each. Interpreting evidence inconsistently in this way may imply group dysfunction. Differently constituted groups, or greater oversight might avoid problems.
In 1993, SIGN guidelines stated intention was to be evidence based, brief and succinct. Brevity increases value as a quick reference guide. Removing or reducing poorly evidenced recommendations would reduce size by more than a third overall and in some up to two-thirds. The two volumes Oxford Textbook of Primary Medical Care (2005) is a relatively brief 1420 pages, more than a 1000 less than the 2559 pages of guidelines. Evidence-based medicine is described as “the use of mathematical estimates of the risk of benefit and harm, derived from high-quality research on population samples, to inform clinical decision-making in the diagnosis, investigation or management of individual patients.”25 Guidelines relevance to daily practice, the reliability of evidence and whether the application of evidence will improve outcomes are important questions.
These results may reflect how professional groups deal with uncertainty. If so, this is not good for individual patients faced with the same uncertainties (whether aware of it or not), nor is it good for scientists who actively seek unanswered questions by challenging established practice, an area in which medicine has a poor record from Semmelweis to the present day.
The finding of a significant increase of level D recommendations in larger guidelines has not happened by chance. A wider debate about how guideline groups can create greater clarity about the reliability of evidence used is needed.26 Reducing the use of poorly evidenced recommendations has potential to create a shorter, more reliable and usable clinical support. The GRADE working group was formed in 2000.27 SIGN moved to a new grading system in 200128 and from 2013 a new system based on GRADE principles. Whether these changes will resolve the challenges that underpin the inconsistencies we have outlined remains to be seen.
The authors would like to thank Heather Barrington, Statistical Adviser; Bridget Bird, Administrative Assistant; Research and Development Support Unit, Dumfries and Galloway, and Anne B Baird (ABB), Sandhead Surgery, Sandhead, Wigtownshire.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Files in this Data Supplement:
Contributors AGB and JRL were involved in revising the raw data and agreed on a statistical approach to discover whether the trend was significant or not, and were also involved in writing and researching the evidence.
Funding This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.