Effectiveness of interventions to reduce ordering of thyroid function tests: a systematic review

Objectives To evaluate the effectiveness of behaviour changing interventions targeting ordering of thyroid function tests. Design Systematic review. Data sources MEDLINE, EMBASE and the Cochrane Database up to May 2015. Eligibility criteria for selecting studies We included studies evaluating the effectiveness of behaviour change interventions aiming to reduce ordering of thyroid function tests. Randomised controlled trials (RCTs), non-randomised controlled studies and before and after studies were included. There were no language restrictions. Study appraisal and synthesis methods 2 reviewers independently screened all records identified by the electronic searches and reviewed the full text of any deemed potentially relevant. Study details were extracted from the included papers and their methodological quality assessed independently using a validated tool. Disagreements were resolved through discussion and arbitration by a third reviewer. Meta-analysis was not used. Results 27 studies (28 papers) were included. They evaluated a range of interventions including guidelines/protocols, changes to funding policy, education, decision aids, reminders and audit/feedback; often intervention types were combined. The most common outcome measured was the rate of test ordering, but the effect on appropriateness, test ordering patterns and cost were also measured. 4 studies were RCTs. The majority of the studies were of poor or moderate methodological quality. The interventions were variable and poorly reported. Only 4 studies reported unsuccessful interventions but there was no clear pattern to link effect and intervention type or other characteristics. Conclusions The results suggest that behaviour change interventions are effective particularly in reducing the volume of thyroid function tests. However, due to the poor methodological quality and reporting of the studies, the likely presence of publication bias and the questionable relevance of some interventions to current day practice, we are unable to draw strong conclusions or recommend the implementation of specific intervention types. Further research is thus justified. Trial registration number CRD42014006192.


INTRODUCTION
Thyroid dysfunctions including hypothyroidism and hyperthyroidism are among the most common medical conditions with prevalence 3.82% (3.77-3.86%) and incidence 259.12 (254. 39-263.9) cases per 100 000/year in Europe. 1 Both undertreatment and overtreatment of these conditions may have serious consequences for the patient's health and, therefore, correct and timely diagnosis and monitoring are important. 2 3 The diagnosis of thyroid dysfunctions, however, is challenging as they present with common and non-specific symptoms: a range of laboratory investigations, such as thyroid-stimulating hormone (TSH), free thyroxine (FT4) and free tri-iodothyronine (FT3) are readily available to rule them in Strengths and limitations of this study ▪ The current systematic review was conducted following the methods recommended by the Cochrane Collaboration. We worked to a prespecified protocol and consider our findings to be robust. ▪ This is the first review focusing specifically on the effectiveness of interventions designed to reduce unnecessary ordering of thyroid function tests. ▪ The evidence suggests that, in general, such interventions are effective in reducing the volume, changing the pattern of ordering, improving compliance with guidelines or reducing the cost of thyroid function tests ordered.
Whether such changes reflect more appropriate test ordering remains unclear as measures of appropriateness were rarely reported. ▪ However, the poor quality of evidence, the significant heterogeneity in study design and the likely presence of publication bias and selective reporting did not allow strong conclusions and more specific recommendations to be made and precluded pooling the result from the individual studies.
or out. In the UK alone 10 million thyroid function tests (TFTs) are ordered each year at an estimated cost of £30 million. 4 Although national guidelines for the use of TFTs exist, 4 a recent audit of general practitioners' (GPs) ordering patterns conducted by our group in the South West of England found that there is a sixfold variation in the rates of test requests between different practices. The study also demonstrated that only about 24% of this variation could be accounted for by variation in the prevalence of hypothyroidism and socioeconomic deprivation. 5 The National Health Service (NHS) Atlas of Variation in Diagnostic Services published in November 2013 6 reported even more extreme variation in the annual rate of TFTs ordered by GPs per practice population across different primary care trusts in England. In this report, the estimated annual rate for TSH ordered by GPs ranged from 6.2 to 355.8 per 1000 practice population (57-fold variation). The reported numbers for FT4 and FT3 were 14.6-231.1 (16-fold) and 0.42-17.0 (40-fold) per 1000 practice population, respectively (p. 122).
A qualitative study we conducted identified a wide range of mechanisms that might be responsible for the variation, including the presence of inappropriate test ordering. 7 Given the continuous rise of thyroid test requests, 8 9 which is disproportionate to the increase in the incidence and prevalence of thyroid conditions, 9 and the fact that these investigations make up a significant proportion of all laboratory tests ordered in primary care, 10 there is a need to help clinicians avoid inappropriate thyroid testing. Such testing not only increases laboratory workload and wastes scarce resources but may also have a negative impact on patients' health through further unnecessary tests and inappropriate treatment. 11 The effectiveness of interventions designed to reduce the number of unnecessary medical tests has already been evaluated in a number of systematic reviews. [12][13][14][15][16] Owing to their broad scope, however, the results are too general and of little help when it comes to designing interventions that target specific test ordering behaviour. The effect of the same intervention may vary considerably across different tests, even when they belong to the same diagnostic modality. 10 17-19 We conducted a systematic review investigating the effect of behavioural interventions on the ordering of TFTs. We believed a more narrowly focused approach with respect to target behaviour might produce more applicable results and thus better inform the development and implementation of interventions specifically designed to improve TFT ordering.

METHODS
In conducting the review, we followed the recommendations of the Cochrane Collaboration. 20 MEDLINE, EMBASE and the Cochrane Database of Systematic Reviews were searched using a predefined search strategy (see online supplementary appendix 1). The original search covered the period up until November 2013 and was updated on 1 May 2015. Also, the bibliographies of the included studies and other relevant publications were scrutinised for additional articles. Studies were selected independently by two reviewers (ZZ and RA) with all disagreements resolved through discussion and, if necessary, arbitration by a third reviewer (CH or BV). In the first round, all electronically identified citations were screened at title and abstract level. Full-text copies of potentially relevant articles were retrieved for full-text screening. Studies were included in the review if they met the following prespecified criteria: ▸ Evaluated the effectiveness of interventions designed to reduce the number of inappropriately ordered TFTs (regardless of whether they were the only targeted tests or not). ▸ Were randomised controlled trials (RCTs), nonrandomised controlled studies or single-group before and after studies (including both those with trend before and after and those with just one time point before and after). ▸ The outcomes were one or more of the following: change in the total number of TFTs, the number of inappropriately ordered tests, the test-related expenditure or health benefits to individual patients (eg, the number of unnecessary tests or treatments avoided). ▸ Reported the specific effect that the intervention had on the targeted TFTs. Studies that targeted TFTs along with other tests and reported only the average effect (across all tests) were excluded. We included all studies that used the rate of inappropriately ordered TFTs as an outcome measure regardless of the definitions they used. Appropriateness of test ordering is usually judged against local protocols or guidelines that may vary from place to place or change over time. We accepted all definitions even when they were outdated or did not fit in with the current UK guidelines. We did not use the setting and the targeted clinicians' characteristics as inclusion criteria but explored, as far as possible, their potential impact on the study outcomes. The methodological quality of the included studies was assessed independently by ZZ and RA using the Effective Public Health Practice Project tool which allows the assessment of all study designs with the same rubric. 21 The method of synthesis was narrative; meta-analysis was not used because of the anticipated clinical heterogeneity, particularly in terms of the interventions. The framework for the analysis was based on an existing typology of behaviour change intervention types: 12 15 ▸ Educational interventions; ▸ Guideline and protocol development and implementation; ▸ Changes to funding policy; ▸ Reminders of existing guidelines and protocols; ▸ Decision-making tools, including test request forms and computer-based decision support; ▸ Audit and feedback.
All work conformed with a protocol defined and published ahead of the review being started (PROSPERO, registration number CRD42014006192).

RESULTS
The initial electronic searches produced 1282 hits of which, after removing duplicates, 869 were screened at title and abstract level and 99 were selected for full-text screening. Twenty five of these papers, with two additional papers identified through backward citation searching, met our prespecified criteria, and were included in the review. 10 17-19 22-44 The update search identified another 131 records of which, after screening the titles and abstracts, 7 were selected for full-text screening and 1 met the inclusion criteria. 45 It should be noted that two papers 46 47 were excluded because in these studies TFTs were allocated to the control arm and, therefore, were not affected by the interventions. Thus, the total number of papers included in the review was 28 of which 2 reported on the same study, the second reporting a long-term follow-up. 30 31 The selection process and the reasons for full-text exclusion are detailed in figure 1.

Study characteristics
The characteristics of the included studies are summarised in table 1 and the evaluated interventions are  presented in table 2. Ten studies were conducted in the USA, six in the UK and the rest in Australia (n=3), France (n=3), Canada (n=2), the Netherlands (n=1), Sweden (n=1) and New Zealand (n=1). All studies were published in English, except for one in Dutch, which was partly translated by a native Dutch speaker with a background in healthcare research. 38 The papers were  published between 1979 43 and 2014: 45 seven of them  were published before 1991, nine between 1991 and  2000, and 12 after 2000. Fourteen studies were conducted in a hospital setting including general and psychiatric hospitals, medical assessment units, emergency departments and a supraregional liver unit, with the remainder in primary care or community settings (table 1).
Education, guidelines/protocols and audit/feedback were the most common types of intervention employed. Reminders and decision tools were less commonly used and changes to funding were assessed in only two studies (table 2). Only three studies reported evaluation of computer-based test ordering, two of which were quite old, published in 1988 35 and 1994, 32 respectively. The recent one 45 evaluated only a limited aspect of computerised test ordering-the display of costs of tests being ordered.
The median duration of the interventions was 12 months (IQR 6-12 months, range 2 days to 36 months) and four studies examined test ordering after the intervention had ended. 26 30 31 33 35 Description of the interventions was usually limited and often insufficient to allow replication. Where detail was provided, it revealed significant variability in the design and implementation of interventions superficially belonging to the same category. For instance, educational interventions varied in terms of content, intensity and frequency, method of delivery, who delivered and who received the intervention as well as other characteristics which are likely to affect their effectiveness and appropriateness for different contexts and purposes. Most interventions were targeted at both senior and junior doctors. In four studies, only junior doctors were included; 24 26 27 32 four studies included other medical staff, such as nurses, physician assistants and laboratory technicians, 17 23 30 33 and in one study, the intervention was specifically directed at nurses 41 (table 1).
In terms of targeted tests, 10 studies focused exclusively on TFTs 22 23 38 four reported an average result without specifying the individual tests 18 24 34 41 and the remaining studies targeted other combinations (table 1). As the studies spanned a long period of time, different generations of tests were used and the guidelines against which the appropriateness of test ordering was judged varied. However, in most studies, the recommended testing strategy was based on TSH as a single first-line test for suspected thyroid dysfunction and for monitoring patients on thyroid replacement hormones.   The effectiveness of the evaluated interventions with respect to TFTs is summarised in table 3 with additional information provided in table 4. The effect of the interventions was measured using a range of outcomes. Thus, 22 studies measured changes in the volume of test ordering expressed either as the absolute number of tests ordered for a period of time or normalised by the number of registered patients, visits or a similar parameter. To capture the effect on the pattern of test ordering, seven studies reported separate results for different TFTs 27 28 36 37 40 42 44 and three studies measured the change in the ratios of two different tests (for instance, whether the ratio 'TSH:all TFTs' has increased as a result of new guidelines recommending TSH as a single first-line test). 30 31 36 38 More direct evaluation of the appropriateness of testing was carried out by measuring adherence to protocols or guidelines 25 26 32 33 37 43 with two of these studies reporting underutilisation as well as overutilisation. 25 26 Five studies reported effectiveness in terms of expenditure 34-36 39 44 and one study reported an estimate of the number of tests avoided as a result of the intervention. 39 In one study, researchers made an effort to investigate whether the evaluated intervention (in this case, a test-ordering protocol) had had any adverse effects on patient outcomes by conducting an audit of 4000 case notes and concluded that "No adverse patient outcomes relating to underutilisation of investigations attributable to the protocol were identified." (ref. 34, p. 133).

Study quality
The results from the methodological quality assessment are presented in table 5. In terms of study design, four were RCTs, 10 23 25 35 five studies were non-randomised controlled studies, 17 18 33 36 42 two were interrupted time series 39 45 and the remaining were single-group studies with just one time point measurement before and after. Most of the studies were of poor or moderate quality; the main issues being selection bias, lack of blinding and failure to control for confounders.

Effectiveness of the interventions Single-mechanism interventions
Fourteen studies 10 17-19 22 24 25 27-31 33 35 45 evaluated the effectiveness of the following single-mechanism interventions (tables 2-4): educational programmes (one controlled and two before and after studies), 17 30 33 guidelines and protocols (four before and after studies), 19 22 24 28 reminders (two RCT and one controlled study), 10 25 33 decision-making tools (two RCTs, one interrupted time series and one before and after study), 25 27 35 45 and audit and feedback (two controlled and two before and after studies). 17 18 29 33 The study by Schectman et al 33 was a two-stage study before and after design in the first part and a comparative design in the second. The majority of the evaluated single-mechanism interventions were effective in decreasing test-related expenditure, 35 the volume of test ordering, 17 18 22 24 27-29 33 changing the pattern of TFT ordering 19 27 28 30 31 or increasing compliance 25 33 in accordance with the recommended practice.
Two of these studies reported data on test ordering once the intervention was discontinued. Mindemark and Larsson 31 investigated the effect of the 2-day educational programme originally evaluated by Larsson et al 30 in a before and after study. Eight years after the programme was delivered, they found that the ratios between pairs of different TFTs was similar to that measured at the end of the original study (1 year after the delivery of the programme). Only the ratio 'TSH:all TFTs' showed slight but statistically significant decrease which the authors explained with the recommendation given to participants to analyse TSH in elderly patients who had not been tested in the previous 2-3 years (table 4). Although impressive, the observed result is difficult to explain with the educational programme alone as other contextual factors are likely to have contributed to the persistence of the effect.
Tierney et al 35 reported that 6 months after the intervention (display of computer-generated probability estimates evaluated in an RCT) the difference between intervention and control group has disappeared and the main outcome-charges per scheduled visit-has returned to baseline (table 4).
Three studies reported unsuccessful interventions: an RCT of good methodological quality demonstrated that a reminder in the form of a memorandum pocket card was unsuccessful in increasing compliance with the recommended thyroid testing strategy; 25 a poor quality before and after study showed that monthly feedback given to consultants for a period of 1 year was unable to decrease the ordering rates of a number of laboratory tests; 29 and an interrupted time series of moderate methodological quality demonstrated that displaying the cost of tests at the time of ordering was moderately effective in a small number of tests but did not affect the ordering of TFTs. 45 The former two studies were conducted in a hospital setting and the latter was a primary care study.
The authors of the before and after study explained the failure of the intervention by the prevailing institutional culture; the fact that the clinical units ordering the largest proportion of tests showed little concern and attributed their requesting pattern to clinical workload and the nature of their patients; and by the fact that the feedback was provided to consultants only, on their request, while many of the tests were ordered by junior doctors unaffected by the intervention. 29 The authors of the interrupted time series study surveyed all intervention and non-intervention physicians to investigate their perceptions regarding the intervention and healthcare costs in general. They found that while nearly all participants endorsed the need for cost containment and found the display of costs informative, 50% of them reported that the displays 'rarely' or 'never' impacted their decision to order the tests. 45       The proportion of indicated TSHs increased significantly (p<0.001) while TSHs per patient visit decreased significantly (p<0.0001) in the intervention period but both showed some decline at 5 months follow-up. The rate of indicated TSHs per visit did not change significantly while the rate of non-indicated TSHs per visit decreased drastically in the intervention period but increased again at follow-up. Data for the control test, CBC with differential, is not shown here but the rate of Continued     Across all tests, a significant change between the total number of sets requested per admission before (7.5 (0.87)) and after the intervention (5.9 (0.33)), p<0.001.
Berwick and colleagues who compared two different types of feedback, on cost and on yield, with test-specific education reported mixed results. Across all tests, only feedback on cost showed statistically significant effect, whereas for the TFTs test-specific education had the largest effect. With regard to reducing variability in test ordering, the two forms of feedback but not education led to a positive change. The statistical significance of the results specific to TFTs, however, was not reported. 17

Multifaceted interventions
Sixteen studies evaluated interventions that relied on more than one mechanism to change test ordering behaviour (tables 2-4). 10 23 25 26 32-34 36-44 Ten of them combined the introduction of guidelines or protocols with audit and feedback, 23 education, 41 43 redesign of a test request form, 39 42 changes to funding policy, 39 44 education plus audit and feedback, 34 36 reminders, 37 or education plus a test request form. 40 Of the remaining six studies, three evaluated the combination of audit and feedback with education, 26 reminders 10 or a problem-oriented test request form; 38 one study evaluated the effectiveness of a computer-based protocol management system enhanced with audit and feedback and education; 32 one compared an educational memorandum followed by a reminder with the same educational memorandum followed by a reminder and feedback; 33 and one compared a combination of pocket memory card (reminder) and redesign of test request form with the same single-mechanism interventions and usual practice defined as simple diffusion of guidelines. 25 With the exception of one study, 23 all reported that the evaluated multifaceted interventions were effective in decreasing the volume of test ordering, 10   accordance with the recommended practice, 26 33 37-40 42-44 avoiding unnecessary testing 39 and/or decreasing test-related expenditure. 34 36 39 Two before and after studies reported data on test ordering once the intervention was discontinued. 26 33 Dowling and colleagues evaluated the effectiveness of education plus feedback and measured the rate of TSH ordered per patient visit and the indicated and nonindicated TSH per visit. Despite the initial statistically significant effect, 5 months after the intervention was discontinued, all indicators showed some decline. Schectman et al 33 evaluated the effect of educational memorandum followed by reminder or reminder and feedback on compliance and the mean number of TFTs ordered per patients. The interventions increased compliance and led to significant decrease in test ordering which continued 6 months after the interventions but increased again at 1 year follow-up (table 4).
The study that reported an unsuccessful intervention was an RCT of good methodological quality and was conducted in primary care. It targeted the use of five frequently ordered laboratory tests including TSH and FT4 and evaluated the effectiveness of a combination of guidelines and feedback. The authors explained the failure of the intervention by the following: the feedback was provided for 1 year only, the participating practices did not volunteer to take part in the study and the guidelines might not have been sufficient to predispose physicians to change their test ordering behaviour. 23 Owing to the significant heterogeneity and poor methodological quality of the studies, we deemed pooling the results inappropriate and were unable to use statistical methods to investigate the impact of various study and intervention characteristics on the reported outcomes. Visual inspection of the data suggests, however, that differences such as intervention type, study design, setting and year of publication have little or no impact on the reported effectiveness. We created a spreadsheet which can be used by the readers to explore this themselves by sorting the results according to different study characteristics (see online supplementary appendix 2).

Main findings
This systematic review of behaviour change interventions designed to modify the ordering of TFTs found 27 studies. Several intervention types were evaluated including education, guidelines and protocols, audit and feedback, decision-making tools, changes to funding policy and reminders, either alone or in combination, in either primary or hospital care, and targeting clinicians of different seniority. Most of the studies were of poor or moderate quality and many of the interventions were poorly described.
In these studies, it appears that behaviour change interventions were, in general, effective in reducing the volume, changing the pattern of test ordering, improving compliance with guidelines or reducing the cost of TFTs ordered. Whether such changes reflect more appropriate test ordering, however, was unclear as in the majority of studies measures of appropriateness were undefined, and thus unreported. No study investigated directly any impact on patient outcomes. Only five studies observed the effect of the intervention on test ordering for more than 12 months 28 34 36 37 44 and only four reported the persistence of effect once the intervention had ended, 26 30 31 33 35 of which three reported return to baseline or decline within 1 year. 26 33 35 Although not the subject of an a priori subgroup analysis, multifaceted interventions did not appear to be more effective than single-mechanism ones. The specific type of intervention(s) appeared less important than the interaction between various intervention-specific variables and the implementation of the intervention in a specific context. For instance, feedback could be very successful if there was a strong institutional support for change 26 or completely ineffective if the changes it was trying to introduce clashed with the dominant institutional culture. 29 Even within a single study, the same intervention performed differently in different clinical circumstances (eg, inpatients vs outpatients) 18 or different interventions seemed to be effective for different type of tests. 17 45 As Mindemark and Larsson 31 put it "… the most decisive factor for the success of a strategy in optimizing test ordering is not the nature of the intervention itself, but rather its design and implementation in a given setting." ( p. 485) The effectiveness of the interventions depended to some extent on the outcomes the researchers had chosen to measure. These outcomes reflected specific assumptions about what constitutes inappropriate test ordering and how this could be changed. 'Appropriateness' is to a large extent, a value judgement, incorporating elements of importance of the diagnosis, benefits from early diagnosis (and the converse, harms from delays in diagnosis), burden and unpleasantness of the test, plus economic considerations. These aspects were rarely-if ever-explicitly reported as part of the rationale for the intervention and selection of the outcome measure. It was simpler to examine the volume of testing or the shift in the pattern of TFTs ordered which most studies did, but clinically, this is a blunt measure of appropriateness.
For instance, in one study a one-off educational event encouraged GPs to use TSH as a single first-line test 30 and the intervention was reported to have a long-term effect. 31 The following ratios were used as an outcome measures: 'TSH:all TFTs' (which was expected to increase) and 'T3:TSH' and 'FT4:TSH' (which were expected to decrease as a result). Therefore, the intervention did not address inappropriate ordering of TSH and the chosen outcomes could not capture a possible 'shift' where doctors ordered inappropriately more TSHs while ordering fewer T3 and FT4 tests. An interrupted time series analysis which investigated the effect of a series of interventions over a period of several years clearly demonstrates such a possibility. 39

Strengths and limitations of study
We conducted the current systematic review using the methods recommended by the Cochrane Collaboration working to a prespecified protocol and consider our findings to be robust. This is the first review focusing specifically on the effectiveness of interventions designed to reduce unnecessary ordering of TFTs. The identified evidence is directly relevant to this particular test ordering behaviour and could be used to guide the design and implementation of future intervention programmes as well as the development of research projects that could address the identified gaps in knowledge. The main limitation is that the quality of evidence did not allow strong conclusions and more specific recommendations to be made. Furthermore, the disparate methods, populations of study, interventions and outcome measures made pooled synthesis of results impossible. Thus, we have chosen to present the results as a narrative synthesis. Similarly, although we strongly suspect that publication bias and selective reporting of outcomes may be operating, particularly for the nonrandomised study designs, we could neither investigate nor attempt to quantify the potential impact. We think it is likely that publication bias has exaggerated the results, but is highly unlikely to completely account for the overall beneficial pattern observed. That the more rigorous designs gave less marked and even negative results (tables 3 and 4 and online supplementary appendix 2) adds a note of caution however.

Comparison with other studies
We are not aware of another systematic review with a similar focus. Other systematic reviews have examined the effectiveness of behavioural interventions designed to influence test ordering as a whole. [12][13][14][15][16] Our review focused on a specific diagnostic scenario-the use of laboratory tests to diagnose and monitor thyroid dysfunction.

Clinical interpretation of the results
Superficially, based on the predominant pattern of favourable results from the included studies, there would seem to be little doubt that where there is evidence that TFTs ordering needs to be modified, the interventions employed in the included studies could be used to effectively reduce volume, improve compliance, change the pattern of testing or reduce cost. However, we believe some caution is required. The most fundamental issue is that in many included studies, the detail about the nature of the intervention is insufficient for implementation; the situation is particularly acute where the target behaviour is appropriateness, because the value of achieving compliance is completely dependent on the definition of appropriateness being used and how it was derived. The problem concerning insufficient definition of the intervention has been noted before and is, we believe, particularly relevant here. 48 Although, additional details may exist outside the published paper and be obtained through personal contact with the investigators, much information about the interventions is likely to be unavailable, particularly for older studies.
Similarly, the current applicability of many of the interventions can be questioned. The circumstances operating in many of the studies may not be similar to the challenges today. A simple example from the study by Thomas et al, 10 clearly applicable to UK primary care, is that the rates of test ordering were in the region of 800 per 10 000 practice patients, whereas the equivalent rates in a recent study in the South West were 2500 per 10000. 5 The importance of this is accentuated by the fact that interventions do not seem to have been designed in the light of investigations to understand the origin of the difficulties underlying the behaviour they were attempting to address. Thus, in primary care, it is widely accepted that the reason why GPs order tests is frequently not for medical reasons, 49 yet most interventions we encountered, such as education and guidelines, assume that lack of medical knowledge is the underlying difficulty. Our own investigations of GPs' reasons why TFT ordering rates might vary, included many factors such as quality of computer systems, communication with hospital systems, general attitude to risk, involvement of other members of the primary care team in test ordering and patient expectations, all issues which are unlikely to have been addressed by any of the interventions we observed. 7 Our concerns about openness to bias of the included studies and possibility of publication and outcome reporting bias reinforce our circumspection about whether the evidence reviewed is good enough to implement.
Given the fact that the majority of TFTs are ordered by GPs in the UK 5 and that guidelines for the use of these tests in primary care already exist (though appear to be ineffective), interventions that raise clinicians' awareness of these guidelines, 'translate' them into easy to follow rules and embed them in decision aids are potentially effective combination. This review suggests that most interventions succeed (albeit in the limited way described above), so it is probable that an intervention can be designed that would work in UK primary care. Other factors will still be relevant, as highlighted by the qualitative study, such as lack of communication, problems with storage and retrieval of previous results, and lack of local protocols that structure the ordering of TFTs in accordance with the existing guidelines. 7

Conclusions and policy implications
The systematic review we have conducted indicates that behaviour change interventions can modify TFT ordering. While a starting point for implementation, we do not believe the evidence base is complete and strongly recommend further research. As well as overcoming the limitations highlighted concerning bias, improving details of the interventions to be implemented and improving applicability to current challenges, new research can also address questions barely touched on by the existing evidence base. Such questions include the effectiveness of interventions like computerised test ordering systems (order.com's) in primary care, how to maintain the effect on test ordering over several years and cost-effectiveness. The scale of the problem is important in this regard. While a few of the included studies had large effects, most only had small effects which would not be large enough to impact for instance on the sixfold variation in test ordering in primary care observed in the South West 5 or the even more extreme variation observed across the UK. 6 The current study also demonstrates that even though reviewing the evidence with the target behaviour clearly in mind is a more productive approach than looking at similar behaviour change interventions applied to a wide variety of targets, such an approach has limitations. For instance, many studies targeting a wide range of test ordering behaviour reported only average effect, even when it was clear that the effect on the ordering of different tests was quite different. This limited the number of studies available for inclusion and probably accounted for the small number of studies evaluating specific interventions such as those based on computerised test ordering systems. Moreover, even when the studies reported test-specific effects, they rarely investigated the reasons for this variation and failed to provide explanation of the observed differences. Given the poor methodological quality of many of the included studies, this made it difficult to draw reliable conclusions and make recommendations. This suggests that although similar reviews to this looking at the effectiveness of behaviour change interventions on modifying the ordering of other routine tests would be helpful, a novel approach may be necessary. Such approach could focus on similar test ordering behaviours rather than similar tests and could incorporate wider range of evidence able to demonstrate not only the effectiveness of different interventions but also to provide insight in the mechanisms behind specific behaviour modifications.