Article Text

Original research
Consequences of early thyroid ultrasound on subsequent tests, morbidity and costs: an explorative analysis of routine health data from German ambulatory care
  1. Susann Hueber1,
  2. Valeria Biermann2,
  3. Johanna Tomandl1,
  4. Lisette Warkentin1,
  5. Angela Schedlbauer1,
  6. Harald Tauchmann3,
  7. David Klemperer1,
  8. Maria Lehmann4,
  9. Ewan Donnachie5,
  10. Thomas Kühlein1
  1. 1 Institute of General Practice, Universitätsklinikum Erlangen, Erlangen, Bayern, Germany
  2. 2 Chair of Health Management, Friedrich-Alexander-Universität Erlangen-Nürnberg, Nürnberg, Bayern, Germany
  3. 3 Professorship of Health Economics, Friedrich-Alexander-Universität Erlangen-Nürnberg, Nürnberg, Bayern, Germany
  4. 4 Institute for Medical Informatics, Biometry and Epidemiology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Bayern, Germany
  5. 5 Bavarian Association of Statutory Health Insurance Physicians, Munich, Germany
  1. Correspondence to Dr Susann Hueber; susann.hueber{at}


Objectives This study aims to evaluate whether the use of thyroid ultrasound (US) early in the work-up of suspected thyroid disorders triggers cascade effects of medical procedures and to analyse effects on morbidity, healthcare usage and costs.

Study design Retrospective analysis of claims data from ambulatory care (2012–2017).

Setting Primary care in Bavaria, Germany, 13 million inhabitants.

Participants Patients having received a thyroid stimulating hormone (TSH) test were allocated to (1) observation group: TSH test followed by an early US within 28 days or (2) control group: TSH test, but no early US. Propensity score matching was used adjusting for socio-demographic characteristics, morbidity and symptom diagnosis (N=41 065 per group after matching).

Primary and secondary outcome measures Using cluster analysis, groups were identified regarding frequency of follow-up TSH tests and/or US and compared.

Results Four subgroups were identified: cluster 1: 22.8% of patients, mean (M)=1.6 TSH tests; cluster 2: 16.6% of patients, M=4.7 TSH tests; cluster 3: 54.4% of patients, M=3.3 TSH tests, 1.8 US; cluster 4: 6.2% of patients, M=10.9 TSH tests, 3.9 US. Overall, reasons that explain the tests could rarely be found. An early US was mostly found in clusters 3 and 4 (83.2% and 76.1%, respectively, were part of the observation group). In cluster 4 there were more women, thyroid-specific morbidity and costs were higher and the early US was more likely to be performed by specialists in nuclear medicine or radiologists.

Conclusion Presumably unnecessary tests in the field of suspected thyroid diseases seem to be frequent, contributing to cascades effects. Neither German nor international guidelines provide clear recommendations for or against US screening. Therefore, guidelines on when to apply US and when not are urgently needed.

  • Thyroid disease

Data availability statement

Data are available upon reasonable request. The data that support the findings of this study are available from the Bavarian Association of Statutory Health Insurance Physicians but restrictions apply to the availability of these data, which were used under licence for the current study and are not publicly available. Data may be obtained from the authors upon reasonable request and with permission of the Bavarian Association of Statutory Health Insurance Physicians.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • The analysis of claims data reflects healthcare in the real-world setting.

  • Results of inpatient care were not available. Therefore, our results are restricted to the outpatient sector.

  • Accuracy of diagnoses in health insurance data is limited.


In recent years, a steep increase in the incidence of thyroid nodules and thyroid cancer has been observed while at the same time cancer specific mortality rates remained low.1–3 This epidemiological pattern of divergence between cancer incidence and mortality can best be explained by overdiagnosis.4–6 The term overdiagnosis describes the detection of findings labelled as diseases that if not detected would not have caused problems or symptoms.7 8 Overdiagnosis often results in overtreatment, a medical treatment with a negative benefit–harm balance for the patient.7 The detection of a nodule by a first thyroid ultrasound (US) might trigger further testing and follow-up frequently ending with invasive treatment as thyroidectomy.9–11 It is believed that the increasing incidence of thyroid cancer is mainly due to the detection of small low risk tumours, discovered through the zealous and often inappropriate use of imaging technology as, for example, thyroid US.12–15

About 90% of the detected carcinomas are classified as small papillary thyroid cancers, a lesion with a very favourable prognosis, explaining why the sharp increase in nodule detection has resulted in increased thyroid cancer rates but unchanged cancer mortality rates.9 12 Although papillary cancer as a biological entity is rarely threatening, it is very much so psychologically. Moreover, it has been shown that not calling papillary thyroid tumours cancer anymore can reduce patients’ anxiety and help them consider less invasive management options.16 Several researchers and experts advocate renaming papillary microcarcinomas to emphasise the fact that these tumours have a very good prognosis (among other studies by Brito et al, Nickel et al and Rosai et al).12 17 18 The German ‘Choosing Wisely’ offshoot ‘Klug entscheiden’ states that at least the elderly should not be screened for thyroid cancer.19 While the benefit of such screening for patients is limited at best they are put at risk for experiencing major harms due to overdiagnosis and overtreatment.20 21

The US Preventive Services Task Force recently recommended against screening for thyroid cancer.20 There are no clear recommendations for or against US screening in the guidelines of the European and the American Thyroid Association.22 23 The guideline of the German Society for General Practitioners has a critical view on the early use of thyroid US in patients with suspected functional thyroid disorder like hypothyroidism.24 However, US devices are widely available in Germany, even in primary care practices. It is very likely that US is used very early in diagnostic processes also when this is probably not indicated. As shown in Korea, administering US without clinical justification contributes to the increase of overdiagnosis of thyroid cancer without reduction in thyroid cancer mortality.10 US may lead to cascade effects defined as chain events of diagnosis and treatment that are difficult to influence once they have been started.25 Cascade effects can be observed as subsequent and most likely unnecessary diagnostic tests and treatments after an initial non-indicated US examination. As pointed out by Mandrola and Morgan, there has been little systematic research on cascade effects.26 Yet, in a survey of specialists in internal medicine, 90% of respondents experienced unwanted cascade effects after incidental findings.27

Investigating overdiagnosis and overtreatment was the main objective of the PRO PRICARE research network (PReventing Overdiagnosis in PRImary CARE). One of the network’s projects focused on the investigation of cascade effects arising from the early use of US in the work-up of suspected thyroid diseases using routine healthcare data. A first analysis of costs arising from the early and presumably non-indicated thyroid US revealed some moderate effects on thyroid-specific costs.28 The effects on morbidity and patients’ healthcare usage have not been analysed. Therefore, the study presented here is intended to evaluate whether the use of thyroid US early in the work-up of suspected thyroid disorders triggers cascade effects of medical procedures. We aim to describe how patients’ healthcare usage, morbidity and costs develop after the early use of thyroid US. In addition, we wanted to identify groups of patients differing in the frequency of diagnostic tests. These groups should then be compared regarding socio-demographic characteristics, morbidity and costs.


Study design and data basis

A retrospective observational cohort study using a quasi-experimental design was conducted. Claims data of patients with statutory health insurance held by the Bavarian Association of Statutory Health Insurance Physicians (Kassenärztliche Vereinigung Bayerns, KVB) were used. The reporting of the study is based on the German GPS (Good Practice Secondary Data Analysis)29 and RECORD statement (REporting of studies Conducted using Observational Routinely-collected health Data) recommendations.30 An ethical approval was not required, as German law allows for analysing anonymous data for research purposes without patients’ consent.

In Germany, physicians accredited with statutory health insurances send their reimbursement claims for provided ambulatory medical services to their corresponding regional Association of Statutory Health Insurance Physicians. This procedure applies to the reimbursement of all services for statutorily insured patients, but not for privately insured patients or patients covered by a selective contract, for example, general practitioner (GP)-centred care (Hausarztzentrierte Versorgung). GP-centred healthcare is a special form of healthcare contract in Germany where a GP is credited with a gate-keeping function. Charging is done directly with the health insurance company and not through the Association of Statutory Health Insurance Physicians. Data of these patients were not or not completely available in the data set. Data are delivered on a quarterly basis. The dates for the services billed are given to the exact day, those for the diagnoses are only given per quarter. For each case, the data set contained an anonymous unique patient identifier, information regarding age in 5-year intervals, sex and diagnoses encoded according to the International Classification of Diseases (German modification V.10, ICD-10-GM). Unlike the international version, the German modification allows physicians to add the strength of diagnostic reasoning such as suspected, valid, excluded or state after. The data set also revealed details about physicians’ age (in 5-year intervals) and sex, practice location (rural, urban, large city and administrative district), practice type (single-handed or group practice, number of physicians per surgery) and specialist training. Further information was provided on regional deprivation coded as Bavarian Index of Multiple Deprivation (BIMD,20).

Data sample

Data captured between January 2012 and December 2017 were analysed (maximum 24 quarters). To analyse the onset of cascades, only data from patients with no prior history of thyroid examinations and/or thyroid disorder were selected. Data of patients who were 18 years or older and received a first thyroid stimulating hormone (TSH) test (as captured by the corresponding billing code) were included. To assess the influence of US early applied after the initial TSH test, the data set was divided in two groups: a so called ‘observation group’ and a ‘control group’. Patients in the ‘observation group’ were defined as having received thyroid US early in the diagnostic process, that is, the initial TSH test was followed by a thyroid US within 28 days. The control group consisted of patients who received a TSH test but received either no US at all or later than 28 days after the TSH test. This decision was made on the basis of the guideline of the German College of General Practitioners and Family Physicians (DEGAM).24 The guideline states, that in case of a first abnormal TSH test and with no relevant medical history, a second TSH test should be carried out as a control. In case of a second abnormal TSH test, further laboratory tests (fT4, fT3, antibodies) are recommended. A thyroid US is recommended only in case of a palpably enlarged goitre, palpable thyroid nodules or lymph nodules or in case of hyperthyroidism in the absence of serological markers for thyroiditis or Graves’ disease, because then it might be caused by an autonomous adenoma.22 31 Palpable nodules occur in 0.5%–5.1% of patients and lymph nodes account for 1.6% of US of the neck.32 33 As all these findings are rare, it can be assumed that the vast majority of early US examinations have to be regarded as non-indicated, constituting an unjustified grey screening.

Both groups were matched for socio-demographic characteristics (age, sex and place of residence by area code), symptom diagnosis (eg, fatigue, see online supplemental appendix A1) and morbidity using diagnosis-related risk classes for insured persons with comparable treatment needs, the German so called InBA grouper. InBa means ‘Institut des Bewertungsausschusses’ (English: Institute of the Evaluation Committee) and is an official organ of the German Federal Ministry of Health. The groupers were developed to measure morbidity in outpatient care and are a number of applicable risks as an indicator for multimorbidity. The matching process was applied to 665 126 patients and revealed 68 862 patients in the observation and control group, respectively. Table 1 shows the mean values of the covariates in the year 2012 for the observation group as well as the control group before and after propensity score matching, respectively. We evaluated matching quality by means of the standardised bias in per cent.34 The matching successfully established similarity in terms of the means of our observable variables across our observation and control groups. It reduced the standardised bias of all covariates substantially, even the largest standardised bias was only at 0.85%. All post-matching standardised biases were thus far below the rule of thumb threshold of 5%.35

Supplemental material

Table 1

Mean values of or percentage of patients with covariates of propensity score matching for socio-demographic characteristics (age, sex), symptom diagnosis (see Appendix A1) and morbidity

In case of hypothyroidism or hyperthyroidism performing a thyroid US might be appropriate. Therefore, in the observation group patients diagnosed with hypothyroidism or hyperthyroidism (ICD codes E02, E03, E05 and E89) in the same accounting quarter as the initial TSH test and the corresponding matching partner in the control group have been excluded (n=15 638). As stated above, billing data of patients registered in GP-centred healthcare (Hausarztzentrierte Versorgung) might have been incomplete in our data set and therefore had to be excluded (n=39 956). Data selection process is also depicted in figure 1.

Figure 1

Flow chart showing data selection and data preparation process. GP, general practitioner; ICD, International Classification of Diseases; TSH, thyroid stimulating hormone; MRI = magnetic resonance imaging.

Data analysis

To assess whether or not and to what extent cascades of medical tests and procedures were only attributable to the early use of US, data from both groups were combined. That is, data of observation and control group were put together and a cluster analysis was performed. A cluster analysis was used as a statistical method to identify patients that received a similar number of TSH tests and US. Cluster analysis aims to identify homogenous groups of patients regarding the frequency of the above-mentioned diagnostic tests. Cluster variables were the number of quarters with at least one TSH test and/or one thyroid US after uptake quarter. Both variables were dummy coded. Due to data structure and the explorative study design, TSH test and US frequencies were counted by the amount of quarters (3 months billing period) with the respective examination, whereas a patient could have had multiple tests in one-quarter. In the following, one examination represents one-quarter with the respective measure. A hierarchical agglomerative cluster analysis was performed (distance measure: Canberra distance, fusion algorithm: complete linkage) using the statistical software package R (V.3.5.1, R Foundation for Statistical Computing, Vienna, Austria). A dendrogram was used to define a cluster solution.

Clusters were then compared in terms of socio-demographic characteristic, morbidity, healthcare usage and costs. As socio-demographic characteristics we used age, sex, place of residence and area level deprivation. Place of residence was coded using district types according to settlement structure referring to the Federal Institute for Research on Building, Urban Affairs and Spatial Development (Bundesinstitut für Bau-, Stadt- und Raumforschung, BBSR).36 Four types of residence have been differentiated: Sparsely populated regions (<100 inhabitants per km2), rural areas (100–150 inhabitants per km2), urban areas (>150 inhabitants per km2) and towns (≥100 000 inhabitants). The BIMD was used to distinguish regions regarding socio-demographic, socioeconomic and environmental characteristics.37 Morbidity was analysed using the Charlson Comorbidity Index (CCI) indicating overall morbidity.38 Thyroid-specific morbidity included functional and structural diseases (see online supplemental appendix A2 for diagnoses). If a diagnosis had a prevalence below 0.5% results will not be reported in detail. A patient was coded as having a disease when a diagnosis was coded in at least two quarters (the so-called M2Q criterion).39 Healthcare usage was measured using the number of thyroid-specific diagnostic tests performed during study period (see online supplemental appendix A3), medical specialties involved in the first year of the study period, medical specialties conducting thyroid US in the uptake quarter and number of all visits to office-based physicians. Costs were calculated using thyroid-specific and overall ambulatory costs. Thyroid specific costs referred to thyroid-specific diagnostic tests listed in online supplemental appendix A3. To estimate whether follow-up diagnostic procedure can be attributed to overtesting and overtreatment we assessed whether at the beginning of the observation period, there were indicators that those tests were needed. Therefore, diagnoses coded during the first 6 months of the study period were defined that would justify ordering TSH tests and/or US. We distinguished between diagnoses justifying either one or two or regular TSH-tests and thyroid US. An example being diabetes type 1 justifying regular TSH tests to exclude concomitant occurrence of various endocrinological conditions. Also, malignant or benign neoplasm of the thyroid gland were included as plausible diagnoses. The full list of diagnoses is listed in the online supplemental appendix A4.

For all categorical variables χ2 tests were used to analyse differences. If analyses were performed with two groups, we used Phi as effect size, else Cramer’s V was used. For all continuous variables, analysis of variances was used to evaluate differences between clusters. As Levene’s test of variance homogeneity was significant, Welch analysis of variances were used. Effect sizes with estimated omega square were used to determine the magnitude of statistically significant differences. P values were adjusted according to the Bonferroni method separately for the number of statistical tests in each section of analyses (socio-demographic characteristics, morbidity, healthcare usage and costs).

Patient and public involvement

The study was planned and conducted by an interdisciplinary scientific team. The aim is to describe the real-world situation in healthcare. The findings from this explorative study are to be used to plan further studies that deal with the decision-making processes on the physician and patient side regarding diagnostics and therapy. Patients and citizens were not involved in the planning of the study reported here.


Characteristics of observation and control group

The final sample included 82 130 patients, 41 065 in each group. Around 60% of them were women, mean age was 46 years and more than one-third had at least one comorbidity. Regarding place of residency, about a quarter each lived in sparsely populated regions, rural areas, urban areas and towns. Socio-demographic characteristics are depicted in table 2.

Table 2

Socio-demographic characteristics of the sample; separately for control versus observation group (N=41 065 each)

Cluster analysis and comparison of subgroups

Description of clusters

A four-cluster solution showing reasonable quality criteria was found. The four clusters varied in the number of TSH tests and thyroid US conducted within the study period. Almost a quarter of patients received on average approximately two TSH tests but no US (N cluster 1=22.8%, MTSH tests =1.6). The second cluster comprises around 16.6% of the patients who received almost five TSH tests but no US (MTSH tests =4.7). Just over half of the patients were grouped in the third cluster receiving approximately three TSH tests and two thyroid USs (N cluster 3=54.4%, MTSH tests =3.3, MUS =1.8). Only a few patients were assigned to the fourth cluster and those patients were examined frequently receiving almost 11 TSH tests and 4 USs (N cluster 4=6.2%, MTSH tests =10.9, MUS =3.9). In the clusters containing patients who were examined most frequently, an US usually took place at the beginning. That is, more than 80% of patients in cluster 3 and 75% in cluster 4 were part of the observation group receiving thyroid US early. Figure 2 illustrates the four clusters and the main differences between them.

Figure 2

Main differences and characteristics of the four clusters. TSH, thyroid stimulating hormone; US, ultrasound.

Socio-demographic characteristics

Regarding age of patients, the clusters differed significantly from each other (mean age: cluster 1=44.9 years; cluster 2=48.8 years; cluster 3=46.4 years; and cluster 4=47.0 years; Welch’s F (3, 19 212.456) = 151.254, p<0.001). Effect size indicated that the difference is of minor importance (est.ω2=0.005). The proportion of women was highest in cluster 4 (proportion of women: cluster 1=54%; cluster 2=61%; cluster 3=61%; and cluster 4=76%; χ² (3) = 811.321, p<0.001; φ c =0.099). A slightly higher number of patients of cluster 4 lived in larger cities and in areas with a higher index of deprivation (proportion of people living in larger cities: cluster 1=26%; cluster 2=28%; cluster 3=27%; and cluster 4=31%; χ² (9) = 69.254, p<0.001; φ c =0.017; proportion of people living in areas with the highest deprivation index: cluster 1=18%; cluster 2=18%; cluster 3=18%; and cluster 4=21%; χ² (12) = 65.168, p<0.01; φ c =0.016). After p value adjustment according to Bonferroni, all tests remained statistically significant (adjusted p value=0.0125). However, all effect sizes were small indicating that the difference between groups regarding socio-demographic variables were less important.


In clusters 2 and 4, at the beginning of the observation period, more patients had a comorbidity (proportion of patients with a CCI=0 in the uptake quarter: cluster 1=65.8%; cluster 2=54.2%; cluster 3=61.5%; and cluster 4=54.8%; χ² (9) = 625.474, p<0.01; φ c =0.08). Also, during the observation period, a higher increase in morbidity has been observed also for cluster 2 and cluster 4 (differences in CCI between the begin and end of the study period: cluster 1=0.06; cluster 2=0.28; cluster 3=0.13; and cluster 4=0.33; Welch’s F (3, 18 794.027) = 205 370, p<0.001; ω 2 =0.002). Thyroid specific morbidity was significantly higher in cluster 4 (all p<0.001). However, only small effect sizes were found for malignant and benign neoplasms (ICD codes C73 and D34), and abnormal results of thyroid function studies (ICD code R94.6) (all φ<0.12). After p value adjustment according to Bonferroni, all tests remained statistically significant (adjusted p value=0.0045).

Healthcare usage

Patients in cluster 4 received significantly more thyroid-specific diagnostic blood tests and imaging procedures (all p<0.001). However, effect sizes indicate that the differences were more pronounced for scintigraphy, fT3 and fT4 tests, but less for CT, MRI and puncture. During the first year, patients in cluster 4 were more likely to be in the care of specialists (radiologists: cluster 1=22%; cluster 2=29.8%; cluster 3=27.8%; and cluster 4=40.1%; χ² (3) = 723.699, p<0.001; φ c =0.094; specialists in nuclear medicine: cluster 1=2.7%; cluster 2=4.2%; cluster 3=15.5%; and cluster 4=35.6%; χ² (3) = 5576.441, p<0.001; φ c =0.261). As compared with patients in cluster 3, patients in cluster 4 were more likely to receive the first presumably not-indicated thyroid US by specialists but less likely by GPs and internists working as GPs (radiologists: cluster 3=2.5% and cluster 4=5.4%; χ² (3) = 141.951, p<0.001; φ=0.053; specialists in nuclear medicine: cluster 3=9.0% and cluster 4=16.8%; χ² (3) = 320.016, p<0.001; φ=0.08; GPs: cluster 3=38.9% and cluster 4=31.1%; χ² (3) = 144.029, p<0.001; φ=0.054; internists as GPs: cluster 3=31.6% and cluster 4=26.7%; χ² (3) = 50.088, p<0.001; φ=0.032). Patients in cluster 2 and 4 visited a doctor most frequently (mean number of visits to office-based physicians: cluster 1=35.54; cluster 2=57.94; cluster 3=47.13; and cluster 4=80.59; Welch’s F (3, 18 773.865) = 4287.105, p<0.001; ω2 =0.135). After p value adjustment according to Bonferroni, all tests remained statistically significant (adjusted p value=0.0026).


Patients in cluster 4 caused significantly higher overall ambulatory costs than patients of all other clusters (mean ambulatory costs per quarter: cluster 1=€2188.98; cluster 2=€3887.33; cluster 3=€2917.57; and cluster 4=€5156.20; Welch’s F (3, 17 660.486) = 705.497, p<0.001; ω2=0.025 and mean thyroid specific costs per quarter: cluster 1=€6.47; cluster 2=€19.74; cluster 3=€36.70; and cluster 4=€125.19; Welch’s F (3, 18 024.487) = 14 881.88, p<0.001; ω2=0.352). After p value adjustment according to Bonferroni, all tests remained statistically significant (adjusted p value=0.025).

All results are depicted in table 3.

Table 3

Comparison of clusters regarding socio-demographic variables, morbidity, healthcare usage and costs

Estimation of overtesting and overtreatment

In cluster 1, one or two TSH tests were administered with 60% of them having a disease at the beginning of the study period that justified running those tests according to our definition. In cluster 2, five TSH tests were performed with only 12% of patients having a diagnosis at the beginning of the observation period justifying regular TSH tests. In cluster 3, an average of three to four TSH tests and one to two thyroid US were performed, with only about one-third having diagnoses justifying these tests (proportion of patients having a disease at the beginning justifying regular TSH tests: 33.6% and one or two US: 36.5%). In cluster 4, an average of 11 TSH tests and 4 USs were administered. Diagnoses justifying these tests were found in 62.8% of patients (for TSH-tests) and 54% of patients (for US). In summary, at the beginning of the study period only a minority of patients in clusters 2 and 3 had plausible diagnoses for regular TSH testing and only a few patients in cluster 3 had conditions that may have justified thyroid US. Results are depicted in table 4.

Table 4

Proportion of persons with at least one diagnosis that justifies the one-time or regular performance of diagnostic tests


Based on real-world data, we found that a substantial number of patients might have been affected by unnecessary tests and treatment in the area of (suspected) thyroid diseases. As only data from patients without prior history of thyroid disease and/or testing were included, this number corresponds to incident cases. For many patients a plausible diagnosis that would justify the tests could not be found. More than 75% of those with frequent follow-up procedures (clusters 3 and 4) were part of the observation group receiving an early and presumably non-indicated thyroid US. Patients in cluster 4, who received the highest amount of follow-up examinations, were more often under care of specialists in nuclear medicine or radiologists from the beginning on and had also received their first US there. Thyroid-specific morbidity was significantly higher in cluster 4 and the costs caused by those patients cannot be explained by increased thyroid-specific costs alone. In cluster 4, there were more women than in the other clusters. At the beginning, more patients in cluster 2 and 4 had a concomitant disease. Morbidity also increased more strongly in these two groups during the course of the study and those patients saw a physician most frequently.

The role of overdiagnosis and overtreatment in the area of thyroid cancer incidence has been described already.5 As an influencing factor the inappropriate use of US and thyroid laboratory tests has been investigated and discussed.10 40–42 The present results also show an association between the initial ordering of US early in the diagnostic work-up and an increased frequency of subsequent tests and treatments. Physicians stated that US is also used for clinically unsupported reasons (eg, patient request or abnormal results of thyroid function test), and it has been shown that specialists are more likely to order US after patient requests.43 Our results suggest that follow-up examinations are more frequent when the first US was performed by a specialist, although this only affected a small group. Patients who did not end up in a loop of frequent testing were more likely to have their first US administered by a GP. People in Germany have the right to freely choose their physician (§ 76, SGB X, German Social Security Code). They can consult any physician, no matter if GP or specialist, who is entitled to treatment in the statutory health insurance. Thus, a consultation with a specialist is initiated either by the patient or by the GP through referral. Referrals from the GP to a specialist for a US examination do not always seem to be justified either,44 and might well be induced by patient requests with GPs not wanting to deny them for fear of disappointing their patients. A German guideline for protection against the overuse and underuse of healthcare states that in principle specialists tend to overuse, whereas GPs are prone to underuse.45 Based on our results, on the one hand one can assume that specialists themselves trigger more diagnostic measures causing overdiagnosis and overtreatment. On the other hand, patients having their first US at a specialist presumably differ from patients of GPs in regard to comorbidities and risk for thyroid disorders. This means that they might have a higher probability for abnormalities and thus receive more follow-up examinations with good reason.

Patients with the highest number of examinations (cluster 4) showed the highest morbidity. Especially the proportion of patients with hypothyroidism and non-toxic goitre was particularly high. There are several possible reasons why those patients could have been examined more frequently. Those reasons could be suspected malignancy and watchful waiting procedure (instead of treatment) or diagnostic clarification in patients with unexplained symptoms. Patients’ demands and expectations could also play a role. The proportion of women in cluster 4 was considerably higher in comparison to other clusters. Women are more likely to consult a doctor46–48 and the prevalence of thyroid nodules in women is significantly higher49 with the latter possibly being the result of the former for many of them. Thus, simply the high proportion of women can probably explain part of the frequent follow-up examinations. The higher morbidity index at the beginning of the study suggests that patients in cluster 4 are more likely to belong to risk groups. This could also be another reason for more frequent, although not always well justified follow-up examinations. The comparison with the patients in clusters 2 and 3 is particularly interesting: cluster 2 patients also had higher morbidity rates and these patients also had frequent visits to the doctor, but less frequent thyroid specific diagnostic tests. Presumably these patients had fewer risk factors, diseases or symptoms indicating an association with the thyroid gland. Our analyses of the justification for TSH testing and US (table 4) suggest that while quite a number of patients in cluster 4 received testing with good reason, more overuse occurred in clusters 2 and 3.

Receiving thyroid US early (and probably not indicated) was associated with more thyroid-specific diagnostic tests, higher expenditures and higher thyroid-specific diagnostic labels (morbidity), especially non-toxic goitre. This clearly corresponds to the increase of measures after a trigger event that Mold and Stein described back in 1986 as cascade effects in the clinical care of patients.25 In line with our results, another recent study has shown that the increasing morbidity in terms of diagnostic labels is partially associated with the density of endocrinologist and the employment of US.50 We assume that part of the increased morbidity can be attributed to overdiagnosis as the majority of patients diagnosed with thyroid nodules would never have experienced any harms related to this finding and the corresponding diagnosis. Thus, they do not have any benefit from it. However, negative consequences of those cascades have been described in numerous studies: invasive treatments associated with complications,51–53 psychological stress for patients54 and rising costs for the healthcare system.11 55 This effect was shown before, when prevalence of clinically non-palpable thyroid nodules and small low-risk tumours increased after the implementation of US screening.3

The results showed that the early use of US was associated with higher costs. In addition to the concerns of patient needs and protection, it is also about avoidable costs to the limited resources of our healthcare system. A US-American study estimates annual costs of overtreatment or low-value care from US$75.7 billion up to US$101.2 billion.55 In the USA, from 1996 to 2006 the total number of thyroidectomies increased by 39% and estimated economic costs of inpatient thyroid surgery tripled.11 Since it is estimated that up to 90% of thyroid cancer diagnoses are attributable to overdiagnosis in South Korea and Italy and up to 60% in Nordic countries,5 6 it is not hard to understand how much financial resources are wasted. However, the analysis of Hafner et al showed that only minor increases in costs are associated with follow-up tests arising from early thyroid US.28 However, further robust and reliable data are needed for Germany to evaluate additional costs for overdiagnosis of thyroid diseases.

In line with other studies our results support the fact that inappropriate use of diagnostic tests is widespread and is likely to trigger subsequent cascades of further tests and treatments.40 41 56 Routine data make it possible to accurately estimate the frequency of diagnostic and therapeutic procedures and relate them to diagnoses, costs and socio-demographic characteristics of caregivers and patients. However, clinical decision-making is far more complex than can be described by these variables alone. Whether or not to administer a test is also influenced by the fear and expectations of the patient, his or her personality, family history, the experience and preferences of the physician and, of course, the clinical results of previous tests.57 We do not see all this in routine data and therefore questions remain. To develop a full picture, additional studies will be needed that focus on the influencing factors related to physicians, patient and the healthcare system. To do so, the involvement of all groups of people involved in care is necessary. Standardised outcome sets measured and reported by all research in this clinical area may support this.58


Data from 2012 to 2017 have been analysed. One cannot exclude that clinical management has been affected and changed since then. Also, results and conclusion are derived from Germany and the generalisability to other countries can be limited. Claims data have the disadvantage that they do not contain the results of clinical tests, the patients’ medical history or symptoms. As our data did not include the reasons for encounter, for example, through using the International Classification of Primary Care,59 information is missing on whether the reasons for encounter encompassed presenting symptoms justifying TSH testing (as hoarseness or local pain), family history, requests for prescriptions and referrals or investigations and also information about ideas, concerns and expectation when consulting their physician. Our data set did not include information on inpatient visits, so that key outcomes such as thyroid surgery could not be considered in our analysis. Also, information on medication was not available.


This study supports the hypothesis of the occurrence of early and unnecessary testing in thyroid care. Unnecessary diagnostic tests can be a trigger of further diagnostic cascades. Once a diagnostic cascade is started it seems hard to be stopped. To avoid starting them, initiatives like ‘Choosing Wisely’ campaigning against thyroid screening by US should be supported. Recommendations of medical societies against screening for thyroid nodules should be enforced and explicitly should address all forms of haphazard screening. To stop cascades on the run, clear recommendations for diagnostic work-up for thyroid nodules are required. There is no German guideline for management of thyroid nodules yet. Therefore, evidence-based guidelines are urgently needed to guide when to apply thyroid US and when not.

Data availability statement

Data are available upon reasonable request. The data that support the findings of this study are available from the Bavarian Association of Statutory Health Insurance Physicians but restrictions apply to the availability of these data, which were used under licence for the current study and are not publicly available. Data may be obtained from the authors upon reasonable request and with permission of the Bavarian Association of Statutory Health Insurance Physicians.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.


We would like to thank the Kassenärztliche Vereinigung Bayerns for providing the data and especially Roman Gerlach and Martin Tauscher for their expertise and support. We would also like to thank the student assistants Lukas Worm, Lena Pachsteffl and Carolin Nürnberger, who contributed to the data preparation and Fenno Brunken for project management.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors SH, TK, AS, HT and ED initiated and designed the study. SH, VB, JT, LW, AS, HT, ML, ED and TK carried out data analysis. SH acted as guarantor. All authors contributed to interpreting the results. SH was the primary contributor in writing the manuscript, with major contributions from TK, JT, HT and LW. All authors contributed to writing of the manuscript, commented on the draft and approved the final version of the manuscript.

  • Funding This research was conducted within the research network PRO PRICARE, Preventing Overdiagnosis in Primary Care and supported by the German Federal Ministry of Education and Research (grant 01GY1605).We acknowledge financial support by Deutsche Forschungsgemeinschaft and Friedrich-Alexander-Universität Erlangen-Nürnberg within the funding programme "Open Access Publication Funding".

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.