STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration

Jérémie F Cohen; Daniël A Korevaar; Douglas G Altman; David E Bruns; Constantine A Gatsonis; Lotty Hooft; Les Irwig; Deborah Levine; Johannes B Reitsma; Henrica C W de Vet; Patrick M M Bossuyt

doi:10.1136/bmjopen-2016-012799

Article Text

PDF

XML

Medical publishing and peer review

Research

STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration

Jérémie F Cohen1,2,
Daniël A Korevaar1,
Douglas G Altman3,
David E Bruns4,
Constantine A Gatsonis5,
Lotty Hooft6,
Les Irwig7,
Deborah Levine8,9,
Johannes B Reitsma10,
Henrica C W de Vet11,
Patrick M M Bossuyt1

¹Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Centre, University of Amsterdam, Amsterdam, The Netherlands
²Department of Pediatrics, INSERM UMR 1153, Necker Hospital, AP-HP, Paris Descartes University, Paris, France
³Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, Centre for Statistics in Medicine, University of Oxford, Oxford, UK
⁴Department of Pathology, University of Virginia School of Medicine, Charlottesville, Virginia, USA
⁵Department of Biostatistics, Brown University School of Public Health, Providence, Rhode Island, USA
⁶Cochrane Netherlands, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, University of Utrecht, Utrecht, The Netherlands
⁷Screening and Diagnostic Test Evaluation Program, School of Public Health, University of Sydney, Sydney, New South Wales, Australia
⁸Department of Radiology, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
⁹Radiology Editorial Office, Boston, Massachusetts, USA
¹⁰Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, University of Utrecht, Utrecht, The Netherlands
¹¹Department of Epidemiology and Biostatistics, EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, The Netherlands

Correspondence to Professor Patrick M M Bossuyt; p.m.bossuyt{at}amc.uva.nl

Abstract

Diagnostic accuracy studies are, like other clinical studies, at risk of bias due to shortcomings in design and conduct, and the results of a diagnostic accuracy study may not apply to other patient groups and settings. Readers of study reports need to be informed about study design and conduct, in sufficient detail to judge the trustworthiness and applicability of the study findings. The STARD statement (Standards for Reporting of Diagnostic Accuracy Studies) was developed to improve the completeness and transparency of reports of diagnostic accuracy studies. STARD contains a list of essential items that can be used as a checklist, by authors, reviewers and other readers, to ensure that a report of a diagnostic accuracy study contains the necessary information. STARD was recently updated. All updated STARD materials, including the checklist, are available at http://www.equator-network.org/reporting-guidelines/stard. Here, we present the STARD 2015 explanation and elaboration document. Through commented examples of appropriate reporting, we clarify the rationale for each of the 30 items on the STARD 2015 checklist, and describe what is expected from authors in developing sufficiently informative study reports.

Reporting quality
Sensitivity and specificity
Diagnostic accuracy
Research waste
Peer review
Medical publishing

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/

https://doi.org/10.1136/bmjopen-2016-012799

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Diagnostic accuracy studies are at risk of bias, not unlike other clinical studies. Major sources of bias originate in methodological deficiencies, in participant recruitment, data collection, executing or interpreting the test or in data analysis.1 ,2 As a result, the estimates of sensitivity and specificity of the test that is compared against the reference standard can be flawed, deviating systematically from what would be obtained in ideal circumstances (see key terminology in table 1). Biased results can lead to improper recommendations about testing, negatively affecting patient outcomes or healthcare policy.

View this table:

Table 1

Key STARD terminology

Diagnostic accuracy is not a fixed property of a test. A test's accuracy in identifying patients with the target condition typically varies between settings, patient groups and depending on prior testing.2 These sources of variation in diagnostic accuracy are relevant for those who want to apply the findings of a diagnostic accuracy study to answer a specific question about adopting the test in his or her environment. Risk of bias and concerns about the applicability are the two key components of QUADAS-2, a quality assessment tool for diagnostic accuracy studies.3

Readers can only judge the risk of bias and applicability of a diagnostic accuracy study if they find the necessary information to do so in the study report. The published study report has to contain all the essential information to judge the trustworthiness and relevance of the study findings, in addition to a complete and informative disclose about the study results.

Unfortunately, several surveys have shown that diagnostic accuracy study reports often fail to transparently describe core elements.4–6 Essential information about included patients, study design and the actual results is frequently missing, and recommendations about the test under evaluation are often generous and too optimistic.

To facilitate more complete and transparent reporting of diagnostic accuracy studies, the STARD statement was developed: Standards for Reporting of Diagnostic Accuracy Studies.7 Inspired by the Consolidated Standards for the Reporting of Trials or CONSORT statement for reporting randomised controlled trials,8 ,9 STARD contains a checklist of items that should be reported in any diagnostic accuracy study.

The STARD statement was initially released in 2003 and updated in 2015.10 The objectives of this update were to include recent evidence about sources of bias and variability and other issues in complete reporting, and make the STARD list easier to use. The updated STARD 2015 list now has 30 essential items (table 2).

View this table:

Table 2

The STARD 2015 list10

Below, we present an explanation and elaboration of STARD 2015. This is an extensive revision and update of a similar document that was prepared for the STARD 2003 version.11 Through commented examples of appropriate reporting, we clarify the rationale for each item and describe what is expected from authors.

We are confident that these descriptions can further assist scientists in writing fully informative study reports, and help peer reviewers, editors and other readers in verifying that submitted and published manuscripts of diagnostic accuracy studies are sufficiently detailed.

STARD 2015 items: explanation and elaboration

Title or abstract

Item 1. Identification as a study of diagnostic accuracy using at least one measure of accuracy (such as sensitivity, specificity, predictive values or AUC)

Example. ‘Main outcome measures: Sensitivity and specificity of CT colonography in detecting individuals with advanced neoplasia (i.e., advanced adenoma or colorectal cancer) 6 mm or larger’.12

Explanation. When searching for relevant biomedical studies on a certain topic, electronic databases such as MEDLINE and Embase are indispensable. To facilitate retrieval of their article, authors can explicitly identify it as a report of a diagnostic accuracy study. This can be performed by using terms in the title and/or abstract that refer to measures of diagnostic accuracy, such as ‘sensitivity’, ‘specificity’, ‘positive predictive value’, ‘negative predictive value’, ‘area under the ROC curve (AUC)’ or ‘likelihood ratio’.

In 1991, MEDLINE introduced a specific keyword (MeSH heading) for indexing diagnostic studies: ‘Sensitivity and Specificity.’ Unfortunately, the sensitivity of using this particular MeSH heading to identify diagnostic accuracy studies can be as low as 51%.13 As of May 2015, Embase's thesaurus (Emtree) has 38 check tags for study types; ‘diagnostic test accuracy study’ is one of them, but was only introduced in 2011.

In the example, the authors mentioned the terms ‘sensitivity’ and ‘specificity’ in the abstract. The article will now be retrieved when using one of these terms in a search strategy, and will be easily identifiable as one describing a diagnostic accuracy study.

Abstract

Item 2. Structured summary of study design, methods, results and conclusions (for specific guidance, see STARD for Abstracts)

Example. See STARD for Abstracts (manuscript in preparation; checklist will be available at http://www.equator-network.org/reporting-guidelines/stard/).

Explanation. Readers use abstracts to decide whether they should retrieve the full study report and invest time in reading it. In cases where access to the full study report cannot be obtained or where time is limited, it is conceivable that clinical decisions are based on the information provided in abstracts only.

In two recent literature surveys, abstracts of diagnostic accuracy studies published in high-impact journals or presented at an international scientific conference were found insufficiently informative, because key information about the research question, study methods, study results and the implications of findings were frequently missing.14 ,15

Informative abstracts help readers to quickly appraise critical elements of study validity (risk of bias) and applicability of study findings to their clinical setting (generalisability). Structured abstracts, with separate headings for objectives, methods, results and interpretation, allow readers to find essential information more easily.16

Building on STARD 2015, the newly developed STARD for Abstracts provides a list of essential items that should be included in journal and conference abstracts of diagnostic accuracy studies (list finalised; manuscript under development).

Introduction

Item 3. Scientific and clinical background, including the intended use and clinical role of the index test

Example. ‘The need for improved efficiency in the use of emergency department radiography has long been documented. This need for selectivity has been identified clearly for patients with acute ankle injury, who generally are all referred for radiography, despite a yield for fracture of less than 15%. The referral patterns and yield of radiography for patients with knee injuries have been less well described but may be more inefficient than for patients with ankle injuries. […] The sheer volume of low-cost tests such as plain radiography may contribute as much to rising health care costs as do high-technology, low-volume procedures. […] If validated in subsequent studies, a decision rule for knee-injury patients could lead to a large reduction in the use of knee radiography and significant health care savings without compromising patient care’.17

Explanation. In the introduction of scientific study reports, authors should describe the rationale for their study. In doing so, they can refer to previous work on the topic, remaining uncertainty and the clinical implications of this knowledge gap. To help readers in evaluating the implications of the study, authors can clarify the intended use and the clinical role of the test under evaluation, which is referred to as the index test.

The intended use of a test can be diagnosis, screening, staging, monitoring, surveillance, prognosis, treatment selection or other purposes.18 The clinical role of the test under evaluation refers to its anticipated position relative to other tests in the clinical pathway.19 A triage test, for example, will be used before an existing test because it is less costly or burdensome, but often less accurate as well. An add-on test will be used after existing tests, to improve the accuracy of the total test strategy by identifying false positives or false negatives of the initial test. In other cases, a new test may be used to replace an existing test.

Defining the intended use and clinical role of the test will guide the design of the study and the targeted level of sensitivity and specificity; from these definitions follow the eligibility criteria, how and where to identify eligible participants, how to perform tests and how to interpret test results.19

Specifying the clinical role is helpful in assessing the relative importance of potential errors (false positives and false negatives) made by the index test. A triage test to rule out disease, for example, will need very high sensitivity, whereas the one that mainly aims to rule in disease will need very high specificity.

In the example, the intended use is diagnosis of knee fractures in patients with acute knee injuries, and the potential clinical role is triage test; radiography, the existing test, would only be performed in those with a positive outcome of the newly developed decision rule. The authors outline the current scientific and clinical background of the health problem studied, and their reason for aiming to develop a triage test: this would reduce the number of radiographs and, consequently, healthcare costs.

Item 4. Study objectives and hypotheses

Example (1). ‘The objective of this study was to evaluate the sensitivity and specificity of 3 different diagnostic strategies: a single rapid antigen test, a rapid antigen test with a follow-up rapid antigen test if negative (rapid-rapid diagnostic strategy), and a rapid antigen test with follow-up culture if negative (rapid-culture)—the AAP diagnostic strategy—all compared with a 2-plate culture gold standard. In addition, […] we also compared the ability of these strategies to achieve an absolute diagnostic test sensitivity of >95%’.20

Example (2). ‘Our 2 main hypotheses were that rapid antigen detection tests performed in physician office laboratories are more sensitive than blood agar plate cultures performed and interpreted in physician office laboratories, when each test is compared with a simultaneous blood agar plate culture processed and interpreted in a hospital laboratory, and rapid antigen detection test sensitivity is subject to spectrum bias’.21

Explanation. Clinical studies may have a general aim (a long-term goal, such as ‘to improve the staging of oesophageal cancer’), specific objectives (well-defined goals for this particular study) and testable hypotheses (statements than can be falsified by the study results).

In diagnostic accuracy studies, statistical hypotheses are typically defined in terms of acceptability criteria for single tests (minimum levels of sensitivity, specificity or other measures). In those cases, hypotheses generally include a quantitative expression of the expected value of the diagnostic parameter. In other cases, statistical hypotheses are defined in terms of equality or non-inferiority in accuracy when comparing two or more index tests.

A priori specification of the study hypotheses limits the chances of post hoc data-dredging with spurious findings, premature conclusions about the performance of tests or subjective judgement about the accuracy of the test. Objectives and hypotheses also guide sample size calculations. An evaluation of 126 reports of diagnostic test accuracy studies published in high-impact journals in 2010 revealed that 88% did not state a clear hypothesis.22

In the first example, the authors' objective was to evaluate the accuracy of three diagnostic strategies; their specific hypothesis was that the sensitivity of any of these would exceed the prespecified value of 95%. In the second example, the authors explicitly describe the hypotheses they want to explore in their study. The first hypothesis is about the comparative sensitivity of two index tests (rapid antigen detection test vs culture performed in physician office laboratories); the second is about variability of rapid test performance according to patient characteristics (spectrum bias).

Methods

Item 5. Whether data collection was planned before the index test and reference standard were performed (prospective study) or after (retrospective study)

Example. ‘We reviewed our database of patients who underwent needle localization and surgical excision with digital breast tomosynthesis guidance from April 2011 through January 2013. […] The patients’ medical records and images of the 36 identified lesions were then reviewed retrospectively by an author with more than 5 years of breast imaging experience after a breast imaging fellowship’.23

Explanation. There is great variability in the way the terms ‘prospective’ and ‘retrospective’ are defined and used in the literature. We believe it is therefore necessary to describe clearly whether data collection was planned before the index test and reference standard were performed, or afterwards. If authors define the study question before index test and reference standards are performed, they can take appropriate actions for optimising procedures according to the study protocol and for dedicated data collection.24

Sometimes, the idea for a study originates when patients have already undergone the index test and the reference standard. If so, data collection relies on reviewing patient charts or extracting data from registries. Though such retrospective studies can sometimes reflect routine clinical practice better than prospective studies, they may fail to identify all eligible patients, and often result in data of lower quality, with more missing data points.24 A reason for this could be, for example, that in daily clinical practice, not all patients undergoing the index test may proceed to have the reference standard.

In the example, the data were clearly collected retrospectively: participants were identified through database screening, clinical data were abstracted from patients' medical records, though images were reinterpreted.

Item 6. Eligibility criteria

Example (1). ‘Patients eligible for inclusion were consecutive adults (≥18 years) with suspected pulmonary embolism, based on the presence of at least one of the following symptoms: unexplained (sudden) dyspnoea, deterioration of existing dyspnoea, pain on inspiration, or unexplained cough. We excluded patients if they received anticoagulant treatment (vitamin K antagonists or heparin) at presentation, they were pregnant, follow-up was not possible, or they were unwilling or unable to provide written informed consent’.25

Example (2). ‘Eligible cases had symptoms of diarrhoea and both a positive result for toxin by enzyme immunoassay and a toxigenic C difficile strain detected by culture (in a sample taken less than seven days before the detection round). We defined diarrhoea as three or more loose or watery stool passages a day. We excluded children and adults on intensive care units or haematology wards. Patients with a first relapse after completing treatment for a previous C difficile infection were eligible but not those with subsequent relapses. […] For each case we approached nine control patients. These patients were on the same ward as and in close proximity to the index patient. Control patients did not have diarrhoea, or had diarrhoea but a negative result for C difficile toxin by enzyme immunoassay and culture (in a sample taken less than seven days previously)’.26

Explanation. Since a diagnostic accuracy study describes the behaviour of a test under particular circumstances, a report of the study must include a complete description of the criteria that were used to identify eligible participants. Eligibility criteria are usually related to the nature and stage of the target condition and the intended future use of the index test; they often include the signs, symptoms or previous test results that generate the suspicion about the target condition. Additional criteria can be used to exclude participants for reasons of safety, feasibility and ethical arguments.

Excluding patients with a specific condition or receiving a specific treatment known to adversely affect the way the test works can lead to inflated diagnostic accuracy estimates.27 An example is the exclusion of patients using β blockers in studies evaluating the diagnostic accuracy of exercise ECG.

Some studies have one set of eligibility criteria for all study participants; these are sometimes referred to as single-gate or cohort studies. Other studies have one set of eligibility criteria for participants with the target condition, and (an)other set(s) of eligibility criteria for those without the target condition; these are called multiple-gate or case–control studies.28

In the first example, the eligibility criteria list presenting signs and symptoms, an age limit and exclusion based on specific conditions and treatments. Since the same set of eligibility criteria applies to all study participants, this is an example of a single-gate study.

In the second example, the authors used different eligibility criteria for participants with and without the target condition: one group consisted of patients with a confirmed diagnosis of Clostridium difficile, and one group consisted of healthy controls. This is an example of a multiple-gate study. Extreme contrasts between severe cases and healthy controls can lead to inflated estimates of accuracy.6 ,29

Item 7. On what basis potentially eligible participants were identified (such as symptoms, results from previous tests, inclusion in registry)

Example. ‘We reviewed our database of patients who underwent needle localization and surgical excision with digital breast tomosynthesis guidance from April 2011 through January 2013’.23

Explanation. The eligibility criteria specify who can participate in the study, but they do not describe how the study authors identified eligible participants. This can be performed in various ways.30 A general practitioner may evaluate every patient for eligibility that he sees during office hours. Researchers can go through registries in an emergency department, to identify potentially eligible patients. In other studies, patients are only identified after having been subjected to the index test. Still other studies start with patients in whom the reference standard was performed. Many retrospective studies include participants based on searching hospital databases for patients that underwent the index test and the reference standard.31

Differences in methods for identifying eligible patients can affect the spectrum and prevalence of the target condition in the study group, as well as the range and relative frequency of alternative conditions in patients without the target condition.32 These differences can influence the estimates of diagnostic accuracy.

In the example, participants were identified through searching a patient database and were included if they underwent the index test and the reference standard.

Item 8. Where and when potentially eligible participants were identified (setting, location and dates)

Example. ‘The study was conducted at the Emergency Department of a university-affiliated children's hospital between January 21, 1996, and April 30, 1996’.33

Explanation. The results of a diagnostic accuracy study reflect the performance of a test in a particular clinical context and setting. A medical test may perform differently in a primary, secondary or tertiary care setting, for example. Authors should therefore report the actual setting in which the study was performed, as well as the exact locations: names of the participating centres, city and country. The spectrum of the target condition as well as the range of other conditions that occur in patients suspected of the target condition can vary across settings, depending on which referral mechanisms are in play.34–36

Since test procedures, referral mechanisms and the prevalence and severity of diseases can evolve over time, authors should also report the start and end dates of participant recruitment.

This information is essential for readers who want to evaluate the generalisability of the study findings, and their applicability to specific questions, for those who would like to use the evidence generated by the study to make informed healthcare decisions.

In the example, study setting and study dates were clearly defined.

Item 9. Whether participants formed a consecutive, random or convenience series

Example. ‘All subjects were evaluated and screened for study eligibility by the first author (E.N.E.) prior to study entry. This was a convenience sample of children with pharyngitis; the subjects were enrolled when the first author was present in the emergency department’.37

Explanation. The included study participants may be either a consecutive series of all patients evaluated for eligibility at the study location and satisfying the inclusion criteria, or a subselection of these. A subselection can be purely random, produced by using a random numbers table, or less random, if patients are only enrolled on specific days or during specific office hours. In that case, included participants may not be considered a representative sample of the targeted population, and the generalisability of the study results may be jeopardised.2 ,29

In the example, the authors explicitly described a convenience series where participants were enrolled based on their accessibility to the clinical investigator.

Item 10a. Index test, in sufficient detail to allow replication

Item 10b. Reference standard, in sufficient detail to allow replication

Example. ‘An intravenous line was inserted in an antecubital vein and blood samples were collected into serum tubes before (baseline), immediately after, and 1.5 and 4.5 h after stress testing. Blood samples were put on ice, processed within 1 h of collection, and later stored at −80°C before analysis. The samples had been through 1 thaw–freeze cycle before cardiac troponin I (cTnI) analysis. We measured cTnI by a prototype hs assay (ARCHITECT STAT high-sensitivity troponin, Abbott Diagnostics) with the capture antibody detecting epitopes 24–40 and the detection antibody epitopes 41–49 of cTnI. The limit of detection (LoD) for the high sensitivity (hs) cTnI assay was recently reported by other groups to be 1.2 ng/L, the 99th percentile 16 ng/L, and the assay 10% coefficient of variation (CV) 3.0 ng/L. […] Samples with concentrations below the range of the assays were assigned values of 1.2 […] for cTnI. […]’.38

Explanation. Differences in the execution of the index test or reference standard are a potential source of variation in diagnostic accuracy.39 ,40 Authors should therefore describe the methods for executing the index test and reference standard, in sufficient detail to allow other researchers to replicate the study, and to allow readers to assess (1) the feasibility of using the index test in their own setting, (2) the adequacy of the reference standard and (3) the applicability of the results to their clinical question.

The description should cover key elements of the test protocol, including details of:

the preanalytical phase, for example, patient preparation such as fasting/feeding status prior to blood sampling, the handling of the sample prior to testing and its limitations (such as sample instability), or the anatomic site of measurement;
the analytical phase, including materials and instruments and analytical procedures;
the postanalytical phase, such as calculations of risk scores using analytical results and other variables.

Between-study variability in measures of test accuracy due to differences in test protocol has been documented for a number of tests, including the use of hyperventilation prior to exercise ECG and the use of tomography for exercise thallium scintigraphy.27 ,40

The number, training and expertise of the persons executing and reading the index test and the reference standard may also be critical. Many studies have shown between-reader variability, especially in the field of imaging.41 ,42 The quality of reading has also been shown to be affected in cytology and microbiology by professional background, expertise and prior training to improve interpretation and to reduce interobserver variation.43–45 Information about the amount of training of the persons in the study who read the index test can help readers to judge whether similar results are achievable in their own settings.

In some cases, a study depends on multiple reference standards. Patients with lesions on an imaging test under evaluation may, for example, undergo biopsy with a final diagnosis based on histology, whereas patients without lesions on the index test undergo clinical follow-up as reference standard. This could be a potential source of bias, so authors should specify which patient groups received which reference standard.2 ,3

More specific guidance for specialised fields of testing, or certain types of tests, will be developed in future STARD extensions. Whenever available, these extensions will be made available on the STARD pages at the EQUATOR (Enhancing the QUAlity and Transparency Of health Research) website (http://www.equator-network.org/).

In the example, the authors described how blood samples were collected and processed in the laboratory. They also report analytical performance characteristics of the index test device, as obtained in previous studies.

Item 11. Rationale for choosing the reference standard (if alternatives exist)

Example. ‘The MINI [Mini International Neuropsychiatric Inventory] was developed as a short and efficient diagnostic interview to be used in both research and clinical settings (reference supporting this statement provided by the authors). It has good reliability and validity rates compared with other gold standard diagnostic interviews, such as the Structured Clinical Interview for DSM [Diagnostic and Statistical Manual of Mental Disorders] Disorders (SCID) and the Composite International Diagnostic Interview (references supporting this statement provided by the authors)’.46

Explanation. In diagnostic accuracy studies, the reference standard is used for establishing the presence or absence of the target condition in study participants. Several reference standards may be available to define the same target condition. In such cases, authors are invited to provide their rationale for selecting the specific reference standard from the available alternatives. This may depend on the intended use of the index test, the clinical relevance or practical and/or ethical reasons.

Alternative reference standards are not always in perfect agreement. Some reference standards are less accurate than others. In other cases, different reference standards reflect related but different manifestations or stages of the disease, as in confirmation by imaging (first reference standard) versus clinical events (second reference standard).

In the example, the authors selected the MINI, a structured diagnostic interview commonly used for psychiatric evaluations, as the reference standard for identifying depression and suicide risk in adults with epilepsy. As a rationale for their choice, they claimed that the MINI test was short to administer, efficient for clinical and research purposes, reliable and valid when compared with alternative diagnostic interviews.

Item 12a. Definition of and rationale for test positivity cut-offs or result categories of the index test, distinguishing prespecified from exploratory

Item 12b. Definition of and rationale for test positivity cut-offs or result categories of the reference standard, distinguishing prespecified from exploratory

Example. ‘We also compared the sensitivity of the risk-model at the specificity that would correspond to using a fixed FIT [fecal immunochemical test] positivity threshold of 50 ng/ml. We used a threshold of 50 ng/ml because this was the anticipated cut-off for the Dutch screening programme at the time of the study’.47

Explanation. Test results in their original form can be dichotomous (positive vs negative), have multiple categories (as in high, intermediate or low risk) or be continuous (interval or ratio scale).

For tests with multiple categories, or continuous results, the outcomes from testing are often reclassified into positive (disease confirmed) and negative (disease excluded). This is performed by defining a threshold: the test positivity cut-off. Results that exceed the threshold would then be called positive index test results. In other studies, an ROC curve is derived, by calculating the sensitivity–specificity pairs for all possible cut-offs.

To evaluate the validity and applicability of these classifications, readers would like to know these positivity cut-offs or result categories, how they were determined and whether they were defined prior to the study or after collecting the data. Prespecified thresholds can be based on (1) previous studies, (2) cut-offs used in clinical practice, (3) thresholds recommended by clinical practice guidelines or (4) thresholds recommended by the manufacturer. If no such thresholds exist, the authors may be tempted to explore the accuracy for various thresholds after the data have been collected.

If the authors selected the positivity cut-off after performing the test, choosing the one that maximised test performance, there is an increased risk that the resulting accuracy estimates are overly optimistic, especially in small studies.48 ,49 Subsequent studies may fail to replicate the findings.50 ,51

In the example, the authors stated the rationale for their selection of cut-offs.

Item 13a. Whether clinical information and reference standard results were available to the performers or readers of the index test

Item 13b. Whether clinical information and index test results were available to the assessors of the reference standard

Example. ‘Images for each patient were reviewed by two fellowship-trained genitourinary radiologists with 12 and 8 years of experience, respectively, who were blinded to all patient information, including the final histopathologic diagnosis’.52

Explanation. Some medical tests, such as most forms of imaging, require human handling, interpretation and judgement. These actions may be influenced by the information that is available to the reader.1 ,53 ,54 This can lead to artificially high agreement between tests, or between the index test and the reference standard.

If the reader of a test has access to information about signs, symptoms and previous test results, the reading may be influenced by this additional information, but this may still represent how the test is used in clinical practice.2 The reverse may also apply, if the reader does not have enough information for a proper interpretation of the index test outcome. In that case, test performance may be affected downwards, and the study findings may have limited applicability. Either way, readers of the study report should know to which extent, such additional information was available to test readers and may have influenced their final judgement.

In other situations, the assessors of the reference standard may have had access to the index test results. In those cases, the final classification may be guided by the index test result, and the reported accuracy estimates for the index test will be too high.1 ,2 ,27 Tests that require subjective interpretation are particularly susceptible to this bias.

Withholding information from the readers of the test is commonly referred to as ‘blinding’ or ‘masking’. The point of this reporting item is not that blinding is desirable or undesirable, but, rather, that readers of the study report need information about blinding for the index test and the reference standard to be able to interpret the study findings.

In the example, the readers of unenhanced CT for differentiating between renal angiomyolipoma and renal cell carcinoma did not have access to clinical information, nor to the results of histopathology, the reference standard in this study.

Item 14. Methods for estimating or comparing measures of diagnostic accuracy

Example. ‘Statistical tests of sensitivity and specificity were conducted by using the McNemar test for correlated proportions. All tests were two sided, testing the hypothesis that stereoscopic digital mammography performance differed from that of digital mammography. A p-value of 0.05 was considered as the threshold for significance’.55

Explanation. Multiple measures of diagnostic accuracy exist to describe the performance of a medical test, and their calculation from the collected data is not always straightforward.56 Authors should report the methods used for calculating the measures that they considered appropriate for their study objectives.

Statistical techniques can be used to test specific hypotheses, following from the study's objectives. In single-test evaluations, authors may want to evaluate if the diagnostic accuracy of the tests exceeds a prespecified level (eg, sensitivity of at least 95%, see Item 4).

Diagnostic accuracy studies can also compare two or more index tests. In such comparisons, statistical hypothesis testing usually involves assessing the superiority of one test over another, or the non-inferiority.57 For such comparisons, authors should indicate what measure they specified to make the comparison; these should match their study objectives, and the purpose and role of the index test relative to the clinical pathway. Examples are the relative sensitivity, the absolute gain in sensitivity and the relative diagnostic OR.58

In the example, the authors used McNemar's test statistic to evaluate whether the sensitivity and specificity of stereoscopic digital mammography differed from that of digital mammography in patients with elevated risk for breast cancer. In itself, the resulting p value is not a quantitative expression of the relative accuracy of the two investigated tests. Like any p value, it is influenced by the magnitude of the difference in effect and the sample size. In the example, the authors could have calculated the relative or absolute difference in sensitivity and specificity, including a 95% CI that takes into account the paired nature of the data.

Item 15. How indeterminate index test or reference standard results were handled

Example. ‘Indeterminate results were considered false-positive or false-negative and incorporated into the final analysis. For example, an indeterminate result in a patient found to have appendicitis was considered to have had a negative test result’.59

Explanation. Indeterminate results refer to those that are neither positive or negative.60 Such results can occur on the index test and the reference standard, and are a challenge when evaluating the performance of a diagnostic test.60–63 The occurrence of indeterminate test results varies from test to test, but frequencies up to 40% have been reported.62

There are many underlying causes for indeterminate test results.62 ,63 A test may fail because of technical reasons or an insufficient sample, for example, in the absence of cells in a needle biopsy from a tumour.43 ,64 ,65 Sometimes test results are not reported as just positive or negative, as in the case of ventilation–perfusion scanning in suspected pulmonary embolism, where the findings are classified in three categories: normal, high probability or inconclusive.66

In itself, the frequency of indeterminate test results is an important indicator of the feasibility of the test, and typically limits the overall clinical usefulness; therefore, authors are encouraged to always report the respective frequencies with reasons, as well as failures to complete the testing procedure. This applies to the index test and the reference standard.

Ignoring indeterminate test results can produce biased estimates of accuracy, if these results do not occur at random. Clinical practice may guide the decision on how to handle indeterminate results.

There are multiple ways for handling indeterminate test results in the analysis when estimating accuracy and expressing test performance.63 They can be ignored altogether, be reported but not accounted for or handled as a separate test result category. Handling these results as a separate category may be useful when indeterminate results occur more often, for example, in those without the target condition than in those with the target condition. It is also possible to reclassify all such results: as false positives or false negatives, depending on the reference standard result (‘worst-case scenario’), or as true positives and true negatives (‘best-case scenario’).

In the example, the authors explicitly chose a conservative approach by considering all indeterminate results from the index test as being false-negative (in those with the target condition) or false-positive (in all others), a strategy sometimes referred to as the ‘worst-case scenario’.

Item 16. How missing data on the index test and reference standard were handled

Example. ‘One vessel had missing FFR_CT and 2 had missing CT data. Missing data were handled by exclusion of these vessels as well as by the worst-case imputation’.67

Explanation. Missing data are common in any type of biomedical research. In diagnostic accuracy studies, they can occur for the index test and reference standard. There are several ways to deal with them when analysing the data.68 Many researchers exclude participants without an observed test result. This is known as ‘complete case’ or ‘available case’ analysis. This may lead to a loss in precision and can introduce bias, especially if having a missing index test or reference standard result is related to having the target condition.

Participants with missing test results can be included in the analysis if missing results are imputed.68–70 Another option is to assess the impact of missing test results on estimates of accuracy by considering different scenarios. For the index test, for example, in the ‘worst-case scenario’, all missing index test results are considered false-positive or false-negative depending on the reference standard result; in the ‘best-case scenario’, all missing index test results are considered true-positive or true-negative.

In the example, the authors explicitly reported how many cases with missing index test data they encountered and how they handled these data: they excluded them, but also applied a ‘worst-case scenario’.

Item 17. Any analyses of variability in diagnostic accuracy, distinguishing prespecified from exploratory

Example. ‘To assess the performance of urinary indices or their changes over the first 24 hours in distinguishing transient AKI [acute kidney injury] from persistent AKI, we plotted the receiver-operating characteristic curves for the proportion of true positives against the proportion of false positives, depending on the prediction rule used to classify patients as having persistent AKI. The same strategy was used to assess the performance of indices and their changes over time in two predefined patient subgroups; namely, patients who did not receive diuretic therapy and patients without sepsis’.71

Explanation. The relative proportion of false-positive or false-negative results of a diagnostic test may vary depending on patient characteristics, experience of readers, the setting and previous test results.2 ,3 Researchers may therefore want to explore possible sources of variability in test accuracy within their study. In such analyses, investigators typically assess differences in accuracy across subgroups of participants, readers or centres.

Post hoc analyses, performed after looking at the data, carry a high risk for spurious findings. The results are especially likely not to be confirmed by subsequent studies. Analyses that were prespecified in the protocol, before data were collected, have greater credibility.72

In the example, the authors reported that the accuracy of the urinary indices was evaluated in two subgroups that were explicitly prespecified.

Item 18. Intended sample size and how it was determined

Example. ‘Study recruitment was guided by an expected 12% prevalence of adenomas 6 mm or larger in a screening cohort and a point estimate of 80% sensitivity for these target lesions. We planned to recruit approximately 600 participants to achieve margins of sampling error of approximately 8 percentage points for sensitivity. This sample would also allow 90% power to detect differences in sensitivity between computed tomographic colonography and optical colonoscopy of 18 percentage points or more’.73

Explanation. Performing sample size calculations when developing a diagnostic accuracy study may ensure that a sufficient amount of precision is reached. Sample size calculations also take into account the specific objectives and hypotheses of the study.

Readers may want to know how the sample size was determined, and whether the assumptions made in this calculation are in line with the scientific and clinical background, and the study objectives. Readers will also want to learn whether the study authors were successful in recruiting the targeted number of participants. Methods for performing sample size calculations in diagnostic research are widely available,74–76 but such calculations are not always performed or provided in reports of diagnostic accuracy studies.77 ,78

Many diagnostic accuracy studies are small; a systematic survey of studies published in 8 leading journals in 2002 found a median sample size of 118 participants (IQR 71–350).77 Estimates of diagnostic accuracy from small studies tend to be imprecise, with wide CIs around them.

In the example, the authors reported in detail to achieve a desired level of precision for an expected sensitivity of 80%.

Results

Item 19. Flow of participants, using a diagram

Example. ‘Between 1 June 2008 and 30 June 2011, 360 patients were assessed for initial eligibility and invited to participate. The figure shows the flow of patients through the study, along with the primary outcome of advanced colorectal neoplasia. Patients who were excluded (and reasons for this) or who withdrew from the study are noted. In total, 229 patients completed the study, a completion rate of 64%’.79 (See figure 1.)

Figure 1

Example of flow diagram from a study evaluating the accuracy of faecal immunochemical testing for diagnosis of advanced colorectal neoplasia (adapted from Collins et al,79 with permission).

Explanation. Estimates of diagnostic accuracy may be biased if not all eligible participants undergo the index test and the desired reference standard.80–86 This includes studies in which not all study participants undergo the reference standard, as well as studies where some of the participants receive a different reference standard.70 Incomplete verification by the reference standard occurs in up to 26% of diagnostic studies; it is especially common when the reference standard is an invasive procedure.84

To allow the readers to appreciate the potential for bias, authors are invited to build a diagram to illustrate the flow of participants through the study. Such a diagram also illustrates the basic structure of the study. An example of a prototypical STARD flow diagram is presented in figure 2.

Figure 2

STARD 2015 flow diagram.

By providing the exact number of participants at each stage of the study, including the number of true-positive, false-positive, true-negative and false-negative index test results, the diagram also helps identifying the correct denominator for calculating proportions such as sensitivity and specificity. The diagram should also specify the number of participants that were assessed for eligibility, the number of participants who did not receive either the index test and/or the reference standard and the reasons for that. This helps readers to judge the risk of bias, but also the feasibility of the evaluated testing strategy, and the applicability of the study findings.

In the example, the authors very briefly described the flow of participants, and referred to a flow diagram in which the number of participants and corresponding test results at each stage of the study were provided, as well as detailed reasons for excluding participants (figure 1).

Item 20. Baseline demographic and clinical characteristics of participants

Example. ‘The median age of participants was 60 years (range 18–91), and 209 participants (54.7%) were female. The predominant presenting symptom was abdominal pain, followed by rectal bleeding and diarrhea, whereas fever and weight loss were less frequent. At physical examination, palpation elicited abdominal pain in almost half the patients, but palpable abdominal or rectal mass was found in only 13 individuals (Table X)’.87 (See table 3.)

View this table:

Table 3

Example of baseline demographic and clinical characteristics of participants in a study evaluating the accuracy of point-of-care fecal tests for diagnosis of organic bowel disease (adapted from Kok et al,87 with permission)

Explanation. The diagnostic accuracy of a test can depend on the demographic and clinical characteristics of the population in which it is applied.2 ,3 ,88–92 These differences may reflect variability in the extent or severity of disease, which affects sensitivity, or in the alternative conditions that are able to generate false-positive findings, affecting specificity.85

An adequate description of the demographic and clinical characteristics of study participants allows the reader to judge whether the study can adequately address the study question, and whether the study findings apply to the reader's clinical question.

In the example, the authors presented the demographic and clinical characteristics of the study participants in a separate table, a commonly used, informative way of presenting key participant characteristics (table 3).

Item 21a. Distribution of severity of disease in those with the target condition

Item 21b. Distribution of alternative diagnoses in those without the target condition

Example. ‘Of the 170 patients with coronary disease, one had left main disease, 53 had three vessel disease, 64 two vessel disease, and 52 single vessel disease. The mean ejection fraction of the patients with coronary disease was 64% (range 37–83). The other 52 men with symptoms had normal coronary arteries or no significant lesions at angiography’.93

Explanation. Most target conditions are not fixed states, either present or absent; many diseases cover a continuum, ranging from minute pathological changes to advanced clinical disease. Test sensitivity is often higher in studies in which more patients have advanced stages of the target condition, as these cases are often easier to identify by the index test.28 ,85 The type, spectrum and frequency of alternative diagnoses in those without the target condition may also influence test accuracy; typically, the healthier the patients without the target condition, the less frequently one would find false-positive results of the index test.28

An adequate description of the severity of disease in those with the target condition and of the alternative conditions in those without it allows the reader to judge both the validity of the study, relative to the study question and the applicability of the study findings to the reader's clinical question.

In the example, the authors investigated the accuracy of exercise tests for diagnosing coronary artery disease. They reported the distribution of severity of disease in terms of the number of vessels involved; the more vessels, the more severe the coronary artery disease would be. Sensitivity of test exercises was higher in those with more diseased vessels (39% for single vessel disease, 58% for two and 77% for three vessels).91

Item 22. Time interval and any clinical interventions between index test and reference standard

Example. ‘The mean time between arthrometric examination and MR imaging was 38.2 days (range, 0–107 days)’.94

Explanation. Studies of diagnostic accuracy are essentially cross-sectional investigations. In most cases, one wants to know how well the index test classified patients in the same way as the reference standard, when both tests are performed in the same patients, at the same time.30 When a delay occurs between the index test and the reference standard, the target condition and alternative conditions can change; conditions may worsen, or improve in the meanwhile, due to the natural course of the disease, or due to clinical interventions applied between the two tests. Such changes influence the agreement between the index test and the reference standard, which could lead to biased estimates of test performance.

The bias can be more severe if the delay differs systematically between test positives and test negatives, or between those with a high prior suspicion of having the target condition and those with a low suspicion.1 ,2

When follow-up is used as the reference standard, readers will want to know how long the follow-up period was.

In the example, the authors reported the mean number of days, and a range, between the index test and the reference standard.

Item 23. Cross tabulation of the index test results (or their distribution) by the results of the reference standard

Example. ‘Table X shows pain over speed bumps in relation to diagnosis of appendicitis’.95 (See table 4.)

View this table:

Table 4

Example of contingency table from a study evaluating the accuracy of pain over speed bumps for diagnosis of appendicitis (adapted from Ashdown et al,95 with permission)

Explanation. Research findings should be reproducible and verifiable by other scientists; this applies both to the testing procedures, to the conduct of the study and to the statistical analyses.

A cross tabulation of index test results against reference standard results facilitates recalculating measures of diagnostic accuracy. It also facilitates recalculating the proportion of study group participants with the target condition, which is useful as the sensitivity and specificity of a test may vary with disease prevalence.32 ,96 It also allows for performing alternative or additional analyses, such as meta-analysis.

Preferably, such tables should include actual numbers, not just percentages, because mistakes made by study authors in calculating estimates for sensitivity and specificity are not rare.

In the example, the authors provided a contingency table from which the number of true positives, false positives, false negatives and true negatives can be easily identified (table 4).

Item 24. Estimates of diagnostic accuracy and their precision (such as 95% CIs)

Example. ‘Forty-six patients had pulmonary fibrosis at CT, and sensitivity and specificity of MR imaging in the identification of pulmonary fibrosis were 89% (95% CI 77%, 96%) and 91% (95% CI 76%, 98%), respectively, with positive and negative predictive values of 93% (95% CI 82%, 99%) and 86% (95% CI 70%, 95%), respectively’.97

Explanation. Diagnostic accuracy studies never determine a test's ‘true’ sensitivity and specificity; at best, the data collected in the study can be used to calculate valid estimates of sensitivity and specificity. The smaller the number of study participants, the less precise these estimates will be.98

The most frequently used expression of imprecision is to report not just the estimates—sometimes referred to as point estimates—but also 95% CIs around the estimates. Results from studies with imprecise estimates of accuracy should be interpreted with caution, as overoptimism lurks.22

In the example, where MRI is the index test and CT the reference standard, the authors reported point estimates and 95% CIs around them, for sensitivity, specificity and positive and negative predictive value.

Item 25. Any adverse events from performing the index test or the reference standard

Example. ‘No significant adverse events occurred as a result of colonoscopy. Four (2%) patients had minor bleeding in association with polypectomy that was controlled endoscopically. Other minor adverse events are noted in the appendix’.79

Explanation. Not all medical tests are equally safe, and in this, they do not differ from many other medical interventions.99 ,100 The testing procedure can lead to complications, such as perforations with endoscopy, contrast allergic reactions in CT imaging or claustrophobia with MRI scanning.

Measuring and reporting of adverse events in studies of diagnostic accuracy will provide additional information to clinicians, who may be reluctant to use them if they produce severe or frequent adverse events. Actual application of a test in clinical practice will not just be guided by the test's accuracy, but by several other dimensions as well, including feasibility and safety. This also applies to the reference standard.

In the example, the authors distinguished between ‘significant’ and ‘minor’ adverse events, and explicitly reported how often these were observed.

Discussion

Item 26. Study limitations, including sources of potential bias, statistical uncertainty and generalisability

Example. ‘This study had limitations. First, not all patients who underwent CT colonography (CTC) were assessed by the reference standard methods. […] However, considering that the 41 patients who were eligible but did not undergo the reference standard procedures had negative or only mildly positive CTC findings, excluding them from the analysis of CTC diagnostic performance may have slightly overestimated the sensitivity of CTC (ie, partial verification bias). Second, there was a long time interval between CTC and the reference methods in some patients, predominately those with negative CTC findings. […] If anything, the prolonged interval would presumably slightly underestimate the sensitivity and NPV of CTC for non-cancerous lesions, since some “missed” lesions could have conceivably developed or increased in size since the time of CTC’.101

Explanation. Like other clinical trials and studies, diagnostic accuracy studies are at risk of bias; they can generate estimates of the test's accuracy that do not reflect the true performance of the test, due to flaws or deficiencies in study design and analysis.1 ,2 In addition, imprecise accuracy estimates, with wide CIs, should be interpreted with caution. Because of differences in design, participants and procedures, the findings generated by one particular diagnostic accuracy study may not be obtained in other conditions; their generalisability may be limited.102

In the Discussion section, authors should critically reflect on the validity of their findings, address potential limitations and elaborate on why study findings may or may not be generalisable. As bias can come down to overestimation or underestimation of the accuracy of the index test under investigation, authors should discuss the direction of potential bias, along with its likely magnitude. Readers are then informed of the likelihood that the limitations jeopardise the study's results and conclusions (see also Item 27).103

Some journals explicitly encourage authors to report on study limitations, but many are not specific about which elements should be addressed.104 For diagnostic accuracy studies, we highly recommend that at least potential sources of bias are discussed, as well as imprecision, and concerns related to the selection of patients and the setting in which the study was performed.

In the example, the authors identified two potential sources of bias that are common in diagnostic accuracy studies: not all test results were verified by the reference standard, and there was a time interval between index test and reference standard, allowing the target condition to change. They also discussed the magnitude of this potential bias, and the direction: whether this may have led to overestimations or underestimations of test accuracy.

Item 27. Implications for practice, including the intended use and clinical role of the index test

Example. ‘A Wells score of ≤4 combined with a negative point of care D-dimer test result ruled out pulmonary embolism in 4–5 of 10 patients, with a failure rate of less than 2%, which is considered safe by most published consensus statements. Such a rule-out strategy makes it possible for primary care doctors to safely exclude pulmonary embolism in a large proportion of patients suspected of having the condition, thereby reducing the costs and burden to the patient (for example, reducing the risk of contrast nephropathy associated with spiral computed tomography) associated with an unnecessary referral to secondary care’.25

Explanation. To make the study findings relevant for practice, authors of diagnostic accuracy studies should elaborate on the consequences of their findings, taking into account the intended use (the purpose of testing) and clinical role of the test (how will the test be positioned in the existing clinical pathway).

A test can be proposed for diagnostic purposes, for susceptibility, screening, risk stratification, staging, prediction, prognosis, treatment selection, monitoring, surveillance or other purposes. The clinical role of the test reflects its positioning relative to existing tests for the same purpose, within the same clinical setting: triage, add-on or replacement.19 ,105 The intended use and the clinical role of the index test should have been described in the introduction of the paper (Item 3).

The intended use and the proposed role will guide the desired magnitude of the measures of diagnostic accuracy. For ruling-out disease with an inexpensive triage test, for example, high sensitivity is required, and less-than-perfect specificity may be acceptable. If the test is supposed to rule-in disease, specificity may become much more important.106

In the Discussion section, authors should elaborate on whether or not the accuracy estimates are sufficient for considering the test to be ‘fit for purpose’.

In the example, the authors concluded that the combination of a Wells score ≤4 and a negative point-of-care D-dimer result could reliably rule-out pulmonary embolism in a large proportion of patients seen in primary care.

Other information

Item 28. Registration number and name of registry

Example. ‘The study was registered at http://www.clinicaltrials.org (NCT00916864)’.107

Explanation. Registering study protocols before their initiation in a clinical trial registry, such as ClinicalTrials.gov or one of the WHO Primary Registries, ensures that existence of the studies can be identified.108–112 This has many advantages, including avoiding overlapping or redundant studies, and allowing colleagues and potential participants to contact the study coordinators.

Additional benefits of study registration are the prospective definition of study objectives, outcome measures, eligibility criteria and data to be collected, allowing editors, reviewers and readers to identify deviations in the final study report. Trial registration also allows reviewers to identify studies that have been completed but were not yet reported.

Many journals require registration of clinical trials. A low but increasing number of diagnostic accuracy studies are also being registered. In a recent evaluation of 351 test accuracy studies published in high-impact journals in 2012, 15% had been registered.113

Including a registration number in the study report facilitates identification of the trial in the corresponding registry. It can also be regarded as a sign of quality, if the trial was registered before its initiation.

In the example, the authors reported that the study was registered at ClinicalTrials.gov. The registration number was also provided, so that the registered record could be easily retrieved.

Item 29. Where the full study protocol can be accessed

Example. ‘The design and rationale of the OPTIMAP study have been previously published in more detail [with reference to study protocol]’.114

Explanation. Full study protocols typically contain additional methodological information that is not provided in the final study report, because of word limits, or because it has been reported elsewhere. This additional information can be helpful for those who want to thoroughly appraise the validity of the study, for researchers who want to replicate the study and for practitioners who want to implement the testing procedures.

An increasing number of researchers share their original study protocol, often before enrolment of the first participant in the study. They may do so by publishing the protocol in a scientific journal, at an institutional or sponsor website, or as supplementary material on the journal website, to accompany the study report.

If the protocol has been published or posted online, authors should provide a reference or a link. If the study protocol has not been published, authors should state from whom it can be obtained.115

In the example, the authors provided a reference to the full protocol, which had been published previously.

Item 30. Sources of funding and other support; role of funders

Example. ‘Funding, in the form of the extra diagnostic reagents and equipment needed for the study, was provided by Gen-Probe. The funders had no role in the initiation or design of the study, collection of samples, analysis, interpretation of data, writing of the paper, or the submission for publication. The study and researchers are independent of the funders, Gen-Probe’.116

Explanation. Sponsorship of a study by a pharmaceutical company has been shown to be associated with results favouring the interests of that sponsor.117 Unfortunately, sponsorship is often not disclosed in scientific articles, making it difficult to assess this potential bias. Sponsorship can consist of direct funding of the study, or of the provision of essential study materials, such as test devices.

The role of the sponsor, including the degree to which that sponsor was involved in the study, varies. A sponsor could, for example, be involved in the design of the study, but also in the conduct, analysis, reporting and decision to publish. Authors are encouraged to be explicit about sources of funding as well as the sponsors role(s) in the study, as this transparency helps readers to appreciate the level of independency of the researchers.

In the example, the authors were explicit about the contribution from the sponsor, and their independence in each phase of the study.

Acknowledgments

The authors thank the STARD Group for helping us in identifying essential items for reporting diagnostic accuracy studies.

References

↵
1. Whiting P,
2. Rutjes AW,
3. Reitsma JB, et al
. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 2004;140:189–202.
OpenUrl CrossRef PubMed Web of Science
↵
1. Whiting PF,
2. Rutjes AW,
3. Westwood ME, et al
. A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. J Clin Epidemiol 2013;66:1093–104. doi:10.1016/j.jclinepi.2013.05.014
OpenUrl CrossRef PubMed
↵
1. Whiting PF,
2. Rutjes AW,
3. Westwood ME, et al
. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155:529–36. doi:10.7326/0003-4819-155-8-201110180-00009
OpenUrl CrossRef PubMed Web of Science
↵
1. Korevaar DA,
2. van Enst WA,
3. Spijker R, et al
. Reporting quality of diagnostic accuracy studies: a systematic review and meta-analysis of investigations on adherence to STARD. Evid Based Med 2014;19:47–54. doi:10.1136/eb-2013-101637
OpenUrl Abstract/FREE Full Text
↵
1. Korevaar DA,
2. Wang J,
3. van Enst WA, et al
. Reporting diagnostic accuracy studies: some improvements after 10 years of STARD. Radiology 2015;274:781–9. doi:10.1148/radiol.14141160
OpenUrl CrossRef PubMed
↵
1. Lijmer JG,
2. Mol BW,
3. Heisterkamp S, et al
. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999;282:1061–6.
OpenUrl CrossRef PubMed Web of Science
↵
1. Bossuyt PM,
2. Reitsma JB,
3. Bruns DE, et al
. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Standards for Reporting of Diagnostic Accuracy. Clin Chem 2003;49:1–6.
OpenUrl Abstract/FREE Full Text
↵
1. Begg C,
2. Cho M,
3. Eastwood S, et al
. Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA 1996;276:637–9.
OpenUrl CrossRef PubMed Web of Science
↵
1. Schulz KF,
2. Altman DG,
3. Moher D
. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ 2010;340:c332.
OpenUrl FREE Full Text
↵
1. Bossuyt PM,
2. Reitsma JB,
3. Bruns DE, et al
. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 2015;351:h5527.
OpenUrl FREE Full Text
↵
1. Bossuyt PM,
2. Reitsma JB,
3. Bruns DE, et al
. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann Intern Med 2003;138:W1–12.
OpenUrl CrossRef PubMed
↵
1. Regge D,
2. Laudi C,
3. Galatola G, et al
. Diagnostic accuracy of computed tomographic colonography for the detection of advanced neoplasia in individuals at increased risk of colorectal cancer. JAMA 2009;301:2453–61. doi:10.1001/jama.2009.832
OpenUrl CrossRef PubMed
↵
1. Deville WL,
2. Bezemer PD,
3. Bouter LM
. Publications on diagnostic test evaluation in family medicine journals: an optimal search strategy. J Clin Epidemiol 2000;53:65–9.
OpenUrl CrossRef PubMed Web of Science
↵
1. Korevaar DA,
2. Cohen JF,
3. Hooft L, et al
. Literature survey of high-impact journals revealed reporting weaknesses in abstracts of diagnostic accuracy studies. J Clin Epidemiol 2015;68:708–15. doi:10.1016/j.jclinepi.2015.01.014
OpenUrl
↵
1. Korevaar DA,
2. Cohen JF,
3. de Ronde MW, et al
. Reporting weaknessess in conference abstracts of diagnostic accuracy studies in ophthalmology. JAMA Ophthalmol 2015;133:1464–7. doi:10.1001/jamaophthalmol.2015.3577
OpenUrl
↵
A proposal for more informative abstracts of clinical articles. Ad Hoc Working Group for Critical Appraisal of the Medical Literature. Ann Intern Med 1987;106:598–604.
OpenUrl CrossRef PubMed Web of Science
↵
1. Stiell IG,
2. Greenberg GH,
3. Wells GA, et al
. Derivation of a decision rule for the use of radiography in acute knee injuries. Ann Emerg Med. 1995;26:405–13.
OpenUrl CrossRef PubMed Web of Science
↵
1. Horvath AR,
2. Lord SJ,
3. StJohn A, et al
. From biomarkers to medical tests: the changing landscape of test evaluation. Clin Chim Acta 2014;427:49–57.
OpenUrl CrossRef Web of Science
↵
1. Bossuyt PM,
2. Irwig L,
3. Craig J, et al
. Comparative accuracy: assessing new tests against existing diagnostic pathways. BMJ 2006;332:1089–92.
OpenUrl FREE Full Text
↵
1. Gieseker KE,
2. Roe MH,
3. MacKenzie T, et al
. Evaluating the American Academy of Pediatrics diagnostic standard for Streptococcus pyogenes pharyngitis: backup culture versus repeat rapid antigen testing. Pediatrics 2003;111(6 Pt 1):e666–70.
OpenUrl Abstract/FREE Full Text
↵
1. Tanz RR,
2. Gerber MA,
3. Kabat W, et al
. Performance of a rapid antigen-detection test and throat culture in community pediatric offices: implications for management of pharyngitis. Pediatrics 2009;123:437–44.
OpenUrl Abstract/FREE Full Text
↵
1. Ochodo EA,
2. de Haan MC,
3. Reitsma JB, et al
. Overinterpretation and misreporting of diagnostic accuracy studies: evidence of ‘spin’. Radiology 2013;267:581–8.
OpenUrl CrossRef PubMed Web of Science
↵
1. Freer PE,
2. Niell B,
3. Rafferty EA
. Preoperative tomosynthesis-guided needle localization of mammographically and sonographically occult breast lesions. Radiology 2015;275:377–83.
OpenUrl
↵
1. Sorensen HT,
2. Sabroe S,
3. Olsen J
. A framework for evaluation of secondary data sources for epidemiological research. Int J Epidemiol 1996;25:435–42.
OpenUrl Abstract/FREE Full Text
↵
1. Geersing GJ,
2. Erkens PM,
3. Lucassen WA, et al
. Safe exclusion of pulmonary embolism using the Wells rule and qualitative D-dimer testing in primary care: prospective cohort study. BMJ 2012;345:e6564.
OpenUrl Abstract/FREE Full Text
↵
1. Bomers MK,
2. van Agtmael MA,
3. Luik H, et al
. Using a dog's superior olfactory sensitivity to identify Clostridium difficile in stools and patients: proof of principle study. BMJ 2012;345:e7396.
OpenUrl Abstract/FREE Full Text
↵
1. Philbrick JT,
2. Horwitz RI,
3. Feinstein AR
. Methodologic problems of exercise testing for coronary artery disease: groups, analysis and bias. Am J Cardiol 1980;46:807–12.
OpenUrl CrossRef PubMed Web of Science
↵
1. Rutjes AW,
2. Reitsma JB,
3. Vandenbroucke JP, et al
. Case-control and two-gate designs in diagnostic accuracy studies. Clin Chem 2005;51:1335–41. doi:10.1373/clinchem.2005.048595
OpenUrl Abstract/FREE Full Text
↵
1. Rutjes AW,
2. Reitsma JB,
3. Di Nisio M, et al
. Evidence of bias and variation in diagnostic accuracy studies. CMAJ 2006;174:469–76. doi:10.1503/cmaj.050090
OpenUrl Abstract/FREE Full Text
↵
1. Knottnerus JA,
2. Muris JW
. Assessment of the accuracy of diagnostic tests: the cross-sectional study. J Clin Epidemiol 2003;56:1118–28.
OpenUrl CrossRef PubMed Web of Science
↵
1. van der Schouw YT,
2. Van Dijk R,
3. Verbeek AL
. Problems in selecting the adequate patient population from existing data files for assessment studies of new diagnostic tests. J Clin Epidemiol1995;48:417–22.
OpenUrl CrossRef PubMed Web of Science
↵
1. Leeflang MM,
2. Bossuyt PM,
3. Irwig L
. Diagnostic test accuracy may vary with prevalence: implications for evidence-based diagnosis. J Clin Epidemiol 2009;62:5–12. doi:10.1016/j.jclinepi.2008.04.007
OpenUrl CrossRef PubMed Web of Science
↵
1. Attia M,
2. Zaoutis T,
3. Eppes S, et al
. Multivariate predictive models for group A beta-hemolytic streptococcal pharyngitis in children. Acad Emerg Med 1999;6:8–13.
OpenUrl PubMed Web of Science
↵
1. Knottnerus JA,
2. Knipschild PG,
3. Sturmans F
. Symptoms and selection bias: the influence of selection towards specialist care on the relationship between symptoms and diagnoses. Theor Med 1989;10:67–81.
OpenUrl CrossRef PubMed
↵
1. Knottnerus JA,
2. Leffers P
. The influence of referral patterns on the characteristics of diagnostic tests. J Clin Epidemiol 1992;45:1143–54.
OpenUrl CrossRef PubMed Web of Science
↵
1. Melbye H,
2. Straume B
. The spectrum of patients strongly influences the usefulness of diagnostic tests for pneumonia. Scand J Prim Health Care 1993;11:241–6.
OpenUrl PubMed
↵
1. Ezike EN,
2. Rongkavilit C,
3. Fairfax MR, et al
. Effect of using 2 throat swabs vs 1 throat swab on detection of group A streptococcus by a rapid antigen detection test. Arch Pediatr Adolesc Med 2005;159:486–90.
OpenUrl CrossRef PubMed Web of Science
↵
1. Rosjo H,
2. Kravdal G,
3. Hoiseth AD, et al
. Troponin I measured by a high-sensitivity assay in patients with suspected reversible myocardial ischemia: data from the Akershus Cardiac Examination (ACE) 1 study. Clin Chem 2012;58:1565–73. doi:10.1373/clinchem.2012.190868
OpenUrl Abstract/FREE Full Text
↵
1. Irwig L,
2. Bossuyt P,
3. Glasziou P, et al
. Designing studies to ensure that estimates of test accuracy are transferable. BMJ 2002;324:669–71.
OpenUrl FREE Full Text
↵
1. Detrano R,
2. Gianrossi R,
3. Froelicher V
. The diagnostic accuracy of the exercise electrocardiogram: a meta-analysis of 22 years of research. Prog Cardiovasc Dis 1989;32:173–206.
OpenUrl CrossRef PubMed Web of Science
↵
1. Brealey S,
2. Scally AJ
. Bias in plain film reading performance studies. Br J Radiol 2001;74:307–16. doi:10.1259/bjr.74.880.740307
OpenUrl Abstract/FREE Full Text
↵
1. Elmore JG,
2. Wells CK,
3. Lee CH, et al
. Variability in radiologists’ interpretations of mammograms. N Engl J Med 1994;331:1493–9. doi:10.1056/NEJM199412013312206
OpenUrl CrossRef PubMed Web of Science
↵
1. Ronco G,
2. Montanari G,
3. Aimone V, et al
. Estimating the sensitivity of cervical cytology: errors of interpretation and test limitations. Cytopathology 1996;7:151–8.
OpenUrl CrossRef PubMed Web of Science
↵
1. Cohen MB,
2. Rodgers RP,
3. Hales MS, et al
. Influence of training and experience in fine-needle aspiration biopsy of breast. Receiver operating characteristics curve analysis. Arch Pathol Lab Med 1987;111:518–20.
OpenUrl PubMed Web of Science
↵
1. Fox JW,
2. Cohen DM,
3. Marcon MJ, et al
. Performance of rapid streptococcal antigen testing varies by personnel. J Clin Microbiol 2006;44:3918–22. doi:10.1128/JCM.01399-06
OpenUrl Abstract/FREE Full Text
↵
1. Gandy M,
2. Sharpe L,
3. Perry KN, et al
. Assessing the efficacy of 2 screening measures for depression in people with epilepsy. Neurology 2012;79:371–5. doi:10.1212/WNL.0b013e318260cbfc
OpenUrl CrossRef
↵
1. Stegeman I,
2. de Wijkerslooth TR,
3. Stoop EM, et al
. Combining risk factors with faecal immunochemical test outcome for selecting CRC screenees for colonoscopy. Gut 2014;63:466–71. doi:10.1136/gutjnl-2013-305013
OpenUrl Abstract/FREE Full Text
↵
1. Leeflang MM,
2. Moons KG,
3. Reitsma JB, et al
. Bias in sensitivity and specificity caused by data-driven selection of optimal cutoff values: mechanisms, magnitude, and solutions. Clin Chem 2008;54:729–37. doi:10.1373/clinchem.2007.096032
OpenUrl Abstract/FREE Full Text
↵
1. Ewald B
. Post hoc choice of cut points introduced bias to diagnostic research. J Clin Epidemiol 2006;59:798–801. doi:10.1016/j.jclinepi.2005.11.025
OpenUrl CrossRef PubMed Web of Science
↵
1. Justice AC,
2. Covinsky KE,
3. Berlin JA
. Assessing the generalizability of prognostic information. Ann Intern Med 1999;130:515–24.
OpenUrl CrossRef PubMed Web of Science
↵
1. Harrell FE Jr.,
2. Lee KL,
3. Mark DB
. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996;15:361–87. doi:10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4
OpenUrl CrossRef PubMed Web of Science
↵
1. Hodgdon T,
2. McInnes MD,
3. Schieda N, et al
. Can quantitative CT texture analysis be used to differentiate fat-poor renal angiomyolipoma from renal cell carcinoma on unenhanced CT images? Radiology 2015;276:787–96. doi:10.1148/radiol.2015142215
OpenUrl
↵
1. Begg CB
. Biases in the assessment of diagnostic tests. Stat Med 1987;6:411–23.
OpenUrl CrossRef PubMed Web of Science
↵
1. Doubilet P,
2. Herman PG
. Interpretation of radiographs: effect of clinical history. AJR Am J Roentgenol 1981;137:1055–8. doi:10.2214/ajr.137.5.1055
OpenUrl PubMed Web of Science
↵
1. D'Orsi CJ,
2. Getty DJ,
3. Pickett RM, et al
. Stereoscopic digital mammography: improved specificity and reduced rate of recall in a prospective clinical trial. Radiology 2013;266:81–8. doi:10.1148/radiol.12120382
OpenUrl CrossRef PubMed
↵
1. Knottnerus JA,
2. Buntinx F
. The evidence base of clinical diagnosis: theory and methods of diagnostic research. 2nd edn. BMJ Books, 2008.
↵
1. Pepe M
. Study design and hypothesis testing. The statistical evaluation of medical tests for classification and prediction. Oxford, UK: Oxford University Press, 2003:214–51.
↵
1. Hayen A,
2. Macaskill P,
3. Irwig L, et al
. Appropriate statistical methods are required to assess diagnostic tests for replacement, add-on, and triage. J Clin Epidemiol 2010;63:883–91. doi:10.1016/j.jclinepi.2009.08.024
OpenUrl CrossRef PubMed Web of Science
↵
1. Garcia Pena BM,
2. Mandl KD,
3. Kraus SJ, et al
. Ultrasonography and limited computed tomography in the diagnosis and management of appendicitis in children. JAMA 1999;282:1041–6.
OpenUrl CrossRef PubMed Web of Science
↵
1. Simel DL,
2. Feussner JR,
3. DeLong ER, et al
. Intermediate, indeterminate, and uninterpretable diagnostic test results. Med Decis Making 1987;7:107–14.
OpenUrl Abstract/FREE Full Text
↵
1. Philbrick JT,
2. Horwitz RI,
3. Feinstein AR, et al
. The limited spectrum of patients studied in exercise test research. Analyzing the tip of the iceberg. JAMA 1982;248:2467–70.
OpenUrl CrossRef PubMed Web of Science
↵
1. Begg CB,
2. Greenes RA,
3. Iglewicz B
. The influence of uninterpretability on the assessment of diagnostic tests. J Chronic Dis 1986;39:575–84.
OpenUrl CrossRef PubMed Web of Science
↵
1. Shinkins B,
2. Thompson M,
3. Mallett S, et al
. Diagnostic accuracy studies: how to report and analyse inconclusive test results. BMJ 2013;346:f2778.
OpenUrl FREE Full Text
↵
1. Pisano ED,
2. Fajardo LL,
3. Tsimikas J, et al
. Rate of insufficient samples for fine-needle aspiration for nonpalpable breast lesions in a multicenter clinical trial: the Radiologic Diagnostic Oncology Group 5 Study. The RDOG5 investigators. Cancer 1998;82:679–88.
OpenUrl CrossRef PubMed Web of Science
↵
1. Giard RW,
2. Hermans J
. The value of aspiration cytologic examination of the breast. A statistical review of the medical literature. Cancer 1992;69:2104–10.
OpenUrl CrossRef PubMed Web of Science
↵
1. Investigators P
. Value of the ventilation/perfusion scan in acute pulmonary embolism. Results of the prospective investigation of pulmonary embolism diagnosis (PIOPED). JAMA 1990;263:2753–9.
OpenUrl CrossRef PubMed Web of Science
↵
1. Min JK,
2. Leipsic J,
3. Pencina MJ, et al
. Diagnostic accuracy of fractional flow reserve from anatomic CT angiography. JAMA 2012;308:1237–45. doi:10.1001/2012.jama.11274
OpenUrl CrossRef PubMed Web of Science
↵
1. Naaktgeboren CA,
2. de Groot JA,
3. Rutjes AW, et al
. Anticipating missing reference standard data when planning diagnostic accuracy studies. BMJ 2016;352:i402.
OpenUrl FREE Full Text
↵
1. van der Heijden GJ,
2. Donders AR,
3. Stijnen T, et al
. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol 2006;59:1102–9. doi:10.1016/j.jclinepi.2006.01.015
OpenUrl CrossRef PubMed Web of Science
↵
1. de Groot JA,
2. Bossuyt PM,
3. Reitsma JB, et al
. Verification problems in diagnostic accuracy studies: consequences and solutions. BMJ 2011;343:d4770.
OpenUrl FREE Full Text
↵
1. Pons B,
2. Lautrette A,
3. Oziel J, et al
. Diagnostic accuracy of early urinary index changes in differentiating transient from persistent acute kidney injury in critically ill patients: multicenter cohort study. Crit Care 2013;17:R56. doi:10.1186/cc12582
OpenUrl CrossRef PubMed
↵
1. Sun X,
2. Ioannidis JP,
3. Agoritsas T, et al
. How to use a subgroup analysis: users’ guide to the medical literature. JAMA 2014;311:405–11. doi:10.1001/jama.2013.285063
OpenUrl CrossRef PubMed Web of Science
↵
1. Zalis ME,
2. Blake MA,
3. Cai W, et al
. Diagnostic accuracy of laxative-free computed tomographic colonography for detection of adenomatous polyps in asymptomatic adults: a prospective evaluation. Ann Intern Med 2012;156:692–702. doi:10.7326/0003-4819-156-10-201205150-00005
OpenUrl CrossRef PubMed
↵
1. Flahault A,
2. Cadilhac M,
3. Thomas G
. Sample size calculation should be performed for design accuracy in diagnostic test studies. J Clin Epidemiol 2005;58:859–62. doi:10.1016/j.jclinepi.2004.12.009
OpenUrl CrossRef PubMed Web of Science
↵
1. Pepe MS
. The statistical evaluation of medical tests for classification and prediction. Oxford, New York: Oxford University Press, 2003.
↵
1. Vach W,
2. Gerke O,
3. Hoilund-Carlsen PF
. Three principles to define the success of a diagnostic study could be identified. J Clin Epidemiol 2012;65:293–300. doi:10.1016/j.jclinepi.2011.07.004
OpenUrl PubMed
↵
1. Bachmann LM,
2. Puhan MA,
3. ter Riet G, et al
. Sample sizes of studies on diagnostic accuracy: literature survey. BMJ 2006;332:1127–9. doi:10.1136/bmj.38793.637789.2F
OpenUrl Abstract/FREE Full Text
↵
1. Bochmann F,
2. Johnson Z,
3. Azuara-Blanco A
. Sample size in studies on diagnostic accuracy in ophthalmology: a literature survey. Br J Ophthalmol 2007;91:898–900. doi:10.1136/bjo.2006.113290
OpenUrl Abstract/FREE Full Text
↵
1. Collins MG,
2. Teo E,
3. Cole SR, et al
. Screening for colorectal cancer and advanced colorectal neoplasia in kidney transplant recipients: cross sectional prevalence and diagnostic accuracy study of faecal immunochemical testing for haemoglobin and colonoscopy. BMJ 2012;345:e4657.
OpenUrl Abstract/FREE Full Text
↵
1. Cecil MP,
2. Kosinski AS,
3. Jones MT, et al
. The importance of work-up (verification) bias correction in assessing the accuracy of SPECT thallium-201 testing for the diagnosis of coronary artery disease. J Clin Epidemiol 1996;49:735–42.
OpenUrl CrossRef PubMed Web of Science
↵
1. Choi BC
. Sensitivity and specificity of a single diagnostic test in the presence of work-up bias. J Clin Epidemiol 1992;45:581–6.
OpenUrl CrossRef PubMed Web of Science
↵
1. Diamond GA
. Off Bayes: effect of verification bias on posterior probabilities calculated using Bayes’ theorem. Med Decis Making 1992;12:22–31.
OpenUrl Abstract/FREE Full Text
↵
1. Diamond GA,
2. Rozanski A,
3. Forrester JS, et al
. A model for assessing the sensitivity and specificity of tests subject to selection bias. Application to exercise radionuclide ventriculography for diagnosis of coronary artery disease. J Chronic Dis 1986;39:343–55.
OpenUrl CrossRef PubMed Web of Science
↵
1. Greenes RA,
2. Begg CB
. Assessment of diagnostic technologies. Methodology for unbiased estimation from samples of selectively verified patients. Invest Radiol 1985;20:751–6.
OpenUrl CrossRef PubMed Web of Science
↵
1. Ransohoff DF,
2. Feinstein AR
. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978;299:926–30. doi:10.1056/NEJM197810262991705
OpenUrl CrossRef PubMed Web of Science
↵
1. Zhou XH
. Effect of verification bias on positive and negative predictive values. Stat Med 1994;13:1737–45.
OpenUrl CrossRef PubMed Web of Science
↵
1. Kok L,
2. Elias SG,
3. Witteman BJ, et al
. Diagnostic accuracy of point-of-care fecal calprotectin and immunochemical occult blood tests for diagnosis of organic bowel disease in primary care: the Cost-Effectiveness of a Decision Rule for Abdominal Complaints in Primary Care (CEDAR) study. Clin Chem 2012;58:989–98. doi:10.1373/clinchem.2011.177980
OpenUrl Abstract/FREE Full Text
↵
1. Harris JM Jr.
. The hazards of bedside Bayes. JAMA 1981;246:2602–5.
OpenUrl CrossRef PubMed Web of Science
↵
1. Hlatky MA,
2. Pryor DB,
3. Harrell FE Jr., et al
. Factors affecting sensitivity and specificity of exercise electrocardiography. Multivariable analysis. Am J Med 1984;77:64–71.
OpenUrl CrossRef PubMed Web of Science
↵
1. Lachs MS,
2. Nachamkin I,
3. Edelstein PH, et al
. Spectrum bias in the evaluation of diagnostic tests: lessons from the rapid dipstick test for urinary tract infection. Ann Intern Med 1992;117:135–40.
OpenUrl CrossRef PubMed Web of Science
↵
1. Moons KG,
2. van Es GA,
3. Deckers JW, et al
. Limitations of sensitivity, specificity, likelihood ratio, and bayes’ theorem in assessing diagnostic probabilities: a clinical example. Epidemiology 1997;8:12–17.
OpenUrl CrossRef PubMed Web of Science
↵
1. O'Connor PW,
2. Tansay CM,
3. Detsky AS, et al
. The effect of spectrum bias on the utility of magnetic resonance imaging and evoked potentials in the diagnosis of suspected multiple sclerosis. Neurology 1996;47:140–4.
OpenUrl
↵
1. Deckers JW,
2. Rensing BJ,
3. Tijssen JG, et al
. A comparison of methods of analysing exercise tests for diagnosis of coronary artery disease. Br Heart J 1989;62:438–44.
OpenUrl Abstract/FREE Full Text
↵
1. Naraghi AM,
2. Gupta S,
3. Jacks LM, et al
. Anterior cruciate ligament reconstruction: MR imaging signs of anterior knee laxity in the presence of an intact graft. Radiology 2012;263:802–10. doi:10.1148/radiol.12110779
OpenUrl CrossRef PubMed
↵
1. Ashdown HF,
2. D'Souza N,
3. Karim D, et al
. Pain over speed bumps in diagnosis of acute appendicitis: diagnostic accuracy study. BMJ 2012;345:e8012.
OpenUrl Abstract/FREE Full Text
↵
1. Leeflang MM,
2. Rutjes AW,
3. Reitsma JB, et al
. Variation of a test's sensitivity and specificity with disease prevalence. CMAJ 2013;185:E537–544. doi:10.1503/cmaj.121286
OpenUrl Abstract/FREE Full Text
↵
1. Rajaram S,
2. Swift AJ,
3. Capener D, et al
. Lung morphology assessment with balanced steady-state free precession MR imaging compared with CT. Radiology 2012;263:569–77. doi:10.1148/radiol.12110990
OpenUrl CrossRef PubMed
↵
1. Lang TA,
2. Secic M
. Generalizing from a sample to a population: reporting estimates and confidence intervals. Philadelphia: American College of Physicians, 1997.
↵
1. Ioannidis JP,
2. Evans SJ,
3. Gotzsche PC, et al
. Better reporting of harms in randomized trials: an extension of the CONSORT statement. Ann Intern Med 2004;141:781–8.
OpenUrl CrossRef PubMed Web of Science
↵
1. Ioannidis JP,
2. Lau J
. Completeness of safety reporting in randomized trials: an evaluation of 7 medical areas. JAMA 2001;285:437–43.
OpenUrl CrossRef PubMed Web of Science
↵
1. Park SH,
2. Lee JH,
3. Lee SS, et al
. CT colonography for detection and characterisation of synchronous proximal colonic lesions in patients with stenosing colorectal cancer. Gut 2012;61:1716–22. doi:10.1136/gutjnl-2011-301135
OpenUrl Abstract/FREE Full Text
↵
1. Irwig LM,
2. Bossuyt PM,
3. Glasziou PP, et al
. Designing studies to ensure that estimates of test accuracy will travel. In: Knottnerus JA, ed. The evidence base of clinical diagnosis. London: BMJ Publishing Group, 2002:95–116.
↵
1. Ter Riet G,
2. Chesley P,
3. Gross AG, et al
. All that glitters isn't gold: a survey on acknowledgment of limitations in biomedical studies. PLoS ONE 2013;8:e73623. doi:10.1371/journal.pone.0073623
OpenUrl
↵
1. Ioannidis JP
. Limitations are not properly acknowledged in the scientific literature. J Clin Epidemiol 2007;60:324–9. doi:10.1016/j.jclinepi.2006.09.011
OpenUrl CrossRef PubMed Web of Science
↵
1. Lord SJ,
2. Irwig L,
3. Simes RJ
. When is measuring sensitivity and specificity sufficient to evaluate a diagnostic test, and when do we need randomized trials? Ann Intern Med 2006;144:850–5.
OpenUrl CrossRef PubMed Web of Science
↵
1. Pewsner D,
2. Battaglia M,
3. Minder C, et al
. Ruling a diagnosis in or out with ‘SpPIn’ and ‘SnNOut’: a note of caution. BMJ 2004;329:209–13. doi:10.1136/bmj.329.7459.209
OpenUrl FREE Full Text
↵
1. Foerch C,
2. Niessner M,
3. Back T, et al
. Diagnostic accuracy of plasma glial fibrillary acidic protein for differentiating intracerebral hemorrhage and cerebral ischemia in patients with symptoms of acute stroke. Clin Chem 2012;58:237–45. doi:10.1373/clinchem.2011.172676
OpenUrl Abstract/FREE Full Text
↵
1. Altman DG
. The time has come to register diagnostic and prognostic research. Clin Chem 2014;60:580–2. doi:10.1373/clinchem.2013.220335
OpenUrl FREE Full Text
↵
1. Hooft L,
2. Bossuyt PM
. Prospective registration of marker evaluation studies: time to act. Clin Chem 2011;57:1684–6. doi:10.1373/clinchem.2011.176230
OpenUrl FREE Full Text
↵
1. Rifai N,
2. Altman DG,
3. Bossuyt PM
. Reporting bias in diagnostic and prognostic studies: time for action. Clin Chem 2008;54: 1101–3. doi:10.1373/clinchem.2008.108993
OpenUrl FREE Full Text
↵
1. Korevaar DA,
2. Ochodo EA,
3. Bossuyt PM, et al
. Publication and reporting of test accuracy studies registered in ClinicalTrials.gov. Clin Chem 2014;60:651–9. doi:10.1373/clinchem.2013.218149
OpenUrl Abstract/FREE Full Text
↵
1. Rifai N,
2. Bossuyt PM,
3. Ioannidis JP, et al
. Registering diagnostic and prognostic trials of tests: is it the right thing to do? Clin Chem 2014;60:1146–52. doi:10.1373/clinchem.2014.226100
OpenUrl FREE Full Text
↵
1. Korevaar DA,
2. Bossuyt PM,
3. Hooft L
. Infrequent and incomplete registration of test accuracy studies: analysis of recent study reports. BMJ Open 2014;4:e004596. doi:10.1136/bmjopen-2013-004596
OpenUrl Abstract/FREE Full Text
↵
1. Leeuwenburgh MM,
2. Wiarda BM,
3. Wiezer MJ, et al
. Comparison of imaging strategies with conditional contrast-enhanced CT and unenhanced MR imaging in patients suspected of having appendicitis: a multicenter diagnostic performance study. Radiology 2013;268:135–43. doi:10.1148/radiol.13121753
OpenUrl CrossRef PubMed
↵
1. Chan AW,
2. Song F,
3. Vickers A, et al
. Increasing value and reducing waste: addressing inaccessible research. Lancet 2014;383:257–66. doi:10.1016/S0140-6736(13)62296-5
OpenUrl CrossRef PubMed Web of Science
↵
1. Stewart CM,
2. Schoeman SA,
3. Booth RA, et al
. Assessment of self taken swabs versus clinician taken swab cultures for diagnosing gonorrhoea in women: single centre, diagnostic accuracy study. BMJ 2012;345:e8107.
OpenUrl Abstract/FREE Full Text
↵
1. Sismondo S
. Pharmaceutical company funding and its consequences: a qualitative systematic review. Contemp Clin Trials 2008;29:109–13. doi:10.1016/j.cct.2007.08.001
OpenUrl CrossRef PubMed Web of Science

View Abstract

Footnotes

JFC and DAK contributed equally to this manuscript and share first authorship.
Contributors JFC, DAK and PMMB are responsible for drafting of manuscript. DGA, DEB, CAG, LH, LI, DL, JBR and HCWdV are responsible for critical revision of manuscript.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement No additional data are available.

[1] ↵
Whiting P,
Rutjes AW,
Reitsma JB, et al
. Sources of variation and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 2004;140:189–202.
OpenUrl CrossRef PubMed Web of Science

[2] Whiting P,

[3] Rutjes AW,

[4] Reitsma JB, et al

[5] ↵
Whiting PF,
Rutjes AW,
Westwood ME, et al
. A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. J Clin Epidemiol 2013;66:1093–104. doi:10.1016/j.jclinepi.2013.05.014
OpenUrl CrossRef PubMed

[6] Whiting PF,

[7] Rutjes AW,

[8] Westwood ME, et al

[9] ↵
Whiting PF,
Rutjes AW,
Westwood ME, et al
. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155:529–36. doi:10.7326/0003-4819-155-8-201110180-00009
OpenUrl CrossRef PubMed Web of Science

[10] Whiting PF,

[11] Rutjes AW,

[12] Westwood ME, et al

[13] ↵
Korevaar DA,
van Enst WA,
Spijker R, et al
. Reporting quality of diagnostic accuracy studies: a systematic review and meta-analysis of investigations on adherence to STARD. Evid Based Med 2014;19:47–54. doi:10.1136/eb-2013-101637
OpenUrl Abstract/FREE Full Text

[14] Korevaar DA,

[15] van Enst WA,

[16] Spijker R, et al

[17] ↵
Korevaar DA,
Wang J,
van Enst WA, et al
. Reporting diagnostic accuracy studies: some improvements after 10 years of STARD. Radiology 2015;274:781–9. doi:10.1148/radiol.14141160
OpenUrl CrossRef PubMed

[18] Korevaar DA,

[19] Wang J,

[20] van Enst WA, et al

[21] ↵
Lijmer JG,
Mol BW,
Heisterkamp S, et al
. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999;282:1061–6.
OpenUrl CrossRef PubMed Web of Science

[22] Lijmer JG,

[23] Mol BW,

[24] Heisterkamp S, et al

[25] ↵
Bossuyt PM,
Reitsma JB,
Bruns DE, et al
. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Standards for Reporting of Diagnostic Accuracy. Clin Chem 2003;49:1–6.
OpenUrl Abstract/FREE Full Text

[26] Bossuyt PM,

[27] Reitsma JB,

[28] Bruns DE, et al

[29] ↵
Begg C,
Cho M,
Eastwood S, et al
. Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA 1996;276:637–9.
OpenUrl CrossRef PubMed Web of Science

[30] Begg C,

[31] Cho M,

[32] Eastwood S, et al

[33] ↵
Schulz KF,
Altman DG,
Moher D
. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ 2010;340:c332.
OpenUrl FREE Full Text

[34] Schulz KF,

[35] Altman DG,

[36] Moher D

[37] ↵
Bossuyt PM,
Reitsma JB,
Bruns DE, et al
. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 2015;351:h5527.
OpenUrl FREE Full Text

[38] Bossuyt PM,

[39] Reitsma JB,

[40] Bruns DE, et al

[41] ↵
Bossuyt PM,
Reitsma JB,
Bruns DE, et al
. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann Intern Med 2003;138:W1–12.
OpenUrl CrossRef PubMed

[42] Bossuyt PM,

[43] Reitsma JB,

[44] Bruns DE, et al

[45] ↵
Regge D,
Laudi C,
Galatola G, et al
. Diagnostic accuracy of computed tomographic colonography for the detection of advanced neoplasia in individuals at increased risk of colorectal cancer. JAMA 2009;301:2453–61. doi:10.1001/jama.2009.832
OpenUrl CrossRef PubMed

[46] Regge D,

[47] Laudi C,

[48] Galatola G, et al

[49] ↵
Deville WL,
Bezemer PD,
Bouter LM
. Publications on diagnostic test evaluation in family medicine journals: an optimal search strategy. J Clin Epidemiol 2000;53:65–9.
OpenUrl CrossRef PubMed Web of Science

[50] Deville WL,

[51] Bezemer PD,

[52] Bouter LM

[53] ↵
Korevaar DA,
Cohen JF,
Hooft L, et al
. Literature survey of high-impact journals revealed reporting weaknesses in abstracts of diagnostic accuracy studies. J Clin Epidemiol 2015;68:708–15. doi:10.1016/j.jclinepi.2015.01.014
OpenUrl

[54] Korevaar DA,

[55] Cohen JF,

[56] Hooft L, et al

[57] ↵
Korevaar DA,
Cohen JF,
de Ronde MW, et al
. Reporting weaknessess in conference abstracts of diagnostic accuracy studies in ophthalmology. JAMA Ophthalmol 2015;133:1464–7. doi:10.1001/jamaophthalmol.2015.3577
OpenUrl

[58] Korevaar DA,

[59] Cohen JF,

[60] de Ronde MW, et al

[61] ↵
A proposal for more informative abstracts of clinical articles. Ad Hoc Working Group for Critical Appraisal of the Medical Literature. Ann Intern Med 1987;106:598–604.
OpenUrl CrossRef PubMed Web of Science

[62] ↵
Stiell IG,
Greenberg GH,
Wells GA, et al
. Derivation of a decision rule for the use of radiography in acute knee injuries. Ann Emerg Med. 1995;26:405–13.
OpenUrl CrossRef PubMed Web of Science

[63] Stiell IG,

[64] Greenberg GH,

[65] Wells GA, et al

[66] ↵
Horvath AR,
Lord SJ,
StJohn A, et al
. From biomarkers to medical tests: the changing landscape of test evaluation. Clin Chim Acta 2014;427:49–57.
OpenUrl CrossRef Web of Science

[67] Horvath AR,

[68] Lord SJ,

[69] StJohn A, et al

[70] ↵
Bossuyt PM,
Irwig L,
Craig J, et al
. Comparative accuracy: assessing new tests against existing diagnostic pathways. BMJ 2006;332:1089–92.
OpenUrl FREE Full Text

[71] Bossuyt PM,

[72] Irwig L,

[73] Craig J, et al

[74] ↵
Gieseker KE,
Roe MH,
MacKenzie T, et al
. Evaluating the American Academy of Pediatrics diagnostic standard for Streptococcus pyogenes pharyngitis: backup culture versus repeat rapid antigen testing. Pediatrics 2003;111(6 Pt 1):e666–70.
OpenUrl Abstract/FREE Full Text

[75] Gieseker KE,

[76] Roe MH,

[77] MacKenzie T, et al

[78] ↵
Tanz RR,
Gerber MA,
Kabat W, et al
. Performance of a rapid antigen-detection test and throat culture in community pediatric offices: implications for management of pharyngitis. Pediatrics 2009;123:437–44.
OpenUrl Abstract/FREE Full Text

[79] Tanz RR,

[80] Gerber MA,

[81] Kabat W, et al

[82] ↵
Ochodo EA,
de Haan MC,
Reitsma JB, et al
. Overinterpretation and misreporting of diagnostic accuracy studies: evidence of ‘spin’. Radiology 2013;267:581–8.
OpenUrl CrossRef PubMed Web of Science

[83] Ochodo EA,

[84] de Haan MC,

[85] Reitsma JB, et al

[86] ↵
Freer PE,
Niell B,
Rafferty EA
. Preoperative tomosynthesis-guided needle localization of mammographically and sonographically occult breast lesions. Radiology 2015;275:377–83.
OpenUrl

[87] Freer PE,

[88] Niell B,

[89] Rafferty EA

[90] ↵
Sorensen HT,
Sabroe S,
Olsen J
. A framework for evaluation of secondary data sources for epidemiological research. Int J Epidemiol 1996;25:435–42.
OpenUrl Abstract/FREE Full Text

[91] Sorensen HT,

[92] Sabroe S,

[93] Olsen J

[94] ↵
Geersing GJ,
Erkens PM,
Lucassen WA, et al
. Safe exclusion of pulmonary embolism using the Wells rule and qualitative D-dimer testing in primary care: prospective cohort study. BMJ 2012;345:e6564.
OpenUrl Abstract/FREE Full Text

[95] Geersing GJ,

[96] Erkens PM,

[97] Lucassen WA, et al

[98] ↵
Bomers MK,
van Agtmael MA,
Luik H, et al
. Using a dog's superior olfactory sensitivity to identify Clostridium difficile in stools and patients: proof of principle study. BMJ 2012;345:e7396.
OpenUrl Abstract/FREE Full Text

[99] Bomers MK,

[100] van Agtmael MA,

[101] Luik H, et al

[102] ↵
Philbrick JT,
Horwitz RI,
Feinstein AR
. Methodologic problems of exercise testing for coronary artery disease: groups, analysis and bias. Am J Cardiol 1980;46:807–12.
OpenUrl CrossRef PubMed Web of Science

[103] Philbrick JT,

[104] Horwitz RI,

[105] Feinstein AR

[106] ↵
Rutjes AW,
Reitsma JB,
Vandenbroucke JP, et al
. Case-control and two-gate designs in diagnostic accuracy studies. Clin Chem 2005;51:1335–41. doi:10.1373/clinchem.2005.048595
OpenUrl Abstract/FREE Full Text

[107] Rutjes AW,

[108] Reitsma JB,

[109] Vandenbroucke JP, et al

[110] ↵
Rutjes AW,
Reitsma JB,
Di Nisio M, et al
. Evidence of bias and variation in diagnostic accuracy studies. CMAJ 2006;174:469–76. doi:10.1503/cmaj.050090
OpenUrl Abstract/FREE Full Text

[111] Rutjes AW,

[112] Reitsma JB,

[113] Di Nisio M, et al

[114] ↵
Knottnerus JA,
Muris JW
. Assessment of the accuracy of diagnostic tests: the cross-sectional study. J Clin Epidemiol 2003;56:1118–28.
OpenUrl CrossRef PubMed Web of Science

[115] Knottnerus JA,

[116] Muris JW

[117] ↵
van der Schouw YT,
Van Dijk R,
Verbeek AL
. Problems in selecting the adequate patient population from existing data files for assessment studies of new diagnostic tests. J Clin Epidemiol1995;48:417–22.
OpenUrl CrossRef PubMed Web of Science

[118] van der Schouw YT,

[119] Van Dijk R,

[120] Verbeek AL

[121] ↵
Leeflang MM,
Bossuyt PM,
Irwig L
. Diagnostic test accuracy may vary with prevalence: implications for evidence-based diagnosis. J Clin Epidemiol 2009;62:5–12. doi:10.1016/j.jclinepi.2008.04.007
OpenUrl CrossRef PubMed Web of Science

[122] Leeflang MM,

[123] Bossuyt PM,

[124] Irwig L

[125] ↵
Attia M,
Zaoutis T,
Eppes S, et al
. Multivariate predictive models for group A beta-hemolytic streptococcal pharyngitis in children. Acad Emerg Med 1999;6:8–13.
OpenUrl PubMed Web of Science

[126] Attia M,

[127] Zaoutis T,

[128] Eppes S, et al

[129] ↵
Knottnerus JA,
Knipschild PG,
Sturmans F
. Symptoms and selection bias: the influence of selection towards specialist care on the relationship between symptoms and diagnoses. Theor Med 1989;10:67–81.
OpenUrl CrossRef PubMed

[130] Knottnerus JA,

[131] Knipschild PG,

[132] Sturmans F

[133] ↵
Knottnerus JA,
Leffers P
. The influence of referral patterns on the characteristics of diagnostic tests. J Clin Epidemiol 1992;45:1143–54.
OpenUrl CrossRef PubMed Web of Science

[134] Knottnerus JA,

[135] Leffers P

[136] ↵
Melbye H,
Straume B
. The spectrum of patients strongly influences the usefulness of diagnostic tests for pneumonia. Scand J Prim Health Care 1993;11:241–6.
OpenUrl PubMed

[137] Melbye H,

[138] Straume B

[139] ↵
Ezike EN,
Rongkavilit C,
Fairfax MR, et al
. Effect of using 2 throat swabs vs 1 throat swab on detection of group A streptococcus by a rapid antigen detection test. Arch Pediatr Adolesc Med 2005;159:486–90.
OpenUrl CrossRef PubMed Web of Science

[140] Ezike EN,

[141] Rongkavilit C,

[142] Fairfax MR, et al

[143] ↵
Rosjo H,
Kravdal G,
Hoiseth AD, et al
. Troponin I measured by a high-sensitivity assay in patients with suspected reversible myocardial ischemia: data from the Akershus Cardiac Examination (ACE) 1 study. Clin Chem 2012;58:1565–73. doi:10.1373/clinchem.2012.190868
OpenUrl Abstract/FREE Full Text

[144] Rosjo H,

[145] Kravdal G,

[146] Hoiseth AD, et al

[147] ↵
Irwig L,
Bossuyt P,
Glasziou P, et al
. Designing studies to ensure that estimates of test accuracy are transferable. BMJ 2002;324:669–71.
OpenUrl FREE Full Text

[148] Irwig L,

[149] Bossuyt P,

[150] Glasziou P, et al

[151] ↵
Detrano R,
Gianrossi R,
Froelicher V
. The diagnostic accuracy of the exercise electrocardiogram: a meta-analysis of 22 years of research. Prog Cardiovasc Dis 1989;32:173–206.
OpenUrl CrossRef PubMed Web of Science

[152] Detrano R,

[153] Gianrossi R,

[154] Froelicher V

[155] ↵
Brealey S,
Scally AJ
. Bias in plain film reading performance studies. Br J Radiol 2001;74:307–16. doi:10.1259/bjr.74.880.740307
OpenUrl Abstract/FREE Full Text

[156] Brealey S,

[157] Scally AJ

[158] ↵
Elmore JG,
Wells CK,
Lee CH, et al
. Variability in radiologists’ interpretations of mammograms. N Engl J Med 1994;331:1493–9. doi:10.1056/NEJM199412013312206
OpenUrl CrossRef PubMed Web of Science

[159] Elmore JG,

[160] Wells CK,

[161] Lee CH, et al

[162] ↵
Ronco G,
Montanari G,
Aimone V, et al
. Estimating the sensitivity of cervical cytology: errors of interpretation and test limitations. Cytopathology 1996;7:151–8.
OpenUrl CrossRef PubMed Web of Science

[163] Ronco G,

[164] Montanari G,

[165] Aimone V, et al

[166] ↵
Cohen MB,
Rodgers RP,
Hales MS, et al
. Influence of training and experience in fine-needle aspiration biopsy of breast. Receiver operating characteristics curve analysis. Arch Pathol Lab Med 1987;111:518–20.
OpenUrl PubMed Web of Science

[167] Cohen MB,

[168] Rodgers RP,

[169] Hales MS, et al

[170] ↵
Fox JW,
Cohen DM,
Marcon MJ, et al
. Performance of rapid streptococcal antigen testing varies by personnel. J Clin Microbiol 2006;44:3918–22. doi:10.1128/JCM.01399-06
OpenUrl Abstract/FREE Full Text

[171] Fox JW,

[172] Cohen DM,

[173] Marcon MJ, et al

[174] ↵
Gandy M,
Sharpe L,
Perry KN, et al
. Assessing the efficacy of 2 screening measures for depression in people with epilepsy. Neurology 2012;79:371–5. doi:10.1212/WNL.0b013e318260cbfc
OpenUrl CrossRef

[175] Gandy M,

[176] Sharpe L,

[177] Perry KN, et al

[178] ↵
Stegeman I,
de Wijkerslooth TR,
Stoop EM, et al
. Combining risk factors with faecal immunochemical test outcome for selecting CRC screenees for colonoscopy. Gut 2014;63:466–71. doi:10.1136/gutjnl-2013-305013
OpenUrl Abstract/FREE Full Text

[179] Stegeman I,

[180] de Wijkerslooth TR,

[181] Stoop EM, et al

[182] ↵
Leeflang MM,
Moons KG,
Reitsma JB, et al
. Bias in sensitivity and specificity caused by data-driven selection of optimal cutoff values: mechanisms, magnitude, and solutions. Clin Chem 2008;54:729–37. doi:10.1373/clinchem.2007.096032
OpenUrl Abstract/FREE Full Text

[183] Leeflang MM,

[184] Moons KG,

[185] Reitsma JB, et al

[186] ↵
Ewald B
. Post hoc choice of cut points introduced bias to diagnostic research. J Clin Epidemiol 2006;59:798–801. doi:10.1016/j.jclinepi.2005.11.025
OpenUrl CrossRef PubMed Web of Science

[187] Ewald B

[188] ↵
Justice AC,
Covinsky KE,
Berlin JA
. Assessing the generalizability of prognostic information. Ann Intern Med 1999;130:515–24.
OpenUrl CrossRef PubMed Web of Science

[189] Justice AC,

[190] Covinsky KE,

[191] Berlin JA

[192] ↵
Harrell FE Jr.,
Lee KL,
Mark DB
. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996;15:361–87. doi:10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4
OpenUrl CrossRef PubMed Web of Science

[193] Harrell FE Jr.,

[194] Lee KL,

[195] Mark DB

[196] ↵
Hodgdon T,
McInnes MD,
Schieda N, et al
. Can quantitative CT texture analysis be used to differentiate fat-poor renal angiomyolipoma from renal cell carcinoma on unenhanced CT images? Radiology 2015;276:787–96. doi:10.1148/radiol.2015142215
OpenUrl

[197] Hodgdon T,

[198] McInnes MD,

[199] Schieda N, et al

[200] ↵
Begg CB
. Biases in the assessment of diagnostic tests. Stat Med 1987;6:411–23.
OpenUrl CrossRef PubMed Web of Science

[201] Begg CB

[202] ↵
Doubilet P,
Herman PG
. Interpretation of radiographs: effect of clinical history. AJR Am J Roentgenol 1981;137:1055–8. doi:10.2214/ajr.137.5.1055
OpenUrl PubMed Web of Science

[203] Doubilet P,

[204] Herman PG

[205] ↵
D'Orsi CJ,
Getty DJ,
Pickett RM, et al
. Stereoscopic digital mammography: improved specificity and reduced rate of recall in a prospective clinical trial. Radiology 2013;266:81–8. doi:10.1148/radiol.12120382
OpenUrl CrossRef PubMed

[206] D'Orsi CJ,

[207] Getty DJ,

[208] Pickett RM, et al

[209] ↵
Knottnerus JA,
Buntinx F
. The evidence base of clinical diagnosis: theory and methods of diagnostic research. 2nd edn. BMJ Books, 2008.

[210] Knottnerus JA,

[211] Buntinx F

[212] ↵
Pepe M
. Study design and hypothesis testing. The statistical evaluation of medical tests for classification and prediction. Oxford, UK: Oxford University Press, 2003:214–51.

[213] Pepe M

[214] ↵
Hayen A,
Macaskill P,
Irwig L, et al
. Appropriate statistical methods are required to assess diagnostic tests for replacement, add-on, and triage. J Clin Epidemiol 2010;63:883–91. doi:10.1016/j.jclinepi.2009.08.024
OpenUrl CrossRef PubMed Web of Science

[215] Hayen A,

[216] Macaskill P,

[217] Irwig L, et al

[218] ↵
Garcia Pena BM,
Mandl KD,
Kraus SJ, et al
. Ultrasonography and limited computed tomography in the diagnosis and management of appendicitis in children. JAMA 1999;282:1041–6.
OpenUrl CrossRef PubMed Web of Science

[219] Garcia Pena BM,

[220] Mandl KD,

[221] Kraus SJ, et al

[222] ↵
Simel DL,
Feussner JR,
DeLong ER, et al
. Intermediate, indeterminate, and uninterpretable diagnostic test results. Med Decis Making 1987;7:107–14.
OpenUrl Abstract/FREE Full Text

[223] Simel DL,

[224] Feussner JR,

[225] DeLong ER, et al

[226] ↵
Philbrick JT,
Horwitz RI,
Feinstein AR, et al
. The limited spectrum of patients studied in exercise test research. Analyzing the tip of the iceberg. JAMA 1982;248:2467–70.
OpenUrl CrossRef PubMed Web of Science

[227] Philbrick JT,

[228] Horwitz RI,

[229] Feinstein AR, et al

[230] ↵
Begg CB,
Greenes RA,
Iglewicz B
. The influence of uninterpretability on the assessment of diagnostic tests. J Chronic Dis 1986;39:575–84.
OpenUrl CrossRef PubMed Web of Science

[231] Begg CB,

[232] Greenes RA,

[233] Iglewicz B

[234] ↵
Shinkins B,
Thompson M,
Mallett S, et al
. Diagnostic accuracy studies: how to report and analyse inconclusive test results. BMJ 2013;346:f2778.
OpenUrl FREE Full Text

[235] Shinkins B,

[236] Thompson M,

[237] Mallett S, et al

[238] ↵
Pisano ED,
Fajardo LL,
Tsimikas J, et al
. Rate of insufficient samples for fine-needle aspiration for nonpalpable breast lesions in a multicenter clinical trial: the Radiologic Diagnostic Oncology Group 5 Study. The RDOG5 investigators. Cancer 1998;82:679–88.
OpenUrl CrossRef PubMed Web of Science

[239] Pisano ED,

[240] Fajardo LL,

[241] Tsimikas J, et al

[242] ↵
Giard RW,
Hermans J
. The value of aspiration cytologic examination of the breast. A statistical review of the medical literature. Cancer 1992;69:2104–10.
OpenUrl CrossRef PubMed Web of Science

[243] Giard RW,

[244] Hermans J

[245] ↵
Investigators P
. Value of the ventilation/perfusion scan in acute pulmonary embolism. Results of the prospective investigation of pulmonary embolism diagnosis (PIOPED). JAMA 1990;263:2753–9.
OpenUrl CrossRef PubMed Web of Science

[246] Investigators P

[247] ↵
Min JK,
Leipsic J,
Pencina MJ, et al
. Diagnostic accuracy of fractional flow reserve from anatomic CT angiography. JAMA 2012;308:1237–45. doi:10.1001/2012.jama.11274
OpenUrl CrossRef PubMed Web of Science

[248] Min JK,

[249] Leipsic J,

[250] Pencina MJ, et al

[251] ↵
Naaktgeboren CA,
de Groot JA,
Rutjes AW, et al
. Anticipating missing reference standard data when planning diagnostic accuracy studies. BMJ 2016;352:i402.
OpenUrl FREE Full Text

[252] Naaktgeboren CA,

[253] de Groot JA,

[254] Rutjes AW, et al

[255] ↵
van der Heijden GJ,
Donders AR,
Stijnen T, et al
. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol 2006;59:1102–9. doi:10.1016/j.jclinepi.2006.01.015
OpenUrl CrossRef PubMed Web of Science

[256] van der Heijden GJ,

[257] Donders AR,

[258] Stijnen T, et al

[259] ↵
de Groot JA,
Bossuyt PM,
Reitsma JB, et al
. Verification problems in diagnostic accuracy studies: consequences and solutions. BMJ 2011;343:d4770.
OpenUrl FREE Full Text

[260] de Groot JA,

[261] Bossuyt PM,

[262] Reitsma JB, et al

[263] ↵
Pons B,
Lautrette A,
Oziel J, et al
. Diagnostic accuracy of early urinary index changes in differentiating transient from persistent acute kidney injury in critically ill patients: multicenter cohort study. Crit Care 2013;17:R56. doi:10.1186/cc12582
OpenUrl CrossRef PubMed

[264] Pons B,

[265] Lautrette A,

[266] Oziel J, et al

[267] ↵
Sun X,
Ioannidis JP,
Agoritsas T, et al
. How to use a subgroup analysis: users’ guide to the medical literature. JAMA 2014;311:405–11. doi:10.1001/jama.2013.285063
OpenUrl CrossRef PubMed Web of Science

[268] Sun X,

[269] Ioannidis JP,

[270] Agoritsas T, et al

[271] ↵
Zalis ME,
Blake MA,
Cai W, et al
. Diagnostic accuracy of laxative-free computed tomographic colonography for detection of adenomatous polyps in asymptomatic adults: a prospective evaluation. Ann Intern Med 2012;156:692–702. doi:10.7326/0003-4819-156-10-201205150-00005
OpenUrl CrossRef PubMed

[272] Zalis ME,

[273] Blake MA,

[274] Cai W, et al

[275] ↵
Flahault A,
Cadilhac M,
Thomas G
. Sample size calculation should be performed for design accuracy in diagnostic test studies. J Clin Epidemiol 2005;58:859–62. doi:10.1016/j.jclinepi.2004.12.009
OpenUrl CrossRef PubMed Web of Science

[276] Flahault A,

[277] Cadilhac M,

[278] Thomas G

[279] ↵
Pepe MS
. The statistical evaluation of medical tests for classification and prediction. Oxford, New York: Oxford University Press, 2003.

[280] Pepe MS

[281] ↵
Vach W,
Gerke O,
Hoilund-Carlsen PF
. Three principles to define the success of a diagnostic study could be identified. J Clin Epidemiol 2012;65:293–300. doi:10.1016/j.jclinepi.2011.07.004
OpenUrl PubMed

[282] Vach W,

[283] Gerke O,

[284] Hoilund-Carlsen PF

[285] ↵
Bachmann LM,
Puhan MA,
ter Riet G, et al
. Sample sizes of studies on diagnostic accuracy: literature survey. BMJ 2006;332:1127–9. doi:10.1136/bmj.38793.637789.2F
OpenUrl Abstract/FREE Full Text

[286] Bachmann LM,

[287] Puhan MA,

[288] ter Riet G, et al

[289] ↵
Bochmann F,
Johnson Z,
Azuara-Blanco A
. Sample size in studies on diagnostic accuracy in ophthalmology: a literature survey. Br J Ophthalmol 2007;91:898–900. doi:10.1136/bjo.2006.113290
OpenUrl Abstract/FREE Full Text

[290] Bochmann F,

[291] Johnson Z,

[292] Azuara-Blanco A

[293] ↵
Collins MG,
Teo E,
Cole SR, et al
. Screening for colorectal cancer and advanced colorectal neoplasia in kidney transplant recipients: cross sectional prevalence and diagnostic accuracy study of faecal immunochemical testing for haemoglobin and colonoscopy. BMJ 2012;345:e4657.
OpenUrl Abstract/FREE Full Text

[294] Collins MG,

[295] Teo E,

[296] Cole SR, et al

[297] ↵
Cecil MP,
Kosinski AS,
Jones MT, et al
. The importance of work-up (verification) bias correction in assessing the accuracy of SPECT thallium-201 testing for the diagnosis of coronary artery disease. J Clin Epidemiol 1996;49:735–42.
OpenUrl CrossRef PubMed Web of Science

[298] Cecil MP,

[299] Kosinski AS,

[300] Jones MT, et al

[301] ↵
Choi BC
. Sensitivity and specificity of a single diagnostic test in the presence of work-up bias. J Clin Epidemiol 1992;45:581–6.
OpenUrl CrossRef PubMed Web of Science

[302] Choi BC

[303] ↵
Diamond GA
. Off Bayes: effect of verification bias on posterior probabilities calculated using Bayes’ theorem. Med Decis Making 1992;12:22–31.
OpenUrl Abstract/FREE Full Text

[304] Diamond GA

[305] ↵
Diamond GA,
Rozanski A,
Forrester JS, et al
. A model for assessing the sensitivity and specificity of tests subject to selection bias. Application to exercise radionuclide ventriculography for diagnosis of coronary artery disease. J Chronic Dis 1986;39:343–55.
OpenUrl CrossRef PubMed Web of Science

[306] Diamond GA,

[307] Rozanski A,

[308] Forrester JS, et al

[309] ↵
Greenes RA,
Begg CB
. Assessment of diagnostic technologies. Methodology for unbiased estimation from samples of selectively verified patients. Invest Radiol 1985;20:751–6.
OpenUrl CrossRef PubMed Web of Science

[310] Greenes RA,

[311] Begg CB

[312] ↵
Ransohoff DF,
Feinstein AR
. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med 1978;299:926–30. doi:10.1056/NEJM197810262991705
OpenUrl CrossRef PubMed Web of Science

[313] Ransohoff DF,

[314] Feinstein AR

[315] ↵
Zhou XH
. Effect of verification bias on positive and negative predictive values. Stat Med 1994;13:1737–45.
OpenUrl CrossRef PubMed Web of Science

[316] Zhou XH

[317] ↵
Kok L,
Elias SG,
Witteman BJ, et al
. Diagnostic accuracy of point-of-care fecal calprotectin and immunochemical occult blood tests for diagnosis of organic bowel disease in primary care: the Cost-Effectiveness of a Decision Rule for Abdominal Complaints in Primary Care (CEDAR) study. Clin Chem 2012;58:989–98. doi:10.1373/clinchem.2011.177980
OpenUrl Abstract/FREE Full Text

[318] Kok L,

[319] Elias SG,

[320] Witteman BJ, et al

[321] ↵
Harris JM Jr.
. The hazards of bedside Bayes. JAMA 1981;246:2602–5.
OpenUrl CrossRef PubMed Web of Science

[322] Harris JM Jr.

[323] ↵
Hlatky MA,
Pryor DB,
Harrell FE Jr., et al
. Factors affecting sensitivity and specificity of exercise electrocardiography. Multivariable analysis. Am J Med 1984;77:64–71.
OpenUrl CrossRef PubMed Web of Science

[324] Hlatky MA,

[325] Pryor DB,

[326] Harrell FE Jr., et al

[327] ↵
Lachs MS,
Nachamkin I,
Edelstein PH, et al
. Spectrum bias in the evaluation of diagnostic tests: lessons from the rapid dipstick test for urinary tract infection. Ann Intern Med 1992;117:135–40.
OpenUrl CrossRef PubMed Web of Science

[328] Lachs MS,

[329] Nachamkin I,

[330] Edelstein PH, et al

[331] ↵
Moons KG,
van Es GA,
Deckers JW, et al
. Limitations of sensitivity, specificity, likelihood ratio, and bayes’ theorem in assessing diagnostic probabilities: a clinical example. Epidemiology 1997;8:12–17.
OpenUrl CrossRef PubMed Web of Science

[332] Moons KG,

[333] van Es GA,

[334] Deckers JW, et al

[335] ↵
O'Connor PW,
Tansay CM,
Detsky AS, et al
. The effect of spectrum bias on the utility of magnetic resonance imaging and evoked potentials in the diagnosis of suspected multiple sclerosis. Neurology 1996;47:140–4.
OpenUrl

[336] O'Connor PW,

[337] Tansay CM,

[338] Detsky AS, et al

[339] ↵
Deckers JW,
Rensing BJ,
Tijssen JG, et al
. A comparison of methods of analysing exercise tests for diagnosis of coronary artery disease. Br Heart J 1989;62:438–44.
OpenUrl Abstract/FREE Full Text

[340] Deckers JW,

[341] Rensing BJ,

[342] Tijssen JG, et al

[343] ↵
Naraghi AM,
Gupta S,
Jacks LM, et al
. Anterior cruciate ligament reconstruction: MR imaging signs of anterior knee laxity in the presence of an intact graft. Radiology 2012;263:802–10. doi:10.1148/radiol.12110779
OpenUrl CrossRef PubMed

[344] Naraghi AM,

[345] Gupta S,

[346] Jacks LM, et al

[347] ↵
Ashdown HF,
D'Souza N,
Karim D, et al
. Pain over speed bumps in diagnosis of acute appendicitis: diagnostic accuracy study. BMJ 2012;345:e8012.
OpenUrl Abstract/FREE Full Text

[348] Ashdown HF,

[349] D'Souza N,

[350] Karim D, et al

[351] ↵
Leeflang MM,
Rutjes AW,
Reitsma JB, et al
. Variation of a test's sensitivity and specificity with disease prevalence. CMAJ 2013;185:E537–544. doi:10.1503/cmaj.121286
OpenUrl Abstract/FREE Full Text

[352] Leeflang MM,

[353] Rutjes AW,

[354] Reitsma JB, et al

[355] ↵
Rajaram S,
Swift AJ,
Capener D, et al
. Lung morphology assessment with balanced steady-state free precession MR imaging compared with CT. Radiology 2012;263:569–77. doi:10.1148/radiol.12110990
OpenUrl CrossRef PubMed

[356] Rajaram S,

[357] Swift AJ,

[358] Capener D, et al

[359] ↵
Lang TA,
Secic M
. Generalizing from a sample to a population: reporting estimates and confidence intervals. Philadelphia: American College of Physicians, 1997.

[360] Lang TA,

[361] Secic M

[362] ↵
Ioannidis JP,
Evans SJ,
Gotzsche PC, et al
. Better reporting of harms in randomized trials: an extension of the CONSORT statement. Ann Intern Med 2004;141:781–8.
OpenUrl CrossRef PubMed Web of Science

[363] Ioannidis JP,

[364] Evans SJ,

[365] Gotzsche PC, et al

[366] ↵
Ioannidis JP,
Lau J
. Completeness of safety reporting in randomized trials: an evaluation of 7 medical areas. JAMA 2001;285:437–43.
OpenUrl CrossRef PubMed Web of Science

[367] Ioannidis JP,

[368] Lau J

[369] ↵
Park SH,
Lee JH,
Lee SS, et al
. CT colonography for detection and characterisation of synchronous proximal colonic lesions in patients with stenosing colorectal cancer. Gut 2012;61:1716–22. doi:10.1136/gutjnl-2011-301135
OpenUrl Abstract/FREE Full Text

[370] Park SH,

[371] Lee JH,

[372] Lee SS, et al

[373] ↵
Irwig LM,
Bossuyt PM,
Glasziou PP, et al
. Designing studies to ensure that estimates of test accuracy will travel. In: Knottnerus JA, ed. The evidence base of clinical diagnosis. London: BMJ Publishing Group, 2002:95–116.

[374] Irwig LM,

[375] Bossuyt PM,

[376] Glasziou PP, et al

[377] ↵
Ter Riet G,
Chesley P,
Gross AG, et al
. All that glitters isn't gold: a survey on acknowledgment of limitations in biomedical studies. PLoS ONE 2013;8:e73623. doi:10.1371/journal.pone.0073623
OpenUrl

[378] Ter Riet G,

[379] Chesley P,

[380] Gross AG, et al

[381] ↵
Ioannidis JP
. Limitations are not properly acknowledged in the scientific literature. J Clin Epidemiol 2007;60:324–9. doi:10.1016/j.jclinepi.2006.09.011
OpenUrl CrossRef PubMed Web of Science

[382] Ioannidis JP

[383] ↵
Lord SJ,
Irwig L,
Simes RJ
. When is measuring sensitivity and specificity sufficient to evaluate a diagnostic test, and when do we need randomized trials? Ann Intern Med 2006;144:850–5.
OpenUrl CrossRef PubMed Web of Science

[384] Lord SJ,

[385] Irwig L,

[386] Simes RJ

[387] ↵
Pewsner D,
Battaglia M,
Minder C, et al
. Ruling a diagnosis in or out with ‘SpPIn’ and ‘SnNOut’: a note of caution. BMJ 2004;329:209–13. doi:10.1136/bmj.329.7459.209
OpenUrl FREE Full Text

[388] Pewsner D,

[389] Battaglia M,

[390] Minder C, et al

[391] ↵
Foerch C,
Niessner M,
Back T, et al
. Diagnostic accuracy of plasma glial fibrillary acidic protein for differentiating intracerebral hemorrhage and cerebral ischemia in patients with symptoms of acute stroke. Clin Chem 2012;58:237–45. doi:10.1373/clinchem.2011.172676
OpenUrl Abstract/FREE Full Text

[392] Foerch C,

[393] Niessner M,

[394] Back T, et al

[395] ↵
Altman DG
. The time has come to register diagnostic and prognostic research. Clin Chem 2014;60:580–2. doi:10.1373/clinchem.2013.220335
OpenUrl FREE Full Text

[396] Altman DG

[397] ↵
Hooft L,
Bossuyt PM
. Prospective registration of marker evaluation studies: time to act. Clin Chem 2011;57:1684–6. doi:10.1373/clinchem.2011.176230
OpenUrl FREE Full Text

[398] Hooft L,

[399] Bossuyt PM

[400] ↵
Rifai N,
Altman DG,
Bossuyt PM
. Reporting bias in diagnostic and prognostic studies: time for action. Clin Chem 2008;54: 1101–3. doi:10.1373/clinchem.2008.108993
OpenUrl FREE Full Text

[401] Rifai N,

[402] Altman DG,

[403] Bossuyt PM

[404] ↵
Korevaar DA,
Ochodo EA,
Bossuyt PM, et al
. Publication and reporting of test accuracy studies registered in ClinicalTrials.gov. Clin Chem 2014;60:651–9. doi:10.1373/clinchem.2013.218149
OpenUrl Abstract/FREE Full Text

[405] Korevaar DA,

[406] Ochodo EA,

[407] Bossuyt PM, et al

[408] ↵
Rifai N,
Bossuyt PM,
Ioannidis JP, et al
. Registering diagnostic and prognostic trials of tests: is it the right thing to do? Clin Chem 2014;60:1146–52. doi:10.1373/clinchem.2014.226100
OpenUrl FREE Full Text

[409] Rifai N,

[410] Bossuyt PM,

[411] Ioannidis JP, et al

[412] ↵
Korevaar DA,
Bossuyt PM,
Hooft L
. Infrequent and incomplete registration of test accuracy studies: analysis of recent study reports. BMJ Open 2014;4:e004596. doi:10.1136/bmjopen-2013-004596
OpenUrl Abstract/FREE Full Text

[413] Korevaar DA,

[414] Bossuyt PM,

[415] Hooft L

[416] ↵
Leeuwenburgh MM,
Wiarda BM,
Wiezer MJ, et al
. Comparison of imaging strategies with conditional contrast-enhanced CT and unenhanced MR imaging in patients suspected of having appendicitis: a multicenter diagnostic performance study. Radiology 2013;268:135–43. doi:10.1148/radiol.13121753
OpenUrl CrossRef PubMed

[417] Leeuwenburgh MM,

[418] Wiarda BM,

[419] Wiezer MJ, et al

[420] ↵
Chan AW,
Song F,
Vickers A, et al
. Increasing value and reducing waste: addressing inaccessible research. Lancet 2014;383:257–66. doi:10.1016/S0140-6736(13)62296-5
OpenUrl CrossRef PubMed Web of Science

[421] Chan AW,

[422] Song F,

[423] Vickers A, et al

[424] ↵
Stewart CM,
Schoeman SA,
Booth RA, et al
. Assessment of self taken swabs versus clinician taken swab cultures for diagnosing gonorrhoea in women: single centre, diagnostic accuracy study. BMJ 2012;345:e8107.
OpenUrl Abstract/FREE Full Text

[425] Stewart CM,

[426] Schoeman SA,

[427] Booth RA, et al

[428] ↵
Sismondo S
. Pharmaceutical company funding and its consequences: a qualitative systematic review. Contemp Clin Trials 2008;29:109–13. doi:10.1016/j.cct.2007.08.001
OpenUrl CrossRef PubMed Web of Science

[429] Sismondo S

Log in using your username and password

Main menu

Log in using your username and password

You are here

Abstract

Statistics from Altmetric.com

Request Permissions

Introduction

STARD 2015 items: explanation and elaboration

Title or abstract

Abstract

Introduction

Methods

Results

Discussion

Other information

Acknowledgments

References

Footnotes

Read the full text or download the PDF:

Log in using your username and password