Evaluating the performance of artificial intelligence software for lung nodule detection on chest radiographs in a retrospective real-world UK population

Objectives Early identification of lung cancer on chest radiographs improves patient outcomes. Artificial intelligence (AI) tools may increase diagnostic accuracy and streamline this pathway. This study evaluated the performance of commercially available AI-based software trained to identify cancerous lung nodules on chest radiographs. Design This retrospective study included primary care chest radiographs acquired in a UK centre. The software evaluated each radiograph independently and outputs were compared with two reference standards: (1) the radiologist report and (2) the diagnosis of cancer by multidisciplinary team decision. Failure analysis was performed by interrogating the software marker locations on radiographs. Participants 5722 consecutive chest radiographs were included from 5592 patients (median age 59 years, 53.8% women, 1.6% prevalence of cancer). Results Compared with radiologist reports for nodule detection, the software demonstrated sensitivity 54.5% (95% CI 44.2% to 64.4%), specificity 83.2% (82.2% to 84.1%), positive predictive value (PPV) 5.5% (4.6% to 6.6%) and negative predictive value (NPV) 99.0% (98.8% to 99.2%). Compared with cancer diagnosis, the software demonstrated sensitivity 60.9% (50.1% to 70.9%), specificity 83.3% (82.3% to 84.2%), PPV 5.6% (4.8% to 6.6%) and NPV 99.2% (99.0% to 99.4%). Normal or variant anatomy was misidentified as an abnormality in 69.9% of the 943 false positive cases. Conclusions The software demonstrated considerable underperformance in this real-world patient cohort. Failure analysis suggested a lack of generalisability in the training and testing datasets as a potential factor. The low PPV carries the risk of over-investigation and limits the translation of the software to clinical practice. Our findings highlight the importance of training and testing software in representative datasets, with broader implications for the implementation of AI tools in imaging.

1.The relevancy/significance of the two reference standards (i.e.radiologist report, MDT cancer diagnosis) might be explained further.Are both reference standards of practical clinical utility?Moreover, a brief analysis of these two reference standards against each other might be considered if possible, to provide additional context.
2. The relative performance as achieved in this study, as compared to the original external validation performance appears to be broadly comparable (Table 1), in the sense that both the sensitivity and FPPI values are within the original external validation range.The summary description of the ALND software as "underperformance" might thus be further justified (e.g. to existing clinical standards)

REVIEWER
Jha, Saurabh University of Pennsylvania REVIEW RETURNED 12-Sep-2023

GENERAL COMMENTS
The authors have answered my concerns about the paper.The paper would benefit from an accompanying editorial.

VERSION 1 -AUTHOR RESPONSE
Reviewer: 1 Dr. Gilbert Lim, National University of Singapore Comments to the Author: We thank the authors for largely addressing our previous comments.A couple of minor points remain.
We thank Dr Lim for their review and pertinent comments.
1.The relevancy/significance of the two reference standards (i.e.radiologist report, MDT cancer diagnosis) might be explained further.Are both reference standards of practical clinical utility?Moreover, a brief analysis of these two reference standards against each other might be considered if possible, to provide additional context.
As touched upon in our discussion of the study limitations, choosing appropriate reference standards for benchmarking AI tools is not always straightforward.We believe that the two reference standards used here are complementary and a strength of this study.
Expert radiologist opinion is a very common reference standard for diagnostic medical imaging AI tools and is an important benchmark when considering whether an AI tool can be implemented clinically ('is the AI software as good as the person currently carrying out the task?').This is often a feasible comparison and one we have made by comparing the software outputs against the formal clinical reports for the radiographs.
However, radiologists sometimes disagree and make errors, so this reference standard has flaws.
What is more challenging is determining whether an AI tool is actually detecting cancer.There are various reference standards that could be used (e.g.serial follow up CT, biopsy results).Our use of an MDT cancer diagnosis reflects real-life practice.MDT diagnosis involves expert opinion and integrates clinical history with imaging and histopathology.While it does have flaws, it's what we currently use for cancer diagnosis in real-life practice and therefore made a useful reference standard.
We have reworded some of the discussion to explore these points.
2. The relative performance as achieved in this study, as compared to the original external validation performance appears to be broadly comparable (Table 1), in the sense that both the sensitivity and FPPI values are within the original external validation range.The summary description of the ALND software as "underperformance" might thus be further justified (e.g. to existing clinical standards) We agree that the software's sensitivity and FPPI values are comparable to those reported in the previous external validation publication.However, we believe that our study highlights the differences between model performance and clinical performance.
The model performance was as previously published.However, the model was trained and tested using enriched datasets and when applied to our real-world patient population showed too low a PPV to make it of clinical utility.We highlight that, based on this cohort, use of the software for its intended purpose may have resulted in around 6 times the number of CT scans being done for no improvement in the number of cancers detected.The "underperformance" here is clinical and refers to the intended use of the software.