Article Text
Abstract
Objective To develop and validate a real-world screening, guideline-based deep learning (DL) system for referable diabetic retinopathy (DR) detection.
Design This is a multicentre platform development study based on retrospective, cross-sectional data sets. Images were labelled by two-level certificated graders as the ground truth. According to the UK DR screening guideline, a DL model based on colour retinal images with five-dimensional classifiers, namely image quality, retinopathy, maculopathy gradability, maculopathy and photocoagulation, was developed. Referable decisions were generated by integrating the output of all classifiers and reported at the image, eye and patient level. The performance of the DL was compared with DR experts.
Setting DR screening programmes from three hospitals and the Lifeline Express Diabetic Retinopathy Screening Program in China.
Participants 83 465 images of 39 836 eyes from 21 716 patients were annotated, of which 53 211 images were used as the development set and 30 254 images were used as the external validation set, split based on centre and period.
Main outcomes Accuracy, F1 score, sensitivity, specificity, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), Cohen’s unweighted κ and Gwet’s AC1 were calculated to evaluate the performance of the DL algorithm.
Results In the external validation set, the five classifiers achieved an accuracy of 0.915–0.980, F1 score of 0.682–0.966, sensitivity of 0.917–0.978, specificity of 0.907–0.981, AUROC of 0.9639–0.9944 and AUPRC of 0.7504–0.9949. Referable DR at three levels was detected with an accuracy of 0.918–0.967, F1 score of 0.822–0.918, sensitivity of 0.970–0.971, specificity of 0.905–0.967, AUROC of 0.9848–0.9931 and AUPRC of 0.9527–0.9760. With reference to the ground truth, the DL system showed comparable performance (Cohen’s κ: 0.86–0.93; Gwet’s AC1: 0.89–0.94) with three DR experts (Cohen’s κ: 0.89–0.96; Gwet’s AC1: 0.91–0.97) in detecting referable lesions.
Conclusions The automatic DL system for detection of referable DR based on the UK guideline could achieve high accuracy in multidimensional classifications. It is suitable for large-scale, real-world DR screening.
- diabetic retinopathy
- vetreoretinal
- medical retina
Data availability statement
Data are available upon reasonable request. Both the main text and supplementary materials are published.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
The data set in this study was constructed from multiple centres using different devices for generalisation.
The five-dimensional classifiers, namely image quality, retinopathy, maculopathy gradability, maculopathy and photocoagulation, were developed according to a real-world approach to diabetic retinopathy screening.
The deep learning platform can automatically generate three-level (image, eye and patient level) referable diabetic retinopathy decision.
Evaluation of two dimensions of quality, namely image quality and maculopathy gradability, on images with diverse qualities was given full consideration, which is consistent with screening practices.
Diabetic macular edema could be misdiagnosed in some cases without stereoscopic images and optical coherence tomography.
Introduction
Diabetic retinopathy (DR), a common ocular complication with retinal microvascular lesions in patients with diabetes mellitus (DM), is one of the leading causes of irreversible blindness and visual impairment among working-age people worldwide.1 It is estimated that the DM population will increase to approximately 700 million by 2045, with a quarter suffering from DR.2 3 DR screening programmes are an important interventional strategy for early identification of referable DR and allow timely referral and treatment to prevent vision loss due to DR.4 5 Yet huge screening demand from a large number of patients with DM and the limited amount of human resources hinder the popularity and sustainability of screening services.6
Deep learning (DL), a subset of artificial intelligence (AI) powered by recent advances in computation and big data, permits multilayer convolutional neural networks to be trained through back propagation techniques to minimise an error function resulting in a classifier output, which works remarkably well in computer vision (image classification tasks).7 In recent years, multiple DL algorithms for automatic detection of DR have been proposed and have shown high sensitivity and specificity (>90%) in detecting referable DR,8–11 throwing light to large-scale DR screening with AI assistance. However, in complex real-world screening scenarios, an appropriate decision for referral is hard based only on a few dimensions of classifications. Various features or conditions should be identified and handled simultaneously, including the image quality of the fundus photos, stage of DR, maculopathy and photocoagulation status.12–14 Hence, the DL algorithm on multidimensional features for referable DR detection should be developed to identify multiple conditions in the complex real-world DR screening scenarios. Herein, this study aimed to develop a multidimensional DL platform for detecting referable DR with five independent classifiers (image quality, retinopathy, maculopathy gradability, maculopathy and photocoagulation) using real-world DR screening data sets. Combined heatmaps were generated to visualise and explain the predicted areas of the referable lesions. The performance of our DL platform was further compared with retinal specialists.
Methods
Informed consent was approved to be waived for this retrospective study using de-identified retinal images for the development of a DL system. This study followed the Standards for Reporting of Diagnostic Accuracy reporting guidelines.
Data sets
The images in this study were captured during DR screening programmes using three types of cameras and were collected from three hospitals (Joint Shantou International Eye Center of Shantou University and the Chinese University of Hong Kong (JSIEC; camera: Top-2000, Topcon, Japan); Liuzhou City Red Cross Hospital (Liuzhou; camera: AFC-230, NIDEK, Japan); and the Second Affiliated Hospital of Shantou University Medical College (STU-2nd; camera: Top-2000, Topcon, Japan)) and one event (Lifeline Express Diabetic Retinopathy Screening Program (LEDRSP); cameras: AFC-230, NIDEK, Japan, and Canon CR-DGi, Canon, Japan) from April 2014 to June 2018. Only mydriatic retinal images with two 45o fields (macula-centred and optic disc-centred) were included. Unless coexisting with DR, images presenting other ocular diseases, such as glaucoma and age-related macular degeneration, were excluded. Non-fundus images were also excluded (online supplemental figure 1).
Supplemental material
Patient and public involvement
Neither participants nor the public were involved in the design and conduct of the present research.
Labelling and grading
Based on the English National Health Screening (NHS) Diabetic Eye Screening Programme (online supplemental table 1),14 15 the retinal images were assessed in four dimensions, namely (1) image quality, (2) retinopathy, (3) maculopathy and (4) photocoagulation status. The labels of the retinal images were annotated according to the following:
‘Image quality’ was categorised as Q0 (ungradable quality, defined as an image with >one-third of the area poorly exposed, with artefact or blur which could not be classified confidently even when any DR feature was observed in the rest of the area) and Q1 (gradable quality, with ≤1/3 poor area that the image could be classified with confidence).
‘Retinopathy’ was divided into four levels according to severity of lesions: R0 (no DR), R1 (background DR), R2 (preproliferative DR) and R3 (proliferative DR). R0 and R1 were further defined as non-referable retinopathy, while R2 and R3 were defined as referable retinopathy.
‘Maculopathy’ was classified as M0 (absence of any M1 features) and M1 (exudate within 1 disc diameter (DD) of the centre of the fovea or any microaneurysm/haemorrhage within 1 DD of the centre of the fovea only if associated with a best corrected visual acuity of ≤ 6/12). Additionally, due to the limited blur or artefact (less than 1/3 area of the whole image) on the macula, maculopathy might be ungradable. Thus, evaluation of maculopathy gradability should precede classification of maculopathy and the image which could not be graded confidently in terms of maculopathy would be annotated as maculopathy ungradable (Mu).
‘Photocoagulation’ was categorised as P0 (image without laser spot or scar) and P1 (image presenting laser spot or scar).
Detailed definitions are shown in online supplemental table 2.
The ground truth labels of the images were obtained from grading of two-level graders. All graders have been trained and certificated by the NHS retinal screening for DR (https://www.gregcourses.com). The workflows for grading with clinical information are as follows: (1) images were primarily graded by two junior graders (PX and YZ) independently and the consistent labels were assigned as the ground truth labels; and (2) images with inconsistent labels from primary grading were submitted for final adjudication by a senior retinal ophthalmologist (GZ). The final adjudication was assigned as the ground truth label. Images satisfying the inclusion criteria and annotated by the ground truth labels were filed as the data set. The development set was constructed using images from LEDRSP and JSIEC, and further randomly divided into training, validation and test data sets by 75:10:15 ratio at the patient level, while images from Liuzhou and STU-2nd were used as the external validation set.
DL algorithm development
The pipeline of the DR screening system is shown in online supplemental figure 2. Briefly, image evaluation was initiated with assessment of image quality, where gradable images were inputted into the main pipeline, whereas ungradable ones were recommended for ‘rephotography’. To construct the main structure of the system, we proposed four-dimensional independent classifiers (retinopathy, maculopathy gradability, maculopathy and photocoagulation) for any given images and each of the classifiers was binary. Three different kinds of neural networks (Google Inception-V3, Xception and InceptinReNet-V2) were used as the base model and unweighted average was used as the model ensemble method. We also adopted a postprocessing method to integrate all single-dimension results as the image-level referable results, and further integrated the image-level results as the eye-level or patient-level results. The details of the methods are shown in online supplemental method 1.
The t-distributed stochastic neighbour embedding (t-SNE) heatmaps were used to visualise the features extracted by the neural networks. SHAP-CAM heatmap, combining Class Activation Mapping (CAM)16 17 and DeepSHAP,18 was used to highlight the important regions that the neural networks were used for making the predictions (online supplemental method 2).
Various recommendations were automatically generated by the system to respond the classifications of different classifiers: (1) patients with more serious lesions (R2, R3 or M1), defined as ‘referable DR’ by the English NHS Diabetic Eye Screening Programme (online supplemental table 1), were advised for referral in the study, whereas those with R0, R1 or M0 were advised for follow-up; (2) images with ungradable maculopathy were generally advised for rephotography, unless R2 or R3 was detected on the same image, or any referable DR was found on other field images of the same fundus; and (3) any laser spot or scar recognised on the image would remind of ‘photocoagulation therapy once’, suggesting previous consultation with an ophthalmologist. The following is the order of priority of various recommendations: ‘refer to previous ophthalmologist’ > ‘referable’ > ‘rephotography’ > ‘follow-up’. Referable decision was automatically generated by the system that the image-level decision was integrated from multiple classifiers, and any dimensional positive prediction of referable lesion would recommend a referable decision. The referable image would further provide referable recommendation at the eye and patient level.
Statistical analysis
The performance of the classifiers was evaluated by true negative, false positive, false negative, true positive, F1 score, sensitivity, specificity, area under the receiver operating characteristic curve (AUROC) with 95% CI and area under the precision-recall curve (AUPRC).19 The open source package pROC (V.1.14.0; Xavier Robin) was used to calculate two-sided 95% CI with the DeLong method for AUROC. Data were analysed from 1 May 2019 to 12 June 2021.
An extra independent data set of 253 images from JSIEC and STU-2nd between 1 January 2019 and 31 December 2020 was used in the human–machine comparison with three experienced retinal ophthalmologists for further validation. The ground truth labelled by two-level graders was considered as the criterion standard. For the human–system comparison, the consistency between the graders (three experienced retinal ophthalmologists) and the DL system and the criterion standard were calculated by the Cohen’s unweighted κ and Gwet’s AC1.20 21 Both of them were further graded using the following scale: 0.2 or less was considered slight agreement, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as strong and 0.81–1.0 as near-complete agreement.
Results
A total of 85 977 retinal images were collected and 2512 (2.9%) were excluded due to a non-fundus view or diseases other than DR, which would reduce the classification performance of the DL system if included in the data sets. Subsequently, a total of 83 465 images of 39 836 eyes from 21 716 patients (mean age of 20 150 patients with available age: 60.0±12.9 years; 7493 of 17 042 (44.0%) patients with known sex as male) were eventually annotated and included in the data sets. The development set compiled from JSIEC and LEDRSP included 53 211 images (63.8% of 83 465), and the external test set from Liuzhou and STU-2nd included 30254 images (36.2%). The distribution of the data is shown in table 1 and online supplemental table 3.
System performance
For the test set at the image level, the performance of all classifiers achieved an accuracy of 0.935–0.994, F1 score of 0.868–0.969, sensitivity of 0.925–0.976, specificity of 0.914–0.995, AUROC from 0.9768 (95% CI 0.9737 to 0.9798) to 0.9979 (95% CI 0.9958 to 1.0000) and AUPRC of 0.9578–0.9981 (table 2, figure 1 and online supplemental figures 3 and 4). The retinopathy classifier achieved an accuracy of 0.972, F1 score of 0.868, sensitivity of 0.976, specificity of 0.971, AUROC of 0.9962 (95% CI 0.9951 to 0.9972) and AUPRC of 0.9687, whereas the maculopathy classifier achieved an accuracy of 0.967, F1 score of 0.888, sensitivity of 0.925, specificity of 0.974, AUROC of 0.9928 (95% CI 0.9912 to 0.9944) and AUPRC of 0.9578.
For the external validation set at the image level, the performance of all classifiers achieved an accuracy of 0.915–0.980, F1 score of 0.682–0.966, sensitivity of 0.917–0.978, specificity of 0.907–0.981, AUROC from 0.9639 (95% CI 0.9617 to 0.9660) to 0.9944 (95% CI 0.9936 to 0.9952) and AUPRC of 0.7504–0.9949 (table 2, figure 1 and online supplemental figures 3 and 4). The retinopathy classifier achieved an accuracy of 0.966, F1 score of 0.870, sensitivity of 0.978, specificity of 0.965, AUROC of 0.9944 (95% CI 0.9936 to 0.9952) and AUPRC of 0.9617, whereas the maculopathy classifier achieved an accuracy of 0.965, F1 score of 0.885, sensitivity of 0.949, specificity of 0.967, AUROC of 0.9904 (95% CI 0.9888 to 0.9919) and AUPRC of 0.9551.
The performance of the three-level (image, eye and patient level) referable DR detection achieved an accuracy of 0.952–0.972, F1 score of 0.886–0.919, sensitivity of 0.942–0.945, specificity of 0.954–0.977, AUROC from 0.9914 (95% CI 0.9884 to 0.9943) to 0.9952 (95% CI 0.9940 to 0.9964) and AUPRC of 0.9679–0.9773 in the test set, and an accuracy of 0.918–0.967, F1 score of 0.822–0.918, sensitivity of 0.970–0.971, specificity of 0.905–0.967, AUROC from 0.9848 (95% CI 0.9819 to 0.9877) to 0.9931 (95% CI 0.9920 to 0.9942) and AUPRC of 0.9527–0.9760 in the external validation set (table 2, figure 1 and online supplemental figures 3 and 4).
Visualisation
The t-SNE helped in the reduction of high-dimensional data extraction from the neural network and structure visualisation on a two-dimensional map. Well-identified binary classes of each classifier are shown in online supplemental figure 5.
In the SHAP-CAM heatmap, the predictive referable lesion visualisation not only showed their located domain, but also the shape of the lesions, which were more fined-discriminative than the CAM heatmaps and with less noise than the DeepSHAP (figure 2).
Human–system comparison
Further validation was conducted on the detection of referable DR lesions between our DL algorithm and three experienced retinal ophthalmologists. Higher sensitivity was found for the DL algorithm (1.000 in retinopathy, 0.949 in maculopathy and 0.953 in referable DR) as compared with that of the retinal ophthalmologists (average (range): 0.935 (0.910–0.970) in referable retinopathy, 0.936 (0.910–0.949) in referable maculopathy and 0.933 (0.918–0.953) in referable DR; table 3). Confusion matrices showed near-complete agreement (Cohen’s κ: 0.86–0.93; Gwet’s AC1: 0.89–0.94) between the DL algorithm and the ground truth label (online supplemental figure 6), which was comparable with the retinal ophthalmologists (Cohen’s κ: 0.89–0.96; Gwet’s AC1: 0.91–0.97).
False prediction analysis
The false predictions in the external validation set were analysed by visualisation of heatmaps. Most of the false positives were due to the non-referable DR lesions, including the background DR predicted as referable retinopathy (646 of 784, 82.4%) and the haemorrhage/microaneurysm in the macula with best corrected visual acuity >0.5 as referable maculopathy (178 of 572, 31.1%). Meanwhile, the artefacts were the common interference factor in false positive classifications (7.4% in referable retinopathy and 20.6% in referable maculopathy). Limited blurred images were observed in the false negative predictions for both referable lesions (online supplemental table 4).
Discussion
In this study, we developed a multidimensional DL platform for DR screening based on a real-world screening guideline. Our results demonstrated that (1) the five-dimensional classifiers (image quality, retinopathy, maculopathy gradability, maculopathy and photocoagulation) achieved high accuracy in each classification; (2) a three-level referable DR decision (image, eye and patient level) could be automatically generated by the DL platform; and (3) visualisation by the SHAP-CAM heatmaps provided the explainability for the referable lesion prediction from the platform.
In this study, multiple dimensional classifications were based on the NHS DR classification guideline (NHSDRCG) rather than the International Clinical Diabetic Retinopathy Severity Scale (ICDRSS).22 In previous studies, referable DR was defined as moderate and worse DR and diabetic macular edema (DME) or both, where patients with retinopathy that is more severe than mild (defined as the presence of microaneurysms only) would be referred to ophthalmologists. Yet there is still no effective management currently available for patients with an early stage of DR. These patients could only be monitored annually, but not referred to retinal specialists.23 24 Over-referral could result when adopting the clinical criteria for referable DR screening, increasing the workload on eye care services and the financial burden associated with the DR screening programmes. The ICDRSS is based on the clinical fundus examination of each quadrant.22 However, only one or two 45° fields of retinal images were taken for DR grading during the DR screening programme.12 15 25 26 This would lead to inaccurate classification of DR or confuse the graders when the grading is based only on one or two fundus photos. In contrast, the NHSDRCG was specially developed for DR screening and has been used for years in different national DR screening programmes, including the Lifeline Express DR programme in China. For the NHSDRCG, the classification is based on multidimensional features of DR lesions, rather than on the most severe DR lesion. Moreover, our system could provide a referable decision at the eye-level by intergrating all image-level decisions of one eye, as well as provide the patient-level decision by combining the results of the two eyes. Multiple DL algorithms have been developed to detect referable or vision-threatening DR, with robust performance in previous studies.8–10 Although these studies achieved high accuracy, they were designed predominately focusing on a general classification of referable DR. In daily DR screening practice, complex conditions can be found and need to be handled. The NHSDRCG should be more suitable to support the development of a multidimensional system.
The two dimensions of quality evaluation in our system, namely image quality and maculopathy gradability, are more consistent with screening practices. First, when the fundus photos are sent to the reading centre, the image quality evaluation should precede the classification of severity of DR. Poor image quality could be due to the opacity of the refractive media, artefacts, poor contrast, defocus or small pupil.27 Previous studies assigned these ungradable pictures to referable DR,9 10 28–30 which can cause unnecessary worries to patients and confuse the graders on their judgement of referable DR or rephotography. Second, maculopathy gradability should be evaluated before grading the maculopathy. Although some fundus image qualities meet the gradable criteria, the macular area might not be seen due to blur or opacity in the area. Third, our platform could provide the grading outcome of maculopathy alone instead of combining the results of retinopathy and maculopathy. Therefore, we could obtain the basis of referral suggestion, caused by retinopathy or by maculopathy. Since DME could now be treated in most primary medical units or hospitals with antivascular endothelial growth factor, referral to senior special hospitals to receive vitrectomy or photocoagulation therapy might not be necessary.31 32
Photocoagulation status on the retinal images received attention by the NHSDRCG. Its corresponding model was also established in our system, which could judge whether patients have ever received photocoagulation therapy by detecting laser spots on the fundus photos. Laser spots indicate patients have received photocoagulation therapy before screening and the treatment suggested to these patients would be different from other cases.
The SHAP-CAM heatmap highlights the predictive referable DR lesions on retinal fundus pictures. Generally, CAM could indicate the proper size but a less precise domain for lesion identification. In contrast, DeepSHAP could depict specific fine lesions,33 but more dispersed. The combination of the two techniques can provide a heatmap of lesions in specific domains, meeting the requirement for distinguishing maculopathy from retinopathy. These visualisations provide explainability and improve the accuracy of and confidence in DR grading.34 35
Limitations
There were several limitations to this study. First, similar to other studies, DME was graded on non-stereoscopic images according to the presence of hard exudate, microaneurysm or haemorrhage in the macular area. This could be misdiagnosed in some cases without stereoscopic images and optical coherence tomography.36 Second, the lesions, which could be tiny or less frequent, such as intraretinal microvascular abnormality and vein beading, might not be detected well on the images. More data presenting these lesions need to be trained if fine-grained classification than basic screening is required. Third, only two classes (referable and non-referable) and relevant indicators were adopted in the study. Besides, limited by the retrospective data with some missing information, stratified analysis of the classification performance of the DL system based on age, duration of diabetes and various devices, which could influence image quality, was not conducted. A prospective study of multiclass classification (ie, DR 0–5) and multifactor analyses could be carried out in the future. Additional indicators suitable for multiclass classification (ie, weighted kappa) could also be applied.
Conclusions
This study demonstrated that our DL platform based on a real-world DR screening guideline achieved high sensitivity and specificity with multidimensional classifiers, indicating that AI tools could assist in large-scale screening of referable DR in primary medical units.
Data availability statement
Data are available upon reasonable request. Both the main text and supplementary materials are published.
Ethics statements
Patient consent for publication
Ethics approval
This study was approved by the Human Medical Ethics Committee of the Joint Shantou International Eye Center of Shantou University and the Chinese University of Hong Kong (EC20190612(3)-P10), which is in accordance with the Declaration of Helsinki.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
GZ, J-WL, JW and JJ contributed equally.
Contributors GZ and MZ proposed and designed the study. GZ, JJ, J-WL, WC, PX, YZ, YX, HW and DL collected the data and/or executed the research. GZ, J-WL and JW analysed the data. GZ, JW, J-WL, JJ and TKN prepared the manuscript. L-PC and CPP gave critical revision. MZ is responsible for the overall content as guarantor.
Funding This work was supported by the Shantou Medical Health, Science and Technology Project Fund (project code: 190716155262406) and the Grant for Key Disciplinary Project of Clinical Medicine under the Guangdong High-Level University Development Program, China. The funding bodies were not involved in the study, including in the collection, analysis and interpretation of data and in writing the manuscript.
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.