Objectives The microscopic evaluation of slides has been gradually moving towards all digital in recent years, leading to the possibility for computer-aided diagnosis. It is worthwhile to know the similarities between deep learning models and pathologists before we put them into practical scenarios. The simple criteria of colorectal adenoma diagnosis make it to be a perfect testbed for this study.
Design The deep learning model was trained by 177 accurately labelled training slides (156 with adenoma). The detailed labelling was performed on a self-developed annotation system based on iPad. We built the model based on DeepLab v2 with ResNet-34. The model performance was tested on 194 test slides and compared with five pathologists. Furthermore, the generalisation ability of the learning model was tested by extra 168 slides (111 with adenoma) collected from two other hospitals.
Results The deep learning model achieved an area under the curve of 0.92 and obtained a slide-level accuracy of over 90% on slides from two other hospitals. The performance was on par with the performance of experienced pathologists, exceeding the average pathologist. By investigating the feature maps and cases misdiagnosed by the model, we found the concordance of thinking process in diagnosis between the deep learning model and pathologists.
Conclusions The deep learning model for colorectal adenoma diagnosis is quite similar to pathologists. It is on-par with pathologists’ performance, makes similar mistakes and learns rational reasoning logics. Meanwhile, it obtains high accuracy on slides collected from different hospitals with significant staining configuration variations.
- computational pathology
- model interpretability, colorectal adenoma
- digital pathology
- deep learning
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
To study the similarities between deep learning models and pathologists before, we put them into practical scenarios, we used colorectal adenoma diagnosis as a testbed and established a semantic segmentation model for colorectal adenomas diagnosis using a deep convolutional neural networks.
The deep learning model had achieved an area under the curve of 0.92 and obtained a slide-level accuracy of over 90% on the slides from two other hospitals.
The performance of the deep learning model was on par with experienced pathologists.
By investigating the feature maps and cases misdiagnosed by the model, we found the concordance of thinking process in diagnosis between the deep learning model and the pathologists.
The current model was not at clinical grade due to the limited size of the training dataset. We need to include more types of adenomas in the training process and further improve the model performance.
Computer-aided pathological diagnosis is becoming possible with the microscopic evaluation of slides has been gradually moving towards all digital in recent years. In the past 10 years, researchers have proposed various medical diagnosis systems using deep learning.1–7 Deep learning has been widely studied in the field of object detection8–12 and semantic segmentation.13 14 Different from traditional machine learning methods, deep convolutional neural networks (CNNs) can learn directly from raw medical images, avoiding the feature engineering procedure and learn key features during the model training process automatically.15
The ability to interpret and elaborate histological features is crucial for artificial intelligence-powered medical diagnosis systems. Before applying deep learning under practical scenarios, we need to address the following non-trivial issues to understand the similarities between models and pathologists. The first and foremost question is whether the deep learning model can perform as good as pathologists. Second, as different hospitals operate under various staining configurations, the generalisation ability should be an important consideration when building the systems. Third, we want to know when the deep learning model would make mistakes and whether they would be similar to pathologists. Lastly, the parameters of the model should be visualisable to enable interrogation of its reasoning logics.
It is estimated that more than 50% of western people may suffer from colorectal adenoma during their lifetime, among which 5%–16% develop to colorectal cancer (CRC).16–18 The diagnosis and removal of these adenomas through colonoscopy lead to a reduction in the expected incidence of CRC, and individualised surveillance strategy for patients will be made according to the histological diagnosis of the resection specimen.19 20 Analysing H&E-staining slides of colorectal adenoma is easier compared with CRC, making it to be a perfect testbed to understand deep learning models.
In this research, we established a semantic segmentation model for diagnosis of colorectal adenomas using a deep CNN and achieved an area under the curve (AUC) of 0.92, which was on par with the performance of experienced pathologists. The deep learning model achieved a slide-level accuracy of over 90% on the slides from two other hospitals. By investigating the cases misdiagnosed and feature maps of the model, we found a concordant thinking process in diagnosis between the deep learning model and the pathologists.
With the popularity of colonoscopy, the number of colorectal pathological slides occupied a large workload in pathology departments. All the histological colorectal slides used in this study were obtained as a part of surveillance colonoscopy. To effectively train the proof-of-concept deep CNN, we had collected a total of 411 slides from Chinese People's Liberation Army (PLA) General Hospital (PLAGH), of which 232 were diagnosed as colorectal adenomas, and 179 were normal mucosa or chronic inflammation which were categorised as non-neoplasm. We selected 177 cases for the training set, 40 cases for validation and 194 cases as test samples. To further test the generalisation ability of the model, we had also collected 168 slides from two other hospitals, including China-Japan Friendship Hospital (CJFH) and Cancer Hospital, Chinese Academy of Medical Sciences (CH), composing the external test group. The detailed configuration of the datasets was shown in table 1. All slides were digitalised using KF-PRO-005 scanner (KFBio) with ×40 objective (eyepiece magnification fixed as ×10). Different from the traditional way of viewing slides on a microscope with fixed objectives, the whole slide images (WSIs) can be viewed at arbitrary levels via digital zooming.
The detailed labelling was prepared on 156 training and 20 validation slides containing adenomas using a self-developed annotation system based on iPad, by qualified pathologists. When adopting a rigorous definition that an adenomatous case contains one or more adenomatous glands, the diagnosis became remarkably subjective even between experienced pathologists. Therefore, a three-stage procedure was devised which included initial labelling, further verification, and the final expert check. Slides were first allocated to a pathologist, chosen randomly. When the labelling was finished, the slides along with annotations were then passed on to another randomly chosen pathologist for review. Finally, the senior pathologists spot-checked the slides that had passed the second reviewing stage. We were able to achieve a much better quality of training dataset using this elaborately designed labelling procedure.
When preparing training and validation sets, the background areas of the slide were filtered out using the Otsu’s method.21 Then the slides were split into tiles with a stride of half of the tile size to form the training and validation data. For different field of views (FoVs), the tile number ranged from 203 212 to 2 265 945. Specifically, for the best performing model trained with ×10 FOV, we used a total of 113 090 adenomatous tiles and 90 122 normal ones for training.
Deep learning model
We built the model19 based on DeepLab v2 with ResNet-34, which is illustrated in figure 1A, with improvements. We introduced a skip layer fusion approach, in which we combined the upsampled lower layers with the higher layers to retain finer details containing semantic information. We also compared the performance of the improved DeepLab v2 against ResNet-50, DenseNet, Inception v3, U-Net, and DeepLab v3.
Since histological slides have no specific directions, we applied random rotating and mirroring to augment the training data. We used carefully designed data augmentation instead of stain normalisation during a training. Since histopathological slides have no specific orientation, we applied random rotations by 90°, 180° and 270° and random flips (horizontal and vertical) to the training patches. To boost the model stability for WSIs collected from different hospitals, we also applied random scaling from ×1.0 to ×1.5, Gaussian and motion blurs and colour jittering in brightness (0.0–0.2), saturation (0.0–0.25), contrast (0.0–0.2) and hue (0.0–0.04).
All models were trained and tested with TensorFlow on an Ubuntu server with 4 Nvidia GTX1080Ti graphics processing units (GPUs). The Adam optimiser with a fixed learning rate of 0.0001 was used to train the models. The batch size was set to 80 (20 on each GPU) and the training process was stopped after 25 epochs.
A benefit from the fully CNN architecture is that the tile sizes during training and at inference need not be identical. In the inference stage, we cut the WSI into tiles with the size of 2000×2000 pixels. To further retain the environment information for the surrounding areas, we adopted the overlap-patch approach22 by feeding a 2200×2200 pixel tile into the model but only used the centred 2000×2000 pixel area for the final prediction.
We used the 15th largest pixel-level probability for slide-level prediction. The receiver operating characteristic (ROC) curve was derived by applying slide-level thresholding to the probability.
We chose three evaluation metrics to describe the model performance
where TP, FP, TN, FN stood for true positive, false positive, true negative and false negative, respectively. The accuracy represented the ratio of the number of correctly predicted slides to the total number of slides. The sensitivity/specificity indicated the proportion of adenomatous/normal slides that were correctly identified. The statistics were made using self-developed Python scripts and plotted by Matlab.
Interpretability had been an issue to be considered in applying deep learning in medical practice. The deep CNNs were often described as black boxes, making it difficult to be applied for clinical use. This obstacle could be tackled by regarding the model either as a black box functional module or as a white box. From the black box perspective, we could study its input–output behaviour and compare it with expert pathologists. Meanwhile, we could also analyse the false predictions and compare them with mistakes made by pathologists. On the white box perspective, we could open the model and try to visualise what it has learnt. One of the most effective approaches of model visualisation is to output feature maps learnt by the CNN and infer what its reasoning logics look like. We visualised feature maps23 to understand how the input samples went through the CNN. We normalised all the visualisation results to the range (0.0–1.0) according to the maximum and minimum values of all the feature maps derived from the corresponding CNN layer.
Performance of different deep learning models
In table 2, we gave the performance of six models trained and validated with 320×320 pixel patches under ×20 FoV. The improved DeepLab v2 outperformed both classification and segmentation models. Moreover, the segmentation model reveals more interpretable predictions, as shown in figure 1B. In the following, we chose the improved DeepLab v2 as the research object.
Comparison of models trained with different FOVs
We trained six models using ×10, ×20 and ×40 FoV tiles with sizes of 640×640 and 320×320 pixels, as illustrated in figure 2A. It can be easily seen that ×10 FoV captured glandular structure and gland–stromal relationships better than other smaller FoVs. In addition, with the help of larger tile size, the model trained with ×10 FoV and 640×640-pixel tile size outperformed others on the validation set, shown in figure 2B.
The computing speed was another important factor to be considered. It was worth investigating the prediction time under different FoVs. Since our deep learning model was fully convolutional, it was possible to make predictions for arbitrary sizes of input images. In our research, we fixed the tile size in the prediction stage to 2000×2000 pixels. The inference time of different FoVs was given in figure 2C, all numbers were normalised by the time taken at ×40. We could see that at ×10 we got both better accuracy and higher computing speed. The final diagnostic system developed with the trained deep learning model was demonstrated in the (online supplemental file 1).
Best model performance compared with pathologists
The slide-level ROC curve was given in figure 3, the AUC was 0.92. We had invited five pathologists to diagnosis the 194 test slides. As shown in figure 3, five pathologists gave significant different diagnosis results, showing the subjectiveness in the adenoma identification process. One could find the model performance was better than the average pathologist. In the following, the best model was chosen at the inverted triangle spotted in figure 3.
Some qualitative examples were shown in figure 4A. When we focused our attention on the regions with high probabilities (crimson), we could see the wedge-shaped adenomatous regions, which was consistent with a common observation from pathologists.
To further test the generalisation ability of the model, we fed slides from the generation testing group into our system and compared the predictions given by the model against the histological reports. Results were shown in table 3. Without any fine-tuning on the original model, it found 155 out of 168 slides (adenoma: 111; normal: 57) were correctly predicted, indicating the model still maintained high accuracy under different staining configurations. Figure 4B shows three examples.
System efficiency and scalability
Due to the large file size of histological slides, it was clear that building a system supporting multiple GPUs for the automatic diagnosis process was essential. The system completed the analysis of a slide with the size of 500 MB in 30 s on a single GTX1080Ti GPU. As shown in figure 4C, the system performance increased near linearly with the hardware configuration (ie, number of GPUs).
False analysis and model visualisation
As shown in figure 5A, in (I) and (II), the adenomatous grade was low, though the model successfully spots these glands, the probability was too low to reach an affirmative decision. The false positive predictions were closely related to tissue cauterisation and hyperplasia as shown in figure 5A-(III) and (IV), respectively.
We gave three representative examples in figure 5B to reveal the model inference process. Interestingly, as we looked carefully into the final probability map, we could infer where exactly the model put its attention on. The highlighted area matched with the area of adenomatous proliferation.
We found the FoV had a substantial impact on the diagnostic accuracy for both machines and pathologists. Specifically, in additional to targeted lesional cells, the histological environment around the cells was also crucial for the diagnosis process. We discovered that the model showed better performance with the ×10 FoV than with ×20 or ×40 FoV. In the meantime, to further increase the FoV that can be perceived by the model, we enlarged the training tile size from the commonly adopted pixel size of 320×320 to 640×640. The best deep learning model reached an AUC of 0.92, showing comparable performance to the pathologists, even better than the average pathologist.
This methodology could be applied to the detection of other symposiums from the histopathological aspect. From the experience of pathologists, the FoV was specific to the disease type. For instance, for cancer detection, ×20 or ×40 FoV was necessary to make a confirmative diagnosis. Despite all this, increasing the tile size should always be effective with abundant GPU resources.
The model was required to have consistent and robust performance, that is, generalisation ability, in order to deal with the different staining configurations of histological slides across different hospitals. We had collected 168 additional slides from two other hospitals and achieved a slide-level accuracy of over 90%.
The false negative predictions were the cases which we need to be more cautious about. As shown in figure 5A, for false negative cases, when the adenomatous glands were small and hard to differentiate from regenerative changes at the base of crypts, the model tended to miss them. This behaviour was similar to that of junior pathologists who very often underdiagnose these ambiguous areas. False positive cases were related closely to tissue cauterisation and hyperplasia. Coincidentally, these tissue structures often confused the junior pathologists, some may overlook an inflammatory background and mistaken regenerative atypia as adenomas. It was necessary to introduce quality assurance step to filter out low-quality slides, such as shredded and folded sections and marked cauterisation, before feeding them into the segmentation model.
To make a decision on whether there were adenomas, the pathologists mainly focused on gland and cell morphologies. From model visualisation results shown in figure 5B, we could observe that the lower CNN layers extracted the edge and colour information from the raw image. As the network went deeper, some of the feature maps gradually revealed glands and cells, especially gland shapes, nucleus and cell forms. For cases with abnormal gland shape and cell morphology, the model made the final decision and determined they were adenomatous glands. Otherwise, the tile was considered normal. This reasoning path was very similar to that of an experienced pathologist.
To apply the methodology to more disease for various organs, it is necessary to recruit a large number of WSIs in the training phase covering diverse tumour subtypes. These WSI should be labelled with accurate pixel-level annotations by experienced pathologists. The augmented data should be generated from domain-specific features of histopathology to further improve the robustness and generatability under complicated scenarios.
It was necessary to know whether deep learning models were similar to pathologists. To answer this question, we established a semantic segmentation model for colorectal adenomas diagnosis using deep CNNs and achieved an AUC of 0.92, which is on par with the performance of experienced pathologists. By carefully studying the influence of FoV on the model performance, we found that a larger FoV brings better diagnostic accuracy, which was consistent with pathologists’ experience.
The model generalisation ability was proved by the multicenter test by slides collected from two other hospitals. We had discovered that the model made similar mistakes on the samples as junior pathologists would do. Meanwhile, model visualisation showed the reasoning path of the deep CNN was very similar to that of experts. By increasing the number of the training samples to include more types of adenomas in the training process, we could further improve the model performance.
The authors would like to thank Xiang Gao, Lang Wang, Yuefeng Wang, Siqi Zheng, Cunguang Wang, Fangjun Ding at Thorough Images for data processing and helpful discussions.
Contributors ZS, SZ, SW, WX, NL and HS proposed the research, CY, YH, XD, JL, LS, JY, XG, WJ, ZW, XC, YZ and XD performed the slide annotation, WW and HC led the multicentre test, CL, GX, ZS and CK wrote the deep learning code and performed the experiment, ZS and SW wrote the manuscript, WX, NL and HS reviewed the manuscript.
Funding This work is supported by CAMS Innovation Fund for Medical Sciences (CIFMS) (grant number 2018-I2M-AI-008); the National Natural Science Foundation of China (NSFC) [grant number 61532001; and the Tsinghua Initiative Research Programme Grant (grant number 20151080475).
Competing interests SW is the cofounder, chief technology officer, and equity holder of Thorough Images. CL, ZS, CK are algorithm researchers of Thorough Images. All remaining authors have declared no conflicts of interest.
Patient and public involvement Patients and/or the public were involved in the design, or conduct, or reporting, or dissemination plans of this research. Refer to the Methods section for further details.
Patient consent for publication Not required.
Ethics approval The study was approved by the institutional review board of each participating hospital (Medical Ethics Committee, Chinese PLA General Hospital, S2018-163-01; Research Ethics Committee, China-Japan Friendship Hospital Clinical, 2018–106 K75; Ethics Committee of National Cancer Center/Cancer Hospital, Chinese Academy of Medical Sciences, NCC1789).
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement Data are available on request to the corresponding authors.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.