Background A growing number of research studies have reported inter-observer variability in sizes of tumours measured from CT scans. It remains unclear whether the conventional statistical measures correctly evaluate the CT measurement consistency for optimal treatment management and decision-making. We compared and evaluated the existing measures for evaluating inter-observer variability in CT measurement of cancer lesions.
Methods 13 board-certified radiologists repeatedly reviewed 10 CT image sets of lung lesions and hepatic metastases selected through a randomisation process. A total of 130 measurements under RECIST 1.1 (Response Evaluation Criteria in Solid Tumors) guidelines were collected for the demonstration. Intraclass correlation coefficient (ICC), Bland-Altman plotting and outlier counting methods were selected for the comparison. The each selected measure was used to evaluate three cases with observed, increased and decreased inter-observer variability.
Results The ICC score yielded a weak detection when evaluating different levels of the inter-observer variability among radiologists (increased: 0.912; observed: 0.962; decreased: 0.990). The outlier counting method using Bland-Altman plotting with 2SD yielded no detection at all with its number of outliers unchanging regardless of level of inter-observer variability. Outlier counting based on domain knowledge was more sensitised to different levels of the inter-observer variability compared with the conventional measures (increased: 0.756; observed: 0.923; improved: 1.000). Visualisation of pairwise Bland-Altman bias was also sensitised to the inter-observer variability with its pattern rapidly changing in response to different levels of the inter-observer variability.
Conclusions Conventional measures may yield weak or no detection when evaluating different levels of the inter-observer variability among radiologists. We observed that the outlier counting based on domain knowledge was sensitised to the inter-observer variability in CT measurement of cancer lesions. Our study demonstrated that, under certain circumstances, the use of standard statistical correlation coefficients may be misleading and result in a sense of false security related to the consistency of measurement for optimal treatment management and decision-making.
- computed tomography
- quality in health care
- protocols & guidelines
- adult oncology
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Strengths and limitations of this study
While several conventional statistical measures are frequently used to evaluate inter-observer variability in radiological measurement, very few comparative studies have been performed to quantify the relative merits of the measures.
The study demonstrated there is no evidence to support the use of statistical correlation coefficient for the assessment of inter-observer CT measurement variability.
This is a retrospective study conducted in a single academic health centre.
Another limitation may be the measurements collected under a highly controlled environment where the radiologists were rarely interrupted throughout the data collection.
Clinical evaluation of cancer therapeutics is based on the assessment of change in tumour burden, which is an important surrogate marker reflecting the therapeutic efficacy of cancer treatments. A comprehensive evaluation of tumour burden often involves a series of measurements of multiple tumour diameters. Measurement accuracy and consistency are essential; a large inter-observer variability in measuring tumour size may interfere with precise assessment of cancer treatment response when serial measurements are performed by multiple radiologists. Some studies suggest there are radiologist-dependent factors (eg, preferred guideline, measurement technique, years of clinical experience) that may contribute variability in the anatomical measurements.1–6 A potentially heightened patient risk associated with the inter-observer variability may be present when a patient’s repeat CT imaging is assigned to a radiologist different from the radiologist who originally measured the tumour. As a result, clinical disagreement due to the variability between the radiologists may result in an unnecessary change in treatment management.
Predominant methods for evaluation of the inter-observer variability in radiological measurements typically include measures based on statistical correlation coefficient and Bland-Altman plot.2 7–14 Intraclass correlation coefficient (ICC) is a widely used reliability measure comparing the variability of different ratings by the same raters to the total variation across all ratings and all raters.15 This reliability measure can be used for test–retest, intra-rater and inter-rater reliability analyses when the rating scale is continuous or ordinal. The Bland-Altman plotting is another popular exploratory analysis approach for intra-rater and inter-rater reliability when two paired measurements use the same scale.16
While these measures serve as useful assessment instruments in many other fields,17–20 their use in evaluating the variability in radiological measurements has not been adequately explored. There is a paucity of research investigating either the absolute or comparative effectiveness of these measures in evaluating inter-observer measurement variability among radiologists. Despite multiple statistical studies containing an explicit warning against the use of correlation-based measures and visualisation in some cases,15 21–25 it remains unclear whether the measures are sufficiently responsive to appropriately evaluate the inter-observer variability. Consequently, it is also not known whether these measures can be used for interventional studies aiming to reduce inter-observer variability in measurement.6 Previous studies on inter-observer variability in radiological measurement have reported correlation coefficient scores ranging from 0.860 to 0.999.2 7–11 14 From a radiologist’s perspective, these numbers offer little clinical insight on level of the inter-observer variability other than the fact that the scores are very high. The question of how high score is small inter-observer variability is open for further investigation.
In this paper using cases with different levels of inter-observer measurement variability, we compare sensitivity and clinical usefulness of different evaluation measures for inter-observer variability in CT lesion measurements. Additionally, cases were assessed using these measures to offer a better clinical insight for the question of how high the scores should be to achieve clinically acceptable measurement variability in daily clinical practice.
Our demonstration is based on three cases with increased, observed and decreased inter-observer measurement variability that were generated from real clinically observed data. Descriptions of how data were generated for each case are detailed below. The observed data set was acquired from a single-site, double-blinded, observational study, conducted in the Department of Radiology, Prisma Health System, located within the Southeast USA. The study was conducted between July 2017 to December 2017. The Department of Radiology operates in an academic health centre but does not train radiology residents.
Collecting observed data
Data were collected from 13 board-certified radiologists who regularly read CT examinations of lung lesions and hepatic metastasis. Each of the five lung lesions and five hepatic metastases samples were randomly selected from the Picture Archiving and Communication System (PACS) following two primary criteria: (1) whether the lesions are measurable under the Response Evaluation Criteria in Solid Tumors (RECIST) 1.1 guideline, and (2) whether the lesions are commonly encountered in clinical practice. See online supplemental material 1, which are the selected images. These CT images contained normal anatomy cephalad and caudal to the lesion of interest. Each CT image set did not contain any recommendations regarding measurement. The 13 radiologists independently reviewed the same 10 CT image sets, which resulted in a total of 130 measurements (13×10). Individual radiologists adjusted the window level according to their preferences, as they would in their clinical practice. According to RECIST 1.1 criteria, only the longest CT axis of a tumour image and its corresponding measurement were collected.
Creating cases with different levels of inter-observer variability
The original observed data were used to generate cases with increased, observed and decreased inter-observer measurement variability. The extent of variability classified as increased, observed or decreased does not indicate the absolute level of measurement variability; the classifications were used to indicate different cases with relatively high or relatively low inter-observer variability. The original observed data served as the data representing the case with observed inter-observer measurement variability.
We generated data representing the case with increased inter-observer variability by moving each measurement in the observed data away from the nearest peer measurements. Specifically, we inflated the inter-observer variability by increasing the deviation of each measurement from the corresponding median by 40% to create a case with evidently unacceptable measurement variability. Similarly, the deviation of each measurement from the corresponding median was decreased by 40% in the case with decreased inter-observer variability, figure 1. The per cent differences between each measurement and the corresponding median were visualised using scatter plots for all CT image sets, figure 2. The raw data for each case can be found in online supplemental material 2.
Description of selected measures for comparison
We selected evaluation measures based on ICC and Bland-Altman plot, which are commonly used for the assessment of intra-observer and inter-observer variability in CT measurement.2 7–14 While Bland-Altman plot is a graphical method rather than statistical measure, some well-respected studies used the plotting for tracking a number of outlier measurement differences outside the 2SD upper and lower limit of agreement (LOA).2 14 26 Accordingly, we quantified Bland-Altman plots using a number of data points exceeding the upper and lower LOA. The plotting compares two radiologists at a time; for each case, we performed a pairwise Bland-Altman analysis for all possible pairs within a group of radiologists and counted the total number of outliers from all pairs, online supplemental material 3. If the number of outliers from Bland-Altman plot is sensitised to the different levels of inter-observer variability, more outliers (ie, higher proportion of outlier measurement differences) would be observed in the case with increased inter-observer variability.
In the clinical context, this pairwise approach explores how safely a patient can be transferred from one radiologist to another within a group of radiologists. If two radiologists reviewed the same set of CT cases but suggested measurements largely different from each other, there may be concerns associated with the patient transfer between the radiologists. Similarly, if two radiologists reviewed the same set of CT cases and suggested measurements similar to each other, the concerns associated with the patient transfer may be marginal. Having more pairs with fewer outlier measurement differences may imply less concern for inter-observer variability when a patient is reviewed by multiple radiologists.
We compared three evaluation measures for the comparison: (1) ICC, (2) Bland-Altman plot with 2SD LOA and (3) Bland-Altman plot with 20% fixed LOA. As for estimations of ICC scores, a two-way random-effects model that characterises absolute agreement by incorporating both lesion-wise effect (target effect) and radiologist-wise effect (rater effect) was applied for both simulated and observed data.2 19 27 28 The ICC scores were estimated based on all 130 measurements for each case (increased, observed and decreased).
While Bland-Altman plot allows data to be analysed both as unit differences plot and as percentage differences plot,16 we used per cent difference plot as suggested by previous studies in the literature.2 14 28 Bland-Altman plot with 2SD LOA was quantified into score value by calculating proportion of data points within the upper and lower LOA.
Bland-Altman plot with 20% fixed limits was also quantified into score value to compare with ICC and standard Bland-Altman plot with 2SD limits. There have been several clinical studies using Bland-Altman plot with fixed limits of agreement evidenced by relevant domain knowledge.29 30 This essentially aligns with other studies that use clinical domain knowledge to define outliers.31–34 We fixed the maximum acceptable LOA to assess the measurement interchangeability between radiologists at 20% evidenced by clinical guidelines. The predominant guideline for cancer treatment response evaluation, RECIST 1.1, heavily depends on per cent difference in lesion diameter with a progression defined as a 20% increase in the sum of longest diameters.35 36 The absolute inter-radiologist difference already exceeding 20% in CT measurements may interfere with the application of the 20% criterion from the guideline when a patient is reviewed by different radiologists. Thus, the 20% measurement difference was used as the fixed LOA for the Bland-Altman plot. In the context of radiological measurement, this means that outlier measurement difference is explicitly defined as measurement difference exceeding 20% when a pair of radiologists reviewing the same image.
Bland-Altman plot also allows identification of any systematic difference (mean difference in measurements) between two observers. For each case of inter-observer variability, the mean difference in measurements was calculated for all possible pairs (n=78) and visualised in a heat map, figure 3.
Patient and public involvement
Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Characteristics of CT image sets included in the study
Each CT image set included in the study consisted of multiple CT slices with an average of 7.6 images, table 1. The minimum and maximum size of the hepatic metastases ranged between 1.68 cm to 2.21 cm and 5.32 cm to 6.72 cm, respectively. The minimum and maximum size of lung lesions ranged between 1.27 cm to 1.68 cm and 3.69 cm to 5.02 cm, respectively. In the observed data, the largest lesion-wise per cent difference in measurements was realised in Hepatic Metastasis 5 with 33.1% difference between the minimum and maximum measurements. The smallest lesion-wise per cent difference in measurements was realised in Lung Lesion 2 with 14.5% difference between the minimum and maximum measurements.
Characteristics of cases with different levels of inter-observer variability
The graph visualisation of the data from each case suggested varying levels of inter-observer variability, figure 2. The visualisation of the original observed data suggested a substantial inter-observer variability with 31 (23.8%) measurements outside the light blue area representing plus or minus 10% interval from the average measurement value for each case. Additionally, a lesion-wise effect on inter-observer variability was observed with relatively high measurement variation in some CT image sets. The visualisation of the case of decreased inter-observer variability illustrated a small number of measurements outside the threshold with 3 (2.3%) measurements locating outside the plus or minus 10% interval. With the decrease in the deviations of each measurement from the corresponding median, all measurements moved towards average and closer together as intended for demonstration. On the other hand, there was a relatively large number of measurements outside the threshold in the case of increased inter-observer variability with 50 (38.5%) measurements locating outside the plus or minus 10% interval. Also, it was observed that all measurements were not only shifted away from median, but also moved further away from each other as intended.
Visualisation of Bland-Altman analysis
The heat map visualisation of average per cent measurement difference (fixed bias) for all pairs of radiologists suggested varying levels of the difference across all pairs, figure 3. Some pairs of radiologists achieved a lower average per cent difference than others. In the heat map of the original observed data, the smallest systematic difference in measurement was observed in the pair of Radiologist 11 and Radiologist 13; they maintained an average of 0.03% difference in their measurements when reviewing the same set of CT images. The largest systematic measurement difference was observed in the pair of Radiologist 1 and Radiologist 6. The systematic difference in their measurements was 13.6% when reviewing the same set of CT images. It was observed that some radiologists attributed more to inter-observer variability than others; Radiologist 1 and 10 generally overestimate lesion size compared with others while Radiologist 2 and 6 generally underestimated lesion size compared with others.
The heat map visualisation from the case of increased inter-observer variability showed the increased systematic measurement differences between any two radiologists compared with other cases. Similarly, the heat map visualisation from the case of decreased inter-observer variability showed the decreased systematic measurement differences compared with other cases. Overall, the cases with relatively high inter-observer variability tend to present the increased systematic measurement differences between any two radiologists as well as more pairs of radiologists with a systematic measurement difference close to 20% when reviewing the same CT image sets.
Comparison of the selected measures
The original observed data achieved the ICC score of 0.962. The ICC scores in the cases of increased and decreased inter-observer variability were 0.990 and 0.912, respectively. The per cent increase in the deviation of each measurement from the corresponding median has a perfect linear relationship with the ICC score (R2=1.00), figure 4. However, the magnitude of association was extremely low; 10 per cent increase in the deviation was associated with 0.01 decrease in the ICC score. As a result, the graph representing a relationship between a per cent increase in the deviation and the corresponding ICC score presented a virtually flat slope, which implies that the score is extremely insensitive to the changes in deviations.
The original observed data achieved the standard Bland-Altman score of 0.937, which indicates 93.7% of data points within lower and upper LOA along with 6.3% outlier data points. The score based on standard Bland-Altman presented flat slope with its score unchanging regardless of level of inter-observer variability (standard Bland-Altman score=0.937).
The presented Bland-Altman score with fixed limits was more responsive to the change in case than other measures. In the case with decreased inter-observer variability, all pairs were identified to have a per cent difference less than 20% when reviewing the same CT image sets (fixed-limit Bland-Altman score=1.0). The original observed data suggested Bland-Altman score with fixed limits of 0.923 with 92.3% of all possible pairwise measurements having a per cent difference less than 20%. In the case with increased inter-observer variability, 75.6% of measurements were identified to have a per cent difference less than 20% when reviewing the same CT image sets. The Bland-Altman score with fixed limits changed by 0.167 (0.756 to 0.923) between increased case and observed data, and 0.077 (0.923 to 1.000) between observed data and increased case, figure 4.
The importance of consistent measurement of cancer lesions in CT scans has been well documented.10 35 36 We have performed an extensive simulation study using conventional evaluation measures and different cases with varying levels of inter-observer variability. Our study investigated precision of those measures and found that some measures are not sensitive enough to detect the difference between cases with clinically desirable and clinically unacceptable inter-observer variability in radiological measurement.
The previous studies by McErlean et al and Zhao et al used statistical correlation coefficients and standard Bland-Altman plot as primary measures and concluded that serial CT measurements can be safely performed by different radiologists.2 7 Our study indicated that the correlation-based measures may fail to serve as a true indicator of inter-observer variability. When the observed data were analysed, the radiologists in our study achieved a high ICC score comparable to previous studies.2 13 However, as demonstrated above, a high ICC score does not always guarantee low inter-observer variability in the context of radiological measurement. Our analysis suggests that the statistical correlation-based measures may yield high scores regardless of level of the inter-observer variability among radiologists. Therefore, a group of radiologists who achieved a high ICC score within the group could fail to maintain clinically reasonable measurement consistency. For instance, an ICC score of 0.9 achieved by a group of readers is often considered to be excellent in many other fields.36 37 However, in the case of cancer treatment response evaluation, the ICC score of 0.9 may raise serious patient safety concerns with radiologists always having at least 10% average per cent difference in measurement to each other when reviewing the same CT image sets. In the presented case with increased inter-observer variability, the ICC score of 0.91 was still not high enough to achieve clinically acceptable inter-observer variability in CT measurement, as affirmed by the participating radiologists, online supplemental material 2. Despite the unrealistically high increase in the variability observed in the case with increased inter-observer variability, the ICC score failed to provide an adequate warning.
Another measure, outlier counts from standard Bland-Altman plotting with 2SD upper and lower LOA, presented no response to the varying levels of inter-observer variability in CT measurements. It was observed that its upper and lower limits increase proportionally to measurement variabilities, figure 5. Our analysis suggested no evidence to support its use for the assessment of CT measurement variability or outlier detection.
While the standard Bland-Altman and ICC scores changed little across the different cases, the presented Bland-Altman score with 20% fixed limits rapidly changed between cases of increased, observed and decreased inter-observer variability. The presented score is also intuitive to interpret because of its self-descriptive nature; the decrease in the score from 0.923 to 0.756 means that the percentage of pairwise measurements having less than 20% difference has decreased from 92.3% to 75.6%. As documented, the predominant guideline for cancer treatment response evaluation defines a diameter increase of 20% as the cut-off for progression of cancer. If multiple pairs of measurements have 20% or higher measurement difference over the same CT image sets, this may interfere with the application of the 20% criterion from the guideline when a patient is reviewed by different radiologists. The Bland-Altman score with fixed limits demonstrated a potential to detect a decrease in the number of pairs having less than 20% measurement difference when reviewing the same image sets, which may better facilitate the application of guideline.
The Bland-Altman heat map of pairwise systematic discrepancy offered some useful insight on how the inter-observer variability can be addressed in interventional studies. The visualisation identified radiologists who largely under-measure or over-measure compared with their peers, which can be a potential target for intervention to reduce the variability. Risk associated with inter-observer variability is realised when a patient is referred from one radiologist to another or reviewed by different radiologists. The pairwise approach to visualise systematic discrepancy may also be useful in addressing the risk by identifying pair of radiologists whose measurements typically differ greatly from each other.
This was a retrospective study conducted in a single academic health centre. Future study may extend our approach to more measurements with various respond evaluation criteria used by radiologists from multiple institutions. A potential limitation of the study may result from the image selection process. Although the images were randomly selected from the health system PACS, the application of the selection criteria was performed by one senior radiologist. A selection criterion was whether or not images are commonly encountered in daily clinical practice, which may have introduced a bias in the image selection. Another limitation is that the measurements were collected under a highly controlled environment where the radiologists were rarely interrupted throughout the data collection. It is commonly believed that in real-world clinical practice, one’s actual performance may be negatively affected by a heavy workload or various types of interruptions. Lastly, future studies are warranted to explore other existing evaluation approaches. For example, although the reliability of the estimated regression line depends on the sample size, the homoscedasticity and normality of the distribution of the differences, the regression of the mean on the difference could reveal whether the extend of disagreement depends on the mean of two measurements.
Conventional measures may yield weak or no detection when evaluating different levels of the inter-observer variability among radiologists. We observed that the outlier counting based on domain knowledge was sensitised to the inter-observer variability in CT measurement of cancer lesions. Our study demonstrated that, under certain circumstances, the use of standard statistical correlation coefficients may be misleading and result in a sense of false security related to the consistency of measurement. A visualisation based on pairwise approach to identify systematic discrepancy may serve as a useful and practical tool for future efforts to reduce the inter-observer variability in radiological measurement.
The authors acknowledge and appreciate the participating radiologists of the Prisma Health System. We also acknowledge and appreciate the logistics and regulatory support of Karen Edwards, MS of the Department of Public Health Sciences, Clemson University.
Contributors MW designed the study, analysed the data and developed the manuscript. MH made substantial contributions to the data analysis and critical revisions. SCL and AMD acted as Clinical Investigators and contributed substantially to study development and clinical data interpretation. RWG served as co-Principal Investigator and supervised preparation, conduct and administration of the study. All authors developed, reviewed and approved the manuscript.
Funding This work was supported by supported by Health Science Center, Prisma Health, Greenville, South Carolina (Grant number: Pro00065670).
Competing interests None declared.
Patient consent for publication Not required.
Ethics approval Institutional Review Board, Prisma Health System, Greenville, South Carolina.
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement All data relevant to the study are included in the article or uploaded as supplementary information. All data relevant to the study are included in the manuscript and uploaded in Supplementary Material 2.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.