ABSTRACT

Human interaction has become almost mandatory for an automated medical system wishing to be accepted by clinical regulatory agencies, such as the U.S. Food and Drug Administration. Since this interaction causes variability in the gathered data, interobserver and intraobserver variability must be analyzed in order to validate the accuracy of the system. This study focuses on the variability from different observers who interact with an automated lung delineation system that relies on human interaction in the form of the delineation of lung borders. The database consists of high-resolution computed tomography (HRCT): 15 normal and 81 diseased patient images taken retrospectively at five levels per patient. Three observers manually delineated the lung borders independently using ImgTracer software (AtheroPoint, Roseville, CA, USA) to delineate the lung boundaries in all five levels of three-dimensional lung volume. The three observers consisted of Observer-1, a less experienced novice tracer who is a resident in radiology under the guidance of radiologist, and Observer-2 and Observer-3, lung image scientists trained by a lung radiologist and a biomedical imaging scientist and experts. The interobserver variability can be shown by comparing each observer's tracings to the automated delineation and also by comparing each manual tracing of the observers with one another. The normality of the tracings was tested using the D'Agostino-Pearson test, and all observers' tracings showed a normal p value higher than 0.05. The analysis of variance (ANOVA) test between the three observers and the automated system showed a p value higher than 0.89 and 0.81 for the right lung (RL) and the left lung (LL), respectively. The performance of the automated system was evaluated using the dice similarity coefficient, the Jaccard index, and Hausdorff distance measures. Although Observer-1 had less experience compared to Observer-2 and Observer-3, the Observer Deterioration Factor (ODF) showed that Observer-1 had less than a 10% difference compared to the other two, which is within the acceptable range as per our analysis. To compare between observers, this study used regression plots, Bland-Altman plots, two-tailed t tests, Mann-Whitney tests, chi-squared test that showed the following p values for RL and LL: (1) Observer-1 and Observer-3, 0.55, 0.48, and 0.29, for RL and 0.55, 0.59, and 0.29, for LL; (2) Observer-1 and Observer-2, 0.57, 0.50, and 0.29 for RL and 0.54, 0.59, and 0.29 for LL; and (3) Observer-2 and Observer-3, 0.98, 0.99, and 0.29 for RL and 0.99, 0.99, and 0.29 for LL. Furthermore, the correlation and R 2 coefficients were computed between observers and equaled 0.9 for RL and LL. However, all three observers managed to show the feature that diseased lungs are smaller than normal lungs in terms of area.