Radiographic Annotation Accuracy: The 'Good Doctor' Performance Competing with AI

Preview
Authors
Yuan Chai, A. Mounir Boudali, John Farey, William L. Walter

Introduction: Human error is usually evaluated using statistical descriptions during radiographic annotation. The technological advances popularized the ""non-human"" landmarking techniques, such as deep learning, in which the error is presented in a confidence format that is not comparable to that of the human method. The region-based landmark definition makes an arbitrary ""ground truth"" point impossible. The differences in patients' anatomies, radiograph qualities, and scales make the horizontal comparison difficult. There is a demand to quantify the manual landmarking error in a probability format.

Methods: Taking the measurement of pelvic tilt (PT) as an example, this study recruited 115 sagittal pelvic radiographs for the measurement of two PT definitions. We proposed a method to unify the scale of images that allows horizontal comparisons of landmarks and calculated the maximum possible error using a density vector. Traditional descriptive statistics were also applied.

Results: All measurements showed excellent reliabilities (intraclass correlation coefficients > 0.9). Eighty-four measurements (6.09%) were qualified as wrong landmarks that failed to label the correct locations. Directional bias (systematic error) was identified due to cognitive differences between observers. By removing wrong labels and rotated pelves, the analysis quantified the error density as a ""good doctor"" performance and found 6.77°-11.76° maximum PT disagreement with 95% data points.

Discussion and Conclusions: The landmarks with excellent reliability still have a chance (at least 6.09% in our case) of making wrong landmark decisions. Identifying skeletal contours is at least 24.64% more accurate than estimating landmark locations. The landmark at a clear skeletal contour is more likely to generate systematic errors. Due to landmark ambiguity, a very careful surgeon measuring PT could make a maximum 11.76° random difference in 95% of cases, serving as a "good doctor benchmark" to qualify good landmarking techniques.