1. Tabár, L. et al. Swedish two-county trial: impact of mammographic screening on breast cancer mortality during 3 decades. Radiology 260, 658–663 (2011).

2. Lehman, C. D. et al. National performance benchmarks for modern screening digital mammography: update from the Breast Cancer Surveillance Consortium. Radiology 283, 49–58 (2017).

3. Bray, F. et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 68, 394–424 (2018).

4. The Canadian Task Force on Preventive Health Care. Recommendations on screening for breast cancer in average-risk women aged 40–74 years. CMAJ 183, 1991–2001 (2011).

5. Marmot, M. G. et al. The benefits and harms of breast cancer screening: an independent review. Br. J. Cancer 108, 2205–2240 (2013).

6. Lee, C. H. et al. Breast cancer screening with imaging: recommendations from the Society of Breast Imaging and the ACR on the use of mammography, breast MRI, breast ultrasound, and other technologies for the detection of clinically occult breast cancer. J. Am. Coll. Radiol. 7, 18–27 (2010).

7. Oeffinger, K. C. et al. Breast cancer screening for women at average risk: 2015 guideline update from the American Cancer Society. J. Am. Med. Assoc. 314, 1599–1614 (2015).

8. Siu, A. L. Screening for breast cancer: U.S. Preventive Services Task Force recommendation statement. Ann. Intern. Med. 164, 279–296 (2016).

9. Center for Devices & Radiological Health. MQSA National Statistics (US Food and Drug Administration, 2019; accessed 16 July 2019); http://www.fda.gov/radiation-emitting-products/mqsa-insights/mqsa-national-statistics

10. Cancer Research UK. Breast Screening (CRUK, 2017; accessed 26 July 2019); https://www.cancerresearchuk.org/about-cancer/breast-cancer/screening/breast-screening

11. Elmore, J. G. et al. Variability in interpretive performance at screening mammography and radiologists’ characteristics associated with accuracy. Radiology 253, 641–651 (2009).

12. Lehman, C. D. et al. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA Intern. Med. 175, 1828–1837 (2015).

13. Tosteson, A. N. A. et al. Consequences of false-positive screening mammograms. JAMA Intern. Med. 174, 954–961 (2014).

14. Houssami, N. & Hunter, K. The epidemiology, radiology and biological characteristics of interval breast cancers in population mammography screening. NPJ Breast Cancer 3, 12 (2017).

15. Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. J. Am. Med. Assoc. 316, 2402–2410 (2016).

16. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).

17. De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24, 1342–1350 (2018).

18. Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 (2019).

19. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).

20. Moran, S. & Warren-Forward, H. The Australian BreastScreen workforce: a snapshot. Radiographer 59, 26–30 (2012).

21. Wing, P. & Langelier, M. H. Workforce shortages in breast imaging: impact on mammography utilization. AJR Am. J. Roentgenol. 192, 370–378 (2009).

22. Rimmer, A. Radiologist shortage leaves patient care at risk, warns royal college. BMJ 359, j4683 (2017).

23. Nakajima, Y., Yamada, K., Imamura, K. & Kobayashi, K. Radiologist supply and workload: international comparison. Radiat. Med. 26, 455–465 (2008).

24. Rao, V. M. et al. How widely is computer-aided detection used in screening and diagnostic mammography? J. Am. Coll. Radiol. 7, 802–805 (2010).

25. Gilbert, F. J. et al. Single reading with computer-aided detection for screening mammography. N. Engl. J. Med. 359, 1675–1684 (2008).

26. Giger, M. L., Chan, H.-P. & Boone, J. Anniversary paper: history and status of CAD and quantitative image analysis: the role of Medical Physics and AAPM. Med. Phys. 35, 5799–5820 (2008).

27. Fenton, J. J. et al. Influence of computer-aided detection on performance of screening mammography. N. Engl. J. Med. 356, 1399–1409 (2007).

28. Kohli, A. & Jha, S. Why CAD failed in mammography. J. Am. Coll. Radiol. 15, 535–537 (2018).

29. Rodriguez-Ruiz, A. et al. Stand-alone artificial intelligence for breast cancer detection in mammography: comparison with 101 radiologists. J. Natl. Cancer Inst. 111, 916–922 (2019).

30. Wu, N. et al. Deep neural networks improve radiologists’ performance in breast cancer screening. IEEE Trans. Med. Imaging https://doi.org/10.1109/TMI.2019.2945514 (2019).

31. Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med. 15, e1002683 (2018).

32. Becker, A. S. et al. Deep learning in mammography: diagnostic accuracy of a multipurpose image analysis software in the detection of breast cancer. Invest. Radiol. 52, 434–440 (2017).

33. Ribli, D., Horváth, A., Unger, Z., Pollner, P. & Csabai, I. Detecting and classifying lesions in mammograms with deep learning. Sci. Rep. 8, 4165 (2018).

34. Pisano, E. D. et al. Diagnostic performance of digital versus film mammography for breast-cancer screening. N. Engl. J. Med. 353, 1773–1783 (2005).

35. D’Orsi, C. J. et al. ACR BI-RADS Atlas: Breast Imaging Reporting and Data System (American College of Radiology, 2013).

36. Gallas, B. D. et al. Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. Acad. Radiol. 19, 463–477 (2012).

37. Swensson, R. G. Unified measurement of observer performance in detecting and localizing target objects on images. Med. Phys. 23, 1709–1725 (1996).

38. Samulski, M. et al. Using computer-aided detection in mammography as a decision support. Eur. Radiol. 20, 2323–2330 (2010).

39. Brown, J., Bryan, S. & Warren, R. Mammography screening: an incremental cost effectiveness analysis of double versus single reading of mammograms. BMJ 312, 809–812 (1996).

40. Giordano, L. et al. Mammographic screening programmes in Europe: organization, coverage and participation. J. Med. Screen. 19, 72–82 (2012).

41. Sickles, E. A., Wolverton, D. E. & Dee, K. E. Performance parameters for screening and diagnostic mammography: specialist and general radiologists. Radiology 224, 861–869 (2002).

42. Ikeda, D. M., Birdwell, R. L., O’Shaughnessy, K. F., Sickles, E. A. & Brenner, R. J. Computer-aided detection output on 172 subtle findings on normal mammograms previously obtained in women with breast cancer detected at follow-up screening mammography. Radiology 230, 811–819 (2004).

43. Royal College of Radiologists. The Breast Imaging and Diagnostic Workforce in the United Kingdom (RCR, 2016; accessed 22 July 2019); https://www.rcr.ac.uk/publication/breast-imaging-and-diagnostic-workforce-united-kingdom

44. Pinsky, P. F. & Gallas, B. Enriched designs for assessing discriminatory performance—analysis of bias and variance. Stat. Med. 31, 501–515 (2012).

45. Mansournia, M. A. & Altman, D. G. Inverse probability weighting. BMJ 352, i189 (2016).

46. Ellis, I. O. et al. Pathology Reporting of Breast Disease in Surgical Excision Specimens Incorporating the Dataset for Histological Reporting of Breast Cancer, June 2016 (Royal College of Pathologists, accessed 22 July 2019); https://www.rcpath.org/resourceLibrary/g148-breastdataset-hires-jun16-pdf.html

47. Chakraborty, D. P. & Yoon, H.-J. Operating characteristics predicted by models for diagnostic tasks involving lesion localization. Med. Phys. 35, 435–445 (2008).

48. Ellis, R. L., Meade, A. A., Mathiason, M. A., Willison, K. M. & Logan-Young, W. Evaluation of computer-aided detection systems in the detection of small invasive breast carcinoma. Radiology 245, 88–94 (2007).

49. US Food and Drug Administration. Evaluation of Automatic Class III Designation for OsteoDetect (FDA, 2018; accessed 2 October 2019); https://www.accessdata.fda.gov/cdrh_docs/reviews/DEN180005.pdf

50. Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982).

51. DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).

52. Gengsheng Qin, & Hotilovac, L. Comparison of non-parametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test. Stat. Methods Med. Res. 17, 207–221 (2008).

53. Obuchowski, N. A. On the comparison of correlated proportions for clustered data. Stat. Med. 17, 1495–1507 (1998).

54. Yang, Z., Sun, X. & Hardin, J. W. A note on the tests for clustered matched-pair binary data. Biom. J. 52, 638–652 (2010).

55. Fagerland, M. W., Lydersen, S. & Laake, P. Recommended tests and confidence intervals for paired binomial proportions. Stat. Med. 33, 2850–2875 (2014).

56. Liu, J.-P., Hsueh, H.-M., Hsieh, E. & Chen, J. J. Tests for equivalence or non-inferiority for paired binary data. Stat. Med. 21, 231–245 (2002).

57. Efron, B. & Tibshirani, R. J. An Introduction to the Bootstrap (Springer, 1993).

58. Chihara, L. M., Hesterberg, T. C. & Dobrow, R. P. Mathematical Statistics with Resampling and R & Probability with Applications and R Set (Wiley, 2014).

59. Gur, D., Bandos, A. I. & Rockette, H. E. Comparing areas under receiver operating characteristic curves: potential impact of the “last” experimentally measured operating point. Radiology 247, 12–15 (2008).

60. Metz, C. E. & Pan, X. “Proper” binormal ROC curves: theory and maximum-likelihood estimation. J. Math. Psychol. 43, 1–33 (1999).

61. Chakraborty, D. P. Observer Performance Methods for Diagnostic Imaging: Foundations, Modeling, and Applications with R-Based Examples (CRC, 2017).

62. Obuchowski, N. A. & Rockette, H. E. Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests an anova approach with dependent observations. Commun. Stat. Simul. Comput. 24, 285–308 (1995).

63. Hillis, S. L. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Stat. Med. 26, 596–619 (2007).

64. Aickin, M. & Gensler, H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am. J. Public Health 86, 726–728 (1996).