An artificial intelligence–optimized Thyroid Imaging Reporting and Data System (TI-RADS) validates the American College of Radiology TI-RADS while slightly improving specificity and maintaining sensitivity. Additionally, it simplifies feature assignments, which may improve ease of use.

AI TI-RADS assigned new point values for eight ACR TI-RADS features. Six features were assigned zero points, which simplified categorization. By using expert reader data, the diagnostic performance of ACR TI-RADS and AI TI-RADS was area under the receiver operating curve of 0.91 and 0.93, respectively. For the same expert, specificity of AI TI-RADS (65%, 55 of 85) was higher ( P < .001) than that of ACR TI-RADS (47%, 40 of 85). For the eight nonexpert radiologists, mean specificity for AI TI-RADS (55%) was also higher ( P < .001) than that of ACR TI-RADS (48%). An interactive AI TI-RADS calculator can be viewed at http://deckard.duhs.duke.edu/∼ai-ti-rads .

A total of 1425 biopsy-proven thyroid nodules from 1264 consecutive patients (1026 women; mean age, 52.9 years [range, 18–93 years]) were evaluated retrospectively. Expert readers assigned points based on five ACR TI-RADS categories (composition, echogenicity, shape, margin, echogenic foci), and a genetic AI algorithm was applied to a training set (1325 nodules). Point and pathologic data were used to create an optimized scoring system (hereafter, AI TI-RADS). Performance of the systems was compared by using a test set of the final 100 nodules with interpretations from the expert reader, eight nonexpert readers, and an expert panel. Initial performance of AI TI-RADS was calculated by using a test for differences between binomial proportions. Additional comparisons across readers were conducted by using bootstrapping; diagnostic performance was assessed by using area under the receiver operating curve.

Risk stratification systems for thyroid nodules are often complicated and affected by low specificity. Continual improvement of these systems is necessary to reduce the number of unnecessary thyroid biopsies.

Summary Artificial intelligence modeling suggests that the American College of Radiology Thyroid Imaging Reporting and Data System may be modified to improve ease of use while also improving specificity.

Key Points ■ By using a set of 1425 thyroid nodules, artificial intelligence (AI) modeling was used to optimize the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS).

■ The revised TI-RADS (hereafter, AI TI-RADS) assigned new point values for eight features, including a simplified scheme for some categories. For example, only assigning points to solid nodules and eliminating point assignments to other composition features represents one such modification.

■ AI TI-RADS resulted in slightly higher specificity for recommending fine-needle aspiration (mean increase of 7.6% across eight radiologist readers; P < .001).

Introduction

Thyroid nodules are an extremely common finding at US and other imaging studies (1,2). Although most thyroid nodules are benign, many patients are subjected to a costly workup that may include one or more biopsies, follow-up imaging, and even diagnostic lobectomy (3). This contributes to the overdiagnosis of thyroid cancers that are not clinically significant (4). Over the past decade, multiple groups have developed biopsy guidelines for thyroid nodules based on their appearance at US, but some guidelines are difficult to apply and all lead to high false-positive rates (benign nodules for which biopsy is recommended).

With these issues in mind, a committee of the American College of Radiology (ACR) created the Thyroid Imaging Reporting and Data System (TI-RADS) to determine if thyroid nodules depicted at US require biopsy or follow-up (5). Nodules are awarded points based on features in five categories—composition, echogenicity, shape, margin, and echogenic foci. The more suspicious the feature, the higher its point value. Points are summed to categorize a nodule into one of five TI-RADS risk levels, TR1 to TR5 (Table 1). Management recommendations are determined by using the risk level and the maximum size of the nodule.

Table 1: Risk Categories of American College of Radiology Thyroid Imaging Reporting and Data System

The points assigned to each feature in ACR TI-RADS were based on evidence in the literature and expert consensus. Therefore, it is possible that the performance of the system could be improved by optimization of the points assigned to each US feature. Given the problem of overdiagnosis in thyroid imaging, this might improve specificity without sacrificing sensitivity. Indeed, the ACR TI-RADS committee recognized that certain features may warrant higher or lower point values to achieve optimal performance (5).

The aim of this study was to use artificial intelligence (AI) algorithms to optimize TI-RADS feature point assignments. Our hypothesis was that our algorithm (hereafter, AI TI-RADS) could achieve similar or higher specificity than could ACR TI-RADS while maintaining sensitivity. This hypothesis would be tested in part by using a set of 100 thyroid nodules that had been interpreted by multiple radiologists as part of another study (although outcomes for our study would be different and use separate data analysis) (6). These results could both validate the current ACR TI-RADS and inform future revisions of the system.

Materials and Methods

Study Population and Image Annotation

This retrospective study was Health Insurance Portability and Accountability Act compliant, institutional review board approved, and used patients from a single academic medical center. A waiver of consent was obtained due to the anonymous and retrospective nature of the study. The initial population included 1631 thyroid nodules in 1439 consecutive patients who underwent diagnostic thyroid US and subsequent biopsy between August 2006 and May 2010. Sonograms were performed for a variety of clinical indications by using commercially available units (Antares and Elegra [Siemens Healthineers, Erlangen, Germany], ATL HDI 5000 and iU22 [Philips, Best, the Netherlands], and Logiq E9 [General Electric, Andover, Mass]). All were considered high-end units at the time that the images were obtained.

Tissue samples were obtained by using standard fine-needle aspiration techniques, and the cytopathologic slides were reviewed by pathology faculty at the institution where the images were obtained. Diagnosis was based on fine-needle aspiration results and surgical specimens, when available. Only nodules that were malignant or benign were included, unless a nodule underwent repeat fine-needle aspiration or surgical resection that confirmed malignancy or benignity. Two hundred three nodules were excluded for indeterminate or nondiagnostic pathologic results and three were excluded because of incomplete images. The final study population comprised 1425 nodules from 1264 patients (Fig 1) with 151 (10.6%) cancers (95 papillary, 40 follicular variants of papillary, six follicular, one medullary, and nine other cancers).

Figure 1: Flowchart illustrates exclusion criteria and final study population. FNA = fine-needle aspiration. Figure 1: Download as PowerPointOpen in Image Viewer

Sonograms were interpreted by one of two expert readers (reader 1 [W.D.M.], with 20 years of experience and reader 2, with 20 years of experience; both members of ACR TI-RADS committee) who were blinded to the indication and pathologic result. Readers were not blinded to patient age, as this was included on the images. Reader 1 interpreted 1044 (64%) of the nodules and reader 2 interpreted 587 (36%). They jointly interpreted 50 cases at the beginning of the study to standardize their approach and read another 50 cases together in the middle of the study. The expert readers assigned features in the five ACR TI-RADS categories for every nodule. Because this was performed prior to the publication of ACR TI-RADS, there were minor differences in terminology for echogenicity and shape. Hypoechoic nodules were characterized as mildly, moderately, or very hypoechoic rather than hypoechoic or very hypoechoic as in ACR TI-RADS. Therefore, nodules that were originally called mildly hypoechoic were recategorized as hypoechoic, and nodules originally categorized as moderately to very hypoechoic were reclassified as hypoechoic or very hypoechoic by a third reader (reader 3 [B.W.T.], a radiology fellow with 5 years daily practice in thyroid imaging). Nodule shape was also determined by reader 3 because this feature was not part of the original analysis. Recategorizations that were deemed difficult or indeterminate were reviewed by a fourth reader (reader 4 [J.K.H.], with 13 years of experience in thyroid imaging; member of ACR TI-RADS committee). In all other respects, the original analysis followed the recommendations of ACR TI-RADS.

Ultimately, all 1425 nodules had feature assignments for all five ACR TI-RADS categories, which yielded point assignments and corresponding TI-RADS risk levels. Nodules were split into a training set of 1325 nodules (1189 benign, 136 malignant) and a test set of the last 100 nodules (85 benign, 15 malignant). A validation set was not used; rather, cross-validation within the training cases was used to tune the algorithm.

AI TI-RADS Algorithm Development

We used a genetic algorithm to derive an optimized and data-driven version of TI-RADS, which we refer to as AI TI-RADS (7). Genetic algorithms are a part of computational intelligence methods, a subgroup of AI methods that focus on algorithms inspired by natural selection and its genetic underpinnings. Specifically, a population of individuals is simulated by a computer algorithm in which each individual represents a solution to a problem. In this instance, the solution was a set of points for different thyroid nodule features. Each individual (representing a possible solution) was evaluated in terms of its “fitness,” which reflected how accurately the set of points could predict malignancy. Through multiple iterations (“generations”), individuals with better performance were prioritized and multiplied. This process was repeated 50 times, and eventually a single best solution was presented. This optimized set of AI TI-RADS points had the same form as did the original ACR TI-RADS, but with different point values for some features. Therefore, the proposed system could immediately be used in the same manner as ACR TI-RADS. Additional details of the genetic algorithm are presented in Appendix E1 (online) and the following link can be accessed for additional code details: https://github.com/mateuszbuda/AI-TI-RADS/releases/tag/v1.0.

Comparing AI TI-RADS with ACR TI-RADS

After new point values were assigned to some TI-RADS features, the two systems were applied to the test set as interpreted by the expert reader. The test set was also interpreted by 11 other radiologists as part of another previously published study (6). Three radiologists (F.N.T., with 34 years of experience in thyroid imaging; reader 5, with 26 years of experience in thyroid imaging; and reader 6, with 31 years of experience in thyroid imaging; all members of the ACR TI-RADS committee) independently interpreted the 100 nodules and their consensus was taken as the best possible performance for the test set. The eight other radiologist readers (two academic, six general private practice [range, 3–32 years of experience in thyroid imaging]) had not routinely used ACR TI-RADS at the time of interpretation. After initial training, they assigned features to each nodule according to ACR TI-RADS, and points and risk categories were assigned by using both systems.

Statistical Analysis

By using the single-expert reader data, the area under the receiver operating curve, sensitivity, and specificity were calculated for ACR TI-RADS and AI TI-RADS models by using a test for differences between two binomial proportions. Sensitivity and specificity for detection of malignancy were also calculated for each nonexpert reader and the expert consensus, and comparison across those groups was performed by using bootstrapping methods. Differences in mean age between men and women were tested by using an unpaired t test. Statistical analysis was conducted by using R software (R Foundation for Statistical Computing, Vienna, Austria; https://r-project.org), and P values less than or equal to .05 were considered to indicate statistical significance.

Results

Study Population and Nodule Characteristics

The mean of the summed ACR TI-RADS points was 4.23 ± 2.45 (standard deviation) for the 1325 training nodules and 4.22 ± 2.64 for the 100 test nodules. The mean age for male patients was 56.7 years ± 13.5, whereas the mean age for female patients was 52.0 years ± 13.9 (P < .001). Additional basic demographics for the training and test sets can be found in Table 2. The distribution for all TI-RADS imaging features across all included nodules for the training and test sets appears in Table 3.

Table 2: Patient Demographics and Nodule Characteristics in the Training and Test Sets

Table 3: Distribution of TI-RADS Features across the Training and Test Sets

AI TI-RADS Algorithm

Figure 2 displays the AI TI-RADS and ACR TI-RADS point values. AI TI-RADS differed slightly from ACR TI-RADS in all five feature categories. The AI algorithm assigned new point values for eight features. Six of the eight features with new values changed by one point, while the other two features (taller than wide for shape and “cannot tell” for composition) each changed by two points. The overall order of features within each category was preserved. The highest risk features maintained the greatest point values. Figure 3 shows the final classification system including size cutoffs and management recommendations. An interactive AI TI-RADS calculator can be viewed at http://deckard.duhs.duke.edu/∼ai-ti-rads.

Figure 2: Image shows comparison of American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) to Artificial Intelligence (AI) TI-RADS. Multiple new point assignments were designated by algorithm, including changing point values to zero for several features. Figure 2: Download as PowerPointOpen in Image Viewer

Figure 3: Image shows Artificial Intelligence (AI) Thyroid Imaging Reporting and Data System (TI-RADS) classification scheme, including nodule sizes that dictate follow-up recommendations. Nodule size cutoffs were kept the same as American College of Radiology TI-RADS. FNA = fine-needle aspiration, TR = TI-RADS category. Figure 3: Download as PowerPointOpen in Image Viewer

For composition, AI TI-RADS assigned three points for solid or almost solid nodules and no points to the three other features under this category (as well as no points to “cannot tell”). For echogenicity, points for hypoechoic and very hypoechoic were the same under both systems, but AI TI-RADS assigned no points for other features within the category. For shape, AI TI-RADS assigned one point for taller-than-wide shape compared with three points for ACR TI-RADS. For margin, an irregular and/or lobulated margin was assigned the same number of points in each system. For echogenic foci, AI TI-RADS assigned no points to macrocalcifications compared with one point in ACR TI-RADS. The other feature point assignments under the echogenic foci category were the same for both systems.

Comparing AI TI-RADS with ACR TI-RADS

When both systems were applied to the test set of 100 nodules, AI TI-RADS assigned lower TI-RADS risk levels than did ACR TI-RADS for 43 nodules. Specifically, five nodules were downgraded from TR5 to TR4, 11 nodules were changed from TR4 to TR3, three nodules were lowered from TR4 to TR2, two nodules were changed from TR4 to TR1, eight nodules were lowered from TR3 to TR2, and 14 nodules were reassigned from TR2 to TR1. There were no nodules for which AI TI-RADS assigned a higher risk level than did ACR TI-RADS. Ultimately, the new risk level assignments resulted in 15 nodules for which ACR TI-RADS recommended fine-needle aspiration but AI TI-RADS did not, and all 15 nodules were benign (examples shown in Figs 4, 5). There were no nodules for which AI TI-RADS recommended fine-needle aspiration but ACR TI-RADS did not.

Figure 4: Image in a 72-year-old woman with right thyroid nodule. Transverse US image shows mixed cystic and solid, isoechoic, taller-than-wide nodule. Its features earn five total points according to American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) with risk level of TR4 and recommendation for fine-needle aspiration (FNA). Nodule earns only one point by using Artificial Intelligence (AI) TI-RADS with risk level of TR1 and no recommendation for FNA. Pathologic finding at FNA was benign nodule. Calipers were included in all images as part of automated nodule detection process. Figure 4: Download as PowerPointOpen in Image Viewer

Figure 5: Image in a 66-year-old man with right thyroid nodule. Long US image shows mixed cystic and solid hypoechoic nodule. This nodule earns three points according to American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) with risk level of TR3 and recommendation for biopsy. Nodule earns only two points with Artificial Intelligence (AI) TI-RADS with risk level of TR2 and no recommendation for fine-needle aspiration (FNA). FNA revealed benign thyroid nodule. Calipers were included in all images as part of automated nodule detection process. Figure 5: Download as PowerPointOpen in Image Viewer

By using the single-expert reader data, the area under the receiver operator curves for each system were similar: 0.91 (95% confidence interval [CI]: 0.82, 0.98) for ACR TI-RADS and 0.93 (95% CI: 0.85, 0.98) for AI TI-RADS (P = .18). Their sensitivities (for detection of malignancy through recommendation of fine-needle aspiration) were the same (14 of 15, 93.3%; 95% CI: 77%, 100% for both), whereas the specificity of AI TI-RADS (55 of 85, 64.7%; 95% CI: 54%, 74%) was higher than was ACR TI-RADS (40 of 85, 47.0%; 95% CI: 37%, 57%) (P < .001) (Fig 6).

Figure 6: Box and whisker plot shows sensitivity and specificity for American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) compared with Artificial Intelligence (AI) TI-RADS. Boxes correspond to 25th and 75th percentiles for eight nonexpert readers. Whiskers denote maximum and minimum values, and green line represents median. Red squares represent expert consensus, and blue triangles represent single expert reader. Sensitivity of both systems was similar, while specificity was higher for all groups when using AI TI-RADS. Figure 6: Download as PowerPointOpen in Image Viewer

When ACR TI-RADS and AI TI-RADS were applied to the test set interpretation of the eight nonexpert radiologists, AI TI-RADS had higher specificity than did ACR TI-RADS for every reader. Mean specificity of the eight readers by using AI TI-RADS was 55.3% ± 12.8 compared with 47.5% ± 12.3 by using ACR TI-RADS (P < .001) (Table 4). The sensitivity of AI TI-RADS was lower for five of the eight nonexpert radiologists, although the mean sensitivity was not significantly lower (Table 4). Performance for the expert panel was similar, although the small increase in specificity was not statistically significant (Table 4).

Table 4: Performance Comparison for Three Different Sets of Readers When Using ACR TI-RADS versus AI TI-RADS

Discussion

Risk stratification systems for thyroid nodules at US are often affected by low specificity and poor interobserver agreement. We applied a machine learning technique to American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) to optimize the performance of the system while still maintaining the TI-RADS lexicon and structure. Our data-driven artificial intelligence (AI)–optimized version of TI-RADS (hereafter, AI TI-RADS) validates ACR TI-RADS: feature point allocations were the same for 15 of 23 features, and the highest risk features maintained the highest point values. However, alterations in point assignments under AI TI-RADS suggest that ACR TI-RADS may be simplified, as six features were assigned new point values of zero. Despite simplification, our results show a modest increase of 7% in mean specificity when applied to eight nonexpert radiologists.

ACR TI-RADS was based on literature review, expert consensus, and partial analysis of a database of proven nodules, and early studies of the system are encouraging. The system was validated in a multi-institutional study of more than 3400 nodules (8), and more recent retrospective studies have shown that it reduces nodule biopsy recommendations and improves accuracy compared with other biopsy guidelines (9–12). The point assignments derived from our AI TI-RADS model were similar to those of the ACR version, adding to the growing body of evidence supporting its use. Although point values in our AI model were different for eight features, most changed by only one point. Moreover, the area under the receiver operating curves for our data set by using ACR TI-RADS and AI TI-RADS were similar (0.91 and 0.93, respectively) and higher than that described in a recent analysis by Pantano et al (9) (area under the receiver operating curve, 0.78). Overall, our data support ACR TI-RADS.

ACR TI-RADS and our AI TI-RADS model had comparable receiver operating characteristic performance by using interpretations by two experts, although AI TI-RADS yielded slightly higher specificity and fewer recommendations for fine-needle aspiration. When applied to eight nonexpert readers, AI TI-RADS again had a small but statistically significant increase in specificity and minimal impact on sensitivity. This achieves a central aim of ACR TI-RADS: to focus on clinically significant thyroid cancers and to reduce fine-needle aspiration of benign nodules (5). It has been reported that overdiagnosis accounts for up to 77% of cases of thyroid cancer (4) and that more thyroid cancer diagnoses do not reduce mortality (13). Therefore, a small reduction in sensitivity seems acceptable in light of a larger gain in specificity. As well, many nodules not biopsied would meet the criteria for follow-up, mitigating the likelihood of missing cancers while potentially reducing health care costs.

Although AI TI-RADS had substantial overlap with and similar performance to ACR TI-RADS, the altered feature values suggest that ACR TI-RADS may be simplified (Fig 2). For example, in the composition category, nodules are assigned four different possible point values in ACR TI-RADS, whereas our AI TI-RADS model assigned three points to solid nodules and zero points for all other types. This simplified scheme, which focuses on solid nodules, aligns with data from Middleton et al (8), who showed that solid nodules had four times higher risk of malignancy than did mixed cystic and solid nodules. Two meta-analyses (14,15) have also shown that solid composition confers some degree of risk, but they did not directly compare them to mixed cystic and solid nodules. This modification would allow a reader to focus on only one feature within the composition category and may improve efficiency.

AI TI-RADS point assignment in the echogenic foci category also differed. Peripheral calcifications and punctate echogenic foci were unchanged, but AI TI-RADS assigned zero points to macrocalcifications (compared with one point for ACR TI-RADS). This would simplify a category that already contains multiple features with low interobserver variability compared with other TI-RADS features (6). This modification highlights ongoing uncertainty regarding the clinical importance of macrocalcifications (16–18). Some studies (19) suggest that they confer a higher degree of risk than originally thought, whereas other studies (8) suggest that this feature is less suspicious than are punctate echogenic foci or peripheral calcifications. Although the allocation of zero points does not imply zero risk, this suggests that macrocalcifications are less suspicious than are punctate echogenic foci or peripheral calcifications based on our data.

The three remaining TI-RADS categories—echogenicity, shape, and margin—were also simplified by AI TI-RADS. The algorithm eliminated points for hyperechoic and isoechoic nodules in the echogenicity category but preserved two and three points for the higher-risk hypoechoic and very hypoechoic features. AI TI-RADS also reduced the number of points for taller-than-wide nodules from three to one, a result that contradicts studies that showed taller-than-wide shape as a high-risk and specific marker of malignancy (14,15,20). The reason for this is unclear, but may be related to low sample size.

Our study had some limitations. The training set was collected from a single institution and feature assignments were based on expert readers. Although there was potential for overfitting given that an expert reader interpreted cases in both the training and test sets, the eight general radiologists were only interpreting the test set and performed relatively similarly to the expert, suggesting a reasonable fit for the model. In addition, a subset of features was assigned by a radiology fellow; however, cases were reviewed with an expert reader (member of the ACR TI-RADS committee), and as before, test data from the eight nonexpert readers and the expert consensus were not modified and helped to validate the model. Another limitation was the possibility of bias due to the test set being taken from the end of the study period, when newer scanners with improved image quality may have become available. However, scanners throughout the study period were considered high quality. We did not use a validation set as part of our training. Rather, we used cross-validation within the training cases. After all the hyperparameters were selected, we fixed them and trained by using entire training set. We chose this approach because of the relatively limited number of cases available. As well, features of extrathyroidal extension and cystic composition did not have enough data to be analyzed. However, they represent extremes of the risk spectrum; the decision to biopsy or not is clear when either of these findings are present. We used integers to make new point assignments in AI TI-RADS to mimic ACR TI-RADS and to simplify categorization. This resulted in some features earning zero points, which may falsely imply that a given feature confers no risk of malignancy. Because AI TI-RADS removed points for seven features and added points for only one feature, it is unsurprising that more nodules were reassigned to a lower TI-RADS level. This would be expected to improve specificity (which was the case), but it is somewhat surprising that there was not a corresponding decrease in sensitivity, possibly due to the relatively small number of malignant nodules (15) in the test set. There were also limitations related to our use of the computational models. For example, a high dimensional input space was relatively scarcely represented. Therefore, for unusual inputs, it is possible that the model returned unexpected and difficult-to-explain results.

Nonetheless, to our knowledge, this study represents a unique computational validation of one of the numerous risk stratification systems for thyroid nodules. Continued performance improvement is vital, and subsequent work could focus on further enhancements. Increasing the number of training cases represents a possible avenue for improvement. Future efforts could also include nodules with indeterminate pathologic results to broaden the mix of nodules included, which may enhance generalizability and performance.

In conclusion, artificial intelligence (AI)–optimized Thyroid Imaging Reporting and Data System (TI-RADS) is a data-driven model for risk stratification of thyroid nodules at US that both validates American College of Radiology (ACR) TI-RADS and suggests modifications to it that may improve its performance and enhance applicability. Prospective studies with long-term follow-up will be needed to refine ACR TI-RADS and assess its impact on clinical outcomes.

Author Contributions

disclosed no relevant relationships.disclosed no relevant relationships.disclosed no relevant relationships.disclosed no relevant relationships.disclosed no relevant relationships.disclosed no relevant relationships.Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: received payment for expert testimony from Davis, Florie, and Starnes; received payment from American College of Radiology for development of educational presentation and for travel/accommodations/meeting expenses unrelated to activities listed. Other relationships: disclosed no relevant relationships.disclosed no relevant relationships.

Author contributions: Guarantors of integrity of entire study, B.W.T., M.A.M.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, B.W.T., M.B., J.K.H., W.D.M.; clinical studies, W.D.M.; experimental studies, M.B., D.T., F.N.T., M.A.M.; statistical analysis, B.W.T., M.B., J.K.H., D.T., M.A.M.; and manuscript editing, all authors