Abstract The Autism and Developmental Disabilities Monitoring (ADDM) Network conducts population-based surveillance of autism spectrum disorder (ASD) among 8-year old children in multiple US sites. To classify ASD, trained clinicians review developmental evaluations collected from multiple health and education sources to determine whether the child meets the ASD surveillance case criteria. The number of evaluations collected has dramatically increased since the year 2000, challenging the resources and timeliness of the surveillance system. We developed and evaluated a machine learning approach to classify case status in ADDM using words and phrases contained in children’s developmental evaluations. We trained a random forest classifier using data from the 2008 Georgia ADDM site which included 1,162 children with 5,396 evaluations (601 children met ADDM ASD criteria using standard ADDM methods). The classifier used the words and phrases from the evaluations to predict ASD case status. We evaluated its performance on the 2010 Georgia ADDM surveillance data (1,450 children with 9,811 evaluations; 754 children met ADDM ASD criteria). We also estimated ASD prevalence using predictions from the classification algorithm. Overall, the machine learning approach predicted ASD case statuses that were 86.5% concordant with the clinician-determined case statuses (84.0% sensitivity, 89.4% predictive value positive). The area under the resulting receiver-operating characteristic curve was 0.932. Algorithm-derived ASD “prevalence” was 1.46% compared to the published (clinician-determined) estimate of 1.55%. Using only the text contained in developmental evaluations, a machine learning algorithm was able to discriminate between children that do and do not meet ASD surveillance criteria at one surveillance site.

Citation: Maenner MJ, Yeargin-Allsopp M, Van Naarden Braun K, Christensen DL, Schieve LA (2016) Development of a Machine Learning Algorithm for the Surveillance of Autism Spectrum Disorder. PLoS ONE 11(12): e0168224. https://doi.org/10.1371/journal.pone.0168224 Editor: Valsamma Eapen, University of New South Wales, AUSTRALIA Received: February 18, 2016; Accepted: November 28, 2016; Published: December 21, 2016 This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. Data Availability: The primary data in this analysis are medical and educational evaluations collected for public health surveillance. Due to the sensitive nature of these documents, we will make these data available (upon request) in the form of the final term-document matrices used to train and test the model’s performance rather than the raw text of the evaluations, as well as the random forest classifier (R object) that was trained on these data. The CDC’s National Center on Birth Defects and Developmental Disabilities (NCBDDD) requires a signed data use agreement by anyone requesting data from the Metropolitan Atlanta Developmental Disabilities Surveillance Program (MADDSP) to ensure that: 1) the data are analyzed for the specific purpose of the proposal submitted, and 2) the investigator will not try to identify any child or present stratified analyses leading to a sample <5 children. These two points are what result in the dataset being considered a restricted public use dataset. All requests for MADDSP public use datasets should be submitted to: ncbddddata@cdc.gov. Funding: The authors received no specific funding for this work. Competing interests: The authors have declared that no competing interests exist.

Introduction Autism spectrum disorder (ASD) refers to a group of neurodevelopmental disorders characterized by impairments in social communication and repetitive behaviors and restricted interests. Like many other conditions described in the Diagnostic and Statistical Manual of Mental Disorders (DSM), a diagnosis of ASD is based on the observation of behavioral features [1] and the specific cause of ASD is not known in many, if not most, cases. [2] A major challenge for ASD surveillance systems—and large studies of ASD in general—is reliable ascertainment of ASD. Although rigorous ASD diagnostic instruments exist, clinicians use a variety of tools and approaches in everyday practice. [3] It is often infeasible for large-scale or population-based studies to classify ASD using the “gold-standard” practices used in clinical settings. Instead, many epidemiological studies rely—sometimes exclusively—upon existing “administrative” designations for ASD classification: International Classification of Diseases, 9th Revision (ICD-9) billing codes, special education categories, or eligibility for disability benefits and services specific to autism (such as Medicaid). [4–9] In the United States, there is considerable variability in the utilization of these classifications and, because their primary intended purposes are to ensure appropriate service provision to individuals rather than to classify disabilities, these systems do not uniformly identify all individuals that meet ASD criteria in the population. [10–12] To address the limitations of relying solely on existing codes or classifications for population-based tracking, the Centers for Disease Control and Prevention (CDC) developed a population-based ASD surveillance protocol that uses information from multiple health and education sources, and does not rely entirely upon existing ASD diagnoses or classifications. The Autism and Developmental Disabilities Monitoring (ADDM) Network uses a detailed process in which each site collects developmental evaluations from clinics and schools in their community. ADDM staff abstract verbatim descriptions from the evaluations, and experienced ADDM clinicians review children’s composite information to determine whether the descriptions of symptoms are consistent with ASD diagnostic criteria described in the DSM. [13] Approximately 20% of children meeting ADDM ASD criteria do not have a previously documented ASD diagnosis or classification. [14] Given this labor-intensive review process, the timeliness and scope of the surveillance system are challenged by the continually-increasing volume of information that must be manually reviewed. While the overall reported prevalence of ASD has increased 120% between 2000 and 2010, there has been an even more dramatic increase in the annual number of evaluations that ADDM Network clinicians must review. For example, clinicians from the Georgia ADDM Network site reviewed 1,152 evaluations in 2000 and 9,811 in 2010—an increase of 750%. To potentially improve the efficiency and timeliness of ASD surveillance, we developed and evaluated a machine learning-based algorithm that predicts whether a child will meet ASD surveillance criteria, using the words and phrases contained in a child’s evaluations.

Methods ASD surveillance system and data This study used data collected by the Metropolitan Atlanta Developmental Disabilities Surveillance Program, the Georgia site of the ADDM Network, from the 2008 and 2010 surveillance years. The study area covers five counties in metropolitan Atlanta. Following ADDM Network protocol, health and special education records for children aged 8 years and living in the study area during the surveillance year are requested from multiple clinics and schools in the community. Health records are requested if they are associated with certain ICD-9 billing codes, and special education records are requested from schools if the child is assigned to the autism special education eligibility category or to another category that might overlap with autism. These records are reviewed by trained record abstractors and, if ASD symptoms are present, all of a child’s developmental evaluations are copied into the surveillance database. Trained ADDM study clinicians review all abstracted evaluations and follow a protocol to code each evaluation for descriptions of DSM-IV-TR diagnostic criteria for pervasive developmental disorder-not otherwise specified or autistic disorder and also indicate whether the child had a previous ASD diagnosis. A clinician reviewer may classify a child’s record as an ASD case if there is sufficient description of the required number and pattern of behavioral features and the clinician decides ASD is an appropriate classification. For the purposes of achieving consensus and maintaining reliability, at least 10% of the records are reviewed by a second clinician. Additional reviews are also performed if a clinician reports a low degree of certainty about the ASD classification, and there is a defined process to reach a consensus. The clinician review process has high inter-rater reliability (90.7% agreement, kappa = 0.80), and a recent study demonstrated high sensitivity and specificity by having unaffiliated clinicians independently classify ADDM data. [14, 15] Additional details about the ADDM Network methods have been thoroughly described elsewhere. [13–19] In 2008, the Georgia surveillance site abstracted 5396 evaluations for 1162 unique children; 601 children met the surveillance ASD case definition. In 2010, the surveillance dataset contained 1450 unique children with a total of 9811 evaluations; 754 children met the ASD case definition. This study was submitted to the Centers for Disease Control and Prevention Institutional Review Board and was determined to be a non-research activity (public health practice) and was not required to undergo human subjects review. Processing text from evaluations We used a “bag-of-words” approach, which captures the relative frequency of a word or phrase in a document, and disregards the order of the neighboring words or phrases. We extracted each child’s evaluations into a single body of text and removed punctuation symbols, numbers, converted letters to lower-case, and “stemmed” the words (removed word endings). We considered all words or phrases occurring in at least 3% (or 35 of 1162) of the children’s files. Instead of simply counting the frequency for each word or phrase, we used term-frequency—inverse-document frequency weighting (a common approach that considers both the frequency of a term in each child’s records and the proportion of children that have this term in their records). The resulting term-document matrix contained 1162 children (rows), and 13,135 1–3 word phrases (columns). Classification to predict ASD case status We used random forests [20], an ensemble classification method, to accomplish two tasks. The first was to identify the subset of words and phrases that are most important for classifying ASD. The second task was to build an algorithm from the useful words and phrases to perform the actual classification. We developed (trained) the algorithm using only the 2008 surveillance data, and evaluated (tested) its predictive ability on the 2010 data. We used random forests’ permutation-based variable importance scores to select the most important words. The scores represent the mean decrease in classification accuracy (over the entire forest) when the values for a word or phrase are replaced with random values. The change in accuracy is estimated by using the sub-sample of data that was not used to build a given tree (i.e., the “out-of-bag” sample). We selected the top 175 words and phrases to determine which to include in the final model (Fig A and Table A in S1 File). Random forests generate many independently-grown decision trees, and the consensus vote of all the trees (‘the forest’) forms the final classification. (S1 Fig and S1 File) For a dichotomous classification, the default is to choose the outcome predicted by the majority of the trees. Because we have slightly fewer non-ASD than ASD cases, we adjusted the cut-off to reflect these proportions (561/1162 = 0.483 versus the default of 0.5). We also explored how alternate classification cut-offs affect performance. (Figs B and C in S1 File) We used the algorithms developed with the 2008 data to classify 2010 data. We compared the algorithm to the final clinician-derived classification and calculated percent agreement, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), Cohen’s kappa, and the area under the receiver-operating characteristic (ROC) curve. We also calculated ASD “prevalence” estimates based on the algorithm’s classifications. We performed all analyses with R 3.1.2 (R Foundation for Statistical Computing, Vienna, Austria), and in particular, the randomForest package. We generated a forest of 10,000 trees for the initial feature selection and forests of 3,000 trees for the reduced-feature set models, including the final model. We kept the default settings for all other parameters. Finally, we compared the characteristics of children that were concordantly classified by the machine learning algorithm and by the clinicians to those that were discordant. We examined the proportion with a previously-documented ASD diagnosis (note that this is separate from the surveillance classification), number of evaluations collected for each child, sources of evaluations, demographic characteristics, and the proportion of children for whom the ADDM clinicians requested a secondary review (if they were uncertain about case status).

Discussion These results demonstrate that a machine learning algorithm can discriminate between children that do and do not meet ASD surveillance criteria, among children with developmental concerns. Currently, the ADDM Network employs highly-trained clinicians to manually review each child’s developmental evaluations (often multiple evaluations per child), requiring an average of 45 to 60 minutes per child. Therefore, if the system must review an increasing number of records, it will require proportional increases in the resources needed to complete this task. In contrast, an automated approach requires relatively fixed resources for nearly any amount of information, and offers the potential to improve the efficiency and timeliness of the surveillance system. Using only the words and phrases contained in a child’s records, the algorithm correctly predicted the clinician-assigned ASD case definition for 86.5% (kappa = 0.73) of the children captured by the surveillance system. This is slightly lower than the clinician inter-rater agreement observed for the overall 2010 ADDM Network (90.7%, kappa = 0.80). [14] Because the algorithm is trained on the clinician-assigned ratings, it is unlikely that agreement between the algorithm and a clinician would ever exceed inter-rater clinician agreement. On the other hand, the algorithm will have perfect intra-rater reliability, as it will always make the same classification for a given set of evaluations. An essential question is: what level of performance—if any—would be considered “acceptable” in order to trust the algorithm’s predictions? Of note, the algorithm-clinician agreement was similar to the inter-rater agreement reported by two other groups doing similar ASD classification on the basis of health records (one reported a kappa of 0.73 [21], and the other 88% agreement [22]). The algorithm was more likely to misclassify children with certain characteristics. In particular, it was less sensitive to classifying ASD among children with fewer evaluations and those that were older when first evaluated. We also observed that the algorithm was more likely to misclassify children that underwent a secondary review by the ADDM clinicians (compared to those that did not undergo secondary review), suggesting these might be more difficult for the clinicians, as well. It might be possible to address some shortcomings by allowing the algorithm to consider the source or number of evaluations, or the age of the child at each evaluation. Alternately, the current model might serve as a useful “filter” to select the records that need manual review. As shown in Fig 1, the predictive values at the extreme ends of the range are quite high, with more misclassification in the middle. These scores could be used to identify records that need clinician reviews (e.g., a score of 0.50) versus those that are “safe bets” (e.g., scores over 0.80 or below 0.20). If, in the future, the surveillance system is able to electronically receive the contents of medical and educational evaluations, this type of “filter” could be immensely useful. A previous study used an analogous approach using early intervention (birth to three years) records to predict which children would later be diagnosed with ASD. [23] The best-performing model from that study reported 91.4% precision (PPV) and 58.2% recall (sensitivity); the lower sensitivity is possibly due to a highly imbalanced ratio of ASD to non-ASD children. While this study had somewhat different goals from ours, the two studies suggest that text-based machine learning techniques may one day be useful in a variety of public health applications concerning ASD. Other recent studies utilizing electronic health information have focused on using medical billing (ICD) codes to detect individuals with ASD. [24,25] These approaches are likely well-suited for case-control studies, where PPV might be more important than sensitivity, but will not detect individuals with ASD that do not have ICD codes. Because the algorithm we developed does not consider ICD-9 codes (as special education records do not assign them), the two approaches could be used to jointly classify ASD when both ICD-9 codes and evaluation text are available. In the future it may be possible to train classification algorithms on ADDM data and distribute them to help identify individuals with ASD from electronic records. Although these results are promising, additional work is needed to evaluate the utility of this approach for ongoing ASD surveillance. For instance, performance characteristics—such as NPV or specificity—could be different in other populations. We trained the algorithm on a single year of data from one ADDM site and tested it on the following year’s data from the same site; we would need to evaluate whether similar performance could be achieved across ADDM sites or in other populations. We would also need to monitor performance so that it does not drift or degrade over time. In particular, the relatively recent changes to the ASD diagnosis in the DSM-5 could affect the terms used to describe ASD symptoms. Likewise, the surveillance case definition for the ADDM Network may change to reflect the DSM-5 criteria. For these reasons—and others—it is likely that any long-running system would require some level of continued manual review to assess the performance and quality of the system. Nevertheless, even a partially automated approach—in which a clinician might confirm or augment the algorithm’s predictions—could result in a substantial reduction in required resources. The ADDM clinicians currently code a variety of behavioral symptoms and produce much more information than a dichotomous case classification; it remains to be seen whether these methods could reliably classify specific symptoms in addition to the overall ASD classification. We plan to pursue much more granular and classification algorithms for specific symptoms or for different populations were our current performance was weakest (such as girls, children only seen after age 6, or children without an intellectual disability). It would also be useful to estimate how well the algorithm (and the ADDM methods in general) compare to other ASD classifications, such as in-person assessments. Quantifying this textual information in a reproducible way will provide novel opportunities to better understand how children are evaluated for ASD in typical community settings. This study is based on a large, population-based surveillance system that has routinely performed ASD surveillance in metropolitan Atlanta for more than a decade. The ASD surveillance case definition uses a well-established protocol for ascertaining ASD from record review, including extensive documentation, training materials, and inter-rater reliability for this procedure. As a by-product of conducting surveillance, the ADDM Network generates information that is useful for training text-based ASD classification algorithms. With relatively small modifications, it could efficiently produce a large volume of very specific examples that could be used to identify particular symptoms or behaviors. Ultimately, the approach piloted in this study could be trained on a much larger sample representing a diversity of community providers and behavioral evaluations.

Conclusion Public health surveillance systems are constantly challenged to become faster, better, or to provide the same information for lower cost. [26] We observed that an automated approach could predict—with high agreement—whether a child would meet ASD surveillance criteria. While there are many logistical issues to consider, these results hint at the potential for using machine learning approaches to identify ASD from unstructured text data.

Supporting Information S1 Fig. Application of Random Forests to Autism Surveillance Data. https://doi.org/10.1371/journal.pone.0168224.s001 (PDF) S1 File. Random Forests: identifying important words and phrases and classification cutoffs. Fig A: Plot of random forest (RF) importance scores versus rank of importance scores for 13,135 words and phrases in 2008 training dataset. Vertical line is drawn at the 175th-most important term. Table A: Most important terms in final random forest model. Fig B: Accuracy of model when different classification cut-points are selected. Fig C: Sensitivity (red) and positive predictive value (PPV) (blue) at different cut-off thresholds. Vertical line shows the classification threshold used in the final model. Fig D: Algorithm classification scores and clinician certainty scores for autism spectrum disorder (ASD) surveillance. Fig E: Algorithm classification scores and Autism and Developmental Disabilities Monitoring (ADDM) Network clinician requests for a secondary review. https://doi.org/10.1371/journal.pone.0168224.s002 (DOCX)

Acknowledgments Disclaimer: The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Author Contributions Conceptualization: MJM. Data curation: KVNB DLC MYA. Formal analysis: MJM. Investigation: MJM MYA KVNB DLC LAS. Methodology: MJM MYA KVNB DLC LAS. Supervision: LAS. Validation: MJM. Visualization: MJM. Writing – original draft: MJM. Writing – review & editing: MJM MYA KVNB DLC LAS.