We have developed two text mining applications designed to extract quantitative data from psychiatric clinical text. Two distinct NLP approaches are described to identify and classify suicide ideation and attempts, both of which performed well as indicated by high precision and recall statistics. This is one of few current studies exploring and applying text mining software to specifically build classification algorithms within a large psychiatric database to detect suicidal ideation and attempts29,30.

Although previous research is sparse, the performance of our NLP tools resonates with other published studies using bespoke NLP techniques and machine learning NLP tools to classify suicide–related ideation and attempt data. We found a precision of 91.7% for identifying suicide ideation and 82.8% for identifying suicide attempts. A study comparing the use of structured ICD-9 (E950–959) codes versus using suicidality specific concept unique identifiers generated from an NLP process to identify patients with suicidal thoughts or behaviours, found a precision of 97% (50 patient records used in training, 280 patient records were used as gold standard) using ICD-9 codes together with NLP approach, versus 60% using just the NLP algorithm26. Another study used accident and emergency records to extract data on patient demographics, clinical observation and laboratory test data to classify (or predict) suicide attempts using seven machine learning techniques (namely, predictive association rules, decision trees, neural networks, logistic regression, random forest, naïve Bayes and support vector machine). They found random forest and naïve Bayes (two examples of machine learning classification methods used in NLP) optimally represented results from manually calculated suicide attempt prevalence, with precision of 93.1% and 96.9% respectively22 (483 patient records, with 112 suicide attempts in the training set and 438 patient records in the test set).

It is difficult to compare these two studies with our study due to major methodological differences, however both studies represent innovative means of detecting suicidal and self-harm behaviour. The relatively high precision values reported in each paper could be owing to the algorithms being designed around a small and precise set of cases, which may not be generalizable to larger datasets. Metzger et al. reported use of low numbers of cases in training sets while Haerian et al. did not report how many cases were present within the training set of 50 patients. Both studies also reported difficulty in contextualising the identified suicidality concept and hence ruling out false positive mentions (e.g. distinguishing instances of self-harm from suicide attempts). In addition, the dependence on using structured codes together with NLP algorithm means the text mining process may perform poorly within datasets that do not use structured ICD codes to record suicidality (as shown in their results). Metzger et al. conducted the study in an emergency site without any available psychiatric services which may introduce bias in the predictive model refraining it from being applicable to other datasets. Moreover, the reliance on several patient clinical features in this model weakens the usability of the algorithm in the absence of such data. However, Metzger et al. and Haerian et al. provide a robust means with which to extract or identify suicidal behaviour from observational datasets. We argue that our tools offer further malleability and adaptability to other databases and add to the intuition of the methods presented by Metzger et al. and Haerian et al.

There are limitations to our approaches. Firstly, in the definition of each suicidality concept (i.e. basic definition of ideation and a comprehensive definition of suicide attempts) and secondly, in the choice of the NLP techniques (i.e. rule-based for suicide ideation, and hybrid machine learning and rule-based for suicide attempt). For the suicidal ideation classifier, the use of basic rules (i.e. sentences containing the terms “suicid*” and “ideat*”) to identify suicide ideation restricts the model from detecting other permutations and variations in text describing suicide ideation. While it is acknowledged that the JAPE code written to identify sentences with “suicid*” and “ideat*” mentions and classify them could be deemed too simplistic, using a lengthy list of terms, as we did to identify suicide attempt, runs the risk of increasing false positive instances making it harder to train models as we see with the requirement for further refinement to optimise the suicide attempt classifier. This refinement will vary with databases making our algorithm less generalizable to other databases not accompanied with text mining expertise. On the other hand, using a single permutation such as “suicid*” followed by “ideat*” is justified because the term “suicide/al ideation” is a standard term used in clinical and research settings and hence is recorded in clinical notes along with “suicidal thoughts” making the latter redundant. It is noteworthy to mention the attempt to extract suicide attempt data separate from self-harm. While there are differences in the definition of a suicide attempt and self-harm, in reality when recording these differences are not as distinct. Another limitation is the sole use of SVM as a classifier model within a pre-defined framework. Some studies argue that such models are too complex to be understood by untrained researchers and the underlying classification mechanism is not sufficiently clear to easily allow for edits31. Moreover, other machine learning algorithms, such as random forests or Naïve Bayes, have been applied successfully in similar tasks, and could be worth investigating further. Finally, a limitation of both classifiers is the use of non-standardised dictionaries. The dictionary to define a suicide attempt was generated after a thorough analysis of the patient records and clinical advice. These dictionaries could limit the generalisability of the algorithms to other databases.

We recognize that further training iterations with TextHunter with larger corpora for the classifier might have further improved the performance of the tool. However due to the excessive numbers of neutral mentions of suicide attempt instances (for example, in the form of questions in questionnaire, e.g. “Have you ever attempted suicide?” or subheadings, e.g. “Past suicide attempts”), we chose instead to develop the post-processing rules (i.e. to exclude the excessive generic neutral mentions of suicide attempt). We also appreciate that the heuristics could be applied prior to training and testing set generation. Moreover, instead of adding a post-processing step with heuristics to mitigate classifier errors, an alternative would be to design the classification task with an appropriate sample for the machine learning algorithm to learn from.

Though these NLP tools enable us to harness the potential of using large longitudinal datasets to explore suicidal behaviour from new perspectives, during their development, there is always a decision to be made on how we define our variable of interest (in this case, symptom or behaviour) and then what the best NLP approach is to ensure we are able to extract the appropriate data with good sensitivity and specificity. Do we identify all variations of a given variable (quantity) or do we want to identify the most common occurrence of the variable (quality)? We opted for the latter to develop a suicide ideation NLP tool (to take into account the fairly standard terminology used by clinicians to record presence or absence of suicide ideation) and the former for suicide attempts (to encapsulate the various ways in which suicide attempt is mentioned in the psychiatric clinical notes in this database). We postulate that using a bespoke dictionary makes the classification tool more adaptable to the richness of the dataset and recording practices. Given the performance for both classifiers, they fulfil the objective for which they were designed which was to identify patients who have ever experienced suicidal ideation or suicide attempt.

There are clear strengths to both the text mining approaches reported here. These NLP tools were built to suit this particular observation dataset and with the primary aim to maximise data capture on suicidal behaviour written in free-text notes with good precision and recall. The results from our evaluation study demonstrate good performance and larger case sample detection. The NLP tools developed in our approach do not rely on any codes or structured fields to function, which broadens the scope for retrieving data from other textual fields. Though we would not propose that our tool would be applied without modification to another clinical setting, for any external applicability, the infrastructure of the classification models allows for development and, more importantly, adaptability to different datasets and researcher objectives given sufficient clinical knowledge of recording styles of suicidal behaviour. For example, the dictionaries can be easily updated to include typos or common clinical recording phrases. In addition, this study provides some insight in using an intricate and simplistic approach to detect suicidality and depending on the complexity of data in other databases either methodology can be used to ensure optimal detection. In addition, the algorithm independently detects mentions of suicidal ideation or attempts as opposed to using further variables to predict such behaviour (i.e. predictive modelling), which may require further text mining.

The risk of suicide is many times higher among patients with a mental health diagnosis compared to the general population32. Hence our training set is likely to have contained a sufficient range of examples of how suicidal ideation or attempts are mentioned, as well as ample cohort sizes for testing models. The large database and corpus of suicide attempt instances allowed us to distinguish between self-harm and suicide attempt and exclude the former when developing classification models. The decision to exclude self-harm was purely based on clinical differences between suicide attempt and self-harm behaviour, where the former is done with intention to die (regardless of lethality) and the latter may manifest as an act or cry for help/attention or there can be many other reasons why one self-harms. There may be different risk factors or motivations for each behaviour to manifest and different treatments are choses to resolve each of the disorders. Hence it is important to have a distinction between the two behaviours. On assessment of the proportion of self-harm instances detected as false positives, we found nine instances out of 500. Finally, GATE, provides a user-friendly platform to refine the performance of the algorithm to achieve optimal precision and recall, allowing improvement on weaknesses in the algorithm, dictionaries or definition of the data variable.

Of relevance to the NLP models developed in this project, there are existing structured fields within the patient EHR that are dedicated to the recording of suicidal behaviours. In-depth risk assessment schedules, which themselves include binary ‘tick box’ fields, indicating the presence or not of previous/current suicide ideation and suicide attempt and structured ICD-10 diagnostic fields are designed to capture patient suicidal ideation experiences or attempts of suicide. However their relatively lower usage compared to Event and Correspondence note fields, give impetus to design NLP tools to detect suicidal ideation or attempts recorded elsewhere. In a small study conducted to demonstrate this lower usage we compared the number of patients identified via use of structured fields only with the number of patients identified via use of our NLP tools. Figure 2 shows diagrammatically the increased number of patients identified when using the NLP tools to identify patients versus when using dedicated structured fields alone. Previously, Neuman et al. have described the use of clinical text mining as a proactive approach, overcoming limits of selective use of questionnaires or structured forms which need to be administered to patients to screen for depression33. They described a text mining software that detects the mention of depression in web-texts and further uses lexical analysis to increase precision in detection of depressive episodes. The algorithm can be used as a screening to alert web-users to seek further advice. Sohn et al. described the development of a hybrid rule-based and machine learning approach to identify as many adverse drug reactions from clinical notes as possible so as not to rely on manual chart reviews34, which are “time-consuming and effort-intensive” for routine use. In addition, Castro et al. highlight the importance of using NLP techniques within EHRs to avoid labour intensive and costly methods to validate potentially misclassified diagnoses recorded in structured forms. In this regard, our results also support use of text mining extracted data over use of structured form extracted data35. While one approach to improve data availability might be to constrain clinicians to complete structured fields in EHRs, this process can promote automaticity and shift the focus away from deep probing and analytical clinical thinking36. We believe that NLP will continue to grow in value, as from both clinical communication and medicolegal perspectives, narrative text will remain integral to psychiatric EHRs, particularly for descriptions of each patient’s symptoms and their mental state.

Figure 2 Venn diagrams comparing patient numbers obtained when (a) using NLP to identify suicide ideation versus using Risk Assessment (structured) fields only; (b) using NLP to identify suicide attempts versus using Risk Assessment (structured) fields only and (c) using NLP to identify suicide attempt versus using ICD-10 codes for suicide attempt only. Full size image

Similar to our study, Castro et al. apply a machine learning approach and a rule-based approach to identifying bipolar disorder35. They also concluded that either method was as good as manual identification of bipolar disorder. Unlike our findings, the rule-based approach required use of a more relaxed set of “non-hierarchical” rules to improve performance. This study adds to the rationale that NLP tools may require some processing and modification to work to their optimal performance. Bearing in mind future directions, in terms of development of the reported classifiers, there is scope for further work to (1) robustly modify the tools to withstand being developed as rules and hybrid machine learning approach and provide optimal evaluation and (2) to develop temporal reasoning solutions in order to distinguish suicidal behaviour at different time points, which will be beneficial for longitudinal observation and analysis. The current machine learning and rule-based hybrid classification algorithm to detect suicide attempts could benefit from the incorporation of lexical and semantic analysis where further machine-learning, dictionaries and exclusion rules can be implemented to improve word-to-concept relations33. Both the classification algorithms should have applicability, with or without further adaptation, to databases utilised for research on emotion detection within suicide notes24,37 and for predicting suicidal behaviour using retrospective risk factor data from EHRs7. Previously, both areas of research have incorporated NLP techniques on occasions but have had limitations including small sample sizes. Moreover, there are calls to use mixed methodology to make progress in suicide research, integrating hypotheses generated from qualitative analyses with quantitative exploration,38 using text from large EHR cohorts together with NLP data identifying suicide attempts; such research is potentially now more feasible. Finally, such NLP classification software might support clinical decision support algorithms, combining patient-reported outcomes, health-related information and clinical observations to assist clinicians in providing optimal suicide prevention care options20. There are potential applications in routine clinical surveillance and service evaluation. For example, Neuman et al. describe one utility of their text mining algorithm (which screens for depression within web-texts) as a flag system for individuals who may have a significant depressive disorder and need referral to psychiatric care services33.

We have described two NLP solutions to identify suicide ideation and suicide attempt in a psychiatric database, both of which performed well as classifiers. Each of our classifiers enables progress of research which would otherwise be limited to low sample size and lack of statistical power. In our case, the data extracted from each classifier will be used to contribute to research investigating the association of antidepressant use with suicide ideation and suicide attempt. There are clear challenges with using and developing NLP processes to identify suicidality within clinical notes - from recording practices and defining suicidality concepts, to building classification algorithms, evaluation and refinement. The field of using NLP to identify suicidality is still in its infancy and so sharing our methods can be used towards advancements in this field to further progress suicide prevention research.