Setting

The data used to develop the NLP algorithm for extracting OCS were obtained from the South London and Maudsley NHS Foundation Trust (SLaM) which is a near-monopoly secondary mental healthcare service provider to 1.36 million residents in four boroughs of south London (Croydon, Lambeth, Lewisham and Southwark), as well as providing some national tertiary mental healthcare services. The SLaM Biomedical Research Centre (BRC), supported by National Institute for Health Research funding, provides anonymised electronic clinical records from the SLaM Case Register for research purposes through the BRC Clinical Record Interactive Search (CRIS) system. The CRIS system was developed in 2008 and accesses full EHRs across all SLaM services since 2007, including both structured and open-text fields, currently on more than 300,000 service users. A detailed description of CRIS and its development is described elsewhere7.

Ethics statement

The CRIS data resource received appropriate research ethics approval as a de-identified database for secondary analyses from Oxford Research Ethics Committee C (reference 08/H0606/71 + 5) and the authors can confirm that the study presented here was performed in accordance with the guidelines and regulations set out in this approval.

Inclusion criteria

To develop and test the OCS NLP algorithm, data extracts were obtained for individuals who were aged 15 years or older at the time of their first severe mental illness (SMI) diagnosis date within the observation period (from 1 January 2007 to 31 December 2015) and who had received a diagnosis (ICD-10 code) of schizophrenia (F20), schizoaffective disorder (F25), or bipolar disorder (F31) during the observation period. Diagnoses were obtained from structured fields and also from unstructured free text using a previously validated NLP algorithm, described elsewhere13.

Definition of OCS

As mentioned above, in this study OCS were defined according to the Structured Clinical Interview for DSM Disorders–Patient (SCID-P)14 as ‘persistent, repetitive, intrusive, and distressful thoughts (obsessions) not related to the patient’s delusions, or repetitive, goal-directed rituals (compulsions) clinically distinguishable from schizophrenic mannerisms or posturing”. As such, individuals whose obsessional thoughts or compulsions were related to psychotic content of thoughts or delusions were not considered to have comorbid OCS9.

Extracting data for training and validation of the algorithm

Data were extracted from EHRs for training and development of the algorithm from those individuals who met the inclusion criteria. To avoid reading and coding a substantial volume of unrelated documents we applied a filter such that we only extracted documents containing specific key terms. Although, once developed, an NLP algorithm can be substantially more sophisticated than a key word search, the development process may include keyword searchers. In this instance a set of key terms were selected which were potentially broad enough to cover all the records that mentioned OCS. The Yale Brown Obsessive Compulsive Scale (Y-BOCS)15 was used as a guide to select these key words. The following key words terms (as shown in table 1) were used to filter the EHRs:

OCD (and variations such as O.C.D)

Obsess (and variations such as “obsessional” and “obsessive”

Compulsive (and variations such as “compulsion”, “compulsiveness” and “compelled”)

Ritual (and variations such as “ritualistic”)

Hoard (and variations such as “hoarding” or “hoarded”)

The presence of any of the YBOCS key terms.

The presence of any of the Patient Insight Key terms.

Through applying the filter, a random sample of 900 documents that contained at least one of the terms shown in Table 1 (including patient notes and correspondence), with one document per unique patient, were randomly extracted from the anonymised EHRs. This sample was then divided into a training set (600 documents) and a validation set (300 documents). These documents contained multiple instances of references to OCS, with each document containing at least one instance, with no upper limit. Text strings around each key word (described above) were extracted from these documents. Each text string included the keyword and the sentence which contained this key word plus two sentences either side of the key word sentence. This was to ensure that any contextual information contained in the surrounding sentences could be incorporated into the NLP algorithm. In some instances, the text strings comprised fewer than five sentences due to there being less than two sentences before and/or after the keyword sentence in that particular document.

Table 1 Key modifier words used in the natural language processing application for OCS. Full size table

Developing manual coding rules

The training and validation sets of documents were then manually coded according to a predetermined set of manual coding rules which were developed using the Y-BOCS as a guide. An approach taken when developing the manual coding rules for identifying OCS in text strings has been outlined in appendices A and B. For example, if the text mentioned that the patient had both obsessions and compulsions, then the patient was classified as having OCS. However, if the text only mentioned obsessions or compulsions (but not both terms) this was only considered OCS if the text also listed specific examples found in the Y-BOCS, such as checking or cleaning, or described intrusive, ego-dystonic thoughts. There were a number of reasons for this conservative approach: firstly, clinical text may be produced by a range of different health professionals or may describe a patient’s belief about themselves and secondly, the terms obsession or compulsion are used in a wide range of contexts beyond OCS.

Annotating training and validation data sets

The training dataset consisted of 600 documents (containing at least one of the key words) which were manually annotated by two annotators (DA, RH) individually annotating each of the records. After the annotations were completed, the results were compared, and individual points of disagreement were identified. To resolve these points of disagreement, a discussion occurred between the two annotators, under the supervision on an arbitrator (DC) to ensure that the process did not give one of the annotators an undue level of input. Inter-annotator reliability between the two annotators produced observed agreement of 92.0% (Cohen’s κ of 0.80), indicating good inter-annotator agreement in determining the OCS coding rules.

Development of an NLP algorithm for extracting OCS

The training data were used to create classification rules needed to build the algorithm. The algorithm was developed using Generalized Architecture for Text Engineering (GATE). which includes a suite of tools for the development of NLP rules which are based on JAPE. JAPE is a unique, Java based, NLP scripting language that is native to GATE. It allows users to generate rules with very high levels of complexity. The manual coding rules which had been applied to annotate the training set were combined with observations of the annotated training set data to create a set of broad rules in JAPE which were then integrated into the application. This involved developing sets of exclusion rules and inclusion rules. Inclusion rules, determined the patterns of text required for an instance to be classed as positive, in the absence of exclusion rules. Exclusion rules used sets of exclusion terms, which would lead to an instance being classed as a negative (these are terms such as negations, or experiencers that are not the patient themselves). The algorithm involved the following steps (Table 2 contains terms that were used in the exclusion of terms as described in steps 4–8)

1. Splitting the text on a sentence by sentence level 2. Finding the presence of a possible OCS reference (in the context of the particular app) within the text. 3. Check for a combination of terms that would indicate an instance of OCS, as in the context of the particular app (e.g. the Hoarding app identifies an instance of Hoarding as an OCS symptom). 4. Exclude all instances wherein the text was characteristic of prompt questions within a clinical questionnaire. Specifically, the algorithm identified all combinations of words and punctuation that were unique to forms (which were determined through analysis of the training data). If any instances contained any of these combinations, they were excluded. In the context of the extracted data, there were very few cases of these. 5. Exclude any instances wherein the sentence contains any negating terms. Each of the five apps had a specific set of negating terms. Through an examination of the training data, a list of negating words and phrases were determined. Any instance that contained any of these words and phrases were excluded. 6. Exclude any instances wherein terms referring to experiencers who were not the subject (such as terms referring to family members or friends), appeared in the sentence. This was done through the determining a list of terms that could refer to an individual other than the patient (including terms for family members or friends or romantic partners). Any instance that contained one of these instances was excluded. 7. Exclude any instances where there were references to uncertainty about the diagnosis (as the aim of the application was to identify definite instances). This was done through creating a list of hedge words and excluding any instance that contained any of the hedge words. 8. Exclude instances of self-diagnosis (where the text indicates the patient diagnosed themselves with OCD). This was done through examining the training data, finding terms that were used in cases where self-diagnosis occurred and excluding those instances.

Table 2 List of terms used to identify candidates for exclusion. Full size table

We included lexical variances in the extraction rules [i.e. acronyms (e.g. OCD), misspellings (e.g. obses*instead of instead of obsessive)]. We took into account semantic variants in the terms of obsessive and compulsive in the extraction because, in so far as, these may have alternative meanings beyond their definitions in the context of OCD/OCS. This was done by distinguishing between the different examples of the obsessions and compulsions provided in the text. Through application of these rules, records were classified.

Validation of an NLP App for extracting OCS

The validation dataset was used as a final test of how well the algorithm performed compared to manually coded data providing an indication of how well this algorithm would perform across the remaining 300,000 plus patient records on the CRIS system. To ensure that there was no information bias in the development of the application, the validation data remained unseen by the NLP developer (DC) throughout the App development process, until it was utilized to test the final version of the OCS algorithm. The accuracy of the OCS algorithm was evaluated using measurements of precision (i.e. positive predictive value) and recall (i.e. sensitivity) at the instance level. Precision was measured as the proportion of positive OCS instances identified by the NLP application tool that were correct according to the manual annotations of these same documents; recall was measured as the proportion of OCS instances in the documents (based on manual annotations) that were correctly identified by the NLP application tool. The development of the NLP application aimed to maximize precision in order to reduce the likelihood of false positive results. This NLP-based OCS application was then applied across the entire SLaM Case Register. Finally, for research purposes the OCS algorithm was combined with data from a pre-existing diagnosis algorithm and information on diagnosis from structured fields. Overall precision and recall produced by combining these approaches were calculated. The 95% confidence intervals were calculated using the exact binomial method16, which were calculated using the Stata software package.