Participants

Participants were 34 help-seeking youths aged 14 to 27 years who were fluent in English (three were immigrants who learned English as children). They were referred from schools and clinicians, or self-referred through the Center of Prevention and Evaluation website. Exclusion criteria included history of threshold psychosis or Axis I psychotic disorder, risk of harm to self or others incommensurate with outpatient care, any major medical or neurological disorder, and Intelligence Quotient<70 (assessed with the Wechsler Abbreviated Scale of Intelligence). The attenuated psychotic symptoms characteristic of the CHR participants could not have occurred solely in the context of substance use or withdrawal. Adults provided written informed consent; participants under 18 provided written assent, with consent provided by a parent. All experiments were performed in accordance with the relevant guidelines and regulations, and all procedures were approved by the Institutional Review Board at the New York State Psychiatric Institute at Columbia University. Five participants transitioned to psychosis within 2.5 years of follow-up (CHR+), whereas 29 did not (CHR−). Demographics for CHR individuals, stratified by psychosis outcome, are presented in Table 1.

Table 1 Demographics Full size table

Procedures

Ascertainment and prospective characterization

The Structured Interview for Prodromal Syndromes/Scale of Prodromal Symptoms (SIPS/SOPS)13 was used for ascertainment of CHR status, for baseline and quarterly symptom ratings,10 and to determine psychosis outcome. The SIPS/SOPS evaluates positive (subthreshold psychotic), negative, disorganized, and general symptoms.

Participants had to meet baseline criteria for one of three prodromal syndromes, assessed with the SIPS/SOPS: (i) attenuated positive symptom syndrome (⩾1 SOPS-positive item in the prodromal range with symptoms beginning or worsening in the past year, and symptoms occurring ⩾once/week in the prior month); (ii) genetic risk and deterioration syndrome (psychosis in a first-degree relative or schizotypal disorder accompanied by a 30% drop in global assessment of function over the past year); or (iii) brief intermittent psychotic symptom syndrome (⩾1 SOPS-positive items in the psychotic range with symptoms beginning in the past 3 months, and symptoms occurring ⩾several minutes/day). All CHR participants in this study met criteria for the attenuated positive symptom syndrome. Trained master-level research assistants administered the SIPS/SOPS, with clinical ratings achieved by expert consensus (with CC).

Participants were prospectively characterized for symptoms every 3 months for up to 2.5 years, with transition to psychosis determined using the SIPS/SOPS ‘presence of psychosis’ criteria.

Baseline interviews

Open-ended, narrative interviews of ~1 h were obtained from participants by interviewers trained by an expert in qualitative interviewing and phenomenological research.15 Participants were encouraged to describe changes they had experienced and the impact of these changes, what had been helpful or unhelpful for them, and their expectations for the future. Interviews took place between 2007 and early 2012, and were transcribed by an independent company. The first 27 transcripts were previously subject to thematic analysis using phenomenological procedures, finding gender differences in themes; this earlier qualitative analysis did not assess the predictive value of the interviews for psychosis outcome.16

Speech preprocessing

Interview transcripts were preprocessed as previously described6 using the Natural Language Toolkit (NLTK; http://www.nltk.org/).5 After discarding punctuation, each interview was automatically parsed into phrases. Words were then converted to the roots from which they are inflected, or lemmatized, using the NLTK WordNet lemmatizer. The resultant preprocessed data consisted of a list of lemmatized words, parsed into phrases, maintaining the original order, without punctuation and in lower case.

Speech analyses

We employed a novel combination of semantic coherence and syntactic assays as predictors of psychosis transition. For the semantic analyses, we used a well-validated approach to automated text analysis previously used to analyze speech in schizophrenia,3 LSA17. LSA is a high-dimensional associative model that rests on the premise that word meaning is a function of the relationship of each word to every other word in the lexicon. If semantically similar words co-occur in texts with consistent topics more frequently than do unrelated words, then the semantic similarity of two words can be quantitatively indexed by the frequency of their co-occurrence in a sufficiently large corpus of texts.17 LSA thus captures the meaning of words through linear representations in high-dimensional (300–400 dimensional) semantic space based on word co-occurrence frequencies. Each word in the lexicon is assigned a vector representing its semantic content; the orientation of these vectors can then be used to compare semantic similarity between words.17

Here, LSA was trained on the Touchstone Applied Science Associates (TASA) Corpus, a collection of educational materials compiled by TASA. The semantic coherence measure we developed is similar to that used by Elvevåg et al.,3 which discriminated between established schizophrenia patients and controls. The present measure differs from the earlier approach in that it explicitly incorporates syntactic information: semantic trajectories are represented by similarity among pairs of consecutive phrases, or pairs of phrases separated by an intervening phrase (see Figure 1). Given the speech transcription D, the document is split into n phrases S i and converted into a vectorial representation by replacing each word in the phrase by its corresponding LSA vector, S i → { l i 1 , ← → , l i N } . The phrase vectors are then summarized by taking the mean of their components:

L i = 1 N ∑ k = 1 N l i k

Figure 1 Pipeline for automated extraction of the semantic coherence features. Texts were initially split into sentences/phrases. Each word was represented as a vector in high-dimensional semantic space using Latent Semantic Analysis (LSA). Summary vectors were calculated as the mean of each vector in each phrase. Coherence was determined based on the semantic similarity between adjacent phrases, calculated as the cosine of their respective vectors. The semantic coherence feature that best discriminated those who transitioned to psychosis from those who did not was the minimum semantic coherence value (i.e., the coherence at the point of maximal discontinuity) within each transcribed text. Full size image

i.e., the mean of all LSA vectors of every word in the phrase.

We defined first-order coherence by taking the similarity of consecutive phrase vectors, averaged over all the phrases in the text (represented by 〈 . 〉 below):

FOC = ⟨ cos ( L i , L i + 1 ) ⟩

and second-order coherence by taking the similarity between phrases separated by another intervening phrase, averaged over all the phrases in the text:

SOC = ⟨ cos ( L i , L i + 2 ) ⟩

With these two features, we were able to characterize semantic coherence by measuring components of the distributions of first- and second-order coherence over the speech samples, including features such as the minimum, mean, median, and s.d.

Thus, we indexed speech coherence by: (i) automated separation of interviews into phrases; (ii) assigning phrases semantic vectors as the mean of the LSA semantic vectors for each word within the phrase; and (iii) assessing semantic similarity (i.e., the cosine) between the phrase vectors of consecutive phrases, or phrases separated by another intervening phrase.

To complement the semantic analysis, we defined another measure for processing the documents, on the basis of Part Of Speech tagging (POS-Tag). This consists of labeling every word by its grammatical function. For example, the sentence ‘The cat is under the table’ is tagged by the POS-Tag procedure as (('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('under', 'IN'), ('the', 'DT'), ('table', 'NN')) where DT is the tag for determiners, NN for nouns, VBZ for verbs, and IN for prepositions. For every transcript, we calculated the POS-Tag information (with NLTK5) and used the frequencies of each tag as an additional attribute of the text. Tagging automation uses a hand-tagged corpus to train a parsing process using a variety of heuristics. NLTK uses a model called Pen Tree Bank.

Code availability

Code for speech preprocessing (WordNet lemmatizer) and POS-Tag (Pen Tree Bank) is available open access through the NLTK (http://www.nltk.org/).5

Classification

A cross-validated classifier is a Machine Learning algorithm with two stages: in the first stage, it learns the underlying patterns of the data using a subset of samples. The learned model is used in the second stage to predict the labels of samples not used during the learning stage (Figure 2).

Figure 2 Pipeline for cross-validation of the Machine Learning classifier. A vector of features for each participant is extracted and fed into the classifier that was trained on the other participants’ data. The classifier is used to predict outcome for the left-out, or test, participant. Each participant is sequentially left out of the training data set to serve as the test subject once, resulting in accuracy of prediction data for all participants. Full size image

We used features derived from the semantic coherence analyses and the POS-Tag extraction, providing a vector of features for each participant's text. With this information, we trained the classifier to learn the features that discriminated among participants who did not subsequently develop psychosis (CHR−) from the group who did (CHR+).

The convex hull of a set of points is the minimal convex polyhedron that contains them. A convex hull classifier was implemented as follows: during training, we sequentially excluded one CHR+ or CHR− participant to be used for testing (leave-one-subject-out cross-validation). Using the training labels, we computed the convex hull of the CHR− set, and then tested whether the left-out sample was inside the hull (predicting CHR−) or outside (predicting CHR+). Each individual was sequentially excluded from the training set used to compute the convex hull to serve as the test subject, providing accuracy of prediction data for all participants.

The semantic coherence feature that best contributed to classification of subsequent psychosis onset was the minimum coherence between two consecutive phrases (i.e., the maximum discontinuity) that occurred in the interview. The syntactic measure included in classification was the frequency of use of determiners (‘that’, ‘what’, ‘whatever’, ‘which’, and ‘whichever’), normalized by the phrase length. Because speech in emergent psychosis often shows marked reductions in verbosity (referred to clinically as poverty of speech), we also included the maximum number of words per phrase in the classification.

Validation

To further probe findings from the CHR analyses, we also conducted the following validation analyses:

Does the coherence measure index ‘disorder’ in a text?

Because the concept of semantic coherence we employed does not have a mathematical definition, in this validation we tested the coherence measure against a corpus of classic literature and assessed how the measure changed when we modified the original texts in a way that is relevant to the concept of semantic coherence.

On the basis of the hypothesis that a text that makes sense will produce a high coherence score, we applied different levels of ‘disorder’ to a range of texts to determine whether the method could detect these modifications. We defined each level of ‘disorder’ as the percent of the text that was moved from its original location. For example, a disorder level of 40% indicates that 4 of 10 sentences were moved and thus were no longer in their original position in the text. For each of 10 disorder levels (10–100%), we created 1,000 samples, randomly shuffling the order of the appropriate proportion of sentences. We performed coherence analysis on randomly selected chapters of the following six classic books: On the Origin of Species by Charles Darwin, A Study in Scarlet by Arthur Conan Doyle, Moby Dick; Or, The Whale by Herman Melville, Pride and Prejudice by Jane Austen, The Adventures of Tom Sawyer by Mark Twain, and The Count of Monte Cristo by Alexandre Dumas.

Are the speech features associated with symptoms assessed with standard diagnostic instruments?

To assess the extent to which the text features that best predicted clinical status at follow-up in CHR patients (minimum first-order coherence, density of determiners, and maximum phrase length) carry information with respect to standard clinical prodromal ratings, we computed the canonical correlation between these three text features (semantic coherence, phrase length and use of determiner pronouns) and two symptom measures on the SIPS/SOPS (total positive symptoms and total negative symptoms). The canonical correlation between two sets of features from the same samples, X and Y, estimates the linear combination of X features such that this combined feature has the highest correlation with an also estimated linear combination of Y features.

Ethics statement

The Institutional Review Board at the New York State Psychiatric Institute at Columbia approved these experiments, and informed consent was obtained for all subjects (parental consent with assent for minors).