Our findings indicate that during the prodromal phase of psychosis, the emergence of psychosis was predicted by speech with low levels of semantic density and an increased tendency to talk about voices and sounds. When combined, these two indicators of psychosis enabled the prediction of future psychosis with a high level of accuracy.

Speech samples were drawn from 40 participants of the North American Prodrome Longitudinal Study (NAPLS) at Emory University (see Methods). Participants were followed up for 2 years or to the time of conversion. For training the model, we included 30 participants from the second phase of the NAPLS (NAPLS-2). Seven of these individuals converted to psychosis during follow-up (Converters) and 23 did not (Non-converters). For validating the model, we included 10 participants, five Converters and five Non-converters from the third phase of the NAPLS (NAPLS-3). Transcriptions of the recorded Structured Interview for Prodromal Syndromes (SIPS) were used for language analysis. The demographics and clinical information of the participants are shown in Table 1.

Table 1 Demographic and clinical information of the participants Full size table

To perform the vector unpacking method, language samples underwent several pre-processing analyses including lemmatizing the words and tagging them for their part of speech (see methods). To derive sentence meanings, the content words (i.e., nouns, verbs, adjectives, and adverbs) were re-expressed as word embeddings (see Methods). Word embeddings map the words of a language into a vector space of reduced dimensionality. The word embeddings used in this research were generated using the skip-gram version of Word2Vec.25,26 The goal of word2vec is to cause words that occur in similar contexts to have similar embeddings. The algorithm can be viewed as instantiating a simple two-layer neural network architecture. In this network, the input layer uses a one-hot encoding method to indicate individual target words. During the feedforward phase, activation travels from the input level to a hidden unit level. Activation from the hidden units travels into a softmax function. The softmax function creates a probability distribution and the system is tuned, using backpropagation, to maximize the probabilities for the words that are being used to train against. The words being trained against code for a word’s context and are specified by a window of words around a target word. In the present research, training was based on 25 years of text from the New York Times (NYT), which includes 42,833,581 sentences. The processing pipeline used to generate word embeddings is shown in Fig. 1.

Fig. 1 Use of the machine learning technique (Skip-gram) Word2vec to create word embeddings by processing a large body of texts through a two-layer neural network. The weights in the first layer of the network constitute the resulting vectors and specify positions in a high dimensionality space (a word-embedding). A 2-dimensional projection of the 99% most frequent words in English (N = 42,234) of this space is shown above (blue = nouns; red = verbs; orange = adjectives; aqua = prepositions) Full size image

The meaning of each sentence was derived by summing the vectors (embeddings) associated with each word in the sentence and normalizing by the magnitude of the vectors. A formal specification of these operations is described in the Methods. To determine semantic density, the number of meaning components expressed in a sentence must be determined. This was accomplished using a vector decomposition technique called vector unpacking. As specified in the Methods, the technique uses gradient descent to discover the linear combination of weighted word vectors (meaning components) that best approximate the observed sentence vectors. When there is minimal semantic overlap among the words in a sentence, all the words in the sentence vector are usually recovered. However, when the semantics of the content words in a sentence overlap in meaning or certain words stand out as unrelated to the semantic emphasis of the sentence, the number of meaning vectors needed to create the sentence is less than the number of content words, resulting in a reduction in semantic density. The process achieved by vector unpacking is depicted in Fig. 2. The word embeddings (black vectors) in a sentence sum to produce a resultant vector for that sentence (blue vector). Vector unpacking finds meaning vectors (red vectors) that, when summed, closely approximate the original sentence vector. In this figure, the number of component vectors (N = 2) is less than the number of words (N = 4) that were used to create the resultant vector.

Fig. 2 Processes involved in vector unpacking. The word embeddings associated with the words in a sentence (black) when summed produce a resultant sentence vector (blue). Meaning vectors are identified through the learning of weights, which result in a linear combination of vectors that approximates the resultant sentence vector as closely as possible Full size image

Let S = {s 1 , …, s n } be the set of sentences in a language sample, indexed by j, and |S| the number of sentences in that sample. The semantic density of a sentence, D j , was calculated by dividing the number of meaning component vectors, m j , by the number of content words, n j , as specified in the formula

$$D_j = \frac{{m_j}}{{n_j}}$$ (1)

The mean density of a participant’s language sample, \(\bar D\), was computed by summing the semantic densities of the individual sentences in that sample and dividing by the total number of sentences, as specified in the formula

$$\bar D = \frac{{\mathop {\sum }

olimits_j D_j}}{{\left| S \right|}}$$ (2)

The steps involved in deriving this measure of semantic density are summarized in Fig. 3.

Fig. 3 Pipeline used to determine semantic density. a Sample sentences of the participants. b Original sentences are reduced to their content words (nouns, verbs, adjectives, and adverbs). c Word embeddings for each content word are added together to produce a sentence vector. d Vector unpacking is used to find the weights that can be used to scale the word vectors so that their addition approximates the sentence vector as closely as possible. e The number of meaning component vectors is divided by the number of content words for each sentence to calculate a measure of semantic density. In a semantic smear, the relative weight of the meaning components and final density is specified in the darkness of the surrounding color Full size image

Semantic density as a predictor of conversion

Given our ability to measure the semantic density of sentences, the language samples of the participants were analyzed to determine whether this aspect of language might predict conversion to psychosis. Regressing SEMANTIC DENSITY on CONVERSION (0 = Non-converter; 1 = Converter), we found that semantic density improved the ability of a model to predict conversion to psychosis, Wald’s χ2(1) = 4.401, p = 0.036. Figure 4a shows the probability of conversion to psychosis given semantic density as estimated by the logistic regression equation CONVERSION = 19.832 + (−24.022) * SEMANTIC DENSITY. Assuming a probability of conversion cutoff of 0.5, the plot shows that conversion to psychosis was associated with semantic densities of 0.825 or less. A model trained on the training set had an accuracy rate of 86.7% (Precision = 1; F 1 score = 0.6, Sensitivity/Recall = 0.428, Specificity = 1).

Fig. 4 Predicting conversion to psychosis based on semantic density in the original and shuffled samples. Individual points show (with a small amount of jitter) semantic densities of individual participants who either converted to schizophrenia (Probability = 1) or did not (Probability = 0). a Probability of conversion to psychosis given semantic density as estimated by binary logistic regression. b Probability of conversion to psychosis estimated by a model derived from the training data. c Probability of conversion to psychosis given semantic density estimated from randomly shuffling the language samples. After shuffling, conversion to schizophrenia was no longer predicted by semantic density Full size image

Validation of SEMANTIC DENSITY on the holdout dataset

An analysis of the validation dataset confirmed that semantic density is a strong predictor of conversation to psychosis, even for observations not included in the training. When the regression equation fitted to the training dataset was applied to a holdout dataset, conversion to psychosis was predicted with 80% accuracy (Precision = 1; F 1 score = 0.75, Sensitivity/Recall = 0.60, Specificity = 1). Figure 4b shows the probability of conversion derived from the training dataset and applied to the holdout dataset. As can be seen, the 0.825 semantic density cutoff calculated from the training set resulted in only two misclassifications in the case of the holdout dataset, both involving failures to predict conversion. Figure 4b also shows that if the logistic regression equation had been trained on the holdout dataset alone, the semantic density cutoff would have increased to ~0.88, which would have resulted in a predicted conversion accuracy of 100%.

Poverty of content, poverty of speech, part-of-speech, and demographic variables

Interestingly, in the present study of the prodromal phase of psychosis, a poverty of speech effect was not found: the number of content words used by those who converted to psychosis (M = 5.10, SD = 0.339) was not significantly lower than the number of content words used by those who did not convert (M = 5.27, SD = 0.474), t(28) = 0.893, p = 0.380. The results suggest, then, the best indicator of conversion during the prodromal period may not be poverty of speech, but rather, poverty of content as measured by semantic density. In this study cohort, we did not find any evidence of correlations between semantic density and IQ, r(28) = 0.22, p = 0.239, age, r(28) = 0.213, p = 0.260, or sex, r(28) = 0.020, p = 0.915. In addition, we did not find a significant correlation between semantic density and sentence length, r(28) = −.042, p = 0.822. In our dataset, density of determiners was not a significant predictor of psychosis, but the direction of the effect was consistent with that found in previous studies,10,11 Wald’s χ2(1) = 2.121, p = 0.115.

Semantic density as a property of sets of words

Semantic density is measured with respect to specific combinations of words. As such, it should depend on the way the words are grouped together into sentences and not simply on the set of words used in the sample ignoring sentence organization. This prediction was tested by randomly shuffling the content words in the transcripts to disrupt their organization, while keeping all other properties of the text the same. Sentence length and syntax were preserved by switching verbs with verbs, nouns with nouns, and so on. As indicated in Fig. 4c, after the words were randomized, conversion to psychosis was no longer predicted by semantic density, Wald’s χ2 (1) = 0.204, p = 0.652. This approach is similar to prior work that has used shuffling to establish a baseline level of semantic coherence.10 Our results indicate that semantic density is sensitive to the way words are grouped into sentences, and hence with the mental processes used to combine them into sentences.

Comparison to alternative approaches to the extraction of semantic density

The technique used to measure poverty of content in this research, vector unpacking, differs from that used in previous research. One such alternative measure is idea density, a quantity that can be measured by dividing the number of verbs, adjectives, adverbs, prepositions, and conjunctions in a sentence by the total number of words.27,28 Idea density is calculated automatically in the program CPIDR (www.covingtoninnovations.com/software.html). We analyzed the training dataset in terms of idea density using CPIDR 5 and found no evidence Converters (M = 0.561, SD = 0.026) had lower levels of idea density than Non-converters (M = 0.574, SD = 0.023), t(28) = 1.258, p = 0.219. Nor did we find evidence that idea density was related to semantic density, r(28) = 0.053, p = 0.783.

Another approach to the measurement of meaning, information value,23,24 suggests that the notion of semantic density might be represented in the vector length of a set of words either by calculating the average vector length of a set of words, with vector length being simply the magnitude of a vector (e.g., in the case of the vector[1, 1], vector length would be \(\sqrt 2\)), or by summing the vectors for a set of words and determining the vector length of their resultant.24 After applying vector length analysis to the training dataset, we found no evidence that Converters used words with shorter vectors (M = 6.865, SD = 0.0318) than Non-converters (M = 6.854, SD = 0.0301), t(28) = 0.839, p = 0.409. Nor did we find evidence that the vector length of a resultant vector of a sentence was shorter for Converters (M = 3.877, SD = 0.534) than Non-converters (M = 4.096, SD = 0.53612), t(28) = 0.944, p = 0.353. Lastly, we observed no relationship between semantic density and either average vector length, r(28) = −.106, p = 0.576, or sentence resultant length, r(28) = −.089, p = 0.641. The lack of any association with semantic density should not be interpreted as implying that vector length is semantically inert. We found that vector lengths correlated negatively with word frequencies, r = −.132, p < 0.0001, and positively with the number of content words summed to create a sentence vector, r(70) = 0.414, p = < 0.001. It is entirely possible that the notions of idea density and information value might capture psychologically interesting dimensions of language, but our analyses suggest that these notions do not capture the same information as semantic density as measured by vector unpacking and are not predictive of conversion to psychosis.

Machine and human ratings of semantic density

The results of a simple validation experiment confirm that the notion of semantic density measured by machine learning resembles the subjective notion of semantic density as understood by humans. In this experiments, human participants (N = 30) rated 72 sentences produced by the participants in the training sample. The machine rating of semantic density was correlated with that of human raters, r(70) = .42, p = < 001. While only moderate in strength, the degree of correlation between human raters and the vector unpacking algorithm was far better than between human raters and other automated measures of semantic density. The correlation between idea density, as measured by CPIDR5,27 and human ratings of semantic density was, in fact, in the opposite direction of what was expected, r(70) = −.199, p = 0.093, and the correlation between information value, as measured by vector length,24 showed no relation to human judgments of semantic density, r(70) = 0.061, p = 0.613. Thus, while several measures of semantic density have been proposed in the literature, only vector unpacking generates values related to those of human raters. We further note that in past research, the inter-rater reliability of human judgments of ideational richness has tended to be relatively low,14,29 suggesting that such judgments are difficult for human judges, which might account for the moderate strength of the association between human raters and vector unpacking.

Latent content as a predictor of conversion

The symptoms of full psychosis may not only involve the lack of certain features—as reflected in absence of certain kinds of content—but also the presence of linguistic content not typical observed in the speech of healthy individuals. While negative symptoms tend to precede positive symptoms,2,19 the early signs of positive symptoms might nevertheless begin to appear in the content of language during the prodromal period.

Such content can be discovered using a set of techniques we call Latent Content Analysis (see Methods). The first step in this analysis involves re-representing the participants’ sentences as vectors. This was accomplished by summing the word embeddings associated with the content words of each sentence and normalizing them to a vector length of 1. To identify latent semantic contents, we selected the 95% most commonly written words in English as reflected in word frequencies in the New York Times corpus (N = 13,592) and used them as semantic probes. This was accomplished by re-expressing the probe words as word embeddings, calculating the cosine between each probe word and each participant’s sentence, and retaining the highest cosine for each word across the sentences for each participant. Importantly, the method allows for the discovery of words that were never actually used by the participants; hence, the technique can be used to discover latent meanings. To obtain semantic themes across participants, the cosines to all 13,592 probe words were averaged across the participants in the Converter and Non-converter groups. The next step in Latent Content Analysis weighs words for their informativity. Finding informative words requires identifying the word meanings used more often than normal. This can be accomplished by determining each probe word’s base-rate cosine, that is, the degree to which the word is similar to the meaning of sentences found in an average conversation. This was achieved by constructing a corpus of 30,000 individuals who engaged in online conversations on the social media platform Reddit. The corpus was roughly 401 million words in size, making it large enough to establish base-rate cosines. Average cosines were obtained by comparing the 13,592 probe words with the sentences in this Reddit corpus. Once obtained, the average cosines to the 13,592 probe words for the Reddit corpus could be combined with those associated with the Converters and Non-converters to form two 13,592 × 2 (probe word × group) matrices, one for the Converters and the other for the Non-converters. Distinctive content words were identified using the tf-idf (term frequency-inverse document frequency) weighting algorithm,30 which is a method that weighs the values in a matrix to better specify their diagnostic importance. The algorithm addresses the problem of large cosines due to high frequencies by factoring in the effect of base-rates. High cosine values are retained so long as they are high for one group and not another. The 50 probe words with the largest positive cosines after tf-idf were retained for further analysis.

It was anticipated that the top probe words might form clusters of meaning. This possibility was investigated by re-expressing the top 50 probe words using the NYT word embeddings described earlier. The dimensionality of the word embeddings was reduced from 200 to 2 dimensions to remove noise and accentuate the most important semantic dimensions using the t-SNE learning algorithm.31 Clusters were identified by applying the k-means++ cluster algorithm, which separates elements into groups by minimizing within-cluster sum-of-squares. The number of clusters was determined by running the algorithm for different numbers of k and choosing the k that maximized the Silhouette Coefficient.32

Figure 5 shows the semantic clusters formed out of the probe words that distinguished the language of the Converters from the 30,000 Reddit users. As can be seen, the top 50 probe words fell into 14 semantic clusters. Some of the resulting clusters such as ‘yes/no’ directly reflect the structured interview context from which the language samples were collected. However, several of the clusters indicate topics of potential diagnostic value. Most notably, the language of the Converters tended to emphasize the topic of auditory perception, with one cluster consisting of the probe words voice, hear, sound, loud, and chant and the other, of the words whisper, utter, and scarcely. Interestingly, many of the words included in these clusters–like the word whisper–were never explicitly used by the Converters but were implied by the overall meaning of their sentences. Such words could be found because the cosines were based on comparisons between probe words and sentence vectors, not individual words. Although the Non-converters were asked the same questions, their responses did not give rise to semantic clusters about voices and sounds.

Fig. 5 Text plot of words that distinguished the language of the Converters from the language of 30,000 Reddit users. Word positions were determined after dimensionality reduction of the word embeddings and clustering the positions using k-means++. The encircled clusters concern concepts related to voices and sounds Full size image

Given their clear connection to auditory hallucination, it is possible that the probe words referring to voices and sounds might not only distinguish Converters from Reddit users, but also Converters from Non-converters. To test this possibility, the cluster based on voice, sound, hear, chant, and loud was converted into a predictor variable. This was achieved by summing the word embeddings associated with these probe words, normalizing by the magnitude of the vectors, and obtaining the cosine between this cluster vector and all of the sentence vectors from the Converter and Non-convert groups. A VOICES predictor variable was constructed by selecting the largest cosine between the cluster vector and the sentence vectors of each participant.

Regressing VOICES on CONVERSION indicated that talk about voices and sounds improved the ability of a model to predict conversion to psychosis, Wald’s χ2(1) = 5.546, p = 0.019. Figure 6a shows the probability of conversion to psychosis given the logistic regression equation CONVERSION = −7.047 + (9.744) * VOICES. A model trained on the training set had an accuracy rate of 83.3% (Precision = 0.75; F 1 score = 0.55, Sensitivity/Recall = 0.428, Specificity = 0.956). Assuming a probability of conversion cutoff of 0.5, the plot shows that conversion was associated with cosine similarities to VOICES that were greater than 0.742. Interestingly, as also shown in Fig. 6b, had VOICES been regressed on CONVERSION using the holdout data alone, prediction accuracy would have been 100%.

Fig. 6 Probability of conversion to psychosis given VOICES. a Prediction over the training set. b Probability of conversion using VOICES based on training dataset applied to data from the holdout dataset Full size image

Validation of VOICES on the holdout dataset

An analysis of the holdout data confirmed that VOICES remained a strong predictor of conversion even on unseen data. When the regression fitted to the training data was applied to the holdout data, conversion to psychosis on the basis of VOICE could be predicted with 70% accuracy (Precision = 1; F 1 score = 0.571, Sensitivity/Recall = 0.40, Specificity = 1). As can be seen in Fig. 6b, the 0.742 cutoff calculated from the training set resulted in three false negative errors in the holdout dataset.

The language samples used in these analyses were drawn from structured interviews. A potential concern is that the effect of voices and sounds may have been more prominent in the Converters than Non-converters due to the structure interview format. In order to test this possibility, we first analyzed the speech of the interviewers in the same way we analyzed the speech of the participants. A model in which the VOICES vector was tested against sentences generated by interviewers was not predictive of conversion, Wald’s χ2(1) = 2.247, p = 0.134, implying that the tendency to talk about voices was not directly induced by the language of the interviewers. We also examined the possibility that the Converters might have been asked more questions about voices and sounds than Nonconverters because the Converters had endorsed perceptual changes. We tested this possibility by analyzing the P4 subscale of the SIPS interview, which contains six questions focusing on auditory distortion, illusion, and hallucination. When a participant endorses experiencing perceptual changes, their P4 scores are increased. We found, however, that P4 scores for Converters (M = 1.20, SD = 2.19) were effectively the same as for Nonconverters (M = 1.19, SD = 1.26), t(28) = 0.013, p = 0.989. In sum, two sources of evidence argue strongly against the effect of VOICE being due to the structure interview.

Language samples indicating early stages of change in auditory perception

The references to voices and sounds in our data nicely demonstrate prior observations made in literature. Crucially, the way prodromal participants seem to experience voices and sound differs from those in patients with overt psychosis. In the early stages of auditory hallucination, individuals realize that there is something wrong with their perceptual experience33 and that their thoughts and perception are somewhat mixed.34,35 As auditory hallucinations become fully formed, patients with overt psychosis report hearing multiple distinct voices other than their own.36 The following excerpts from two of the converters exemplify statements illustrative of the early stage of auditory hallucination.

Patient 1) “…You know I talk to myself but I don’t … I don’t know if it is me. I mean if I talk to myself in the mirror you know. I’m talking to me. But how can I have a conversation with myself? I say stuff in my head as if I am talking to me and it’s funny and I laugh like I didn’t know that I was going to say that…”

Patient 2) “I would hear something that sound like a plane engine or like a really… you know… a really far off motor. It never went away entirely. It’s gone a lot more in the past couple of months since Christmas. It just sounds like that… it sounds like a little flame or a cellular… a digital motor.”

A predictive model based on SEMANTIC DENSITY and VOICES

When SEMANTIC DENSITY and VOICES are combined, the resulting model predicts the emergence of psychosis with 93% accuracy (Precision = 0.86; F 1 score = 0.86, Sensitivity/Recall = 0.86, Specificity = 0.96). Both SEMANTIC DENSITY, Wald’s χ2(1) = 4.047, p = 0.044, and VOICES, Wald’s χ2(1) = 5.323, p = 0.021, contributed to the model’s predictive performance. Figure 7a shows the probability of conversion to schizophrenia for the logistic regression equation CONVERSION = 35.828 + (−57.254) * SEMANTIC DENSITY + (20.483) * VOICES. In this model, all but one of the seven convertors are above the 0.5 probability cutoff.

Fig. 7 Probability of conversion to psychosis based on SEMANTIC DENSITY and VOICES. a Prediction over the training set. b Probability of conversion based on training dataset applied to data from the holdout dataset Full size image

Validation of SEMANTIC DENSITY and VOICES on the holdout dataset

When the regression equation fitted to the training data was applied to the holdout data, it resulted in 90% prediction accuracy (Precision = 1; F 1 score = 0.89, Sensitivity/Recall = 0.80, Specificity = 1). As shown Fig. 7b, all but one of the converters to psychosis had probabilities greater than the 0.5 probability cutoff of 0.0. Also as shown in Fig. 7b, had the model been based on the holdout data alone, regressing SEMANTIC DENSITY and VOICES on CONVERSION would have allowed for 100% prediction accuracy.

Association between computational linguistic features and clinically rated symptoms

Combining semantic density and VOICE in a single model improved prediction performance in part because the two variables capture different kinds of information, as reflected in the lack of correlation between them, r(28) = 0.069, p = 0.717. Prior research suggests that semantic density should align with negative symptoms, and VOICE with positive symptoms.18 We investigated this possibility using the negative and positive scores on the SIPS obtained within 6 months of the interview. As predicted, negative symptoms correlated negatively with semantic density, r(28) = −.446, p = 0.013, but not VOICE, r(28) = 0.316, p = 0.089., and positive symptoms correlated positively with VOICE, r(28) = 0.411, p = 0.024, but not semantic density, r(28) = −.134, p = 0.480. The pattern observed in these individual correlations was further supported by canonical correlations, which indicated a latent variable associated with semantic density and VOICE correlated positively with the latent variable associated with negative and positive symptoms, r = 0.568, p = 0.012. The positive relation between these variables is reflected in the scatterplot shown in Fig. 8. The correlation implies that the semantic variables extracted from text are related to classic variables on standardized rating scales. Crucially, however, when positive and negative symptoms are combined, the resulting model predicts the emergence of psychosis with only 80% accuracy (Precision = 0.66; F 1 score = 0.4, Sensitivity/Recall = 0.286, Specificity = 0.956). Thus, a predictive model based on linguistic features outperforms one using standardized clinical ratings.

Fig. 8 Scatterplot between Semantic density + Voices on the X-axis and positive + negative symptoms on the Y-axis. Canonical correlations indicated that the linguistic indicators are related to classic variables on standardized rating scales (r = 0.568, p = 0.012) Full size image

Appropriateness of the New York Times corpus