This section outlines recent work linking language with personality, gender, and age. In line with the focus of this paper, we predominantly discuss works which sought to gain psychological insights. However, we also touch on increasingly popular attempts at predicting personality from language in social media, which, for our study, offer an empirical means to compare a closed vocabulary analysis (relying on a priori word category human judgments) and an open vocabulary analysis (not relying on a priori word category judgments).

Personality refers to the traits and characteristics that make an individual unique. Although there are multiple ways to classify traits [13], we draw on the popular Five Factor Model (or “Big 5”), which classifies personality traits into five dimensions: extraversion (e.g., outgoing, talkative, active), agreeableness (e.g., trusting, kind, generous), conscientiousness (e.g., self-controlled, responsible, thorough), neuroticism (e.g., anxious, depressive, touchy), and openness (e.g., intellectual, artistic, insightful) [14]. With work beginning over 50 years ago [15] and journals dedicated to it, the FFM is a well-accepted construct of personality [16].

Automatic Lexical Analysis of Personality, Gender, and Age

By examining what words people use, researchers have long sought a better understanding of human psychology [17]–[19]. As Tauszczik & Pennebaker put it:

Language is the most common and reliable way for people to translate their internal thoughts and emotions into a form that others can understand. Words and language, then, are the very stuff of psychology and communication [20].

The typical approach to analyzing language involves counting word usage over pre-chosen categories of language. For example, one might place words like ‘nose’, ‘bones’, ‘hips’, ‘skin’, ‘hands’, and ‘gut’ into a body lexicon, and count how often words in the lexicon are used by extraverts or introverts in order to determine who talks about the body more. Of such word-category lexica, the most widely used is Linguistic Inquiry and Word Count or LIWC, developed over the last couple decades by human judges designating categories for common words [11], [19]. The 2007 version of LIWC includes 64 different categories of language ranging from part-of-speech (i.e. articles, prepositions, past-tense verbs, numbers,...) to topical categories (i.e. family, cognitive mechanisms, affect, occupation, body,...), as well as a few other attributes such as total number of words used [11]. Names of all 64 categories can be seen in Figure 2.

Pennebaker & King conducted one of the first extensive applications of LIWC to personality by examining words in a variety of domains including diaries, college writing assignments, and social psychology manuscript abstracts [21]. Their results were quite consistent across such domains, finding patterns such as agreeable people using more articles, introverts and those low in conscientiousness using more words signaling distinctions, and neurotic individuals using more negative emotion words. Mehl et al. tracks the natural speech of 96 people over two days [22]. They found similar results to Pennebaker & King and that neurotic and agreeable people tend to use more first-person singulars, people low in openness talk more about social processes, extraverts use longer words.

The recent growth of online social media has yielded great sources of personal discourse. Besides advantages due to the size of the data, the content is often personal and describes everyday concerns. Furthermore, previous research has suggested populations for online studies and Facebook are quite representative [23], [24]. Sumner et al. examined the language of 537 Facebook users with LIWC [25] while Holtgraves studied the text messages of 46 students [26]. Findings from these studies largely confirmed past links with LIWC but also introduced some new links such as neurotics using more acronyms [26] or those high in openness using more quotations [25].

The larger sample-sizes from social media also enabled the first study exploring personality as a function of single-word use. Yarkoni investigated LIWC categories along with single words in connection with Big-5 scores of 406 bloggers [27]. He identified single word results which would not have been caught with LIWC, such as ‘hug’ correlating positively with agreeableness (there is no physical affection category inLIWC), but, considering the sparse nature of words, 406 blogs does not result in comprehensive view. For example, they find only 13 significant word correlations for conscientiousness while we find thousands even after Bonferonni-correcting significance levels. Additionally, they did not control for age or gender although they reported roughly 75% of their subjects were female. Still, as the most thorough point of comparison for LIWC results with personality, Figure 2 presents the findings from Yarkoni's study along with LIWC results over our data.

Analogous to a personality construct, work has been done in psychology looking at the latent dimensions of self-expression. Chung and Pennebaker factor analyzed 119 adjectives used in student essays of “who you think you are” and discovered 7 latent dimensions labeled such as “sociability” or “negativity” [28]. They were able to relate these factors to the Big-5 and found only weak relations, suggesting 7 dimensions as an alternative construction. Later, Kramer and Chung ran the same method over 1000 unique words across Facebook status updates, finding three components labeled, “positive events”, “informal speech”, and “school” [29]. Although their vocabulary size was somewhat limited, we still see these as previous examples of open-vocabulary language analyses for psychology – no assumptions were made on the categories of words beyond part-of-speech.

LIWC has also been used extensively for studying gender and age [21]. Many studies have focused on function words (articles, prepositions, conjunctions, and pronouns), finding females use more first-person singular pronouns, males use more articles, and that older individuals use more plural pronouns and future tense verbs [30]–[32]. Other works have found males use more formal, affirmation, and informational words, while females use more social interaction, and deictic language [33]–[36]. For age, the most salient findings include older individuals using more positive emotion and less negative emotion words [30], older individuals preferring fewer self-references (i.e. ‘I’, ‘me’) [30], [31], and stylistically there is less use of negation [37]. Similar to our finding of 2000 topics (clusters of semantically-related words), Argamon et al. used factor analysis and identified 20 coherent components of word use to link gender and age, showing male components of language increase with age while female factors decrease [32].

Occasionally, studies find contradictory results. For example, multiple studies report that emoticons (i.e. ‘:)’ ‘:-(‘) are used more often by females [34], [36], [38], but Huffaker & Calvert found males use them more in a sample of 100 teenage bloggers [39]. This particular discrepancy could be sample-related – differing demographics or having a non-representative sample (Huffaker & Calvert looked at 100 bloggers, while later studies have looked at thousands of twitter users) or it could be due to differences in the domain of the text (blogs versus twitter). One should always be careful generalizing new results outside of the domain they were found as language is often dependent on context [40]. In our case we explore language in the broad context of Facebook, and do not claim our results would up under other smaller or larger contexts. As a starting point for reviewing more psychologically meaningful language findings, we refer the reader to Tauszczik & Pennebaker's 2010 survey of computerized text analysis [20].

Eisenstein et al. presented a sophisticated open-vocabulary language analysis of demographics [41]. Their method views language analysis as a multi-predictor to multi-output regression problem, and uses an L1 norm to select the most useful predictors (i.e. words). Part of their motivation was finding interpretable relationships between individual language features and sets of outcomes (demographics), and unlike the many predictive works we discuss in the next section, they test for significance of relationships between individual language features and outcomes. To contrast with our approach, we consider features and outcomes individually (i.e. an “L0 norm”), which we think is more ideal for our goals of explaining psychological variables (i.e. understanding openness by the words that correlate with it). For example, their method may throwout a word which is strongly predictive for only one outcome or which is collinear with other words, while we want to know all the words most-predictive for a given outcome. We also explore other types of open-vocabulary language features such as phrases and topics.

Similar language analyses also occurred in many fields outside of psychology or demographics [42], [43]. For example, Monroe et al. explored a variety of techniques that compare two frequencies of words – one number for each of two groups [44]. In particular, they explored frequencies across democratic versus republican speeches and settled on a Bayesian model with regularization and shrinkage based on priors of word use. Lastly, Gilbert finds words and phrases that distinguish communication up or down a power-hierarchy across 2044 Enron emails [45]. They used penalized logistic regression to fit a single model using coefficients of each feature as their “power”; this produces a good single predictive model but also means words which are highly collinear with others will be missed (we run a separate regression for each word to avoid this).

Perhaps one of the most comprehensive language analysis surveys outside of psychology is that of Grimmer & Stewart [43]. They summarize how automated methods can inexpensively allow systematic analysis and inference from large political text collections, classifying types of analyses into a of hierarchy. Additionally, they provide cautionary advice; In relation to this work, they note that dictionary methods (such as the closed-vocabulary analyses discussed here) may signal something different when used in a new domain (for example ‘crude’ may be a negative word in student essays, but be neutral in energy industry reports: ‘crude oil’). For comprehensive surveys on text analyses across fields see Grimmer & Stewart [43], O'Connor, Bamman, & Smith [42], and Tausczik & Pennebaker [46].