Key Steps in Text Mining Research

TM generally entails three steps, namely, (a) text preprocessing, (b) application of TM operations, and (c) postprocessing (Y. Zhang, Chen, & Liu, 2015). Figure 1 provides a diagram of the different steps in the TM process and the steps, along which our discussion is organized. Text preprocessing may be further subdivided into text data cleaning and text data transformation (e.g., converting unstructured text into mathematical structures which can serve as input to various TM operations). TM operations refer to the application of pattern mining algorithms with the overriding goal to model characteristics of text. Finally, postprocessing involves interpreting and validating knowledge derived from TM operations. Below we explain each step in detail. Hereafter, all mentions of the word data refer to text data. Moreover, we use the words document and text interchangeably. The appendix provides a glossary of key terms.

The output from word segmentation provides the vocabulary. One may start by creating a document-by-term matrix. In R, the “tm” library has a function that can generate a document-by-term matrix, with an additional option for specifying weights. For example, consider the six preprocessed texts in Table 2 . Part of the document-by-term matrix constructed from the texts using raw frequency weighting is shown in Table 2 . The complete matrix has 40 columns, equal to the number of unique words found in the 6 texts.

Representing text as a document-by-term matrix presupposes that word order information is not crucial in the analysis. Although unsophisticated, it is noteworthy that transformations that ignore word order information perform better in many applications than transformations that account for it ( Song, Liu, & Yang, 2005 ; W. Zhang, Yoshida, & Tang, 2008 ). The main computational challenge for document-by-term matrix representation is how to deal with the resulting dimensionality, which is directly proportional to the size of the vocabulary. One can use different data dimensionality reduction methods to reduce the number of variables (e.g., variable selection and variable projection techniques) or employ specific techniques suited for data with high dimensionality. These techniques will be highlighted in the Text Mining Operations section.

Word frequency, in itself, may not be useful if the task is to make groupings or categories of documents ( Kobayashi, Mol, Berkers, Kismihók, & Den Hartog, 2018 ). Consider the word study in a corpus of abstracts of scientific articles. If the objective is to categorize the articles into topics or research themes then this word is not informative as in this particular context almost all documents contain this word. A way to prevent the inclusion of terms that possess little discriminatory power is to assign weights to each word with respect to their specificity to some documents in a corpus ( Lan, Tan, Su, & Lu, 2009 ). The most commonly used weighting procedure for this is the inverse document frequency (IDF; Salton & Buckley, 1988 ). A term is not important in the discrimination process if its IDF is 0, implying that the word is present in every document. In fact, IDF can also be the basis to select stop words for the categorization task at hand. Words that have a low IDF have little discriminatory power and can be discarded. When multiplied, the word raw frequency (tf) and IDF yield the popular TF-IDF , which simultaneously takes into account the importance of a word and its specificity ( Frakes & Baeza-Yates, 1992 ).

Text transformation is a quantification strategy in which text is transformed into mathematical structures. Most analytical techniques require text to be transformed into a matrix structure, where the columns are the variables (also referred to as features) and the rows are the documents. One way to construct this matrix is to use the words or terms in the vocabulary as variables. The resulting matrix is called a “document-by-term matrix” in which the values of the variables are the “weights” of the words in that document. In many applications, this is a straightforward choice since words are the basic linguistic units that express meaning. The raw frequency of a word is the count of that word in a document. Thus in this transformation, each document is transformed into a “vector,” the size of which is equal to the size of the vocabulary, with each element representing the weight of a particular term in that document ( Scott & Matwin, 1999 ).

A popular stemming algorithm was developed by Porter ( Porter, 1980 ; Willett, 2006 ). Most of the other aforementioned procedures can be performed by applying string processing. For example in R, the Text Processing part ( R Programming/Text Processing, 2014 ) of the R Programming wiki provides information on how to implement text processing procedures. The “tm” library, the core framework for TM in R, has functions for stop word removal and stemming. The website RANKS NL (n.d.) provides lists of stop words for many human languages. There are cases where it is not appropriate to apply stop word removal and stemming, for example, in short text classification ( Faguo, Fan, Bingru, & Xingang, 2010 ).

Text segmentation ( Huang & Zhang, 2009 ) is the process of dividing text into sentences and words. Stop words such as conjunctions and prepositions (e.g., and, the, of, for) are words that have low information content and do not contribute much to the meaning in the text. “Stemming” homogenizes the representation of semantically similar words (e.g., representing the words “ensures,” “ensuring,” and “ensured” by “ensure”). Since these techniques delete words, they also serve to reduce the size of the vocabulary.

Data cleaning enhances data quality, which in turn enhances the validity of extracted patterns and relationships. Cleaning is done by retaining only the relevant text elements ( Palmer, 2010 ). Standard cleaning procedures for text include deletion of unimportant characters (e.g., extra whitespaces, formatting tags, etc.), “text segmentation,” “lowercase conversion,” “stop word removal,” and “word stemming.” For open-ended survey responses (and other informally produced texts such as SMS texts or personal emails), in our experience it may be useful to run a spelling check to correct misspelled words. For web documents, HTML or XML tags must be removed since these do not add meaningful content. Thus the end result is text data stripped of all low content words and characters.

As APIs provide an efficient and legal means of obtaining data from the web, researchers who use text from the web first need to find out whether the target website offers an API. One way to find APIs is to use a search platform for APIs such as the ProgrammableWeb ( ProgrammableWeb, n.d. ). When an API is not available, the next option is web scraping. R has libraries that can automate the web scraping process such as “rvest.” Other useful packages include “RCurl” (for http requests), “XML” (for parsing HTML and XML documents), and “stringi” ( for text manipulation). If web scraping is not allowed, researchers should ask data owners if they are willing to share their data through remote connections to their databases. Text documents in databases may be fetched using standard query language (SQL).

It is important to be aware of legal and ethical issues associated with data access, particularly web scraping. Website contents are often protected by copyright law and lawsuits may ensue if agreement under fair use is violated (see, for instance, Associated Press, 2013 ). Also, privacy issues may preclude the use of certain types of personal text data without permission, such as web forms, surveys, emails, and performance appraisals ( Van Wel & Royakkers, 2004 ). Another potential issue to be aware of is that of data storage. In small projects, collected text data can be temporarily stored in a local file system (e.g., on a computer). However, in large-scale text analytics, especially when the data come from different sources, merging, storing, and managing data may require integrated database systems or data warehouses ( Inmon, 1996 ).

Before initiating the TM process one should have text data and the first step in data collection is to decide on the most suitable data source(s). Potential sources include the web, enterprise documents (e.g., memos, reports, and hiring offers), personal text (e.g., diaries, emails, SMS messages, and tweets; Inmon & Nesavich, 2007 ), and open-ended survey responses. TM requires that text must be in digital form or that it can be transcribed to this form. In case it is not, nondigital text (e.g., handwritten or printed documents) may be digitalized using optical character recognition techniques ( Borovikov, 2014 ). Web text data are collected from websites either through web application programming interfaces (APIs) or web scraping (i.e., automatic extraction of web page content; Olston & Najork, 2010 ).

Both dimensionality reduction and distance and similarity computing are usually evaluated on their impact on the text classification and text clustering performance ( Forman, 2003 ). That is, an effective dimensionality reduction technique must contribute to the improvement of classification or clustering performance. An analogous comment can be made for distance and similarity computing, since these measures often serve as input to the clustering (e.g., k-means) and classification task (e.g., nearest neighbor), although there are applications where distance and similarity measures are used as a standalone method ( Houvardas & Stamatatos, 2006 ; Lewis, 1992b ; Mihalcea et al., 2006 ). An example of the latter is comparing (parts of) leader and subordinate resumes to operationalize person-supervisor similarity. In information retrieval, where the task is to match queries to document content, performance metrics for distance or similarity measures are precision, recall, and the F-measure.

Model evaluation helps us choose which among competing models best explains the data ( Alpaydin, 2014 ). Model evaluation needs to address issues related to underfitting and overfitting. Underfitting happens when the model does not adequately represent the relationships present in the data (i.e., high variance). Overfitting occurs when a model performs well on data used to build it but poorly on new data (i.e., high bias). Hence, a model generalizes well if it also demonstrates good performance on new data ( Mitchell, 1997 ). A common way to assess the quality of the model’s generalizability is to use hold-out data ( Alpaydin, 2014 ). The procedure involves repeatedly splitting the corpus of documents to create a training and a test set either by randomly sampling documents from the corpus or by partitioning the corpus. Documents in the training set are used to fit the model and the generalizability of this model is assessed using the documents in the test set. Procedures that evaluate a model by partitioning the corpus are K-fold cross validation and a resampling procedure called bootstrapping ( Kohavi, 1995 ). Measures to assess generalizability are commonly referred to as evaluation metrics. Since different values of the metric for each unique split will be obtained, values are usually averaged across splits. Using cross-validation and bootstrapping, one can build confidence intervals and assess the true performance of the model. The choice of metric is dependent on the task and application domain. However, it should be kept in mind that conclusions generated are conditioned on the data; that is, a model is good only insofar as the data are representative of the population. Second, there are other criteria to judge the merit of a model, such as the time it takes to build the model and its interpretability.

For topic extraction, the recommended initial approach is to try LDA. It may also be useful to investigate the assignment of documents to topics. Code to run topic models is available in the “topicmodels” package in R. For example, we ran LDA and CTM on the example text above in Table 2 (see Table 3 for the results). The top terms listed in each topic form the basis for topic interpretation. For example, the top terms of Topic 1 indicate that this topic is about product management, whereas Topic 2 is more about ability on aircraft apps, and Topic 3 about the handling of clients. Examining the most likely topic for each document we observe that Documents 3, 9, and 11 have Topic 3 as the most likely topic since these documents are focused on dealing with customers.

Perhaps the most popular topic models are the latent Dirichlet allocation (LDA; Blei et al., 2003 ) model and the correlated topic model (CTM; Blei & Lafferty, 2007 ). LDA and CTM both operate on the document-by-term matrix ( Porteous et al., 2008 ). CTM will yield almost the same topics as LDA. The main difference between the two is that in LDA topics are assumed to be uncorrelated, whereas in CTM topics can be correlated. In comparing LSA with LDA, the latter has been found to be particularly suitable for documents containing multiple topics ( S. Lee, Baker, Song, & Wetherbe, 2010 ).

Topic models automatically extract topics from documents. These topics can indicate underlying constructs or themes. In machine learning and natural language processing, topic models are probabilistic models that are used to discover topics by examining the pattern of term frequencies ( Blei, Ng, & Jordan, 2003 ). Its mathematical formulation has two premises: A topic is characterized by a distribution of terms and each document contains a mixture of different topics. The most likely topic of a document is therefore determined by its terms. For example, when an open-ended survey response, contains words such as pay , compensation , salary , and incentive , one might label its topic as “rewards or pay systems.”

One can start with k-means or a hierarchical approach such as the complete linkage or Ward’s method ( El-Hamdouchi & Willett, 1989 ). If a researcher has a clear idea of how many clusters to create, then k-means is a good start. If a researcher has no idea as to how many clusters to construct, then she may use hierarchical clustering to see whether interpretable groupings emerge. The “cluster” and “mclust” packages in R run most of the clustering techniques described here and the “proxy” package offer various distance and similarity measures.

Most clustering algorithms are categorized as either hierarchical or partitional ( Steinbach et al., 2000 ). Hierarchical clustering algorithms either treat each object as its own cluster and then gradually merge clusters until all objects belong to a single cluster (i.e., agglomerative) or by first putting all objects under one cluster and recursively splitting clusters until each object is in its own cluster (i.e., divisive). The merging (or splitting) of clusters is depicted by a tree or dendrogram. For partitional clustering the user has to specify the number of clusters a priori and clusters are formed by optimizing an objective function that is usually based on the distances of the objects to the centers of the clusters to which they have been assigned. The popular k-means algorithm is an example of partitional clustering ( Derpanis, 2006 ). One key challenge in clustering is the determination of how many clusters to form. Since clustering is an exploratory technique, a common strategy is to experiment with different number of clusters and use cluster evaluation measures to decide. Examples of quality measures are the Dunn index and the silhouette coefficient ( Rendón, Abundez, Arizmendi, & Quiroz, 2011 ).

Many tasks in TM involve organizing text in groups such that documents belonging to the same group are similar and documents from different groups are not ( Jain, Murty, & Flynn, 1999 ; Steinbach, Karypis, & Kumar, 2000 ). The process of grouping is called clustering . The main use of text clustering is either to organize documents to facilitate search and retrieval or to impose an automatic categorization of documents. For example, text clustering has been used to detect crime patterns (e.g., location, type of crime, weapons) in crime reports ( Bsoul, Salim, & Zakaria, 2013 ), to organize and deepen the taxonomy of legal practice areas ( Conrad, Al-Kofahi, Zhao, & Karypis, 2005 ), and to improve the performance of a document retrieval system or web-based search engine by creating a taxonomy of documents and grouping the search query results ( Osinski & Weiss, 2005 ). To perform text clustering the researcher needs to define distance between texts (e.g., Euclidean distance). The distance measure can be computed from the original set of variables or from the reduced set of variables (e.g., after application of dimensionality reduction techniques such as LSA).

Assessing the similarity of two or more documents is a key activity in many applications such as in document retrieval (e.g., document matching), and recommendation systems (e.g., for finding similar products based on product descriptions or reviews). Numerous measures that operate on vector representations may be employed to assess distance or similarity. An example of the latter is the cosine measure that is used extensively in information retrieval ( Frakes & Baeza-Yates, 1992 ). The values for this measure range from –1 (two vectors point in opposite direction) to 1 (two vectors point in the same direction); 0 means that the two vectors are orthogonal or perpendicular (or uncorrelated). This measure assesses the similarity of two documents based on the frequencies of terms they share, which are taken to indicate similarity of content. This measure has been applied to document matching ( Frakes & Baeza-Yates, 1992 ) and detecting semantic similarity ( Mihalcea, Corley, & Strapparava, 2006 ).

Consider the 11 texts in Table 2 , which, after data cleaning, are transformed into a document-by-term matrix. Running LSA on the transpose of the document-by-term-matrix, we retained 2 dimensions. The resulting LSA space represented as a matrix is presented in Table 3 . Observe that the value for “product” in Document 11 is 0.32 although Document 11 does not contain the word “product.” This is due to the presence of the word “experience” in the other two documents (Document 4 and Document 8) that also contain “product.” Since “experience” is present in Document 11, LSA expects to find “product” in this document. This is how LSA deduces meaning from words (which is also useful for the identification of synonyms).

The dimensionality reduction stage is usually initiated by applying LSA. The LSA results (i.e., the reduced data set) can be used as input to clustering and classification. Alternatively, one can apply one of the filter methods to trim out unimportant variables. One advantage of filters as compared to LSA is interpretability since no new variables are constructed. Moreover, filter methods are faster to run. The R package “lsa” provides functionality for running LSA.

An alternative approach to reduce dimensionality is to eliminate variables by using variable selection methods ( Guyon & Elisseeff, 2003 ). In contrast to projection methods, variable selection methods do not create new variables but rather select from the existing variables by eliminating those that are uninformative or redundant (e.g., words that occur in too many documents might not be useful for categorizing documents). Three types of methods are available: filters, wrappers, and embedded methods. Filters assign scores to variables and apply a threshold to scores to delete irrelevant variables. Popular filters are TF-IDF thresholding, information gain, and the chi-square statistic ( Forman, 2003 ; Yang & Pedersen, 1997 ). Wrappers select the best subset of variables in conjunction with an analytical method. In embedded methods, searching the best subset of variables is accomplished by minimizing an objective function that simultaneously takes into account model performance and complexity. Model performance can be measured for example by prediction error (in the case of classification) and complexity is quantified by the number of variables in the model. The smallest subset of variables yielding the lowest prediction error is the preferred subset.

LSA is commonly used to detect synonymy (i.e., different words that have the same meaning) and polysemy (i.e., one word used in different yet related senses) among words. PCA is effective for data reduction as it preserves the variance of the data. Parallel analysis ( Ford, MacCallum, & Tait, 1986 ; Hayton, Allen, & Scarpello, 2004 ; Montanelli & Humphreys, 1976 ) is the recommended strategy to choose how many dimensions to retain in PCA. A disadvantage of both LSA and PCA is that it may be difficult to attach meaning to the constructed dimensions. Another technique is random projection, where data points are projected to a lower dimension while maintaining the distances among points ( Bingham & Mannila, 2001 ).

Singular value decomposition (SVD) is a classic tool that underlies techniques such as latent semantic analysis ( Landauer, Foltz, & Laham, 1998 ) and principal component analysis (PCA; Jolliffe, 2005 ). The SVD method decomposes a matrix X of size p × n (where p is the number of variables and n is the number of documents) into a product of three matrices, that is, X = U ∑ V T . One of these is a diagonal square matrix (∑) which contains the singular values ( Klema & Laub, 1980 ). Reducing the number of dimensions involves retaining the first few largest singular values. Usually, this implies choosing latent dimensions and recovering the underlying dimensionality of the data because at times, true dimensionality is obscured by random noise.

Document-by-term matrices tend to have many variables. It is usually desirable to reduce the size of these matrices by applying dimensionality reduction techniques. Some of the benefits of reducing dimensionality are more tractable analysis, greater interpretability of results (e.g., it is easier to interpret variable relationship when there are few of them), and more efficient representation. Compared to working with the initial document-by-term matrices, dimensionality reduction may also reveal latent dimensions and yield improved performance ( Bingham & Mannila, 2001 ). Two general approaches are commonly used to reduce dimensionality. One is to construct new latent variables and the second is to eliminate irrelevant variables. In the former case, new variables are modeled as a (non)linear combination of the original variables and may be interpreted as latent constructs (e.g., the words years , experience , and required may be merged to express the concept of work experience in job vacancies).

Though text transformation precedes the application of analytical methods, these two steps are closely intertwined. The document-by-term matrix from the text transformation step serves as the input data for most of the procedures in this section. Sometimes, when results are unsatisfactory, the researcher may consider changing or enlarging the set of variables derived from the transformation step ( Lewis, 1992a ; Scott & Matwin, 1999 ) or choosing another analytical method. Usually different combination of data transformation and analytical techniques are tried and tested and the one that yields the highest performance is selected.

Another way to validate TM output is through replication, data triangulation, and through an indirect inferential routing ( Binning & Barrett, 1989 ). The standard can be established by obtaining external data using accepted measures or instruments that may provide theory based operationalizations that should or should not be correlated to the model. Such correlations give an indication of validity. For example, to validate experience requirements extracted from job vacancies, one can administer questionnaires to job incumbents asking them about their experience. Validity is then ascertained through the correlation between both operationalizations. This can be replicated on various types of text to assess if the TM model consistently generates valid experience requirements for a particular occupation. In theory, one could even compute full multitrait multimethod correlation matrices ( Campbell & Fiske, 1959 ) to compare the measurements obtained from TM with established instruments, although in practice it may be difficult to obtain the fully crossed dataset that it requires.

A straightforward practice for construct validation is to have independent experts validate TM output. For example, in text classification, SMEs may be consulted from time to time to assess whether the resulting classifications of text are correct or not. A high agreement between the experts and the model provides an indication of the content-related validity of the model. The agreement is usually quantified using measures such as the Cohen’s kappa or intraclass correlation coefficient.

Prior to being applied to support decision making and knowledge generation, the validity of TM-based findings will need to be established. When TM is used to identify and operationalize constructs, using different forms of data triangulation will help generate construct validity evidence. For example, in our job analysis example of TM application, which follows below, we enlisted the help of job analysts and subject matter experts (SMEs) in evaluating the output of the TM of vacancy texts. In other cases, TM outcomes could be compared to survey data, such as for the aforementioned study on the role of personality in language use ( Yarkoni, 2010 ). More generally, TM-based models will require a comparative evaluation in which (part of) the TM output is correlated with independent data sources or other “standards” (such as the aforementioned survey or expert data). Though it is easy to view TM as a mechanistic means of extracting information from data, the input of domain experts is critically important. Finally, there is no reason why validity assessment procedures, such as those outlined by Binning and Barrett (1989) to establish the validity of personnel decisions, cannot be applied to TM output.

The postprocessing step may involve domain experts to assist in determining how the output of the models can be used to improve existing processes, theory, and/or frameworks. Two major issues are usually addressed here. The first is to find out whether the extracted patterns are real and not just random occurrences due to the sheer size of the data (e.g., by applying Bonferroni’s principle). The second is, as with all empirical research, whether data and results are valid. Establishing the reliability, validity (e.g., content, predictive, and discriminant validity), and credibility of the output of TM models is particularly important for TM to gain legitimacy in organizational research. It is important to note here that it is not the TM procedures that need to be validated but the output (in the same manner that we do not validate factor analysis), for example, the predictions of a TM-based classifier.

The SME classified 93.3% of the extracted tasks as representative of actual nursing jobs. The expert validation provided initial support for the content validity of the TM model as the collected information from the vacancies appears to accurately reflect the job. Second, we compared the TM results with traditional job analysis, namely a task inventory, to validate our results by data triangulation. The task inventory consisted of four interviews and a two-day observation with SMEs (i.e., nurses and head nurse) from two German hospitals. More information about the task inventory is available in the Supplemental Materials). Tasks from both lists were rated as synonyms (i.e., exactly the same), similar (i.e., different wording, same meaning), or dissimilar (i.e., different wording and meaning) based on the decision rules of Tett, Guterman, Bleier, and Murphy (2005) . Based on this comparison 55.6% of all tasks were found in both lists, whereas 29.1% were unique to the task inventory and 15.2% to the online vacancies. The relatively high correspondence (≥50%) between the list of task collected by TM and the list of tasks collected in the task inventory further established convergent validity.

Since it is difficult to find job experts that have expertise across job professions, the following discussion of validity is based solely on nursing jobs and experts in those. Specifically, we wanted to assess whether the extracted work activities for nurses correspond to actual nursing tasks. We validated the TM application to job analysis in two ways. First, we asked a nursing expert (i.e., training coordinator) to examine the condensed list of 76 nursing tasks that we extracted from the nursing vacancies for consistency with the actual tasks executed in practice. The 76 nursing tasks were obtained by first extracting task sentences from vacancies, and then applying clustering to group similar tasks together. Hence, we only presented core nursing tasks to the expert.

The parameter set for each technique and the classification results are summarized in Table 5 . The mean of the two metrics from the 10-fold cross validation suggests that SVM and RF perform better than NB. A comparison of the mean accuracies using a one-way ANOVA found that at least one mean accuracy was different from the rest, F (2.27) = 15.94, p = .000. A post hoc analysis using Tukey’s honestly significant difference (HSD) method revealed that the mean accuracy of NB is significantly different from the other two techniques (RF, p = .001; SVM, p = .000), whereas SVM and RF did not significantly differ from one another ( p = .988). These high accuracies can be explained by the appropriateness of the extracted variables and the suitability of these classifiers for studying text data. To make predictions even more valid, one can aggregate them (e.g., by means of majority voting).

For the classifier, three techniques were tested, namely naive Bayes (NB), support vector machine (SVM), and random forest (RF). We chose these as they are purportedly the most effective classifiers for text classification ( Aggarwal & Zhai, 2012 ). We built each classifier and assessed its performance through 10-fold cross-validation using accuracy and F-measure as performance metrics. These performance metrics reflect our objective of creating an accurate classifier that favors neither one of the categories (attribute or activity).

The data matrix served as the input data for the classification of job information. To construct models, we needed labeled training data. We examined each sentence and employed standard definitions from the job analysis literature (these were accumulated in a coding manual that is available in the Supplemental Materials) to label each sentence as either a work activity or a worker attribute. In establishing the labeled training data, mixed sentences containing both activity and attribute information were split and buffer sentences not containing any relevant information were dropped. For the construction of the classification model we added a 169th column to the data matrix. This column contained the classification of sentences into either job attribute (0) or job activity (1) as derived from the manually labeled sentences.

For the purposes of our analysis, we put all noun, verb, adjective, and adverb related tags together. We grouped related tags under one general derived tag, since we did not require detailed information about each tag. For example, singular or mass noun (NN), plural noun (NNs), singular proper noun (NNP), plural proper noun (NNPS) were all subsumed under the “noun” tag. Other noteworthy tags are TO (to), CD (cardinal number) and MD (modal), these tags appeared important for discriminating between work activities and worker attributes. The TO tag is indicative of job activity (e.g., “ to ensure project stays on track for assigned client projects”), because it reflects either the results of an action or the indefinite form of a verb in a task. The presence of a CD tag most often points to the years of education or work experience required from job applicants, and hence is indicative of the worker attribute category. The complete list of variables can be found in Table 4 .

For the transformation step we deviated from the approach of using solely words as variables. We generated a list of variables that would potentially be able to predict the category membership of sentences, that is, into either work activities (e.g., tasks) or worker attributes (e.g., skills). We used knowledge from the job analysis field and eye-balling coupled with statistical tests to preselect these variables. Based on definitions of tasks, for example, we deduced that often, these are indicated by sentences that consist of an action verb, the object of the action, the source of information or instruction, and the results ( Morgeson & Dierdorff, 2011 ; Voskuijl, 2005 ). We also expected verbs to be more prevalent in activity sentences than in attribute sentences. Using part-of-speech tagging, we computed features such as the percentage of verbs in a sentence and the part-of-speech of the first word. For the POS labels, we based the tags on the “Penn part of speech tags” ( Penn Part of Speech Tags, n.d. ).

Since our analysis operates at the sentence level and some of our variables are derived from the words, we started by applying sentence and word segmentation. Once words and sentences were identified, we converted letters to lowercase and removed stop words. The criteria used to determine whether a word is a stop word or not were based on the standard English language ( RANKS NL, n.d. ) stop word list and our own inductive identification of words that did not appear to be associated with the types of job information we were interested in detecting. Hence, conjunctions, articles, and prepositions were deleted. We retained the following stop words, “to,” “have,” “has,” “had,” “must,” “can,” “could,” “may,” “might,” “shall,” “should,” “will,” “would,” because these were useful for the classification task. Specifically, sentences containing “to” and “will” often contain job activities, whereas, “have,” “has,” “had,” “should,” and “must” are suggestive of worker attributes. Having deleted the irrelevant stop words, we removed punctuation except for intraword dashes to avoid separating words which together express a single meaning (e.g., problem-solving, customer-oriented, pro-active, etc.). Finally, we stripped the extra whitespaces that resulted from the deletion of particular characters. The output was a collection of sentences in which all letters were in lowercase from which the irrelevant stop words and punctuation had been stripped.

To illustrate the key steps in TM we provide an example from job analysis. Job analysis aims to collect and analyze any type of job-related information to describe and understand a job in terms of behaviors necessary for performing the job ( Sanchez & Levine, 2012 ; Voskuijl, 2005 ). Job analytic data are traditionally collected through interviews, observations, and surveys among SMEs, including job holders, supervisors, and job analysts ( Morgeson & Dierdorff, 2011 ). Here, we apply a TM approach to automatically classify job information from vacancies and assess whether the worker attributes necessary for effective job performance emerge from the vacancies to show that TM might be useful tool to job analysts.

Topic Modeling on Worker Attributes

We now proceed with our second of aim of analyzing all of the extracted worker attributes (i.e., not restricted to solely those of the nurses). Our goal is to summarize the worker attributes and find worker attribute constructs and use these to cluster jobs. For this purpose we applied topic modeling using LDA to the extracted worker attribute sentences. We set the number of topics equal to 140 based on two criteria. One criterion is based on topic distances as discussed in the article of Cao, Xia, Li, Zhang, and Tang (2009) and the other is based on the idea that LDA is a matrix factorization mechanism and the quality of the factorization depends on choosing the right number of topics (for additional information we refer the reader to the article of Arun, Suresh, Madhavan, & Murthy, 2010). We use variational expectation maximization to estimate the parameters of the LDA model. For the interest of space and purpose of illustration we show in Table 5 a subset of twelve topics generated from LDA. Looking at the top 8 words, Topics 75, 18, 45, 108, and 129 appear to point to behavioral/personal qualities. Topic 75 could be interpreted as interpersonal communication skills, Topic 18 as self-motivation, Topic 45 seems to pertain to attention to detail, Topic 108 seems to be about analytical and problem-solving skills, and Topic 129 about team-working. Topics 132 and 16 are attributes that were seldom considered in job analysis studies (e.g., Harvey, 1986) and may as well reflect new worker attributes sought by contemporary organizations. Topic 132 seems to be about willingness to travel and the ability to operate on a flexible work schedule and Topic 16 about data analytical skills. The rest of the Topics seem to be about technical skills specific to certain professions such as sales for Topic 20 and software/programming for Topics 100 and 60. Topic 61 pertains to a specific requirement and is about having a valid driving license. Interestingly, even without giving LDA prior information about which worker attributes to expect it still appears to recover both technical and soft skill requirements. Though it is a bit difficult to interpret Topics 86, 105 and 15 they seem to be topics pertaining to generic personal qualities such as the ability to learn new things quickly (86), goal-setting and leadership (105), and possessing a positive, energetic, and enthusiastic attitude (15). We can visualize the correlations among words within each topic to aid interpretation. We show this in Figure 3 where an edge between words indicate a correlation of at least 0.1 and the thickness of an edge indicates the strength of correlation. The word-networks are in line with our interpretations and show that a topic could capture more than 2 worker attributes; the model put them in one topic because they tend to co-occur. From the topics we can generate hypotheses about which behavioral/personal characteristics are actually required to carry out a particular job, which could then be tested in an empirical study.

Investigating the relationship between topics provides a way to assess the convergent/divergent validity of the topical content. Here we cannot directly use correlation since topics are assumed to be uncorrelated, however, we can use the “distance” between topics. To get a better idea about how to judge whether an association is low or high, we suggest using simulation techniques such as Monte Carlo or permutation tests. In this case, magnitude is always application dependent. To compute distance, we use the Jensen-Shannon divergence which measures the distance between probability distributions. Here we focus the discussion on Topic 75, which we previously interpreted as interpersonal communication skills. Topic 75 is closer to Topics 13, 30, 51, 88, 111, 129, and 103 (please refer to the Supplemental Materials for the complete list of topics). Topics 13 (effective oral and written communication), 30 (professional demeanor), 129 (team work), and 103 (analytical and problem solving skills) all relate to interpersonal skills hence these qualities are expected to relate to interpersonal communication. A noteworthy similarity exists between Topics 132, 77, and 119 which are willingness to travel, ability to work on a flexible schedule, and work relocation, respectively. We can further explore this relationship by performing a more inference driven investigation by comparing the findings here to the results obtained by interviewing SMEs or job holders, which will further help in establishing construct validity. Aside from similar topics there are also less similar ones, for example Topic 75 (interpersonal communication skills) is least similar to Topics 31 (finance) and 89 (programming languages). Possible interpretations include range restriction (that is, if job incumbents in a position do not vary on certain characteristics these characteristics may not be mentioned in the vacancies), but it could also mean that interpersonal communication is not essential to perform jobs requiring those specific technical skills, or that incumbents who excel in those jobs have low interpersonal communication skills.

To examine the relationship among all topics simultaneously, we applied multidimensional scaling and projected the topics on 2 dimensions. Figure 4a shows the projections of topics on 2 dimensions. Topics 7, 8, 9, 6, 25, and 35 (bottom rightmost, fourth quadrant) are close together because they all relate to programming or software skills. This also holds for Topics 123, 124, 128, 107, and 133 (bottom leftmost, third quadrant) which are about written and oral communication skills. Topics 46, 52, 50, 83, and 31 (upper between first and second quadrants) are about how someone should work (fast paced and dynamic) and the qualities needed to perform the work (adaptable, able to multitask, and can work independently or in a team).

The output from LDA allows us to determine the most likely topic for each document. Here we want to find the most likely worker attribute for each job. Consider Topics 16 and Topic 18. Most jobs under Topic 16 are quantitatively oriented jobs such as data scientist, statistician, and financial analyst. On the other hand jobs under Topic 18 appear to pertain mostly to sales, marketing, and customer management. Note that in LDA, each document can have more than one topic (each document is actually a mixture of topics), we can utilize all topic probabilities for each document and construct a hierarchical clustering of jobs. In Figure 4b we show part of the cluster dendrogram highlighting medically related jobs.

Terms associated to topics give us an idea about the possible interpretation of topics, however, we need to examine the relationship graph to help us surmise the context in which these words are used. Also, topic modeling showed that it is not only possible to accurately classify job information from vacancies but that we can also derive behavioral characteristics that are valued or required by employers from potential or existing job holders. We further made use of the extracted job information by summarizing the worker attributes on 140 dimensions, defining “job similarity” based on topic mixtures, and then clustering the jobs. Further analysis can be performed such as analyzing trends of worker attributes required by organizations across time, occupations, companies, and geographical regions given that these types of information are generally provided in the vacancies. Also, one can build a network of work activities to examine relationship among tasks.

Data collection, through TM, is faster, cheaper, and more reliable than traditional job analytic methods (McEntire et al., 2006). For our work on nursing tasks extraction, data triangulation showed that a substantial amount of the extracted tasks may be characterized as context-specific (e.g., caring for patients with spine surgery, caring for mentally ill patients) and that not all nurses perform these tasks. These tasks reflect idiosyncrasies in jobs that may be overlooked with data collection from SMEs because it would be impossible to interview, observe, and/or survey all nurses. Due to context-specificity, traditional ways of data collection have compromised the reliability of job-analytic data, causing bias (Dierdorff & Morgeson, 2009; Morgeson & Campion, 1997, 2000; Morgeson et al., 2004; Sanchez & Levine, 2000). Our application of TM, however, showed that this information can be extracted automatically from vacancies to complement, enrich, and strengthen traditional methods of job analysis.

Of course there are also validity concerns associated with online vacancies as a data source. First, there are noticeable differences in the quality of the information across sources. For example vacancies posted by recruitment agencies are often lower in quality (e.g., level of detail, clarity of information) compared to vacancies posted by organizations. Data triangulation for the nurses also showed that specificity varied a lot between TM and task inventory data. There are for example five tasks about medication (i.e., prepare medication, arrange medication new patients, check medication, and hand out medication), all with extensive descriptions in the task inventory, whereas the TM counterpart is only “administration of medication.” Thus the level of detail is much lower there. Second, online data, as all secondary data, is often produced with very different purposes than the research purpose it may subsequently be repurposed for, in this case job analysis. For example, online vacancies are aimed at recruiting employees, which means that the included information might be biased through advertising only certain, mainly positive, aspects of the job and/or not mentioning very mundane tasks. Tasks unique to the traditional task inventory included, for example more mundane and less positive, but very frequently occurring tasks in the nursing profession (e.g., washing patients, changing patients, cleaning beds, checking temperature). Third, not all jobs are advertised online (Sodhi & Son, 2007), potentially leaving out relevant information and jobs. Our recommendation to further validate the relationships is to compare the results we obtained with alternative sources of information such as interviewing SMEs or job incumbents, and computing measures traditionally used in interrater reliability as what we did with nursing tasks