Text data are pervasive in organizations. Digitization (Cardie & Wilkerson, 2008) and the ease of creating online information (e.g., e-mail messages; Berry & Castellanos, 2008) contributes to the vast quantities of text generated each day. Embedded in these texts is information that may improve our understanding of organizational processes. Thus, organizational researchers increasingly seek ways to organize, classify, label, and extract opinions, experiences, and sentiments from text (Pang & Lee, 2008; Wiebe, Wilson, & Cardie, 2005). Up until recently, the majority of text analyses in organizations relied on time consuming and labor-intensive manual procedures, which are impractical and less effective for voluminous collections of documents especially when resources are limited (Kobayashi et al., 2018). Hence, automatic (or computer-assisted) strategies are increasingly employed to accelerate the analysis of text (Berry & Castellanos, 2008).

Similar to content analysis (Duriau, Reger, & Pfarrer, 2007; Hsieh & Shannon, 2005; Scharkow, 2013) and template analysis (Brooks, McCluskey, Turley, & King, 2015), a common objective of text analysis is to assign text to predefined categories. Manually assigning large collections of text to categories is costly and may become inaccurate and unreliable due to cognitive overload. Furthermore, idiosyncrasies among human coders may creep into the labeling process resulting in coding errors. One workaround is to code only part of the corpus as opposed to coding all documents. However, this comes at the expense of possibly omitting relevant information, which may lead to bias and a degradation of the internal and external validity of the findings. Another option is to hire multiple human coders, but this adds cost (e.g., cost of hiring and training coders) and effort pertaining to determining interrater reliability and consensus seeking (Sheng, Provost, & Ipeirotis, 2008). A final (and more affordable) option is to solicit the help of the public to label text, for instance through the Amazon Mechanical Turk platform (Buhrmester, Kwang, & Gosling, 2011). However, this may be effective only in labeling objective information (e.g., names of people, events, etc.) since it is often difficult to establish consistency on subjective labels (e.g., sentiments; Wiebe, Wilson, Bruce, Bell, & Martin, 2004). Hence, automatic text analysis procedures that reliably, efficiently, and effectively assign text elements to classes are both necessary and advantageous especially in dealing with a massive corpus of text.

This article focuses on automatic text classification for several reasons. First, although text classification (henceforth TC) has been applied in various fields, such as in political science (Atteveldt, Kleinnijenhuis, Ruigrok, & Schlobach, 2008; B. Yu, Kaufmann, & Diermeier, 2008), occupational fraud (Holton, 2009), law (Gonçalves & Quaresma, 2005), finance (Chan & Chong, 2017; Chan & Franklin, 2011; Kloptchenko et al., 2004), and personality research (Shen, Brdiczka, & Liu, 2013), so far its uptake in organizational research is limited. Second, the use of TC is economical both in terms of time and cost (Duriau et al., 2007). Third, many of the techniques that have been developed in TC, such as sentiment analysis (Pang & Lee, 2008), genre classification (Finn & Kushmerick, 2006), and sentence classification (Khoo, Marom, & Albrecht, 2006) seem particularly well suited to address contemporary organizational research questions. Fourth, the acceptance and broader use of TC within the organizational research community can stimulate the development of novel TC techniques.

Tutorials or review-tutorials on TC that have been published so far (Harish, Guru, & Manjunath, 2010; Li & Jain, 1998; Sebastiani, 2002) were targeted mainly toward researchers in the field of machine learning and data mining. This has resulted in a skewed focus on technical and methodological details. In this article our goal is to balance the discussion among techniques, theoretical concepts, and validity concerns to increase the accessibility of TC to organizational researchers.

Below we first discuss the TC process, by pointing out key concerns and providing concrete recommendations at each step. Previous studies are cited to enrich the discussion and to illustrate different use cases. The second part is a hands-on tutorial using part of our own work as a running example. We applied TC to automatically extract nursing job tasks from nursing vacancies to augment nursing job analysis (Kobayashi, Mol, Kismihók, & Hesterberg, 2016). The findings from this study were used in the EU-funded Pro-Nursing (http://pro-nursing.eu) project which aimed to understand, among others, how nursing tasks are embedded in the nursing process. We also address validity assessment because the ability to demonstrate the validity of TC outcomes will likely be critical to its uptake by organizational researchers. Thus, we discuss and illustrate how to establish validity for TC outcomes. Specifically, we address assessing the predictive validity of the classifier and triangulating the output of the classification with other data sources (e.g., expert input and output from alternative analyses).

An ideal classifier would mimic how humans process and deduce meaning from text. However, there are still many challenges before this becomes reality. Natural languages contain high-level semantics and abstract concepts ( Harish et al., 2010 ; Popping, 2012 ) that are difficult to articulate in computer language. For instance, the meaning of a word may change depending on the context in which it is used ( Landauer, Foltz, & Laham, 1998 ). Also, lexical, syntactic, and structural ambiguities in text are continuing challenges that would need to be addressed ( Hindle & Rooth, 1993 ; Popping, 2012 ). Another issue is dealing with typographical errors or misspellings, abbreviations, and new lexicons. Strategies for dealing with ambiguities all need to be explicated during classifier development. Before a classifier is deployed it thus needs several rounds of training, testing, fine-tuning (of parameters), and repeated evaluation until acceptable levels of performance and validity are reached. The resulting classifier is expected to approximate the performance of human experts in classification tasks ( Cardie & Wilkerson, 2008 ), but for a large corpus its advantage is that it will be able to do so in a faster, cheaper, and more reliable manner.

TC is defined as the automatic assignment of text to one or more predefined classes ( Li & Jain, 1998 ; Sebastiani, 2002 ). Formally, the task of TC is stated as follows. Given a set of text and a set of categories, construct a model of the form: Y = f ( X , θ ) + ∊ from a set of documents with known categories. In the preceding formula, X is a suitably chosen text representation (e.g., a vector), θ is the set of unknown parameters associated with the function f (also known as the classifier or classification model ) that need to be estimated using the training data, and ∊ is the error of the classification. The error is added to account for the fact that f is just an approximation to the true but unknown function h such that Y = h ( X ). Hence, the smaller ∊ is, the more effective the classifier f is. The Y term usually takes numerical values indicating the membership of text to a particular category. For example, when there are only 2 categories, such as in classifying the polarity of relations between political actors and issues, as either positive or negative ( Atteveldt et al., 2008 ), Y can take the values of +1 and −1, respectively signifying positive and negative sentiment. We further discuss how to deal with each part of the formula, such as how to choose X and f , below. Once the classification model has been constructed it is then used to predict the category of new text ( Aggarwal & Zhai, 2012 ).

Another approach is to use classification output to help us determine which observations to label. In this way, we take a targeted approach to labeling by labeling those observations which are most likely to generate better classifiers. This is called active learning in the machine learning literature ( Settles, 2010 ). Active learning is made possible because some classifiers give membership probabilities or confidence rather than a single decision as to whether to assign to one class or not. For example, if a classifier assigns equal membership probabilities to all categories for a new observation then we call an expert to label the new observation. For a review of active learning techniques we refer the reader to Fu, Zhu, and Li (2013) .

In many classification problems, labeled data are costly or difficult to obtain. Fortunately, even in this case, principled approaches can be applied. In practice, unlabeled data are plentiful and we can apply techniques to make use of the structure and patterns in the unlabeled data. This approach of using unlabeled data in classification is called semisupervised classification ( Zhu, 2005 ). Various assumptions are made to make semisupervised classification feasible. Examples are the smoothness assumption which says that observations near each other are likely to share the same label, and the cluster assumption which states that if the observations form clusters then observations in the same cluster are likely to share the same label ( Zhu, 2005 ).

A practical question that often arises is how many documents one should label to ensure a valid classifier. The size of the training dataset depends on many considerations such as the cost and limitations associated with acquiring prelabeled documents (e.g., ethical and legal impediments) and the kind of learning framework we are using. In the probably approximately correct (PAC) learning framework, which is perhaps the most popular framework for learning concepts (such as the concept of spam emails or party affiliation) training size is determined by the type of classification technique, the representation size, the maximum error rate one is willing to tolerate, and the probability of not exceeding the maximum error rate. Under the PAC learning framework, formulae have been developed to determine the lower bound for the training size, an example being the one by Goldman (2010) : Ω ( 1 ∊ log 2 1 δ + VCD ( C ) ∊ ) , where ε is the maximum error rate, 1 − δ indicates the probability that the error will not exceed ∊ , and VCD( C ) of the classifier C . VCD stands for Vapnik-Chervonenski dimension of the classifier C which can be interpreted as the expressive power of the classifier which depends on the representation size and the form of the classifier (e.g., axis parallel rectangle, closed sets, or half-spaces). As an illustration suppose we want to learn the concept of positive sentiment from English text, we then represent each document as a vector of 50,000 dimensions (number of commonly used English words) and our classification technique constructs a hyperplane that separates positive and negative observations (e.g., SVM using ordinary dot product as kernel). If we want to ensure with probability 0.99 that the error rate will not exceed 0.01, then the minimum training size is: 1 0.01 log 2 1 0.01 + 50001 0.01 = 5000764 . This means we would need at least 5 million documents. Here we calculated the VCD of C using the formula d + 1, where d is the dimensionality of the representation, since we consider classifiers that construct hyperplane boundaries (half-space) in 50,000 dimensions. Of course, in practice dimensionality reduction can be applied still get adequate representation. If one would manage to reduce dimensionality to 200 then the lower bound for the training size is dramatically reduced to 20,765. We can tweak this lower bound by adjusting the other parameters.

There are options for supervised dimensionality reduction for imbalanced classification such as those provided by Ogura et al. (2011) . For the choice of classification techniques, those discussed previously can be used with minor variations such as adjusting the costs of misclassification, which is known as cost-sensitive classification ( Elkan, 2001 ). Traditional techniques apply equal costs of misclassification to all categories, whereas, for cost-sensitive classification we can assign large cost for incorrect classification of observations in the minority class. For the choice of evaluation measures, we suggest using the weighted F-measure or balanced accuracy. One last suggestion is to treat imbalanced classification as an anomaly or outlier detection problem where the observations in the minority class are the outliers ( Chandola, Banerjee, & Kumar, 2009 ).

Obvious fixes are to label more observations until the classes are balanced as was done by Holton (2009) , or by disregarding some observations in the majority class. In cases where classification problems are inherently imbalanced and labeling additional data is costly and difficult, another approach is to oversample the minority class or to undersample the majority class during classifier training and evaluation. A strategy called the synthetic minority oversampling technique (SMOTE) is based on oversampling but instead of selecting existing observations in the minority class it creates synthetic samples to increase the number of observations in the minority class ( Chawla, Bowyer, Hall, & Kegelmeyer, 2002 ). Preprocessing and representation remain the same as in balanced classes. The parts that make use of class membership need to be adjusted for imbalanced data.

By and large, in binary classification, when the number of observations in one class represents less than 20% of the total number of observations then the data can be seen as imbalanced. The main danger of imbalanced classification is that we may train a classifier with a high accuracy even if it fails to correctly classify the observations in the minority class. In some cases, we are more interested in detecting the observations in the minority class. At the same time however, we also want to avoid many false detections.

The four evaluation measures can also be extended to classifications with more than two classes by computing these measures per category, the same as in one-against-all, and averaging the results. An example is the extension of F-measure called the macro F-measure which is obtained by computing the F-measure of each category, and then averaging them.

Multiclass classification pertains to dealing with more than 2 categories. The preprocessing and representation parts are the same as in the binary case. The only changes are in the choices of supervised feature selection techniques, classification techniques and evaluation measures. Most supervised feature selection techniques can be easily generalized to more than 2 categories. For example, when calculating CHI, we just need to add an extra column to the two-way contingency table. Most techniques for classification we discussed previously have been extended to multiclass classification. For example, techniques suited for binary classification problems (e.g., SVM) are extended to the multiclass case by breaking the multiclass problem into several binary classification problems in either one-against all or one-against-one approaches. In the former approach we build binary classifiers by taking each category as the positive class and merging the others into the negative class. Hence, if there are K categories, then we build K binary classifiers. For the latter approach, we construct a binary classifier for each pair of categories resulting to K ( K − 1 ) 2 classifiers. Since several classifiers are built, and thus there are several outputs, final category membership is obtained by choosing the category with the largest value for the decision function for the one-against-all case or by a voting approach for the one-against-one case ( Hsu & Lin, 2002 ).

In this section we discuss how to deal with multiclass classification, where there is an increased likelihood of classes being imbalanced, and provide some suggestions on determining training size and what to do when obtaining labeled data is both expensive and difficult.

Once model validity is established one may start applying the classification model to unlabeled data. However, the model will still need to be reevaluated from time to time. When the performance drops below an acceptability threshold, there are four possible solutions: (a) add more features or change existing features, (b) try other classification algorithms, (c) do both, and/or (d) collect more data or label additional observations.

A useful strategy to further assess the validity of the classification model is to compare the classifications made by the model with the classification of an independent (group of) human expert(s). Usually agreement between the model and the human expert(s) is quantified using measures of concordance or measures of how close the classification of the two correspond to one another (such as Cohen’s kappa for interrater agreement where one “rater” is the classifier). Using expert knowledge, labels can also be checked against standards. For example, in job task extraction from a specific set of job vacancies one can check with experts or job incumbents to verify whether the extracted tasks correspond to those tasks actually carried out on the job and whether specific types of tasks are under or over represented.

The second component, the classification algorithm, models the relationship between features and class membership. Similar to the features, the validity of the algorithm is ultimately determined from the classification performance and is also for the most part data driven. The validity of both the features and the classification algorithm establishes the validity of the classification model.

Many TC applications use the set of unique words as the feature set (i.e., VSM). For organizational researchers this way of specifying the initial set of features may seem counterintuitive since features are constructed in an ad hoc and inductive manner, that is, without reference to theory. Indeed, specifying the initial set of features, scoring features, transforming features, evaluating features, and modifying the set of features in light of the evaluation constitutes a data-driven approach to feature construction and selection ( Guyon, Gunn, Nikravesh, & Zadeh, 2008 ). The validity of the features is ultimately judged in terms of the classification performance of the resulting classification model. But this does not mean that researchers should abandon theory based approaches. If there is prior knowledge or theory that supports the choice of features then this can be incorporated ( Liu & Motoda, 1998 ). Theory can also be used as a basis for assigning scores to features such as using theory to rank features according to importance. Our recommendation, however, would be to have theory complement, as opposed to restrict, feature construction, because powerful features (that may even be relevant to subsequent theory building and refinement) may emerge inductively.

Since accuracy may give misleading results when classes are imbalanced we recommend using measures sensitive to this, such as F-measure or balanced accuracy ( Powers, 2011 ). For the systematic evaluation of the classifier we advise using k -fold cross validation and setting K to 5 or 10 when data are large, as this ensures sufficient data for the training. For smaller data sets, such as fewer than 100 documents, we suggest bootstrapping or choosing a higher K for cross-validation.

Cross-validation can be applied by computing not only one value for the evaluation measure but several values corresponding to different splits of the data. A systematic strategy to evaluate a classifier is to use k -fold cross-validation ( Kohavi, 1995 ). This method splits the labeled dataset into k parts. A classifier is trained using k − 1 parts and evaluated on the remaining part. This is repeated until each of the k parts has been used as test data. Thus for k equals 10, there are 10 partitions of the labeled data and corresponding 10 values for a given measure, the final estimate is just the average of the 10 values. Another strategy is called bootstrapping, which is accomplished by computing an average of the evaluation measures for N bootstrap samples of the data (sampling with replacement).

Evaluation measures are computed from the labeled data. It is not advisable to use all labeled data to train the classifier since this might result in overfitting which is the case when the classifier is good at classifying the observations in the training data but performs poorly on new data. Hence, part of the labeled data should be set aside for evaluation so that we can assess the degree to which the classifier is able to predict accurately in data that were not used for training.

Evaluation measures are useful to compare the performance of several classifiers ( Alpaydin, 2014 ). Thus, one can probe different combinations of feature sets and classification techniques to determine the best combination (i.e., the one which gives the optimal value for the evaluation measure). Apart from classification performance, one can also take the parsimony of the trained classifier into account by examining the relative size of the different feature sets, since they determine the complexity of the trained classifier. In line with Occam’s razor, when two classifiers have the same classification performance, the one with the lower number of features is to be preferred ( Shreve, Schneider, & Soysal, 2011 ).

Alternative measures to accuracy are precision, recall, F-measure ( Powers, 2011 ), specificity, breakeven point, and balanced accuracy ( Ogura, Amano, & Kondo, 2011 ). In binary classification, classes are commonly referred to as positive and negative. Classifiers aim to correctly identify observations in the positive class. A summary table which can be used as a reference for computing these measures is presented in Figure 1 . The entries of the table are as follows: TP stands for true positives, TN for true negatives, FP for false positives (i.e., negative cases incorrectly classified into the positive class), and FN for false negatives (i.e., positive cases incorrectly classified into the negative class). Hence the five evaluation measures are computed as follows: p r e c i s i o n = T P T P + F P , r e c a l l = T P T P + F N , s p e c i f i c i t y = T N T N + F P , F − m e a s u r e = 2 × r e c a l l × p r e c i s i o n r e c a l l + p r e c i s i o n , and B a l . A c c u = ( T P T P + F P + T N T N + F N ) / 2 .

Crucial to any classification task is the assessment of the performance of classifiers using evaluation measures ( Powers, 2011 ; Yang, 1999 ). These measures indicate whether a classifier models the relationship between features and class membership well, and may thus be used to indicate the extent to which the classifier is able to emulate a human coder. The most straightforward evaluation measure is the accuracy measure, which is calculated as the proportion of correct classifications. Accuracy ranges from 0 to 1 (or 0 to 100 when expressed as a percentage). The higher the accuracy the better the classifier (1 corresponds to perfect classification). However, in case of imbalanced classification (i.e., when there is one class with only a few documents) and/or unequal costs of misclassification, accuracy may not be appropriate. An example is detecting career shocks (cf. Seibert, Kraimer, Holtom & Pierotti, 2013 ) in job forums. Since it is likely that only a small fraction of these postings pertain to career shocks (suppose .05), a classifier can still have a high accuracy (equal to .95) even if that classifier classifies all discussion as containing no career shocks content.

Rather than using a single technique, we suggest applying different methods, by pairing different algorithms and feature sets (including those obtained from feature selection and transformation) and choosing the pair with the lowest error rate. For example, using the DTM matrix, apply SVM, naive Bayes, random forest bagging, and gradient boosted trees. When feature transformation has been applied (e.g., LSA and nonnegative matrix factorization), use logistic regression or discriminant analysis. When the training data are large (e.g., hundreds of thousands of cases), use K-nearest neighbors. Rule-based algorithms are seldom used in TC, however, if readability and efficiency are desired in a classifier, then these can be trialed as well.

The third type of algorithm is the logical classifier, which accomplishes classification by means of logical rules ( Dumais, Platt, Heckerman, & Sahami, 1998 ; Rokach & Maimon, 2005 ). An example of such a rule in online news categorization is: “If an article contains any of the stemmed terms “vs”, “earn”, “loss” and not the words “money”, “market open”, or “tonn” then classify the article under category “earn” ( Rullo, Cumbo, & Policicchio, 2007 ). The rules in logical models are readable and thus facilitate revision, and, if necessary, correction of how the classification works. An example of a logical classifier is a decision tree ( Rokach & Maimon, 2005 ).

Probabilistic algorithms compute a joint probability distribution between the observations (e.g., documents) and their classes. Each document is assumed to be an independent random draw from this joint probability distribution. The key point in this case is to estimate the posterior probability P ( Y m | X ) . Classification is achieved by identifying the class that yields the maximum posterior probability for a given document. The posterior probability is estimated in two ways. Either one can marginalize the joint distribution P ( X , Y m ) , or one may compute P ( X | Y m ) and P ( Y m ) separately and apply Bayes theorem. Both naive Bayes ( Eyheramendy, Lewis, & Madigan, 2003 ) and logistic regression ( J. Zhang, Jin, Yang, & Hauptmann, 2003 ) are examples of probabilistic algorithms.

Geometric algorithms assume that the documents can be represented as points in a hyperspace, the dimensions of which are the features. This means that distances between documents and lengths of the documents can be defined as well. In this representation, nearness implies similarity. An example of a geometric classifier is K - nearest neighbors in which classification is done by first finding the closest K documents (using a distance measure) from the training data ( Jiang, Pang, Wu, & Kuang, 2012 ) then the majority class of the K closest documents is the class to which the new document is assigned. The parameter K is chosen to be an odd number to prevent ties from occurring. Another geometric classifier is support vector machines ( Joachims, 1998 ) in which a hyperplane is constructed that provides the best separation among the text in each class. The hyperplane is constructed in such a way that it provides the widest separation between the two nearest observations of each class.

The transformed text, usually the original DTM or the dimensionality reduced DTM, serves as input to one or more classification techniques. Most techniques are from the fields of machine learning and statistics. There are three general types of techniques: (a) geometric, (b) probabilistic, and (c) logical ( Flach, 2012 ).

For LSA and nonnegative matrix factorization, we need to decide how many dimensions to retain. For LSA, Fernandes, Artífice, and Fonseca (2017) offered this formula as a rough guide K = N ( 1 1 + log ( N ) 10 ) , where N is the size of the corpus, K is the number of dimensions to retain and the logarithm is base 10. For example, if there are 500 documents, then retain approximately 133 latent dimensions. In the case of nonnegative matrix factorization, an upper bound for choosing K is that it must satisfy this inequality ( N + M ) K < N M , where M is the number of original features ( Tsuge, Shishibori, Kuroiwa, & Kita, 2001 ). Hence if there are 500 documents and 1,000 terms, K should not be greater than 333. Of course, one has to experiment with different sizes of dimensionality and select the size that yields the maximum performance. For example, the formula gave 133 dimensions for 500 documents but one may also try experimenting values within ±30 around 133.

Our recommendation is start with the traditional VSM, that is, transform the documents into vectors using single terms as features. For the unsupervised scoring, compute the DF of each term and filter out terms with very low and very high DF, customarily those terms belonging to the lower 5th and upper 99th percentiles. For the supervised scoring try CHI and IG and for the feature transformation try LSA and nonnegative matrix factorization. Compare the effect on classification performance of the different feature sets generated by the methods and choose the feature set that yields the highest performance (e.g., accuracy). We also suggest to try combining scoring and transformation methods. For example, one can first run CHI and perform LSA on the terms selected by CHI. Note that the quality of the feature set (and that of the representation) is assessed based on its resulting classification performance ( Forman, 2003 ).

An alternative to scoring methods is to create latent orthogonal features by combining existing features. Methods that construct new features from existing ones are known as feature transformation methods. Techniques include principal component analysis (PCA; Sirbu et al., 2016 ; Zu, Ohyama, Wakabayashi, & Kimura, 2003 ), latent semantic analysis (LSA; Landauer et al., 1998 ), and nonnegative matrix factorization ( Zurada, Ensari, Asl, & Chorowski, 2013 ). These methods construct high level features as a (non)linear combination of the original features with the property that the new features are uncorrelated. They operate on the DTM by applying a matrix factorization method. The text is scored (or projected) on the new features, or factors, and these new features are used in the subsequent analysis. LSA improves upon the VSM through its ability to detect synonymy ( Landauer et al., 1998 ). Words that appear together and load highly on a single factor may be considered to be synonyms.

Another group of strategies to score features is to make use of class membership information in the training data. These methods are called supervised scoring methods . Examples of these methods are mutual information (MI), chi-squared (CHI), Gini index (GI), and information gain (IG; Yang & Pedersen, 1997 ). Supervised scoring methods are expected to be superior to unsupervised ones (e.g., DF), although in some cases DF thresholding has yielded performance comparable to supervised scoring methods such as CHI and GI ( Yang, 1999 ) and even exceeded the performance of MI.

One way to eliminate features is to first assign scores to each feature and then remove features by setting a cutoff value. This is called thresholding ( Lan, Tan, Su, & Lu, 2009 ; Salton & Buckley, 1988 ). Weights from the transformation steps are sometimes used to score features. An example is to remove rare terms, that is, terms with high IDF or low DF since they are noninformative for category prediction or not influential in global performance. In some cases, rare terms are noise terms (e.g., misspellings).

Even after preprocessing, transformation through VSM is still likely to result in a large feature set. Too large a number of features is undesirable because it may increase computational time and may degrade classification performance, especially when there are many redundant and noisy features ( Forman, 2003 ; Guyon & Elisseeff, 2003 ; Joachims, 1998 ). The size of the vector and hence the size of feature set is referred to as the dimensionality of the VSM representation. When possible, one should reduce dimensionality either by selectively eliminating features or by creating latent features from existing ones without sacrificing classification performance ( Burges, 2010 ; Fodor, 2002 ; van der Maaten, Postma, & van den Herik, 2009 ). A reduced feature set has advantages such as higher efficiency and in some cases, improved classification performance.

Text transformation plays a critical role in determining classification performance. Inevitably some aspects of the text are lost in the transformation phase. Thus, when resulting classification performance is poor, we recommend that the researcher reexamines this step. For example, while term-based features are popular, if performance is poor one could also consider developing features derived from linguistic information (e.g., parts of speech) contained in text ( Gonçalves & Quaresma, 2005 ; Kobayashi et al., 2017; Moschitti & Basili, 2004 ) or using consecutive characters instead of whole words (e.g., n-grams; Cavnar & Trenkle, 1994 ).

Although the VSM ignores word order information, it is popular due to its simplicity and effectiveness. Ignoring word order means losing some information regarding the semantic relationships between words. Also, words alone may not always express true atomic units of meaning. Some researchers improve the VSM by adding adjacent word pairs or trios ( bigrams and trigrams ) as features. For example, “new” followed by “york” becomes “new york” in a bigram. Although this incorporates some level of word order information, it also leads to feature explosion thereby increasing noise and redundancy. Also, many bigrams and trigrams do not occur often, thus their global contributions to the classification are negligible and will only contribute to sparsity and computational load. A workaround is to use only the most informative phrases (e.g., frequent phrases; Scott & Matwin, 1999 ). Strategies for selecting key phrases include the noun phrase ( Lewis, 1992 ) and key phrase ( Turney, 1999 ) extraction algorithms. However, this does add additional complexity in the analysis, which may again not result in a significant improvement in the classification. Studies have consistently shown that using bigrams only marginally improved classification performance and in some cases degraded it, whereas the use of trigrams typically yielded improvement ( Dave et al., 2003 ; Ragas & Koster, 1998 ). Using syntactic phrases typically does not improve performance much compared to single term features ( Moschitti & Basili, 2004 ; Scott & Matwin, 1999 ). Thus, the recommendation is to rely on single term features rather than phrases unless there is a strong rationale to use phrases.

Other weighting options can be derived from basic count weighting. One can take the logarithm of the counts to dampen the effect of highly frequent terms. Here we need to add 1 to the counts so that we avoid taking the logarithm of zero counts. It is also possible to normalize with respect to document length by dividing each count by the maximum term count in a given document. This is to ensure that frequent terms in long documents are not overrepresented. Apart from the weights of the terms in each document, terms can also be weighted with respect to the corpus. Common corpus-based weights include the inverse document frequency (IDF), which assesses the specificity of terms in a corpus ( Algarni & Tairan, 2014 ). Terms that occur in too few (large IDF) or in too many (IDF close to zero) documents have low discriminatory power and are therefore not useful for classification purposes. The formula for IDF is: I D F ( i ) = log ( N d f ( i ) ) , where df ( i ) stands for the document frequency of term i , that is, the number of documents containing term i . Document- and corpus –based weights may also be combined so that that the weights simultaneously reflect the importance of a term in a document and its specificity to the corpus. The most popular combined weight measure is the product of term frequency (TF) and the IDF ( x j i = T F ( j , i ) × I D F ( i ) ) ( Aizawa, 2003 ).

The most common way to transform text is to use the so-called vector space model (VSM) where documents are modeled as elements in a vector space ( Raghavan & Wong, 1986 ; Salton, Wong, & Yang, 1975 ). The features in this representation are the individual terms found in the corpus. This somehow makes sense under the assumption that words are the smallest independently meaningful units of a language. The size of the vector is therefore equal to the size of the vocabulary (i.e., the set of unique terms in a corpus). Hence, we can represent document j as X j = ( x j 1 x j 2 ⋯ x j M ) where M is the size of the vocabulary, and the element x j i is the weight of term i in document j . Weights can be the count of the terms in a document ( x j i = T F ( j , i ) ) or, when using binary weighting, a 1 (presence of a term) or 0 (absence of a term). Applying the transformation to the entire corpus will lead to a document-by-term matrix (DTM), where the rows are the documents, the columns are the terms, and the entries are the weights of the terms in each document.

Text transformation is about representing documents so that they form a suitable input to a classification algorithm. In essence, this comprises imposing structure on a previously unstructured text. Most classification algorithms accept vectors or matrices as input. Thus the most straightforward way is to represent a document as a vector and the corpus as a matrix.

In using English documents, our general recommendation is to apply word tokenization, convert upper case letters to lower case, and apply stopword removal (except for short text such as email messages and product titles; Méndez et al., 2006 ; H.-F. Yu, Ho, Arunachalam, Somaiya, & Lin, 2012 ). Since the effects of normalization have been mixed, our suggestion is to apply it only when there is no substantial degradation on classification performance, since it can increase classification efficiency by reducing the number of terms. When in doubt whether to remove numbers or punctuations (or other symbols), our advice is to retain them and apply the dimensionality reduction techniques discussed in the below section on text transformation.

During preprocessing stemming , which is defined as the process of obtaining the base or stem form of words ( Frakes, 1992 ; Porter, 1980 ), is also commonly applied. A key assumption in stemming is that words that have similar root forms are identical in meaning. Stemming is performed by removing suffixes that may not correspond to an actual base form of the word ( Willett, 2006 ). For example, the words calculate , calculating , calculated will be rewritten to calculate although the actual base form is calculate ( Toman, Tesar, & Jezek, 2006 ). If one wants to recover the actual base form then one can use lemmatization instead of stemming. However, lemmatization is more challenging than stemming ( Toman et al., 2006 ) and the added complexity of applying lemmatization may offset its benefits. Both lemmatization and stemming leads to a loss of inflection information in words (e.g., tense, gender, and voice). Inflection information may be important in some applications, such as in identifying the sentiment of product reviews, since as it turns out, most negative reviews are written in the past tense ( Dave, Lawrence, & Pennock, 2003 ). Stemming and lemmatization are part of a broad class of preprocessing techniques called normalization ( Dave et al., 2003 ; Toman et al., 2006 ). The aim of normalization is to merge terms that express the same idea or concept under a single code called a template . For example, another normalization strategy is to use the template POST_CODE to replace all occurrences of postcodes in a collection of documents. This can be useful when it is important to consider if a document does or does not contain a postcode (i.e., contains an address), but the actual postcode is irrelevant.

Punctuations and numbers, if deemed irrelevant to the classification task at hand are removed, although in some cases these may be informative and thus retained (exclamation marks or emoticons, for instance, may be indicative of sentiment). Dictionaries or lexicons are used to apply spelling correction, and to resolve typos, and abbreviations. Words that are known to have low information content such as conjunctions and prepositions are typically deleted. These words are called stopwords ( Fox, 1992 ), examples of pre-identified stopwords in the English language are “and,” “the,” and “of” (see http://www.ranks.nl/stopwords for different lists stopwords in various languages). When the case of the letters is irrelevant it is advisable to transform all upper case letters into lower case.

The purpose of preprocessing is to remove irrelevant bits of text as these may obscure meaningful patterns and lead to poor classification performance and redundancy in the analysis ( Uysal & Gunal, 2014 ). During preprocessing we first apply tokenization to separate individual terms. Terms may be words, punctuation marks, numbers, tags, and other symbols (e.g., an emoticon). In written English, terms are usually separated by spaces.

The TC process consists of six interrelated steps, namely (a) text preprocessing, (b) text representation or transformation, (c) dimensionality reduction, (d) selection and application of classification techniques, (e) classifier evaluation, and (f) classifier validation. As with any research activity, before starting the TC process, we begin by formulating the research question and identifying text of interest. Here, we assume that classes are predefined and that the researcher has access to, or can gather, documents with known classes, that is, the training data . For example, in a study about identifying disgruntled employee communications, researchers used posts from intracompany discussion groups. Subsequently, using criteria on employee disgruntlement, two people manually classified 80 messages into either disgruntled or nondisgruntled communication ( Holton, 2009 ). Another study focused on the detection of personality of users from their email messages. Researchers first administered a 120-item questionnaire to 486 users to identify their personalities after which their email messages over a 12-month period were collected ( Shen et al., 2013 ). Compared to the study on disgruntlement, it is more straightforward to label the associated text in this latter study because the labels are based on the questionnaire. Researchers are often faced with the decision of how many documents to label, an issue we will return to in the “Other TC issues” section below. Once the training dataset has been compiled, the next step is to preprocess the documents.

Tutorial

We developed the following tutorial to provide a concrete treatment of TC. Here we demonstrate TC using actual data and codes. Our intended audience are researchers who have little or no experience with TC. This tutorial is a scaled down version of our work on using TC to automatically extract job tasks from job vacancies. Our objective is to build a classifier that automatically classifies sentences into task or nontask categories. The sentences were obtained from German language nursing job vacancies.

We set out to automate the process of classification because one can then deal with huge numbers (i.e., millions) of vacancies. The output of the text classifier can be used as input to other research or tasks such as job analysis or the development of tools to facilitate personnel decision making. We used the R software since it has many ready-to-use facilities that automate most TC procedures. We provide the R annotated scripts and data to run each procedure. Both codes and data can be downloaded as a Zip file from Github; the URL is https://github.com/vkobayashi/textclassificationtutorial. The naming of R scripts are in the following format: CodeListing (CL) <number>.R and in this tutorial we referenced them as CL <number>. Thus, CL 1 refers to the script CodeListing_1.R. Note that the CL files contain detailed descriptions of each command, and that each command should be run sequentially.

All the scripts were tested and are expected to work on any computer (PC or Mac) with R, RStudio, and the required libraries installed. However, basic knowledge including how to start R, open R projects, run R commands, and install packages in RStudio are needed to run and understand the codes. For those new to R we recommend following an introductory R tutorial (see, for example, DataCamp [www.datacamp.com/courses/free-introduction-to-r] or tutorialspoint [www.tutorialspoint.com/index.htm] for free R tutorials).

This tutorial covers each of the previously enumerated TC steps in sequence. For each step we first explain the input, elaborate the process, and provide the output, which is often the input for the subsequent step. Table 2 provides a summary of the input, process, and output for each step in this tutorial. Finally, after downloading the codes and data, open the text_classification_tutorial.Rproj file. The reader should then run the codes for every step as we go along, so as to be able to examine the input and the corresponding output.

Table 2. Text Classification Based on the Input-Process-Output Approach.

Preparing Text The input for this step consists of the raw German job vacancies. These vacancies were obtained from Monsterboard (www.monsterboard.nl). Since the vacancies are webpages, they are in hypertext markup language (HTML), the standard markup language for representing content in web documents (Graham, 1995). Apart from the relevant text (i.e., content), raw HTML pages also contain elements used for layout. Therefore, a technique known as HTML parsing is used to separate the content from the layout. In R, parsing HTML pages can be done using the XML package. This package contains two functions, namely, htmlTreeParse() that parses HTML documents and xpathSApply() that extracts specific content from parsed HTML documents. CL 1 (see the annotations in the file for further details as to what each command does), installs and loads the XML package, and applies the htmlTreeParse() and xpathSApply() functions. In addition, the contents of the HTML file sample_nursing_vacancy.html in the folder data are imported as a string object and stored in the variable htmlfile. Subsequently, this variable is provided as an argument to the htmlTreeParse() function. The parsed content is then stored in the variable rawpagehtml, which in turn is the doc argument to the xpathSApply() function which searches for the tags in the text that we are interested in. In our case this text can be found in the div tag of the class content. Tags are keywords surrounded by angle brackets (e.g., <div> and </div>). The xmlValue in the xpathSApply() function means that we are obtaining the content of the HTML element between the corresponding tags. Finally the writeLines() function writes the text content to a text file named sample_nursing_vacancy.txt (in the folder parsed). To extract text from several HTML files, the codes in CL 1 are put in a loop in CL 2. The function htmlfileparser() in CL 2 accepts two arguments and applies the procedures in CL 1 to each HTML file in a particular folder. The first argument is the name of the folder and the second argument is the name of the destination folder where the extracted text content is to be written. Supposing these html files are in the folder vacancypages and the extracted text content is to be saved in the folder parsedvacancies, these are the arguments we provide to htmlfileparser(). Expectedly, the number of text files generated corresponds to the number of HTML files, provided that all HTML files are well-defined (e.g., correct formatting). The text files comprise the output for this step.

Preprocessing Text The preprocessing step consists of two stages. The first identifies sentences in the vacancies, since the sentence is our unit of analysis, and the second applies text preprocessing operations on the sentences. We used sentences as our unit of analysis since our assumption is that the sentence is the right resolution level to detect job task information. We did not use the vacancy as our unit of analysis since a vacancy may contain more than one task. In fact if we chose to treat the vacancy as the unit of analysis it would still be important to identify which of the sentences contain task information. Another reason to select sentence as the unit of analysis is to minimize variance in document length. Input for the first stage are the text files generated from the previous step, and the output sentences from this stage serve as input to the second stage. CL 3 contains functions that can detect sentences from the parsed HTML file in the previous section (i.e., sample_nursing_vacancy.html). The code loads the openNLP package. This package contains functions that run many popular natural language processing (NLP) routines including a sentence segmentation algorithm for the German language. Although the German sentence segmenter in openNLP generally works well, at times it may fail. Examining such failures in the output can provide ideas for the inclusion of new arguments in the algorithm. For example, if the segmenter encounters the words bzw. and inkl. (which are abbreviations of “or” and “including” respectively in German) then the algorithm will treat the next word as the start of a new sentence. This is because the algorithm has a rule that when there is a space after a period the next word is the start of a new sentence. To adjust for these and other similar cases, we created a wrapper function named sent_tokens(). Another function sent_split()searches for matches of the provided pattern within the string and when a match is found it separates the two sentences at this match. For example, some vacancies use bullet points or symbols such as “|” to enumerate tasks or responsibilities. To separate these items we supply the symbols as arguments to the function. Finally, once the sentences are identified the code writes the sentences to a text file where one line corresponds to one sentence. For multiple text files, the codes should again be run in a loop. One large text file will then be generated containing the sentences from all parsed vacancies. Since we put all sentences from all vacancies in a single file, we attached the names of the corresponding text vacancy files to the sentences to facilitate tracing back the source vacancy of each sentence. Thus, the resulting text file containing the sentences has two columns: the first column contains the file names of the vacancies from which the sentences in the second column were derived. After applying sentence segmentation on the parsed vacancy in sample_nursing_vacancy.txt, the sentences are written to the file sentencelines_nursing_vacancy.txt located in the folder sentences_from_sample_vacancy. The next task is to import the sentences into R so that additional preprocessing (e.g., text cleaning) can be performed. Other preprocessing steps that may be applied are lower case transformation, punctuation removal, number removal, stopword removal, and stemming. For this we use the tm package in R. This package automatically applies word tokenization, so we do not need to create separate commands for that. The sentences are imported as a data frame in R (see CL 4). Since the sentence is our unit of analysis, hereafter we refer to these sentences as documents. The first column is temporarily ignored since it contains only the names of the vacancy files. Since the sentences are now stored in a vector (in the second column of the data frame), the VectorSource() function is used. The source determines where to find the documents. In this case the documents are in mysentences[,2]. If the documents are stored in another source, for example in a directory rather than in a vector, one can use DirSource(). For a list of supported sources, invoke the function getSources(). Once the source has been set, the next step is to create a corpus from this source using the VCorpus() function. In the tm package, corpus is the main structure for managing documents. Several preprocessing procedures can be applied to the documents once collected in the corpus. Many popular preprocessing procedures are available in this package. Apart from the existing procedures, users can also specify their own via user-defined functions. The procedures we applied are encapsulated in the transformCorpus()function. They include number, punctuation, and extra whitespace removal, and lower case conversion. We did not apply stemming since previous work recommends not to use stemming for short documents (H.-F. Yu et al., 2012). The output consists of the cleaned sentences in the corpus with numbers, punctuation, and superfluous whitespaces removed.

Text Transformation CodeListing 5 details the structural transformation of the documents. The input in this step is the output from the preceding step (i.e., the cleaned sentences in the training data). To quantify text characteristics, we use the VSM because this is the simplest and perhaps most straightforward approach to quantify text and thus forms an appropriate starting point in the application of TC (Frakes & Baeza-Yates, 1992; Salton et al., 1975). For this transformation, the DocumentTermMatrix() of the tm package has a function that may be used to build features based on the individual words in the corpus. The DocumentTermMatrix() function transforms a corpus into a matrix where the rows are the documents, the columns are features, and the entries are the weights of the features in each document. The default behavior of the DocumentTermMatrix() function is to ignore terms with less than 3 characters. Hence, it is possible that some rows consist entirely of 0’s because after preprocessing it may be the case that in some sentences all remaining terms have less than 3 characters. The output in this step is the constructed DTM. This matrix is then used as a basis for further analysis. We can further manipulate the DTM, for instance by adjusting the weights. We mentioned previously that for word features one can use raw counts as weights. The idea of using raw counts is that the higher the count of a term in a document the more important it is in that document. The DocumentTermmatrix() function uses the raw count as the default weighting option. One can specify other weights through the weighting option of the control argument. To take into account documents sizes, for example, we can apply a normalization to the weights although in this case it is not an issue because sentences are short. Let us assign a “weight” to a feature that reflects its importance with respect to the entire corpus using the DF. Another useful feature of DF is that it provides us with an idea of what the corpus is about. For our example the word with the highest DF (excluding stopwords) is plege (which translates to “care”) which makes sense because nursing is about the provision of care. Terms that are extremely common are not useful for classification. Another common text analysis strategy is to find keywords in documents. The keywords may be used as a heuristic to determine the most likely topic in each document. For this we can use the TF-IDF measure. The keyword for each document is the word with the maximum TF-IDF weight (ties are resolved through random selection). The codes in CL 6 compute the keyword for each document. For example, the German keyword for Document 4 is aufgabenschwerpunkte which translates in English to “task focal points.” The final DTM can be used as input to dimensionality reduction techniques or directly to the classification algorithms. The process from text preprocessing to text transformation culminated in the DTM that is depicted in Figure 3. Download Open in new tab Download in PowerPoint

Dimensionality Reduction Before running classification algorithms on the data, we first investigate which among the features are likely most useful for classification. Since the initial features were selected in an ad hoc manner, that is, without reference to specific background knowledge or theory, it may be possible that some of the features are irrelevant. In this case, we applied dimensionality reduction to the DTM. LSA is a commonly applied to reduce the size of the feature set (Landauer et al., 1998). The output of LSA yields new dimensions which reveal underlying patterns in the original features. The new features can be interpreted as new terms that summarize the contextual similarity of the original terms. Thus, LSA partly addresses issues of synonymy and in some circumstances, polysemy (i.e., when a single meaning of a word is used predominantly in a corpus). In R, the lsa package contains a function that runs LSA. To illustrate LSA we need additional vacancies. For illustrative purposes we used 11 job vacancies (see the parsedvacancies folder). We applied sentence segmentation to all the vacancies and obtained a text file containing 425 sentences that were extracted from the 11 vacancies (see the sentences_from_vacancies folder). After applying preprocessing and transforming the sentences in sentenceVacancies.txt into a DTM, we obtained 1079 features and retained 422 sentences. We selected all sentences and ran LSA on the transposed DTM (i.e., the term-by-document matrix; see CL7). We applied normalization to the term frequencies to minimize the effect of longer sentences. Documents and terms are projected onto the constructed LSA space in the projdocterms matrix. The entries in this matrix are readjustments of the original entries in the term-by-document matrix. The readjustments take into account patterns of co-occurrence between terms. Hence, terms which often occur together will roughly have the same values in documents where they are expected to appear. We can apply the cosine measure to identify similar terms. Similarity is interpreted in terms of having the same pattern of occurrence. For example, terms which have the same pattern of occurrence with sicherstellung can be found by running the corresponding commands in CL 8. The German word sicherstellung (which means “to guarantee” or “to make sure” in English) is found to be contextually similar to patientenversorgunng (patient care) and reibungslosen (smooth or trouble-free) because these two words appeared together with sicherstellung (to guarantee) in the selected documents. Another interesting property of LSA is that it can uncover similarity between two terms even though the two terms may never be found to co-occur in a single document. Consider for example the word koordinierung (coordination), we find that kooperative (cooperative) is a term with which it is associated even though there is not one document in the corpus in which the two terms co-occur. This happens because both terms are found to co-occur with zusammenarbeit (collaboration), thus when either one of the terms occurs then LSA expects that the other should also be present. This is the way LSA addresses the issue of synonymy and polysemy. One can also find the correlation among documents and among terms by running the corresponding commands in CL 8. Since our aim is to reduce dimensionality, we project the documents to the new dimensions. This is accomplished through the corresponding codes in CL 8. From the LSA, we obtain a total of 107 new dimensions from the original 1,079 features. It is usually not easy to attach natural language interpretations to the new dimensions. In some scenarios, we can interpret the new dimension by examining the scaled coefficients of the terms on the new dimensions (much like in PCA). Terms with higher loadings on a dimension have a greater impact on that dimension. Figure 4 visualizes the terms with high numerical coefficients on the first 6 LSA dimensions (see CL 8 for the relevant code). Here we distinguish between terms found to occur in a task sentence (red) or not (blue). In this way, an indication is provided of which dimensions are indicative for each class (note that distinguishing between tasks and nontasks requires the training data, which is discussed in greater detail below). Download Open in new tab Download in PowerPoint Another approach that we could try is to downsize the feature set by eliminating those features that are not (or less) relevant. Such techniques are collectively called filter methods (Guyon & Elisseeff, 2003). They work by assigning scores to features and setting a threshold whereby features having scores below the threshold are filtered out. Both the DF and IDF can be used as scoring methods. However, one main disadvantage of DF and IDF is that they do not use class membership information in the training data. Including class membership (i.e., through supervised scoring methods) ought to be preferred, as it capitalizes on the discriminatory potential of features (Lan et al., 2009). For supervised scoring methods, we need to rely on the labels of the training data. In this example, the labels are whether a sentence expresses task information (1) or not (0). These labels were obtained by having experts manually label each sentence. For our example, experts manually assigned labels to the 425 sentences. We applied three scoring methods, namely, Information Gain, Gain ratio, and Symmetric Uncertainty (see CL 12). Due to the limited number of labeled documents, these scoring methods yielded less than optimal results. However, they still managed to detect one feature that may be useful for identifying the class of task sentences, that is, the word zusammenarbeit (collaboration), as this word most often occurred in task sentences. The output from this step is a column-reduced matrix that is either the reduced version of the DTM or the matrix with the new dimensions. In our example we applied LSA and the output is a matrix in which the columns are the LSA dimensions.

Classification The reduced matrix from the preceding section can be used as input for classification algorithms. The output from this step is a classification model which we can then use to automatically classify sentences in new vacancies. We have mentioned earlier that reducing dimensionality is an empirically driven decision rather than one which is guided by specific rules of thumb. Thus, we will test whether the new dimensions lead to improvement in performance as compared to the original set by running separate classification algorithms, namely support vector machines (SVMs), naive Bayes, and random forest, on each set. These three have been shown to work well on text data (Dong & Han, 2004; Eyheramendy et al., 2003; Joachims, 1998). Accuracy is not a good performance metric in this case since the proportion of task sentences in our example data is low (less than 10%). The baseline accuracy (computed from the model which assigns all sentences to the dominant class), would be 90% which is high, and thus difficult to improve upon. More suitable performance metrics are the F-measure (Ogura et al., 2011; Powers, 2011) and balanced accuracy (Brodersen, Ong, Stephan, & Buhmann, 2010). We use these two measures here since the main focus is on the correct classification of task sentences and we also want to control for misclassifications (nontask sentences put into the task class or task sentences put into the nontask class). In assessing the generalizability of the classifiers, we employed 10 times 10 fold cross-validation. We repeated 10 fold cross-validation 10 times because of the limited training data. We use one part of the data to train a classifier and test its performance by applying the classifier on the remaining part and computing the F-measure and balanced accuracy. For the 10 times 10 fold cross-validation, we performed 100 runs for each classifier using the reduced and original feature sets. Hence, for the example we ran about 600 trainings since we trained 6 classifiers in total. All performance results reported are computed using the test sets (see CL 10). From the results we see how classification performance varies across the choice of features, classification algorithms, and evaluation measures. Figure 5 presents the results of the cross-validation. Based on the F-measure, random forest yielded the best performance using the LSA reduced feature set. The highest F-measure obtained is 1.00 and the highest average F-measure is 0.40 both from random forest. SVM and naive Bayes have roughly the same performance. This suggests that among the three classifiers random forest is the best classifier using the LSA reduced feature set, and F-measure as the evaluation metric. If we favor the correct detection of task sentences and we want a relatively small dimensionality then random forest should thus be favored over the other methods. For the case of using the original features, SVM and random forest exhibit comparable performance. Hence, when using F-measure and the original feature set, either SVM or random forest would be the preferred classifier. The low values of the F-measures can be accounted for by the limited amount of training data. For each fold, we found that there are about 3-4 task sentences, thus a single misclassification of a task sentence leads to sizeable reduction in precision and recall which in turn results in a low F-measure value. Download Open in new tab Download in PowerPoint When balanced accuracy is the evaluation measure, SVM and random forest consistently yield similar performance when using either the LSA reduced feature set or the original feature set, although, random forest yielded a slightly higher performance compared to SVM using the LSA reduced features set. This seems to suggest that for balanced accuracy and employing the original features, one can choose between SVM and random forest, and if one decides to use the LSA feature set then random forest is to be preferred. Moreover, notice that the numerical value for balanced accuracy is higher than F-measure. Balanced accuracy can be increased by the accuracy of the dominant class, in this case the nontask class. This classification example reveals the many issues that one may face in building a suitable classification model. First is the central role of features in classification. Second is how to model the relationship between the features and the class membership. Third is the crucial role of choosing an appropriate evaluation measure or performance metric. This choice should be guided by the nature of the problem, the objectives of the study, and the amount of error we are willing to tolerate. In our example, we assign equal importance to both classes, and we therefore have slight preference for balanced accuracy. In applications where the misclassification cost for the positive class is greater than that for the other class, the F-measure may be preferred. For a discussion of alternative evaluation measures see Powers (2011). Other issues include the question of how to set a cutoff value for the evaluation measure to judge whether a model is good enough. A related question is how much training data are needed for the classification model to generalize well (i.e., how to avoid overfitting). These questions are best answered empirically through systematic model evaluation, such as by trying different training sizes and varying the threshold, and then observing the effect on classifier performance. One strategy is to treat this as a factorial experiment where the choices of training size and evaluation measures are considered as factor combinations. In addition, one has to perform repeated evaluation (e.g., cross-validation) and validation. Aside from modeling issues there are also practical concerns such as the cost of acquiring training data and the interpretability of the resulting model. Classification models with high predictive performance are not always the ones that yield the greatest insight. Insofar as the algorithm is to be used to support decision making, the onus is on the researcher to be able to explain and justify its workings.

Classification for Job Information Extraction For our work on job task information extraction three people hand labeled a total of 2,072 out of 60,000 sentences. It took a total of 3 days to label, verify and relabel 2,072 sentences. From this total, 132 sentences were identified as task sentences (note that the task sentences were not unique). The proportion of task sentences in vacancy texts was only 6%. This means that the resulting training data are imbalanced. This is because not all tasks that are part of a particular job will be written in the vacancies, likely only the essential and more general ones. This partly explains their low proportion. Since labeling additional sentences will be costly and time-consuming we employed a semisupervised learning approach called label propagation (Zhu & Ghahramani, 2002). For the transformation and dimensionality reduction we respectively constructed the DTM and applied LSA. Once additional task sentences were obtained via semisupervised learning we ran three classification algorithms, namely, SVM, random forest, and naive Bayes. Instead of choosing a single classifier we combined the predictions of the three in a simple majority vote. For the evaluation measure we used the Recall measure since we wanted to obtain as many task sentences as possible. Cross-validation was used to assess the generalization property of the model. The application of classification resulted to identification of 1,179 new task sentences. We further clustered these sentences to obtain unique nursing tasks since some sentences pointed to the same tasks.