Introduction

Text mining via analysis of online text sources is a vital tool for gaining new knowledge and insights into people’s habits, sentiments and for monitoring social progress. While search engines exist to provide internet users with easier access to existing information, text mining provides the opportunity to identify new knowledge and insights.

As many corporations store the majority of their data in text form, cost efficient text mining can be a valuable asset. One major use of text mining is within marketing where trends can be analysed using text transcripts of interactions with customers. For example, Hewlett Packard uses SAS Text Miner to analyse transcripts of telesales calls, partitioning notes by themes which can be used for subsequent analysis .

In addition, text mining of research literature has proven to be enormously useful, particularly in medical research. A notable success is that of the Children’s Memorial Hospital in Chicago where SPSS text mining software has been used for identifying drug targets to potentially cure cancers .

Comparison to Traditional Knowledge Extraction

Consider an analysis done by the University of Regensburg, Germany, of bond markets between 1913 and 1919. This shows German investors remained relatively positive until months before the end of World War I. However British and Dutch investors had a vastly different view, as the price of German government bonds fell on British and Dutch markets.

The full paper highlights many of the weaknesses of such analysis – typically information is incomplete, often originally in paper based format and confined to a narrow section of people.

With social networking, deeper analysis is possible. The following is a social network graph of Twitter data where terms relating to R and data mining are clustered together. Terms used a lot in the same tweets are closer together while the most common terms are in the center.

Such text mining has been seen to be enormously useful for commercial use – for search engine optimisation, identifying hot topics in newsprint and for use in recommender algorithms.

Data Input

Often data can come from many inconvenient sources, so data input and scrubbing tends to be an important and time consuming task. Consider the Reuters-21578 dataset, a collection of Reuters articles from 1987 . The following is a sample of the first news item in one of the files:

<!DOCTYPE lewis SYSTEM "lewis.dtd"> <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1"> <DATE>26-FEB-1987 15:01:01.79</DATE> <TOPICS><D>cocoa</D></TOPICS> <PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES> <PEOPLE></PEOPLE> <ORGS></ORGS> <EXCHANGES></EXCHANGES> <COMPANIES></COMPANIES> <UNKNOWN> C T f0704reute u f BC-BAHIA-COCOA-REVIEW 02-26 0105</UNKNOWN> <TEXT> <TITLE>BAHIA COCOA REVIEW</TITLE> <DATELINE> SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review. . . lines omitted for brevity and illustrative purposes . </BODY></TEXT> </REUTERS>

Luckily, data has been prepared with a standard format which is explained in an accompanying README file. The article content is contained between the tags <BODY> and </BODY> while additional category data is included e.g. topic, location, people, organisations. From this, a basic script can be written to extract the news articles relating to business acquisitions:

con <- file("stdin", open = "r") #initialise document ;list document <- list() #boolean to check if in news article body <- FALSE #boolean to check if article relates to chosen topic on_topic <- FALSE while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- sub("<","<",line) if(grepl("<TOPICS>",line)) { #filter so only acquisition related news is looked at on_topic <- grepl("acq",line) } if(!body & grepl("<BODY>",line)) { #check if reached article start body <- TRUE } else if(body & grepl("</BODY>",line)) { #check if reached article end body <- FALSE; on_topic <- FALSE } if(grepl("<BODY>",line) & on_topic) { #create new document in list and add first line of article document[[length(document)+1]] <- strsplit(line,"<BODY>")[[1]][2] } else if(body & on_topic) { #or add next line of article document[[length(document)]] <- paste(document[[length(document)]],line) } } close(con)

Scrubbing and Preparing Data

Firstly, text is converted to lowercase while whitespace, numbers and punctuation are removed. This ensures multiple forms of a word, e.g. “Word,” and “word”, are now the same. Also, small but common words, e.g. “a” and “is”, need to be removed to get useful results.

Next, stemming is performed where similar words with different spellings are grouped, e.g. instances of ‘work’, ‘worker’, ‘worked’ and ‘working’ might come under the stem-word ‘work’. Stemming is a challenging problem given languages’ complexity e.g. “share” as a verb is vastly different from “shares” as a noun. In addition, software will often produce unusual stem words e.g. “company” stems to “compani”.

R provides support for text preparation via the package tm which can be used for texts in a wide variety of languages. The following code shows how to implement text scrubbing on the list of news articles extracted in the previous section:

library(tm) myStopwords <- c(stopwords('english'), "available", "via", "the", "said","reut","reuter","pct","mln","dlrs","inc","it","ab") corp = Corpus(VectorSource(document)) parameters = list(minDocFreq = 1, wordLengths = c(2,Inf), tolower = TRUE, stripWhitespace = TRUE, removeNumbers = TRUE, removePunctuation = TRUE, stemming = TRUE, stopwords = myStopwords, tokenize = NULL, weighting = function(x) weightSMART(x,spec="ltn")) myTdm = TermDocumentMatrix(corp,control=parameters)

This creates a term document matrix, which shows the number of occurrences for each term in each document, an essential tool in text analytics.

Generating Results

Word Cloud

The easiest result that can be generated from data treated with above code is a word cloud. The following code does just this:

library(wordcloud) m <- as.matrix(myTdm) #frequency of words in descending order wordFreq <- sort(rowSums(m), decreasing=TRUE) #set colour in accordance with word frequency grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) ) #build word cloud based on word frequency wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=3, random.order=F, colors=grayLevels)

And generates the following image:

Clustering

A useful tool when text mining is ‘clustering’ or grouping of data e.g. words like ‘shares’ and ‘acquisition’ might be part of a cluster. This has enormous use in search engine optimisation and observing user behaviours on social media.

Before clustering, a look at the term document matrix in R reveals this:

> myTdm <<TermDocumentMatrix (terms: 1828, documents: 100)>> Non-/sparse entries: 5261/177539 Sparsity : 97% Maximal term length: 17 Weighting : SMART ltn (SMART)

Now the term document matrix is quite sparse, where 97% of the entries are 0. In other words, many terms will only appear in a handful of documents and working with the whole matrix can be unnecessarily time consuming. Using the function removeSparseTerms removes such terms.

Hierarchical cluster analysis is where the relationship between textual terms is evaluated based on similarity i.e. how similar their rows in a term document matrix are:

#remove sparse terms myTdm2 <- removeSparseTerms(myTdm, sparse=0.75) m2 <- as.matrix(myTdm2) # create distance matrix # which shows the distance between textual terms # where each text term's position is represented # by their row in the Term Document Matrix # showing how similar terms are distMatrix <- dist(scale(m2)) #apply hierarchical clustering fit <- hclust(distMatrix, method="ward.D") #plot dendogram plot(fit) #mark out clusters with rectangles, default colour is red rect.hclust(fit, k=10)

This easily reveals relationships between common terms e.g. trading related terms such as “stock”, “offer”, “common” and “share” are together on one side of the graph. However it is a time consuming algorithm for Big Data datasets. More efficient approaches are given by the k-means and k-medoids algorithms.

K-means and K-medoids Examples

Consider 4 users, their ratings of 2 movies which are rated from 1 to 5 stars, and try to break this into 2 clusters.

K-means Algorithm

Step 1: Randomly pick 2 users and let the centroid of each cluster be at their position.

Step 2: Assign each user to the cluster with the closest centroid.

Step 3: Calculate a new centroid for each cluster i.e. the average position of users in each cluster.



Step 4: Repeat steps 2 and 3 until the centroids converge.

K-medoids Partioning Around Medoids (PAM) Algorithm

Step 1: Randomly pick 2 users and let the medoid of each cluster be at their position.

Step 2: Assign non-medoids to the cluster of the closest medoid and find the partition’s “cost”. The cost is the sum of the distance of each user to the nearest medoid.

Step 3: For each medoid try to find a non-medoid that results in the total cost of the partition being minimal. Repeat this step until the set of medoids converges.

Now the following illustrates the code for k-means clustering:

#remove sparse terms myTdm2 <- removeSparseTerms(myTdm, sparse=0.75) m2 <- as.matrix(myTdm2) #take a guess at number of clusters k <- 8 #apply k-means kmeansResult <- kmeans(m2, k) #print out clusters and terms in each cluster for (i in 1:k) { cat(paste("cluster ", i, ": ", sep="")) cat(rownames(m2)[which(table(rownames(m2), kmeansResult$cluster)[,i]==1)], "

") }

Which outputs the following clusters.

cluster 1: will cluster 2: acquisit cluster 3: offer cluster 4: group cluster 5: common share stock cluster 6: acquir cluster 7: unit cluster 8: compani corp

And for the PAM algorithm:

#remove sparse terms myTdm2 <- removeSparseTerms(myTdm, sparse=0.75) m2 <- as.matrix(myTdm2) library(cluster) #take a guess at number of clusters k <- 8 #apply k-medoids pam algorithm pamResult <- pam(m2,k) #print out clusters and terms in each cluster for (i in 1:k) { cat(paste("cluster ", i, ": ", sep="")) cat(rownames(m2)[which(table(rownames(m2), pamResult$clustering)[,i]==1)], "

") } cluster 1: acquir compani corp cluster 2: acquisit cluster 3: common share cluster 4: group cluster 5: offer cluster 6: stock cluster 7: unit cluster 8: will

The results seen above are random in nature as kmeans and pam initially choose random centroids/medoids, so running the above code will give different results each time. If more stable results are required, particularly when developing code, the set.seed() function can be used.

Social Network Graphs

R has a powerful package, igraph, to generate social network graphs such as the Twitter analysis presented earlier. The following code generates a graph of terms, highlighting the key terms in the documents:

#remove sparse terms termDocMatrix <- removeSparseTerms(myTdm, sparse=0.75) termDocMatrix <- as.matrix(termDocMatrix) #create boolean matrix termDocMatrix[termDocMatrix>=1] <- 1 #transform into a term-term adjacency matrix termMatrix <- termDocMatrix %*% t(termDocMatrix) library(igraph) #create graph based on adjacency matrix g <- graph.adjacency(termMatrix, weighted=T, mode="undirected") #remove loops g <- simplify(g) #set labels and degrees of vertices V(g)$label <- V(g)$name V(g)$degree <- degree(g) #show graph plot(g, layout=layout.fruchterman.reingold(g))

Similarly, by numbering documents, a graph can be generated showing which documents are closely related by terms used and which aren’t. There are 100 articles related to acquisitions in the dataset and a graph of how they relate would be too dense to print. The analysis below is instead on articles relating to grain, which produces a clearer network graph.

#remove most common terms, so graph hasn't got too many edges idx <- which(dimnames(termDocMatrix)$Terms %in% c("tonn", "trade","price","week")) M <- termDocMatrix[-idx,] #create document-document adjacency matrix docsMatrix <- t(M) %*% M library(igraph) #create graph based on adjacency matrix g <- graph.adjacency(docsMatrix, weighted=T, mode = "undirected") #remove loops g <- simplify(g) #remove edges of low weight #i.e. when 2 terms don't appear that often in the same document #don't show them connecting g <- delete.edges(g, E(g)[E(g)$weight <= 1]) #remove isolated vertices g <- delete.vertices(g, V(g)[degree(g) == 0]) #show graph plot(g, layout=layout.fruchterman.reingold)

Inspecting elements at the centre e.g. documents 9, 13, 22 and 24 and comparing to documents on the exterior e.g. documents 14, 15 and 18 reveals that documents on the interior of the graph directly relate to grain and agriculture and are quite similar in content, while those at the edges have less relevance to grain and are a little more random in content:

> document[c(9,13,22,24)] [[1]] [1] "The U.S. Agriculture Department is not actively considering offering subsidized wheat to the Soviet Union \"The grain companies are trying to get this fired up again,\" an aide to Agriculture Secretary Richard Lyng said. ... REUTER" [[2]] [1] "Indonesia\"s agriculture sector will grow by just 1.0 pct in calendar 1987 ... Production of Indonesia\"s staple food, rice, is forecast to fall to around 26.3 mln tonnes ... REUTER" [[3]] [1] "All major grain producing countries must do their part to help reduce global surpluses ... REUTER" [[4]] [1] "Grain trade representatives continued to speculate that the Reagan administration will offer subsidized wheat to the Soviet Union ... REUTER" > document[c(14,15,18)] [[1]] [1] "China's wheat crop this year is seriously threatened ... REUTER" [[2]] [1] "Canadian and Egyptian wheat negotiators failed to conclude an agreement on Canadian wheat exports ... REUTER" [[3]] [1] "The French Cereals Intervention Board, ONIC, left its estimate of French 1986/87 (July/June) soft wheat deliveries ... Reuter"

Limitations and Issues with Text Mining

While processes such as stemming and removing stopwords, or profanity, have moved on since the 1990s these processes can never be unambiguously defined. It is therefore still at the user’s discretion and bias as to how this is implemented and whether or not key data patterns are noticed.

Text mining can be prone to bias relating to the age, sex and occupation of typical social network users and it is difficult to account for sarcasm or that some users will only write online when in one particular mood e.g. only reviewing a hotel if an experience is bad. These are only some of the factors that can skew text analysis results, which needs to be borne in mind when reporting results.

Conclusion

The above examples represent only a sample of what can be done with text mining. Given the level of textual information many conpanies store, text mining can significantly enhance data mining results. Also, text mining techniques have uses in other areas of computer science e.g. clustering can be used to process images and improve their quality.

Given the complexities of language, text mining can be quite prone to biases in data preparation and analysis. Nevertheless the knowledge, patterns and results obtained by text mining can often reveal new insights that can be corroborated and checked against other forms of analysis, ensuring any biases are removed.

References

Authored by:

Liam Murray

Liam Murray is a data driven individual with a passion for Mathematics, Machine Learning, Data Mining and Business Analytics. Most recently, Liam has focused on Big Data Analytics – leveraging Hadoop and statistically driven languages such as R and Python to solve complex business problems. Previously, Liam spent more than six years within the finance industry working on power, renewables & PFI infrastructure sector projects with a focus on the financing of projects as well as the ongoing monitoring of existing assets. As a result, Liam has an acute awareness of the needs & challenges associated with supporting the advanced analytics requirements of an organization.