Do you want to get started on text mining, but most of the tutorials you start, get pretty complex very quickly? Or you can’t find a proper data set to work on?

DataCamp’s latest post will walk you through 8 tips and tricks that will help you to start text mining and to stay hooked on it.

1. Get Curious About Text The first step to almost anything in data science is to get curious. Text mining is no exception to that. You should get curious about text like David Robinson, data scientist at StackOverflow, described in his blog a couple of weeks ago, “I saw a hypothesis […] that simply begged to be investigated with data”. (For those of you who are wondering what the hypothesis was, it was this: Every non-hyperbolic tweet is from iPhone (his staff).



Every hyperbolic tweet is from Android (from him). pic.twitter.com/GWr6D8h5ed — Todd Vaziri (@tvaziri) August 6, 2016 ) Or maybe, if you’re not really into verifying hypotheses, you should get curious about that cool word cloud you saw, realizing you want to reproduce it for yourself. Do you still need to be convinced of how cool text mining can be? Get inspired by one of the many text mining use cases that recently got a lot of attention in the media, like the text mining and analysis of South Park dialogue, film dialogue, …

2. Get The Skills and Knowledge You Need When you have gotten curious, it’s time to step up your game and start developing your knowledge and skills about text mining. You can easily do this by completing some tutorials and courses. What you should look out for in these courses is that they introduce you to at least some of the steps that you find in a data science workflow, such as data preparation or preprocessing, data exploration, data analysis, … DataCamp offers some material for those who are looking to get started with text mining: recently, Ted Kwartler wrote a guest tutorial on mining data from Google Trends and Yahoo’s stock service. This easy-to-follow R tutorial lets you learn text mining by doing and is a great start for any text mining starters. In addition, Ted Kwartler is also the instructor of DataCamp’s R course “Text Mining: Bag of Words”, which will introduce you to a variety of essential topics for analyzing and visualizing data and lets you practice your acquired text mining skills on a real-world case study. On the other hand, you also have some other material out there that is not necessarily limited to R. For Python, you could check out these tutorials and/or courses: for an introduction to text analysis in Python, you can go to this tutorial. Or you can also go through this introductory Kaggle tutorial. Are you, however, more interested in other resources? Go to DataCamp’s Learn Data Science - Resources for Python & R tutorial!

3. Words, Words, Words - Finding Your Data Once you have gotten the hang out of the essential concepts and topics that you need to analyze and visualize your data, it is time to go and find the data! And believe us when we tell you that there are a lot of ways to get your data. Besides the mention of Google Trends and Yahoo, you can also access data from: Twitter! Both R and Python offer packages or libraries that will allow you to connect to the Twitter API and retrieve tweets. You will learn more about this in the next section.

The Internet Archive, a non-profit library of millions of free books, movies, software, music, websites, and more.

Project Gutenberg offers over 55,000 free ebooks. Most of them are established literature and will thus be a good source if you want to do an analysis on the works of authors like Shakespeare, Jane Austen, Edgar Allan Poe.

For an academic approach to text mining, you can use the contents of JSTOR’s data for research. It is a free, self-service tool that allows computer scientists, digital humanists, and other researchers to select and interact with content on JSTOR.

If you’re looking to do text mining on series or movies, just like in the examples that were given above, you might want to consider downloading the subtitles. A simple Google search can definitely provide you what you need to form your own corpus to get started on text mining.

You can also get your data from corpora. Two of the well-known corpora are: The Reuters Text Corpus. Some will argue that this is not the most diverse corpus to use, but it is excellent if you’re just starting to learn to do text mining. The Brown Corpus contains text from 500 sources, which are categorized by genre.



As you can see, the possibilities are endless. Everything that contains text can become the topic of your text mining case study.

5. Preparation Is Half The Battle - Preprocessing Your Data It probably doesn’t come to you as a surprise when I tell you that data scientists spend 80% of their time cleaning their data. Text mining is also no exception in this respect. Textual data can be dirty, so you should make sure that you spend enough time to clean it. If you’re unsure of what preprocessing your data means, some of the standard preprocessing steps include: Extracting text and structure so that you have the textual format you want to process,

Removing stopwords such as “that” or “and”,

Stemming, which you use to extract the root of words. This can be done with the help of a dictionary or with linguistic rules or algorithms such as Porter’s Algorithm.

These steps seem hard, but preprocessing your data doesn’t need to be like that. For the most part, the libraries and packages that were mentioned in the previous section can already help you a lot. For example, the tm library in R allows you to do some preprocessing with its built-in functions: you can do stemming and remove stop words, eliminate white spaces and convert the words to lowercase. Similarly, the nltk package in Python allows you to do much of the preprocessing because of the built-in functions. However, you can still go a step further and also do some preprocessing based on regular expressions to describe the character patterns which interest you. This way, you will also speed up the process of data cleaning a bit. For Python, you can make use of the re library and for R, there are a bunch of functions that can help you out, such as grep() , grepl() , regexpr() , gregexpr() , sub() , gsub() , and strsplit() . If you want to know more about these functions and regular expressions in R, you can always check out this page.

6. Data Scientist’s Adventures in Wonderland - Exploring Your Data By now, you will be excited to get started on your analysis. It is, however, always a good idea to get a look at your data before you start your analysis. Some ideas to quickly get started on exploring your data with the help of the base packages or the libraries that have been mentioned above: Create a document term matrix: elements in this matrix represent the occurrence of a term (a word or an n-gram) in a document of the corpus.

After you have made the document term matrix, you can use a histogram to visualize the frequency of the words in your corpus.

You might also be interested in knowing the correlation between two or more terms in your corpus.

To visualize your corpus, you can also make a word cloud. In R, you can make use of the wordcloud library. A Python package with the same name also exists if you want to do the same in Python.

The nice thing about exploring your data before diving into your analysis is that you already have an idea what you’ll be working with. If you see in the document term matrix or the histogram that you have a lot of sparse words, you can decide to remove them from your corpus.

7. Level Up Your Text Mining Skills When you have preprocessed and have done a basic textual analysis of your data with the tools that have been mentioned in the previous step, you might also consider using your data set to broaden your text mining skills. Because there is so much more. You have only seen the tip of the iceberg when it comes to text mining. Firstly, you should consider exploring the difference between text mining and Natural Language Processing (NLP). More NLP libraries in R can be found on this page. With NLP, you will discover Named Entity Recognition, POS tagging and parsers, sentiment analysis, … For Python, you can make use of the nltk package. You can find a a full tutorial on sentiment analysis with the nltk package here. Besides these packages, you can check out more tools to get started on topics such as deep learning and statistical topic detection modeling (such as Latent Dirichlet Allocation or LDA), among the many others that exist. Some of the packages that you can use to approach these topics are listed below: Python packages: the Python packages gensim to implement word2vec, among others, and GloVe. Also, theano should probably also be on your list if you want to discover deep learning further. Lastly, use gensim if you want to implement LDA.

R packages: for an approach to vectorization and word embeddings, use text2vec. If, however, you’re more interested in getting into sentiment analysis, the syuzhet library in combination with the tm library is probably the way to go. Finally, the topicmodels library for R is ideal for statistical topic detection modeling.

And these packages are not nearly all that exists. Since text mining is a pretty hot topic, there has been a lot to discover these past years in terms of research and you can expect that it will keep on being important in the years to come with multimedia mining, multilingual text mining, …

8. More Than Words - Visualizing Your Results Don’t forget to communicate the results of your analysis! This is probably one of the most wonderful things that you can do, since visual representations attract people. Your visualizations are your story. So don’t hold back to visualize the correlations or topics you have found in your analysis. For both Python and R you have specific packages that will help you to do this. You should therefore complete your list of the packages that you will have to use with these specific data visualization libraries to present your results: For Python, you might consider using the NetworkX package to visualize complex networks. However, the matplotlib package can also come in handy for other types of visualizations. Also the plotly package, which allows you to make interactive, publication-quality graphs online is one of go-to packages for presenting your results visually. A tip for all of those who are huge fans of data visualization: try linking Python and D3, the JavaScript library for dynamic data manipulation and visualization that allows your audience to become active participants in data visualization process. For R, besides the libraries that you will already know, such as ggplot2 , which is always a good idea to use, you can also use the igraph library to analyze following or followed and retweeting relationships. Do you want even more? Consider checking out plotly and networkD3 to link R and JavaScript or the LDAvis library to interactively visualize topic models.