1. Preprocessing:

Just as Tim has his drawer, we have different data sources with different types of data. We as humans are drawers in a way, with all of our biometrics, phone calls, contact agenda, YouTube views, films, series, songs, text messages, tweets, books, activities, commutes, and so on.

That said, we can refer to two main types of data:

Figure 1: Example of relational data (SQL table). Source

Structured data: information that has a pre-defined structure, which is typically represented in a numerical way but can also include text (to denote classes for example). A good example is an SQL table. SQL is a standard language for storing, manipulating and retrieving data in databases. This means that data is structured and related through different columns. In figure 1 we can see 2 columns that can take numerical values (EmployeeId and DepartmentId) and other two for textual information (LastName and Coutry). Unstructured data: this is a large chunk of the total amount of data that we consume and produce every day: information that does not have a well defined systematic structure. When we deal with natural language, we are dealing with unstructured data: we can’t specify a universal structure or an invariable range of values that a sentence can have (in contrast with the example above).

I like to think of this type of information as information that we don’t currently know how we process and manipulate (cognitively). We can conceive NLP as the different set of tools that can be applied in order to structure natural language for different purposes. As it was mentioned in the first part of the series, our dataset will be called corpus, since it is composed by a set of textual information (the plural is corpora). We can think of this corpus as the set of Lego© pieces that Tim selects initially, by discarding pieces from other puzzles.

Preprocessing (source)

When Tim removes the damaged or unuseful pieces from his set, this is called the preprocessing step, in which he tries to select useful pieces for the building process. When we preprocess the data, in NLP we name it text normalization or data preparation, since we are trying to ‘normalize’ in some way the elements of our corpus.

The following posts will cover each of the different tools that can be applied in each step, but, for the sake of argument, if we have both cat and cats in our corpus, we would be interested in normalizing them by unifying both terms into cat (this, as we will see, is called stemming). Other examples would be splitting the corpus into different sentences or removing URLs and other elements that could be present in our corpus and may not be of our interest for the task.