Parsing Options

We can guide the MLDataTable to make parsing easier. MLDataTable.ParsingOptions allows us to specify the custom formats in our data. With these options, we can guide parsing with settings like the delimiter, end of line character ( lineTerminator ) or whether our data contains a header ( containsHeader ). We can even choose the columns that we want to parse by setting the column names to the selectColumns parameter. It lets us parse more lines successfully.

As you can see above, we can guide the model as we indicate the file uses \r (carriage return) for line breaks. With this guidance, the number of lines failed decreases to 2.

In this case, the problems are the commas that are being used in the sentences. It’s confusing to parse commas as they’re serving as both delimiters for columns and as punctuation. We’d better remove the punctuation before training.

When we click the eye icon in the assistant editor, it shows us the summary of the MLDataTable.

Creating the MLDataTable with a Dictionary

We can create the MLDataTable with a dictionary as well. Types that conform to the MLDataValueConvertible protocol can be converted to a value in a data table. Array and Dictionary structures already conform to this protocol.

The code below shows converting a dictionary to the data table:

Removing a Column

We can use the removeColumn method to remove the columns that we don’t want to include in the training process.

Adding a New Column

We can create a new column in our data table by creating a MLDataColumn and add it with the addColumn method:

Okay, so far so good. Let’s have a look at another sample. In this case, we want to create a new column by merging two other columns. How can we do that?

To show this use case, let’s say we have two subjective sentiment scores for the words in our data table.

The scores are between -2 and+2 (negative to positive). We want to calculate the average of these two scores and label them as positive, negative, or neutral. We’ll write this label into a new column.

After running this code in Create ML, the data table will have a new column like in the image below.

Extra Methods

dropDuplicates : removes the duplicate rows in the table.

dropMissing : removes the rows that have missing values.

fillMissing : fills the missing values in the named column.

randomSplit : splits the data into two sets—this is useful for creating training and testing data sets. It takes a value between 0.0 and 1.0 , indicating the fraction of rows that go into one subset or the other.

After finishing the pre-processing, we can give this MLDataTable directly to a model and start the training process of our machine learning model.

We use MLTextClassifier to train a text classification model—it takes the text column and the label column as parameters. We hit the play button to start training and track the process in the debug area.