This article was peer reviewed by Wern Ancheta. Thanks to all of SitePoint’s peer reviewers for making SitePoint content the best it can be!

As of late, it seems everyone and their proverbial grandma is talking about Machine Learning. Your social media feeds are inundated with posts about ML, Python, TensorFlow, Spark, Scala, Go and so on; and if you are anything like me, you might be wondering, what about PHP?

Yes, what about Machine Learning and PHP? Fortunately, someone was crazy enough not only to ask that question, but to also develop a generic machine learning library that we can use in our next project. In this post we are going take a look at PHP-ML – a machine learning library for PHP – and we’ll write a sentiment analysis class that we can later reuse for our own chat or tweet bot. The main goals of this post are:

Explore the general concepts around Machine learning and Sentiment Analysis

Review the capabilities and shortcomings of PHP-ML

Define the problem we are going to work on

Prove that trying to do Machine learning in PHP isn’t a completely crazy goal (optional)

What is Machine Learning?

Machine learning is a subset of Artificial Intelligence that focuses on giving “computers the ability to learn without being explicitly programmed”. This is achieved by using generic algorithms that can “learn” from a particular set of data.

For example, one common usage of machine learning is classification. Classification algorithms are used to put data into different groups or categories. Some examples of classification applications are:

Email spam filters

Market segmentation

Fraud detection

Machine learning is something of an umbrella term that covers many generic algorithms for different tasks, and there are two main algorithm types classified on how they learn – supervised learning and unsupervised learning.

Supervised Learning

In supervised learning, we train our algorithm using labelled data in the form of an input object (vector) and a desired output value; the algorithm analyzes the training data and produces what is referred to as an inferred function which we can apply to a new, unlabelled dataset.

For the remainder of this post we will focus on supervised learning, just because its easier to see and validate the relationship; keep in mind that both algorithms are equally important and interesting; one could argue that unsupervised is more useful because it precludes the labelled data requirements.

Unsupervised Learning

This type of learning on the other hand works with unlabelled data from the get-go. We don’t know the desired output values of the dataset and we are letting the algorithm draw inferences from datasets; unsupervised learning is especially handy when doing exploratory data analysis to find hidden patterns in the data.

PHP-ML

Meet PHP-ML, a library that claims to be a fresh approach to Machine Learning in PHP. The library implements algorithms, neural networks, and tools to do data pre-processing, cross validation, and feature extraction.

I’ll be the first to admit PHP is an unusual choice for machine learning, as the language’s strengths are not that well suited for Machine Learning applications. That said, not every machine learning application needs to process petabytes of data and do massive calculations – for simple applications, we should be able to get away with using PHP and PHP-ML.

The best use case that I can see for this library right now is the implementation of a classifier, be it something like a spam filter or even sentiment analysis. We are going to define a classification problem and build a solution step by step to see how we can use PHP-ML in our projects.

The Problem

To exemplify the process of implementing PHP-ML and adding some machine learning to our applications, I wanted to find a fun problem to tackle and what better way to showcase a classifier than building a tweet sentiment analysis class.

One of the key requirements needed to build successful machine learning projects is a decent starting dataset. Datasets are critical since they will allow us to train our classifier against already classified examples. As there has recently been significant noise in the media around airlines, what better dataset to use than tweets from customers to airlines?

Fortunately, a dataset of tweets is already available to us thanks to Kaggle.io. The Twitter US Airline Sentiment database can be downloaded from their site using this link

The Solution

Let’s begin by taking a look at the dataset we will be working on. The raw dataset has the following columns:

tweet_id

airline_sentiment

airline_sentiment_confidence

negativereason

negativereason_confidence

airline

airline_sentiment_gold

name

negativereason_gold

retweet_count

text

tweet_coord

tweet_created

tweet_location

user_timezone

And looks like following example (side-scrollable table):

tweet_id airline_sentiment airline_sentiment_confidence negativereason negativereason_confidence airline airline_sentiment_gold name negativereason_gold retweet_count text tweet_coord tweet_created tweet_location user_timezone 570306133677760513 neutral 1.0 Virgin America cairdin 0 @VirginAmerica What @dhepburn said. 2015-02-24 11:35:52 -0800 Eastern Time (US & Canada) 570301130888122368 positive 0.3486 0.0 Virgin America jnardino 0 @VirginAmerica plus you’ve added commercials to the experience… tacky. 2015-02-24 11:15:59 -0800 Pacific Time (US & Canada) 570301083672813571 neutral 0.6837 Virgin America yvonnalynn 0 @VirginAmerica I didn’t today… Must mean I need to take another trip! 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada) 570301031407624196 negative 1.0 Bad Flight 0.7033 Virgin America jnardino 0 “@VirginAmerica it’s really aggressive to blast obnoxious “”entertainment”” in your guests’ faces & they have little recourse” 2015-02-24 11:15:36 -0800 Pacific Time (US & Canada) 570300817074462722 negative 1.0 Can’t Tell 1.0 Virgin America jnardino 0 @VirginAmerica and it’s a really big bad thing about it 2015-02-24 11:14:45 -0800 Pacific Time (US & Canada) 570300767074181121 negative 1.0 Can’t Tell 0.6842 Virgin America jnardino 0 “@VirginAmerica seriously would pay $30 a flight for seats that didn’t have this playing. it’s really the only bad thing about flying VA” 2015-02-24 11:14:33 -0800 Pacific Time (US & Canada) 570300616901320704 positive 0.6745 0.0 Virgin America cjmcginnis 0 “@VirginAmerica yes nearly every time I fly VX this “ear worm” won’t go away :)” 2015-02-24 11:13:57 -0800 San Francisco CA Pacific Time (US & Canada) 570300248553349120 neutral 0.634 Virgin America pilot 0 “@VirginAmerica Really missed a prime opportunity for Men Without Hats parody there. https://t.co/mWpG7grEZP” 2015-02-24 11:12:29 -0800 Los Angeles Pacific Time (US & Canada)

The file contains 14,640 tweets, so it’s a decent dataset for us to work with. Now, with the current amount of columns we have available we have way more data than what we need for our example; for practical purposes we only care about the following columns:

text

airline_sentiment

Where text will become our feature and the airline_sentiment becomes our target. The rest of the columns can be discarded as they will not be used for our exercise. Let’s start by creating the project, and initialize composer using the following file:

{ "name": "amacgregor/phpml-exercise", "description": "Example implementation of a Tweet sentiment analysis with PHP-ML", "type": "project", "require": { "php-ai/php-ml": "^0.4.1" }, "license": "Apache License 2.0", "authors": [ { "name": "Allan MacGregor", "email": "amacgregor@allanmacgregor.com" } ], "autoload": { "psr-4": {"PhpmlExercise\\": "src/"} }, "minimum-stability": "dev" }

composer install

If you need an introduction to Composer, see here.

To make sure we are set up correctly, let’s create a quick script that will load our Tweets.csv data file and make sure it has the data we need. Copy the following code as reviewDataset.php in the root of our project:

<?php namespace PhpmlExercise ; require __DIR__ . '/vendor/autoload.php' ; use Phpml \ Dataset \ CsvDataset ; $dataset = new CsvDataset ( 'datasets/raw/Tweets.csv' , 1 ) ; foreach ( $dataset - > getSamples ( ) as $sample ) { print_r ( $sample ) ; }

Now, run the script with php reviewDataset.php , and let’s review the output:

Array ( [ 0 ] = > 569587371693355008 ) Array ( [ 0 ] = > 569587242672398336 ) Array ( [ 0 ] = > 569587188687634433 ) Array ( [ 0 ] = > 569587140490866689 )

Now that doesn’t look useful, does it? Let’s take a look at the CsvDataset class to get a better idea of what’s happening internally:

<?php public function __construct ( string $filepath , int $features , bool $headingRow = true ) { if ( ! file_exists ( $filepath ) ) { throw FileException :: missingFile ( basename ( $filepath ) ) ; } if ( false === $handle = fopen ( $filepath , 'rb' ) ) { throw FileException :: cantOpenFile ( basename ( $filepath ) ) ; } if ( $headingRow ) { $data = fgetcsv ( $handle , 1000 , ',' ) ; $this - > columnNames = array_slice ( $data , 0 , $features ) ; } else { $this - > columnNames = range ( 0 , $features - 1 ) ; } while ( ( $data = fgetcsv ( $handle , 1000 , ',' ) ) !== false ) { $this - > samples [ ] = array_slice ( $data , 0 , $features ) ; $this - > targets [ ] = $data [ $features ] ; } fclose ( $handle ) ; }

The CsvDataset constructor takes 3 arguments:

A file-path to the source CSV

An integer that specifies the number of features in our file

A boolean to indicate if the first row is header

If we look a little closer we can see that the class is mapping out the CSV file into two internal arrays: samples and targets. Samples contains all the features provided by the file and targets contains the known values (negative, positive, or neutral).

Based on the above, we can see that the format our CSV file needs to follow is as follows:

| feature_1 | feature_2 | feature_n | target |

We will need to generate a clean dataset with only the columns we need to continue working. Let’s call this script generateCleanDataset.php :

<?php namespace PhpmlExercise ; require __DIR__ . '/vendor/autoload.php' ; use Phpml \ Exception \ FileException ; $sourceFilepath = __DIR__ . '/datasets/raw/Tweets.csv' ; $destinationFilepath = __DIR__ . '/datasets/clean_tweets.csv' ; $rows = [ ] ; $rows = getRows ( $sourceFilepath , $rows ) ; writeRows ( $destinationFilepath , $rows ) ; function getRows ( $filepath , $rows ) { $handle = checkFilePermissions ( $filepath ) ; while ( ( $data = fgetcsv ( $handle , 1000 , ',' ) ) !== false ) { $rows [ ] = [ $data [ 10 ] , $data [ 1 ] ] ; } fclose ( $handle ) ; return $rows ; } function checkFilePermissions ( $filepath , $mode = 'rb' ) { if ( ! file_exists ( $filepath ) ) { throw FileException :: missingFile ( basename ( $filepath ) ) ; } if ( false === $handle = fopen ( $filepath , $mode ) ) { throw FileException :: cantOpenFile ( basename ( $filepath ) ) ; } return $handle ; } function writeRows ( $filepath , $rows ) { $handle = checkFilePermissions ( $filepath , 'wb' ) ; foreach ( $rows as $row ) { fputcsv ( $handle , $row ) ; } fclose ( $handle ) ; }

Nothing too complex, just enough to do the job. Let’s execute it with phpgenerateCleanDataset.php .

Now, let’s go ahead and point our reviewDataset.php script back to the clean dataset:

Array ( [0] => @AmericanAir That will be the third time I have been called by 800-433-7300 an hung on before anyone speaks. What do I do now??? ) Array ( [0] => @AmericanAir How clueless is AA. Been waiting to hear for 2.5 weeks about a refund from a Cancelled Flightled flight & been on hold now for 1hr 49min )

BAM! This is data we can work with! So far, we have been creating simple scripts to manipulate the data. Next, we are going to start creating a new class under src/classification/SentimentAnalysis.php .

<?php namespace PhpmlExercise \ Classification ; class SentimentAnalysis { public function train ( ) { } public function predict ( ) { } }

Our Sentiment class will need two functions in our sentiment analysis class:

A train function , which will take our dataset training samples and labels and some optional parameters.

, which will take our dataset training samples and labels and some optional parameters. A predict function, which will take an unlabelled dataset and assigned a set of labels based on the training data.

In the root of the project create a script called classifyTweets.php . We will use his script to instantiate and test our sentiment analysis class. Here is the template that we will use:

<?php namespace PhpmlExercise ; use PhpmlExercise \ Classification \ SentimentAnalysis ; require __DIR__ . '/vendor/autoload.php' ;

Step 1: Load the Dataset

We already have the basic code that we can use for loading a CSV into a dataset object from our earlier examples. We are going to use the same code with a few tweaks:

<?php . . . use Phpml \ Dataset \ CsvDataset ; . . . $dataset = new CsvDataset ( 'datasets/clean_tweets.csv' , 1 ) ; $samples = [ ] ; foreach ( $dataset - > getSamples ( ) as $sample ) { $samples [ ] = $sample [ 0 ] ; }

This generates a flat array with only the features – in this case the tweet text – which we are going to use to train our classifier.

Step 2: Prepare the Dataset

Now, having the raw text and passing that to a classifier wouldn’t be useful or accurate since every tweet is essentially different. Fortunately, there are ways of dealing with text when trying to apply classification or machine learning algorithms. For this example, we are going to make use of the following two classes:

Token Count Vectorizer : This will transform a collection of text samples to a vector of token counts. Essentially, every word in our tweet becomes a unique number and keeps track of amounts of occurrences of a word in a specific text sample.

: This will transform a collection of text samples to a vector of token counts. Essentially, every word in our tweet becomes a unique number and keeps track of amounts of occurrences of a word in a specific text sample. Tf-idf Transformer: short for term frequency–inverse document frequency, is a numerical statistic intended to reflect how important a word is to a document in a collection or corpus.

Let’s start with our text vectorizer:

<?php . . . use Phpml \ FeatureExtraction \ TokenCountVectorizer ; use Phpml \ Tokenization \ WordTokenizer ; . . . $vectorizer = new TokenCountVectorizer ( new WordTokenizer ( ) ) ; $vectorizer - > fit ( $samples ) ; $vectorizer - > transform ( $samples ) ;

Next, apply the Tf-idf Transformer:

<?php ... use Phpml\FeatureExtraction\TfIdfTransformer; ... $tfIdfTransformer = new TfIdfTransformer(); $tfIdfTransformer->fit($samples); $tfIdfTransformer->transform($samples);

Our samples array is now in a format where it an easily be understood by our classifier. We are not done yet, we need to label each sample with its corresponding sentiment.

Step 3: Generate the Training Dataset

Fortunately, PHP-ML has this need already covered and the code is quite simple:

<?php ... use Phpml\Dataset\ArrayDataset; ... $dataset = new ArrayDataset($samples, $dataset->getTargets());

We could go ahead and use this dataset and train our classifier. We are missing a testing dataset to use as validation, however, so we are going to “cheat” a little bit and split our original dataset into two: a training dataset and a much smaller dataset that will be used for testing the accuracy of our model.

<?php ... use Phpml\CrossValidation\StratifiedRandomSplit; ... $randomSplit = new StratifiedRandomSplit($dataset, 0.1); $trainingSamples = $randomSplit->getTrainSamples(); $trainingLabels = $randomSplit->getTrainLabels(); $testSamples = $randomSplit->getTestSamples(); $testLabels = $randomSplit->getTestLabels();

This approach is called cross-validation. The term comes from statistics and can be defined as follows:

Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. — Wikipedia.com

Step 4: Train the Classifier

Finally, we are ready to go back and implement our SentimentAnalysis class. If you haven’t noticed by now, a huge part of machine learning is about gathering and manipulating the data; the actual implementation of the Machine learning models tends to be a lot less involved.

To implement our sentiment analysis class, we have three classification algorithms available:

Support Vector Classification

KNearestNeighbors

NaiveBayes

For this exercise we are going to use the simplest of them all, the NaiveBayes classifier, so let’s go ahead and update our class to implement the train method:

<?php namespace PhpmlExercise \ Classification ; use Phpml \ Classification \ NaiveBayes ; class SentimentAnalysis { protected $classifier ; public function __construct ( ) { $this - > classifier = new NaiveBayes ( ) ; } public function train ( $samples , $labels ) { $this - > classifier - > train ( $samples , $labels ) ; } }

As you can see, we are letting PHP-ML do all the heavy lifting for us. We are just creating a nice little abstraction for our project. But how do we know if our classifier is actually training and working? Time to use our testSamples and testLabels .

Step 5: Test the Classifier’s Accuracy

Before we can proceed with testing our classifier, we do have to implement the prediction method:

<?php . . . class SentimentAnalysis { . . . public function predict ( $samples ) { return $this - > classifier - > predict ( $samples ) ; } }

And again, PHP-ML is doing us a solid and doing all the heavy lifting for us. Let’s update our classifyTweets class accordingly:

<?php . . . $predictedLabels = $classifier - > predict ( $testSamples ) ;

Finally, we need a way to test the accuracy of our trained model; thankfully PHP-ML has that covered too, and they have several metrics classes. In our case, we are interested in the accuracy of the model. Let’s take a look at the code:

<?php . . . use Phpml \ Metric \ Accuracy ; . . . echo 'Accuracy: ' . Accuracy :: score ( $testLabels , $predictedLabels ) ;

We should see something along the lines of:

Accuracy : 0.73651877133106 %

Conclusion

This article fell a bit on the long side, so let’s do a recap of what we’ve learned so far:

Having a good dataset from the start is critical for implementing machine learning algorithms.

The difference between supervised learning and unsupervised Learning.

The meaning and use of cross-validation in machine learning.

That vectorization and transformation are essential to prepare text datasets for machine learning.

How to implement a Twitter sentiment analysis by using PHP-ML’s NaiveBayes classifier.

This post also served as an introduction to the PHP-ML library and hopefully gave you a good idea of what the library can do and how it can be embedded in your own projects.

Finally, this post is by no means comprehensive and there is plenty to learn, improve and experiment with; here are some ideas to get you started on how to improve things further:

Replace the NaiveBayes algorithm with the Support Vector Classification algorithm.

If you tried running against the full dataset (14,000 rows) you’d probably notice how memory intensive the process can be. Try implementing model persistence so it doesn’t have to be trained on each run.

Move the dataset generation to its own helper class.

I hope you found this article useful. If you have some application ideas regarding PHP-ML or any questions, don’t hesitate to drop them below into the comments area!