“Machine learning” is a mystical term. Most developers don’t need it at all in their daily work, and the only details about it we know are from some university course 5 years ago (which is already forgotten). I’m not a machine learning expert, but I happened to work in a company that does a bit of that, so I got to learn the basics. I never programmed actual machine learning tasks, but got a good overview.

But what is machine learning? It’s instructing the computer to make sense of big amounts of data (#bigdata hashtag – check). In what ways?

classifying a new entry into existing classes – is this email spam, is this news article about sport or politics, is this symbol the letter “a”, or “b”, or “c”, is this object in front of the self-driving car a pedestrian or a road sign.

predicting a value of a new entry (regression problems) – how much does my car cost, how much will the stock price be tomorrow.

grouping entries into classes that are not known in advance (clustering) – what are you market segments, what are the communities within a given social network (and many more applications)

How? With many different algorithms and data structures. Which are fortunately already written by computer scientists and developers can just reuse them (with a fair amount of understanding, of course).

But if the algorithms are already written, then it must be easy to use machine learning? No. Ironically, the hardest part of machine learning is the part where a human tells the machine what is important about the data. This process is called feature selection. What are those features, that describe the data in a way, that the computer can use it to identify meaningful patterns. I am no machine learning expert, but the way I see it, this step is what most machine learning engineers (or data scientists) are doing on a day-to-day basis. They aren’t inventing new algorithms; they are trying to figure out what combinations of features for a given data gives best results. And it’s a process with many “heuristics” that I have no experience with. (That’s an oversimplification, of course, as my colleagues were indeed doing research and proposing improvements to algorithms, but that’s the scientific aspect of things)

I’ll now limit myself only to classification problems and leave the rest. And when I say “best results”, how is that measured? There are the metrics of “precision” and “recall” (they are most easily used for classification into two groups, but there are ways to apply them to multi-class or multi-label classification). If you have to classify an email as spam or not spam, your precision is the percentage of the emails properly marked as spam from all the emails marked as spam. And the recall is the percentage of emails properly marked as spam from the total number of emails marked as spam. So if you have 200 emails, 100 of them are spam, and your program marks 80 of them as spam correctly and 20 incorrectly, you have a 80% precision (80/80+20) and 80% recall (80/100 actual spam emails). Good results are achieved when you score higher in these two metrics. I.e. your spam filter is good if it correctly detects most spam emails, and it also doesn’t mark non-spam emails as spam.

The process of feeding data into the algorithm is simple. You usually have two sets of data – the training set and the evaluation set. You normally start with one set and split it in two (the training set should be the larger one). These sets contain the values for all the features that you have identified for the data in question. You first “train” your classifier statistical model with the training set (if you want to know how training happens, read about the various algorithms), and then run the evaluation set to see how many items were correctly classified (the evaluation set has the right answer in it, so you compare that to what the classifier produced as output).

Let me illustrate that with my first actual machine learning code (with the big disclaimer, that the task is probably not well-suited for machine learning, as there is a very small data set). I am a member (and currently chair) of the problem committee (and jury) of the International Linguistics Olympiad. We construct linguistics problems, combine them into problem sets and assign them at the event each year. But we are still not good at assessing how hard a problem is for high-school students. Even though many of us were once competitors in such olympiads, we now know “too much” to be able to assess the difficulty. So I decided to apply machine learning to the problem.

As mentioned above, I had to start with selecting the right features. After a couple of iterations, I ended up using: the number of examples in a problem, the average length of an example, the number of assignments, the number of linguistic components to discover as part of the solution, and whether the problem data is scrambled or not. The complexity (easy, medium, hard) comes from the actual scores of competitors at the olympiad (average score of: 0-8 points = hard, 8-12 – medium, >12 easy). I am not sure whether these features are related to problem complexity, hence I experimented with adding and removing some. I put the feature data into a Weka arff file, which looks like this (attributes=features):

@RELATION problem-complexity @ATTRIBUTE examples NUMERIC @ATTRIBUTE avgExampleSize NUMERIC @ATTRIBUTE components NUMERIC @ATTRIBUTE assignments NUMERIC @ATTRIBUTE scrambled {true,false} @ATTRIBUTE complexity {easy,medium,hard} @DATA 34,6,11,8,false,medium 12,21,7,17,false,medium 14,11,11,17,true,hard 13,16,9,14,false,hard 16,35,7,17,false,hard 20,9,7,10,false,hard 24,5,8,6,false,medium 9,14,13,4,false,easy 18,7,17,7,true,hard 18,7,12,10,false,easy 10,16,9,11,false,hard 11,3,17,13,true,easy ...

The evaluation set looks exactly like that, but smaller (in my case, only 7 entries).

Weka was recommended as a good tool (at least for starting), and it has a lot of algorithms included, which one can simply reuse.

Following the getting started guide, I produced the following simple code:

public static void main(String[] args) throws Exception { ArffLoader loader = new ArffLoader(); loader.setFile(new File("problem_complexity_train_3.arff")); Instances trainingSet = loader.getDataSet(); // this is the complexity, here we specify what are our classes, // into which we want to classify the data int classIdx = 5; ArffLoader loader2 = new ArffLoader(); loader2.setFile(new File("problem_complexity_test_3.arff")); Instances testSet = loader2.getDataSet(); trainingSet.setClassIndex(classIdx); testSet.setClassIndex(classIdx); // using the LMT classification algorithm. Many more are available Classifier classifier = new LMT(); classifier.buildClassifier(trainingSet); Evaluation eval = new Evaluation(trainingSet); eval.evaluateModel(classifier, testSet); System.out.println(eval.toSummaryString()); // Get the confusion matrix double[][] confusionMatrix = eval.confusionMatrix(); .... }

A comment about the choice of the algorithm – having insufficient knowledge, I just tried a few and selected the one that produced the best result.

After performing the evaluation, you can get the so called “confusion matrix”, (eval.toConfusionMatrix) which you can use to see the quality of the result. When you are satisfied with the results, you can proceed to classify new entries, that you don’t know the complexity of. To do that, you have to provide a data set, and the only difference to the other two is that you put question mark instead of the class (easy, medium, hard). E.g.:

... @DATA 34,6,11,8,false,? 12,21,7,17,false,?

Then you can run the classifier:

ArffLoader loader = new ArffLoader(); loader.setFile(new File("unclassified.arff")); Instances dataSet = loader.getDataSet(); DecimalFormat df = new DecimalFormat("#.##"); for (Enumeration<Instance> en = dataSet.enumerateInstances(); en.hasMoreElements();) { double[] results = classifier.distributionForInstance(en.nextElement()); for (double result : results) { System.out.print(df.format(result) + " "); } System.out.println(); };

This will print the probabilities for each of your entries to fall into each of the classes. As we are going to use this output only as a hint towards the complexity, and won’t use it as a final decision, it is fine to yield wrong results sometimes. But in many machine learning problems there isn’t a human evaluation of the result, so getting higher accuracy is the most important task.

How does this approach scale, however. Can I reuse the code above for a high volume production system? On the web you normally do not run machine learning tasks in real time (you run them as scheduled tasks instead), so probably the answer is “yes”.

I am still a novice in the field, but having done one actual task made me share my tiny experience and knowledge. Meanwhile I’m following the Stanford machine learning course on Coursera, which can give you way more details.

Can we, as developers, use machine learning in our work projects? If we have large amounts of data – yes. It’s not that hard to get started, and although probably we will be making stupid mistakes, it’s an interesting thing to explore and may bring value to the product we are building.