Despite all the flashy headlines from Musk and Hawking on the impending doom to be visited on us mere mortals by killer robots from the skies, machine learning and artificial intelligence are here to stay. More importantly, machine learning (ML) is quickly becoming a critical skill for developers to enhance their applications and their careers, better understand data, and to help users be more effective.

What is machine learning? It is the use of both historical and current data to make predictions, organize content, and learn patterns about data without being explicitly programmed to do so. This is typically done using statistical techniques that look for significant events like co-occurrences and anomalies in the data and then factoring in their likelihood into a model that is queried at a later time to provide a prediction for some new piece of data.

Common machine learning tasks include classification (applying labels to items), clustering (grouping items automatically), and topic detection. It is also commonly used in natural language processing. Machine learning is increasingly being used in a wide variety of use cases, including content recommendation, fraud detection, image analysis and ecommerce. It is useful across many industries and most popular programming languages have at least one open source library implementing common ML techniques.

Reflecting the broader push in software towards open source, there are now many vibrant machine learning projects available to experiment with as well as a plethora of books, articles, tutorials, and videos to get you up to speed. Let's look at a few projects leading the way in open source machine learning and a few primers on related ML terminology and techniques.

Primers

Beyond the project home pages and documentation, there are several excellent sources available to teach the core concepts behind machine learning. While there are hundreds (even thousands) of books and tutorials on ML, I've tried to focus on those targeted towards programmers and less on those that are more rigorous or focused too much on the math behind the scenes. While that stuff is important in the long run, it is often impedes engineers in the getting started phase from trying out real systems with real data.

Projects

While there are many great open source machine learning projects out there, the following projects combine strong technical capabilities with good documentation and accessible communities for asking questions and troubleshooting.

Weka

Weka, from the University of Waikato in New Zealand, has long set the standard for open source machine learning with a rich set of tools, lots of algorithms to try out, and user interfaces for exploring data and results. It also has an excellent accompanying book that explains a lot of the ML concepts while showing examples using Weka. While it isn't necessarily up on the latest craze of deep learning and the like, it is a solid project to get started with in understanding the concepts.

Mahout

Near and dear to my own heart as a co-founder of the project, Apache Mahout has retooled itself in the past year to focus on Apache Spark as well as on overhauling the way one builds ML models while shipping implementations of commonly used ML algorithms. For those still using Hadoop MapReduce, Mahout continues to maintain implementations of key algorithms for classification, clustering, and recommendations using the MapReduce paradigm.

Spark's MLLib

Built from day one for Apache Spark, MLLib is focused on delivering commonly used machine learning algorithms for clustering and classification in a scalable manner. By leveraging Spark, MLLib is able to take advantage of large scale cluster optimizations for processing big data, which can be especially important in machine learning, since many of the algorithms used are iterative in nature and data hungry.

Scikit-learn

Building on other solid Python libraries like NumPy and SciPy, scikit-learn brings many of the algorithms and tools covered in the above Java/Scala libraries to the Python stack. Add in a nice set of tutorials, and you have a library poised to have you up and learning in no time.

DeepLearning4J

Capitalizing on the latest buzzword within the buzzword laden field of ML, Deep Learning for Java brings to open source a strong set of algorithms designed to do single machine and distributed deep learning on Hadoop and Spark. It has a range of utilities for working with data and also has GPU (graphical processing unit) support.

What is deep learning? Increasingly used at places like Google, Facebook and Amazon, deep learning is a new, large scale approach to neural networks designed to significantly reduce the amount of human intervention needed to train and maintain models while also providing significantly better results. DL4J, as it is called, also has a book (preorder) in the works via Adam Gibson and Josh Patterson.

Bonus projects

As with any overview article, there simply isn't enough room to cover all the great projects in a space, so be sure to also check out H20, Vowpal Wabbit, PredictionIO as well as the MLOSS archive of open source machine learning libraries.

Next steps

The real key to getting started in machine learning is to download some sample data and the code from one of the projects above. Be prepared for lots of trial and error as you explore the different approaches. You will quickly find that, despite all the hype about artificial intelligence, building these applications still requires a good dose of human intelligence to get good results.

Apache

Quill

This article is part of the Apache Quill column coordinated by Rikki Endsley. Share your success stories and open source updates within projects at Apache Software Foundation by submitting your story to Opensource.com.