How open source search and machine learning are driving insight into what’s important

“What is important is seldom urgent and what is urgent is seldom important." —attributed to Dwight “Ike” Eisenhower, 34th President of the United States

In today’s data-driven, constantly-connected, technology-centric world, we are awash in attention-grabbing content, events, and requests. Like me, I suspect most Opensource.com readers relish this way of life—after all, us technologists created it—except for, that is, when we don’t. We like it if, like Ike, we can properly separate the urgent from the important and use this tidal wave of data to our personal and professional advantage. As data sizes grow, this becomes harder and harder.

Lest this become yet another “me too” article on the glories of all things big data, I’m going to focus instead on another trend happening that I think is the real prize to be had in this race to be all things to all data: automatically identifying and taking action on what is important. Imagine a phone company automatically determining the difference in localized cell phone traffic spikes due to an emergency situation versus a big event like a concert by combining your real-time network monitoring with real-time social media analysis, and reacting accordingly when there is excess load or outages.

The former situation is important, the latter is merely urgent. Importance, above all other use cases, is the most important aspect of the current data movement. Storing all and aggregating all your data into next-gen data warehouses are great use cases, but eventually you’ll run out of disk (even if it is cheap) or the ability to look at and comprehend all those reports. You will never, however, run out of the need to know what’s important in your data and how it impacts you, your clients, and your company.

What’s all of this have to do with open source? Innovation in open source libraries, specifically search, machine learning, and Natural Language Processing (NLP) libraries are paving the way for a deeper, more fulfilling understanding of what data is important. Before we get into what these tools do, let’s take a step back to think about what “being important” means, in the context of data.

At first blush, knowing what is important seemingly falls into Potter Stewart’s "I know it when I see it" category of things most people take for granted as understanding. Most of us can tell if an email or a phone call is important or not with a quick read of the subject line or a glance at the caller ID. Few of us understand how we are able to come to those conclusions so quickly or what factors into those simple decisions, much less more complex understandings of what is important in more difficult situations. Therein lies the challenge to building software and systems to help solve the problem. Determining importance covers a broad number of aspects, including, but not limited to: timeliness, personalization, past behavior, social impact, content, meaning, and whether or not something is actionable. For example, a security attack on your infrastructure happening right now is likely more important than one that happened last year. I say likely, because the one last year may have clues as to how to stop the one right now and is thus hugely important.

Despite all these challenges, exceptions, and subtleties, we’ve made good strides in separating the wheat from the chaff when it comes to identifying important data, in no small part thanks to open source. In particular, gains made in search engine technology like Apache Lucene and Solr have revolutionized our ability to deal with multi-structured content at scale, rank it and return it in a timely manner. Search engines have evolved significantly in recent years to seamlessly collect, collate, and curate data across a wide variety of data types (text, numeric, time-series, spatial, and more) and are no longer about just doing fast keyword lookups. Combined with large scale data processing frameworks (Hadoop, Spark, et. al), R for statistical analysis, machine learning capabilities like Apache Mahout, Vowpal Wabbit, MLlib and NLP libraries like Stanford’s NLP libraries, Apache OpenNLP, NLTK and more, it is now possible to build sophisticated solutions that take in your data, model it, serve it up to your users and then learn from their behavior.

Google, Amazon, Facebook and others have been doing this for years. And through the power of open source, many of these techniques are now widely available to the rest of us.

The end goal of all of this work is to create a virtuous cycle between your users and your data. The more your users interact with you, the smarter your system gets. The smarter your system gets, the more your users will want to interact with you.

Apache

Quill

This article is part of the Apache Quill column coordinated by Jason Hibbets. Share your success stories and open source updates within projects at Apache Software Foundation by contacting us at open@opensource.com.