Quality Data: The first mile of machine learning Watch Now

Machine learning is a powerful paradigm many organizations are utilizing to derive insights and add features to their applications, but using it requires skills, data, and effort. Explorium, a startup from Israel, has just announced $19 million of funding to lower the barrier on all of the above.

The funding announced today comprises a seed round of $3.6 million led by Emerge with the participation of F2 Capital and a $15.5 million Series A led by Zeev Ventures with the involvement of the seed investors. Explorium was founded by Maor Shlomo, Or Tamir, and Omer Har, three Israeli tech entrepreneurs, who previously led large-scale data mining and optimization platforms for big data-based marketing leaders.



"We are doing for machine learning data what search engines did for the web," said Explorium co-founder and CEO Maor Shlomo. "Just as a search engine scours the web and pulls in the most relevant answers for your need, Explorium scours data sources inside and outside your organization to generate the features that drive accurate models."

Explorium's platform works in three stages: Data enrichment, feature engineering, and predictive modeling.

Data enrichment

The first part of the process involves finding appropriate data for the task at hand. To train machine learning algorithms, relevant datasets are needed. Let's say, for example, an organization is interested in devising a predictive model for HR, to help reduce churn by generating alerts and recommendations for action.

To train this model, data from the organization's HR will have to be used. But for the data to be useful, they have to be sufficient in quantity and quality, which is not always the case. This is where Explorium comes in.

Initially, users connect a dataset with a target column, indicating what they would like to predict. Multiple internal sources can be connected, as long as one of them contains the target column. Then Explorium detects the meaning of the columns for each input dataset and enriches the dataset with additional data sources.



For example, if the engine identifies a location coordinate in the columns of the customer's data (latitude and longitude), it would enrich his data with Geo-Spatial taggings (such as competitors in the area), demographic sources, and so on.



Before getting to how this identification works, it's worth pondering where does that data come from, and how their relevance and reliability is assessed. Explorium sources data from multiple channels. Some of it is open and public datasets, but there's more.

Special feature Turning Big Data into Business Insights Businesses are good at collecting data, and the Internet of Things is taking it to the next level. But, the most advanced organizations are using it to power digital transformation. Read More

Shlomo said Explorium had built an extensive data partnership network to enrich its data catalog and create integrated views. This also includes what he calls premium providers, allowing Explorium to purchase data from companies and commercial entities looking to monetize their data safely (e.g., aggregated usage statistics).



Shlomo concluded by mentioning Explorium aggregates multiple data sources into a single coherent and meaningful piece of data using machine learning methods, as well as structuring untapped data from online assets such as photos, extract entities, and actions in web text (e.g., articles).



This raises several questions, with data protection, compliance, and security being a very obvious one. Shlomo emphasized that security is a big focus for them, mentioning Explorium is SOC 2 compliant, and on track for additional accreditations.



"We take great care to support our customers in complying with relevant regulations such as GDPR. For example, our customers can choose to only work with a subset of our data sources according to the regulations they need to comply with", said Shlomo.

Feature engineering

Besides compliance and security, however, the gist of it all is whether the data is useful and relevant. This is something for which you'll have to trust Explorium's process, which Shlomo describes as follows:



"Data quality, dependability, and research are ensured by multiple functions across the organization in diverse methodologies. For example, expert teams research sources prior to initial use and ascertain their value, origins, and relevancy for different needs. We deploy automated quality tests across all data sources to make sure we're delivering the best data product possible."



But there's more to this than aggregating data. Explorium promises it can automatically detect the meaning of columns for each input dataset, which is a core feature in its offering. It enables Explorium to understand which data sources the user can connect to, which sources the platform can automatically explore, and which features can be automatically generated later on.



Then Explorium generates auto-engineered features for each and every connected source resulting in hundreds of thousands of candidate features from different enrichment sources. The meaning of the data is used to extract complex features out of the raw data. This is based on proprietary algorithms that understand multiple characteristics, structures, and entities behind the data.

The purpose of this stage is to generate as many candidates as possible for the elimination phase. On average, hundreds of features are generated per data source (which ends up being hundreds of thousands of features overall). Shlomo said users could extend Explorium's engine with their own custom functions (such as NLP embeddings or advanced time-series features), and leverage it to explore their own ideas and introduce domain knowledge into the process.



Explorium then evaluates hundreds of models on different subsets of features and sources to introduce automated feedback on the data sources and features. Different features are ranked, and the weak ones (94% on average) are eliminated. The impact of auto-generated features on the predictive models' accuracy is evaluated, and then the feedback is used to improve search results.

Predictive modeling

The process gives each feature an Explorium Score (indicating impact and relevance to the problem), while each data source derives an aggregated score based on the features generated from it. Eventually, Explorium converges into the best subset of features given a specific model. Discussing models with Shlomo, his point of view was that models had been largely commoditized:

"The main challenge was, and still is, the data that is fed into those algorithms. We find open-source implementations of models (Sklearn, Xgboost, LibSVM, Tensorflow...) powerful and beautifully designed. We don't see a reason to reinvent the wheel. We leverage off-the-shelf libraries to build our AutoML component."

Explorium also integrates with leading AutoML providers to serve customers who prefer using their existing stack, including cloud providers. Different models are tested against different subsets of auto-generated features from many different sources, which is a hard optimization problem.

Explorium uses optimization methods to help converge into the best performing features set, models, and parameters. To enhance this process and converge quicker, statistics across different use cases, projects, and customers are collected to keep track of which models work best with which types of data sources/features.



The platform aims to provide an end-to-end data science solution: from data discovery to models in production. The goal is to help users automatically discover the right datasets, generate high-impact features, and feed them into predictive models.



The optimal feature set can either be consumed directly by data science teams as they require it (e.g. real-time API batch or batch pipelines), or used to train and serve a machine learning model using Explorium's open source-based AutoML, integrations with cloud vendors AutoML, or custom models that users are free to build themselves.



For example, said Shlomo, a user who just discovered dozens of new impactful data sources doesn't need to integrate with each one of them, as the platform does it automatically. Users have a choice on whether to consume a finalized prediction result from one of the platform's models or to consume the raw, enriched features and feed into their own models.

AutoML and data marketplace - a BI killer?

Explorium's business model is subscription-based and driven mainly by the number of "use cases." A use case is a project where the customer consumes a model's predictions or auto-generated features. Explorium can be installed both on-prem or in the cloud. User datasets are transferred to the platform, while users can choose to modify or delete data at any point.



If there's a connection to the outside world, Explorium automatically uses the most up-to-date data sources in its catalog. In case of a closed on-prem setup, a replica of the enrichment database is installed. This means missing out on high-frequency updated datasets like stock information or news, so Explorium recommends cloud-based deployment whenever possible.

Shlomo said Explorium customers range from Fortune 100 and leading financial institutions, all the way down to hyper-growth startups. The team, largely made up of engineers and data scientists, is growing quickly with offices in Tel Aviv, San Francisco, and Kiev. Explorium has been named in the top 5 fastest growing companies in Tel-Aviv.



The combination of AutoML and data marketplace that Explorium offers seems compelling. But is it unique? Could an AutoML (cloud) vendor add data marketplace features, or a data marketplace add AutoML, to the same effect? Shlomo believes existing tools are not built to deal with a large variety of data sources and goes as far as to predict the demise of traditional BI tools: