Large labeled datasets are critical for developing machine learning applications and better training the modern machine learning models in the AI field. Creating these datasets however is a process that requires considerable time and money. A new Google AI paper published in collaboration with Stanford and Brown University introduces Snorkel Drybell, an experimental internal system which leverages the open-sourced Snorkel framework to harness various existing organizational knowledge resources and generate training data for web-scale machine learning models.

“Labeling training data is one of the most costly bottlenecks in developing or modifying machine learning-based applications. We survey how resources from across an organization can be used as weak supervision sources for three classification tasks at Google, in order to bring development time and cost down by an order of magnitude. We build on the Snorkel framework, extending it as a new system, Snorkel DryBell, which integrates with Google’s distributed production systems and enables engineers to develop and execute weak supervision strategies over millions of examples in less than thirty minutes. We find that Snorkel DryBell creates classifiers of comparable quality to ones trained using up to tens of thousands of hand-labeled examples, in part by leveraging organizational resources not servable in production which contribute an average 52% performance improvement to the weakly supervised classifiers.” (arXiv).

Synced invited Bert Huang, an assistant professor in the Department of Computer Science at Virginia Tech, to share his thoughts on Snorkel DryBell.

How would you describe Snorkel DryBell?

Snorkel DryBell is a system for large-scale integration of multiple information sources to perform weakly supervised machine learning. Google highlights its capabilities in transferring information across different parts of Google’s ecosystem. The system allows experts to encode relationships across different information sources by programming labeling functions. It then reasons about these labeling functions and uses them to estimate the true labels for arbitrarily large unlabeled datasets. These automatically labeled datasets can then train large complex models.

Why does Snorkel DryBell matter?

Snorkel DryBell is a large-scale, in-production example of the importance of weak supervision as a machine learning paradigm. Weak supervision allows experts to provide training signals to learning algorithms without exhaustively labeling individual examples. As methods for supervised learning are maturing, this requirement of individually labeling huge amounts of data is revealing itself to be a critical bottleneck in the development of machine learning models. Snorkel, Snorkel DryBell, and other weak supervision techniques can be a path around this bottleneck.

Snorkel DryBell, and the Snorkel system supporting it, also reason about dependencies among the weak supervision signals. If these dependencies are not properly handled, the estimated labels could be biased or systematically wrong. Snorkel uses probabilistic modeling to discover and compensate for potential dependencies. By handling this possible trap, it’s able to benefit from the full power of weak supervision.

What impact might Snorkel DryBell bring to the research community?

Snorkel DryBell is a concrete example of weak supervision used in a practical, high-impact, high-visibility application. While methods on weak supervision have been demonstrated in the past, Snorkel DryBell is a real example of a large-scale system of this type in production behind one of the world’s largest computing organizations.

A lot of machine learning methods are important mathematical ideas or technical demonstrations on benchmark problems. Snorkel DryBell appears to be more than that. It’s always important to see real deployed machine learning systems based on modern research ideas.

Can you identify any bottlenecks in the research?

As methods on learning from weak supervision are beginning to mature, one bottleneck that arises is the need for sources of weak supervision. Google is especially well positioned to have many types of weak supervision available by relating data from their various products. Not all practitioners will have as much opportunity to find relationships between their data to serve as weak supervision. Then again, if they don’t have resources for weak supervision, they almost certainly lack the resources for full supervision.

Can you predict any potential future developments related to this research?

Those of us thinking about weak supervision are all in agreement about its importance. As more researchers become aware of the promise of this learning paradigm, and as we develop more reliable methods for weakly supervised learning, it’s possible that practitioners may choose weakly supervised methods over fully supervised methods in the future.

The paper Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale is on arXiv.