Machine learning systems are built from both code and data. It’s easy to reuse the code but hard to reuse the data, so building AI mostly means doing annotation. This is good, because the examples are how you program the behaviour – the learner itself is really just a compiler. What’s not good is the current technology for creating the examples. That’s why we’re pleased to introduce Prodigy, a downloadable tool for radically efficient machine teaching.

We’ve been working on Prodigy since we first launched Explosion last year, alongside our open-source NLP library spaCy and our consulting projects (it’s been a busy year!). During that time, spaCy has grown into the most popular library of its type, giving us a lot of insight into what’s driving success and failure for language understanding technologies. Most of those insights have been used to make spaCy better: AI DevOps was hard, so we made sure models could be installed via pip. Large models made CI tricky, so the new models are less than 1/10th the size.

Prodigy addresses the big remaining problem: annotation and training. The typical approach to annotation forces projects into an uncomfortable waterfall process. The experiments can’t begin until the first batch of annotations are complete, but the annotation team can’t start until they receive the annotation manuals. To produce the annotation manuals, you need to know what statistical models will be required for the features you’re trying to build. Machine learning is an inherently uncertain technology, but the waterfall annotation process relies on accurate upfront planning. The net result is a lot of wasted effort.

Why not cloud? Active learning works best when you have a lot of raw input to stream through the model, so that more informative examples can be chosen for annotation. The model must be updated during the annotation session, and the updates must be specific to each user. Solutions to these problems could surely be developed – but… why? As attractive as SaaS is to investors, it only makes sense if the hosted component is adding value, instead of removing it.

Prodigy solves this problem by letting data scientists conduct their own annotations, for rapid prototyping. Ideas can be tested faster than the first planning meeting could even be scheduled. We also expect Prodigy to reduce costs for larger projects, but it’s the increased agility we’re most excited about. Data science projects are said to have uneven returns, like start-ups: a minority of projects are very successful, recouping costs for a larger number of failures. If so, the most important problem is to find more winners. Prodigy helps you do that, because you get to try things much faster.

Update (June 2020)

Most annotation tools avoid making any suggestions to the user, to avoid biasing the annotations. Prodigy takes the opposite approach: ask the user as little as possible, and try to guess the rest. Prodigy puts the model in the loop, so that it can actively participate in the training process and learns as you go. The model uses what it already knows to figure out what to ask you next. As you answer the questions, the model is updated, influencing which examples it asks you about next. In order to take full advantage of this strategy, Prodigy is provided as a Python library and command line utility, with a flexible web application. There’s a thin, and optional hosted component to make it easy to share annotation queues, but the tool itself is entirely under your control.

Prodigy comes with built-in recipes for training and evaluating text classification, named entity recognition, image classification and word vector models. There’s also a neat built-in component for doing A/B evaluations, which we expect to be particularly useful for developing generative models and translation systems. To keep the system requirements to a minimum, data is stored in an SQLite database by default. It’s easy to use a different SQL backend, or to specify a custom storage solution.

The components are wired togther into a recipe, by adding the @recipe decorator to any Python function. The decorator lets you invoke your function from the command-line, as a prodigy subcommand. Recipes can start the web service by return a dictionary of components. The recipe system provides a good balance of declarative and procedural approaches. If yo just need to wire together built-in components, return a Python dictionary is no more typing than the equivalent JSON representation. But the Python function also lets you implement more complicated behaviours, and reuse logic across your recipes.

recipe.py import prodigy import your_arbitrary_ETL_logic @prodigy . recipe ( 'custom_stream' , dataset = ( "Dataset" ) , db = ( "Database" ) , label = ( "Label" , "option" ) ) def custom_stream ( dataset , db = None , label = '' ) : DB = your_arbitrary_ETL_logic . load ( db ) return { 'dataset' : dataset , 'stream' : ( { 'text' : row . text , 'label' : label } for row in DB ) 'view_id' : 'classification' }

When humans interact with machines, their experience is what decides about the success of the interaction. Most annotation tools avoid making suggestions to the user, to avoid biasing the annotations. Prodigy takes the opposite approach: ask the user as little as possible. The more complicated the structure your model has to produce, the more benefit you can get from Prodigy’s binary interface. The web app lets you annotate text, entities, classification, images and custom HTML tasks straight from your browser – even on mobile devices.

Human time and attention is precious. Instead of presenting the annotators with a span of text that contains an entity, asking them to highlight it, select one of many labels from a dropdown and confirm, you can break the whole interaction down into a simple binary decision. You’ll have to ask more questions to get the same information, but each question will be simple and focused. You’ll collect more user actions, giving you lots of smaller pieces to learn from, and a much tighter feedback loop between the human and the model.

Most AI systems today rely on supervised learning: you provide labelled input and output pairs, and get a program that can perform analogous computation for new data. Supervised learning algorithms have been improving quickly, leading many people to anticipate a new wave of entirely unsupervised algorithms: algorithms so “advanced” they can compute whatever you want, without you specifying what that might be. This is like hoping for a programming language so advanced you don’t even need to write a program.

Rather than spending a month figuring out an unsupervised machine learning problem, just label some data for a week and train a classifier. — Richard Socher (@RichardSocher) March 10, 2017

The beauty of ML is that the complexity of the final system comes much from the data than from the human-written code. — Andrew Ng (@AndrewYNg) July 7, 2017

Unsupervised algorithms return meaning representations, based on the internal structure of the data. By definition, you can’t directly control what the process returns. Sometimes the meaning representation will directly address a useful question. If you’re looking for suspicious activity on your platform, you might find that an outlier detection process is all you need. However, the unsupervised algorithm won’t usually return clusters that map neatly to the labels you care about. With the right feature weightings, you might be able to come up with a model that sorts your data more usefully, but doing this by hand is unproductive: this is exactly the problem supervised learning is designed to solve.

Text classification models can be trained to perform a wide variety of useful tasks, including sentiment analysis, chatbot intent detection, and flagging abusive or fraudulent content. One of the problems with text classification is that it’s usually hard to guess how accurately the system will perform. Some problems turn out to be unexpectedly easy, while others are so difficult the intended functionality needs to be redesigned. Prodigy lets you perform very rapid prototyping, so that you can quickly find out which ideas are worth further exploration.

Workflow and data set For more info on how to do text classification with Prodigy, see the detailed text classification workflow. You can also download the annotated data set we’ve created with Prodigy for this example.

Text classification really shines when the task would otherwise be performed by hand. For instance, we regularly categorise GitHub issues for our library, spaCy. Keeping the issue tracker tidy is something many open source projects struggle with – so automated tools could definitely be helpful. How easy would it be to create a bot to tag the issues automatically?

Prodigy is a Python library, so it’s easy to stream in data from any source — all you have to do is create a generator that yields out your examples. Prodigy also includes several built-in API loaders, including one for the GitHub API. To get started, we’ll want to search for a query that returns a decent number of documentation issues. The model can’t know what we’re looking for until we’ve said “yes” to some examples. To find a good query, it’s useful to pipe the stream into less , so we can look at the results:

prodigy textcat.print-stream "docs" --api github --label DOCS | less -r

Custom recipes The subcommand system is fully extensible. All you have to do is add the @recipe decorator to your function, and you’ll be able to call it from the command line. To start the annotation server, your recipe just has to return a dictionary of components, like the stream of examples, the annotation interface, and optional callbacks to update and save your model.

Now it’s time to start annotating. We first add initialise a new dataset, adding a quick description for future reference. The next command starts the annotation server. The textcat.teach subcommand tells prodigy to run the built-in recipe function teach() , using the rest of the arguments supplied on the command line.

prodigy dataset gh_issues "Classify issues on GitHub" prodigy textcat.teach gh_issues en_core_web_sm "docs" --api github --label DOCS

Opening localhost:8080 , we get a sequence of recent GitHub issue titles, displayed with our category as the title. If the category is correct, click accept , press a , or swipe left on a touch interface. If the category does not apply, click reject , press x , or swipe right. Some examples are unclear or exceptions that you don’t want the model to learn from. In these cases, you can click ignore or press space .

Prodigy trains a model during annotation, on the answers you’re providing. This lets Prodigy rank the examples in the stream, to ask less redundant questions. Learning from streaming data is a tricky problem, so we can usually get better results by training a new from scratch, once all the annotations are collected. This also lets us study the model in more detail, and try different hyper-parameters.

After around 40 minutes of annotating the stream of issue titles for the search queries “docs”, “documentation”, “readme” and “instructions”, we end up with a total of 830 annotations that break down as follows:

Decision Count accept 261 reject 525 ignore 44 total 830

prodigy textcat.print-dataset gh_issues | less -r

By default, Prodigy uses spaCy v2.0’s new text classification system (currently in alpha). The model is a convolutional neural network stacked with a unigram bag-of-words. The bag-of-words model learns quickly, while the convolutional network lets the model pick up cues from longer phrases, once a few hundred examples are available.

Using a different text classification strategy with Prodigy is very easy. If you want to keep using spaCy, you can simply pass a new model instance to the TextClassifier component. For an entirely custom NLP solution, you only need to provide two functions: one which assigns scores to the text, and another which updates the model on a new batch of examples. If your text classification solution only supports batch training, you can use the built-in model during annotation, and then export the annotations to train your solution separately.

Within the first hour of annotation, the system classified 140 out of the 156 evaluation issues correctly. To put this into some context, we have to look at the class balance of the data. In the evaluation data, 65% of the examples were labelled reject , i.e. they were tagged as not documentation issues. This gives a baseline accuracy of 65%, which the classifier easily exceeded. We can get some sense of how the system will improve as more data is annotated by retraining the system with fewer examples.

prodigy textcat.train-curve gh_issues --label DOCS --eval-split 0.2 % ACCURACY 25 % 0.73 +0.73 50 % 0.82 +0.09 75 % 0.84 +0.02 100 % 0.87 +0.03

Interpreting the curve Each row of the table shows an experiment where the model was evaluated on 20% of the data, and trained with a subset of the remaining examples. This lets you see the relationship between the data set size and accuracy, so you can predict how much accuracy might improve if you collect more annotations.

The train curve shows the accuracy achieved with 10%, 25%, 50% and 75% of the training data. The last 25% of the training data brought 3% improvement in accuracy, indicating that further training will improve the system. Similar logic is used to estimate the progress indicator during training.

After training the model, Prodigy outputs a ready-to-use spaCy model, making it easy to put into production. spaCy comes with a handy package command that converts a model directory into a Python package, allowing the data dependency to be specified in your requirements.txt . This gives a smooth path from prototype to production, making it easy to really test the model, in the context of a larger system.

Model training and packaging prodigy textcat.batch-train gh_issues /tmp/gh_docs --label DOCS spacy package /tmp/gh_docs /tmp cd /tmp/gh_docs python setup.py sdist pip install dist/gh_docs-1.0.0.tar.gz

Usage in spaCy v2.0.0 alpha+ import gh_docs nlp = gh_docs . load ( ) texts = [ 'missing documentation' , 'docker container not loading' , 'installation not working on windows' ] for doc in nlp . pipe ( texts ) : print ( doc . cats )

If annotation projects are expensive to start, you have to guess which ideas look promising. These guesses will often be wrong, because it’s difficult to predict the performance of a statistical model before the data has been collected. Prodigy helps you break through this bottleneck by dramatically reducing the cost of investigating new ideas. The whole annotation process is cheaper with Prodigy, but it’s the time-to-first-evidence that’s most important. There’s no shortage of ideas that would be incredibly valuable if they could be made to work. The shortage is in time to investigate those opportunities – which is exactly what Prodigy gives you more of.

Try Prodigy!