ML/NLP Tools and Datasets ⚙️

Here I highlight stories related to software and datasets that have assisted in enabling NLP and machine learning research and engineering.

Hugging Face released a popular Transformer library based on Pytorch names pytorch-transformers. It allows NLP practitioners and researchers to easily use state-of-the-art general-purpose architectures such as BERT, GPT-2, and XLM, among others. If you are interested in how to use pytorch-transformers there are a few places to start but I really liked this detailed tutorial by Roberto Silveira showing how to use the library for machine comprehension.

TensorFlow 2.0 was released with a bunch of new features. Read more about best practices here. François Chollet also wrote an extensive overview of the new features in this Colab notebook.

PyTorch 1.3 was released with a ton of new features including named tensors and other front-end improvements.

The Allen Institute for AI released Iconary which is an AI system that can play Pictionary-style games with a human. This work incorporates visual/language learning systems and commonsense reasoning. They also published a new commonsense reasoning benchmark called Abductive-NLI.

spaCy releases a new library to incorporate Transformer language models into their own library so as to be able to extract features and used them in spaCy NLP pipelines. This effort is built on top of the popular Transformers library developed by Hugging Face. Maximilien Roberti also wrote a nice article on how to combine fast.ai code with pytorch-transformers.

The Facebook AI team released PHYRE which is a benchmark for physical reasoning aiming to test the physical reasoning of AI systems through solving various physics puzzles.

StanfordNLP released StanfordNLP 0.2.0 which is a Python library for natural language analysis. You can perform different types of linguistic analysis such as lemmatization and part of speech recognition on over 70 different languages.

GQA is a visual question answering dataset for enabling research related to visual reasoning.

exBERT is a visual interactive tool to explore the embeddings and attention of Transformer language models. You can find the paper here and the demo here.

exBERT — source

Distill published an article on how to visualize memorization in Recurrent Neural Networks (RNNs).

Mathpix is a tool that lets you take a picture of an equation and then it provides you with the latex version.

Parl.ai is a platform that hosts many popular datasets for all works involving dialog and conversational AI.

Uber researchers released Ludwig, an open-source tool that allows users to easily train and test deep learning models with just a few lines of codes. The whole idea is to avoid any coding while training and testing models.

Google AI researchers release “Natural Questions” which is a large-scale corpus for training and evaluating open-domain question answering systems.

Articles and Blog posts ✍️

This year witnessed an explosion of data science writers and enthusiasts. This is great for our field and encourages healthy discussion and learning. Here I list a few interesting and must-see articles and blog posts I came across:

Christian Perone provides an excellent introduction to maximum likelihood estimation (MLE) and maximum a posteriori (MAP) which are important principles to understand how parameters of a model are estimated.

Reiichiro Nakano published a blog post discussing neural style transfer with adversarially robust classifiers. A Colab notebook was also provided.

Saif M. Mohammad started a great series discussing a diachronic analysis of ACL anthology.

“Graphs showing average academic age, median academic age, and percentage of first-time publishers in AA over time.” — source

The question is: can a language model learn syntax? Using structural probes, this work aims to show that it is possible to do so using contextualized representations and a method for finding tree structures.

Andrej Karpathy wrote a blog post summarizing best practices and a recipe on how to effectively train neural networks.

Google AI researchers and other researchers collaborated to improve the understanding of search using BERT models. Contextualized approaches like BERT are adequate to understand the intent behind search queries.

Rectified Adam (RAdam) is a new optimization technique based on Adam optimizer that helps to improve AI architectures. There are several efforts in coming up with better and more stable optimizers but the authors claim to focus on other aspects of optimizations that are just as important for delivering improved convergence.

With a lot of development of machine learning tools recently, there are also many discussions on how to implement ML systems that enable solutions to practical problems. Chip Huyen wrote an interesting chapter discussing machine learning system design emphasizing on topics such as hyperparameter tuning and data pipeline.

NVIDIA breaks the record for creating the biggest language model trained on billions of parameters.

Abigail See wrote this excellent blog post about what makes a good conversation in the context of systems developed to perform natural language generation task.

Google AI published two natural language dialog datasets with the idea to use more complex and natural dialog datasets to improve personalization in conversational applications like digital assistants.

Deep reinforcement learning continues to be one of the most widely discussed topics in the field of AI and it has even attracted interest in the space of psychology and neuroscience. Read more about some highlights in this paper published in Trends in Cognitive Sciences.

Samira Abner wrote this excellent blog post summarizing the main building blocks behind Transformers and capsule networks and their connections. Adam Kosiorek also wrote this magnificent piece on stacked capsule-based autoencoders (an unsupervised version of capsule networks) which was used for object detection.

Researchers published an interactive article on Distill that aims to show a visual exploration of Gaussian Processes.

Through this Distill publication, Augustus Odena makes a call to researchers to address several important open questions about GANs.

Here is a PyTorch implementation of graph convolutional networks (GCNs) used for classifying spammers vs. non-spammers.

At the beginning of the year, VentureBeat released a list of predictions for 2019 made by experts such as Rumman Chowdury, Hilary Mason, Andrew Ng, and Yan LeCun. Check it out to see if their predictions were right.

Learn how to finetune BERT to perform multi-label text classification.

Due to the popularity of BERT, in the past few months, many researchers developed methods to “compress” BERT with the idea to build faster, smaller and memory-efficient versions of the original. Mitchell A. Gordon wrote a summary of the types of compressions and methods developed around this objective.

Superintelligence continued to be a topic of debate among experts. It’s an important topic that needs a proper understanding of frameworks, policies, and careful observations. I found this interesting series of comprehensive essays (in the form of a technical report by K. Eric Drexler) to be useful to understand some issues and considerations around the topic of superintelligence.

Eric Jang wrote a nice blog post introducing the concept of meta-learning which aims to build and train machine learning models that not only predict well but also learn well.

A summary of AAAI 2019 highlights by Sebastian Ruder.

Graph neural networks were heavily discussed this year. David Mack wrote a nice visual article about how they used this technique together with attention to perform shortest path calculations.

Bayesian approaches remain an interesting subject, in particular how they can be applied to neural networks to avoid common issues like over-fitting. Here is a list of suggested reads by Kumar Shridhar on the topic.

“Network with point-estimates as weights vs Network with probability distribution as weights” - Source

Ethics in AI 🚨

Perhaps one of the most highly discussed aspects of AI systems this year was ethics which include discussions around bias, fairness, and transparency, among others. In this section, I provide a list of interesting stories and papers around this topic:

The paper titled “Does mitigating ML’s impact disparity require treatment disparity?” discusses the consequences of applying disparate learning processes through experiments conducted on real-world datasets.

HuggingFace published an article discussing ethics in the context of open-sourcing NLP technology for conversational AI.

Being able to quantify the role of ethics in AI research is an important endeavor going forward as we continue to introduce AI-based technologies to society. This paper provides a broad analysis of the measures and “use of ethics-related research in leading AI, machine learning and robotics venues.”

This work presented at NAACL 2019 discusses how debiasing methods can cover up gender bias in word embeddings.

Listen to Zachary Lipton presenting his paper “Troubling Trends in ML Scholarship”. I also wrote a summary of this interesting paper which you can find here.

Gary Marcus and Ernest Davis published their book on “Rebooting AI: Building Artificial Intelligence We Can Trust”. The main theme of the book is to talk about the steps we must take to achieve robust artificial intelligence. On the topic of AI progression, François Chollet also wrote an impressive paper making a case for better ways to measure intelligence.

Check out this Udacity course created by Andrew Trask on topics such as differential privacy, federated learning, and encrypted AI. On the topic of privacy, Emma Bluemke wrote this great post discussing how one may go about training machine learning models while preserving patient privacy.

At the beginning of this year, Mariya Yao posted a comprehensive list of research paper summaries involving AI ethics. Although the list of paper reference was from 2018, I believe they are still relevant today.

ML/NLP Education 🎓

Here I will feature a list of educational resources, writers and people doing some amazing work educating others about difficult ML/NLP concepts/topics:

CMU released materials and syllabus for their “Neural Networks for NLP” course.

Elvis Saravia and Soujanya Poria released a project called NLP-Overview that is intended to help students and practitioners to get a condensed overview of modern deep learning techniques applied to NLP, including theory, algorithms, applications, and state of the art results — Link

Microsoft Research Lab published a free ebook on the foundation of data science with topics ranging from Markov Chain Monte Carlo to Random Graphs.

“Mathematics for Machine Learning” is a free ebook introducing the most important mathematical concepts used in machine learning. It also includes a few Jupyter notebook tutorials describing the machine learning parts. Jean Gallier and Jocelyn Quaintance wrote an extensive free ebook covering mathematical concepts used in machine learning.

Stanford releases a playlist of videos for its course on “Natural Language Understanding”.

On the topic of learning, OpenAI put together this great list of suggestions on how to keep learning and improving your machine learning skills. Apparently, their employees use these methods on a daily basis to keep learning and expanding their knowledge.

Adrian Rosebrock published an 81-page guide on how to do computer vision with Python and OpenCV.

Emily M. Bender and Alex Lascarides published a book titled “Linguistic Fundamentals for NLP”. The main idea behind the book is to discuss what

“meaning” is in the field of NLP by providing a proper foundation on semantics and pragmatics.

Elad Hazan published his lecture notes on “Optimization for Machine Learning” which aims to present machine training as an optimization problem with beautiful math and notations. Deeplearning.ai also published a great article that discusses parameter optimization in neural networks using a more visual and interactive approach.

Andreas Mueller published a playlist of videos for a new course in “Applied Machine Learning”.

Fast.ai releases its new MOOC titled “Deep Learning from the Foundations”.

MIT published all videos and syllabus for their course on “Introduction to Deep Learning”.

Chip Huyen tweeted an impressive list of free online courses to get started with machine learning.

Andrew Trask published his book titled “Grokking Deep Learning”. The book serves as a great starter for understanding the fundamental building blocks of neural network architectures.

Sebastian Raschka uploaded 80 notebooks about how to implement different deep learning models such as RNNs and CNNs. The great thing is that the models are all implemented in both PyTorch and TensorFlow.

Here is a great tutorial that goes deep into understanding how TensorFlow works. And here is one by Christian Perone for PyTorch.

Fast.ai also published a course titled “Intro to NLP” accompanied by a playlist. Topics range from sentiment analysis to topic modeling to the Transformer.

Learn about Graph Convolutional Neural Networks for Molecular Generation in this talk by Xavier Bresson. Slides can be found here. And here is a paper discussing how to pre-train GNNs.

On the topic of graph networks, some engineers use them to predict the properties of molecules and crystal. The Google AI team also published an excellent blog post explaining how they use GNNs for odor prediction. If you are interested in getting started with Graph Neural Networks, here is a comprehensive overview of the different GNNs and their applications.

Here is a playlist of videos on unsupervised learning methods such as PCA by Rene Vidal from Johns Hopkins University.

If you are ever interested in converting a pretrained TensorFlow model to PyTorch, Thomas Wolf has you covered in this blog post.

Want to learn about generative deep learning? David Foster wrote a great book that teaches data scientists how to apply GANs and encoder-decoder models for performing tasks such as painting, writing, and composing music. Here is the official repository accompanying the book, it includes TensorFlow code. There is also an effort to convert the code to PyTorch as well.

A Colab notebook containing code blocks to practice and learn about causal inference concepts such as interventions, counterfactuals, etc.

Here are the materials for the NAACL 2019 tutorial on “Transfer Learning in Natural Language Processing” delivered by Sebastian Ruder, Matthew Peters, Swabha Swayamdipta and Thomas Wolf. They also provided an accompanying Google Colab notebook to get started.

Another great blog post from Jay Alammar on the topic of data representation. He also wrote many other interesting illustrated guides that include GPT-2 and BERT. Peter Bloem also published a very detailed blog post explaining all the bits that make up a Transformer.

A visual illustration of basic self-attention — source

Here is a nice overview of trends in NLP at ACL 2019, written by Mihail Eric. Some topics include infusing knowledge into NLP architectures, interpretability, and reducing bias among others. Here are a couple more overviews if you are interested: link 2 and link 3.

The full syllabus for CS231n 2019 edition was released by Stanford.

David Abel posted a set of notes for ICLR 2019. He was also nice to provide an impressive summary of NeurIPS 2019.

This is an excellent book that provides learners with a proper introduction to deep learning with notebooks provided as well.

An illustrated guide to BERT, ELMo, and co. for transfer learning NLP.

Fast.ai releases its 2019 edition of the “Practical Deep Learning for Coders” course.

Learn about deep unsupervised learning in this fantastic course taught by Pieter Abbeel and others.

Gilbert Strang released a new book related to Linear Algebra and neural networks.

Caltech provided the entire syllabus, lecture slides, and video playlist for their course on “Foundation of Machine Learning”.

The “Scipy Lecture Notes” is a series of tutorials that teach you how to master tools such as matplotlib, NumPy, and SciPy.

Here is an excellent tutorial on understanding Gaussian processes. (Notebooks provided).

This is a must-read article in which Lilian Weng provides a deep dive into generalized language models such as ULMFit, OpenAI GPT-2, and BERT.

Papers with Code is a website that shows a curated list of machine learning papers with code and state-of-the-art results.

Christoph Molnar released the first edition of “Interpretable Machine Learning” which is a book that touches on important techniques used to better interpret machine learning algorithms.

David Bamman releases the full syllabus and slides to the NLP courses offered at UC Berkley.

Berkley releases all materials for their “Applied NLP” class.

Aerin Kim is a senior research engineer at Microsoft and writes about topics related to applied Math and deep learning. Some topics include intuition to conditional independence, gamma distribution, perplexity, etc.

Tai-Danae Bradley wrote this blog post discussing ways on how to think about matrices and tensors. The article is written with some incredible visuals which help to better understand certain transformations and operations performed on matrices.