Highlights from My First NIPS

As a machine learning practitioner in the Los Angeles area, I was ecstatic to learn that NIPS 2017 would be in Long Beach this year. The conference sold out in a day or two. The conference was held at the Long Beach Convention Center (and Performing Arts Center), very close to the Aquarium of the Pacific and about a mile from the Queen Mary. The venue itself was beautiful, and probably the nicest place I’ve ever attended a conference. It’s also the most expensive place I’ve ever had a conference. $5 for a bottle of Coke? $11 for two cookies? But I digress.I attended most of the conference, but as someone who has attended many conferences, I’ve learned that attending everything is not necessary, and is counterproductive to one’s sanity. I attended the main conference, and one workshop day, but skipped the tutorials, the Saturday workshops and the industry demos. The conference talks were livestreamed via Facebook Live at the NIPS Foundation’s Facebook page, and the recordings are also archived there.

This may make some question why one would actually want to attend the conference in person, but there are several!

to talk with the authors of interesting papers during the poster sessions; to meet up with likeminded people — a reunion of sorts. I had dinner with the LA Data Science crew; to be surrounded by likeminded people and perhaps get to meet some of the big names in machine learning, or people whose work has been valuable. During the week, I saw Yann LeCun, Ian Goodfellow, Hal Daume III, Judea Pearl etc. There were so many people at this NIPS that I did not see many others that I knew were present; As my friend Rob pointed out “THE WORKSHOPS!” Yes, the workshops are legit and are not recorded. You can also buy a special ticket for only the tutorials or only the workshops. That is something to keep in mind if your time is limited; The sponsor and employer expo can be useful for those looking for internships, full time jobs, or post-docs. Unfortunately, the opportunities were heavily focused on research fellow positions, and positions in research labs as a researcher, not the standard applied roles that I usually gravitate towards. This was a real bummer. There are also plenty of parties for the TensorBros. I kid. If convex optimization, TPUs and functional programming are too boring for you, you could have chilled with Flo Rida instead. Wait, who??

I usually write very long, drawn out blog posts about these conferences, but I am getting old, so I will try to just summarize some of the sessions and research I found the most interesting. It looks like it is just as long as usual.

Keynotes

Usually I nod off during the keynote and plenary talks as I tend to find them too general. Honestly, I think I found these talks to be the most interesting and motivating talks of the entire conference. The presenters spoke more about issues facing the community without getting hung up on deep learning and particular ways of doing machine learning AI.

Ali Rahimi, this year’s winner of the Test of Time award, delivered an acceptance speech that earned a standing ovation, and it gave all of us a reality check about the direction of machine learning AI. He described a “self-congratulatory” aura in the AI community. He further likened our current deep learning discourse to alchemy and encouraged a return to rigor, which NIPS seemed to be quite religious about in earlier days. He seemed to take issue with Andrew Ng’s tweet, “AI is the new electricity.” My take is that we are currently in a hype cycle, one that I believe transformed from the “data science” hype cycle. I admit I have not embraced deep learning in my own work and Ali’s claim that we are treating AI as alchemy really struck me to the point that I feel a bit vindicated. I am not a proofs or theory person, but I cannot use methods that do not seem to have some sort of mathematical basis… and to use such methods for life changing decisions would be unethical and irresponsible. Yann LeCun posted disagreement to some of Ali’s points here.

Kate Crawford spoke about fairness and bias in machine learning models, and how many models are biased against particular groups because they are trained on data that is biased by preconceived notions about race, gender roles, and more. Her concern is that if we allow these biases to affect models that make life-changing decisions, machine learning will suffer negative backlash leading to another AI winter. Kate listed several examples, but a few of them stood out to me as very surprising. She noted that one study showed that when Googling a name that sounds African-American, Google’s ad server chose to display an ad for criminal background checks. To approach resolving the problem, Kate suggested building pre-release models and carefully studying how the model treats each subpopulation. This is something that is commonplace in the world of educational testing (I originally studied psychometrics), a field test procedure is always performed on new test items. If a particular subpopulation performs significantly better or worse than the others net of all other factors, the test item is dropped. This phenomenon can be described mathematically as differential item functioning (DIF). Anyway, back to Kate. What I appreciated about her talk is that the problem was clearly obvious to anyone that works in machine learning, but she went into a level of detail that we have not heard before.

Main Conference

The main conference was divided into two parallel tracks that started with 4-6 15-minute talks followed by 12-20 “spotlight” (i.e. lightning) talks of 5 minutes each. The tracks were: Algorithms, Optimization, Algorithms/Optimation (a 2 for 1!), Theory (goodness no), Deep Learning Applications, Probabilistric Methods Theory, Deep Learning and Reinforcement Deep Learning. The tracks were very blurred – I mean, the entire conference was theory and there is a lot of optimization involved with deep learning, so the main conference was sort of a grab bag involving a lot of walking back and forth between rooms depending on the topic… or for me, whether or not the air conditioning was on.

Most of the talks involved deep learning obviously. I found that the majority of applications focused on images, video and speech… the usual. I would love to see more talks focused on language/text, music and motion, though I am sure those are coming. There was some discussion about art and style transfer, which is cool, but, well, cool. There were a lot of interesting talks, but the one that stood out to me (and many others) was actually a 5 minute spotlight/lightning talk on interpreting models using a technique called Shapley Additive Explanations, or SHAP (paper, code). The method boils down to an importance score for each feature and each observation which can be studied after model prediction to determine why a particular observation was labeled as it was and which feature(s) was/were responsible. There was a similar talk focused on image processing, where a proposed algorithm would “highlight” parts of an image that “encouraged” the model to attach a certain label (such as the ears and nose for a dog).

Many, if not all, of the spotlight/lightning talks are associated with a poster and also have slides, code, and sometimes a video associated with it. Check the NIPS 2017 schedule to find resources for each poster.

Symposia

The symposia seemed just seemed like another “main conference” track but with a panel discussion. I attended the symposium on Interpretable Machine Learning, which seems like a hot topic right now… but we statisticians have been doing it for years, and have stuck to the unsexy methods “regression” and “decision trees.” Many of the talks involved causality and interventions, which initially came out of left field to me, but makes sense in the grand scheme of things. If one can “prove” that x causes y, interpreting models becomes easier. Although we can prove correlations, many are spurious and meaningless, and thus the model likely is not interpretable. This whole issue seems to have arisen from the medical community (my opinion/observation) as machine learning AI is being used more and more in medicine for diagnoses and recommendations. If we are going to deploy models that prescribe certain medicines or procedures, we (and doctors) need to be able to debug model errors or we will injure or kill many people. For machine learning practitioners, this “debugging” is conceptual and mechanical. For doctors, this debugging must be in terms of their original training… in other words, the model must be interpretable. Another area where AI-in-a-box can cause problems is with driverless cars. Honk, honk!

One talk I found very interesting in this session was a talk called On Fairness and Calibration (unfortunately I don’t remember which author spoke). The speaker basically rehashed the importance of looking at metrics other than accuracy such as true positive rate, false positive rate, true negative rate etc., particularly among subpopulations. He suggested analyzing performance across groups and looking at the gap between how we expect the model to perform for a subpopulation and what performance we actually observe. What was amusing to me as a statistician is that this paper basically “rediscovers” ROC and PR curves (calibration in general), hypothesis testing (observed vs. expected results), and the mixed effects model (different analysis for each group) used in statistics. Of course, the audience was not from statistics and it was a very impactful talk.

In statistics we are taught that interpretable models are extremely important. This is why some machine learning competitions on platforms such as Kaggle are bothersome for aspiring data scientists. The problem statements and datasets often encourage extremely complicated models that really have no meaning but seem to “just work.” I suppose these models are fine for products, but they are dangerous in high stakes situations.

The panel discussion showed that we have a long way to go in terms of interpretable models. Much of the discussion involved participant’s definitions of the word interpretability and statements of “it depends on your definition of interpretable.” I think this whole issue is going to end up being resolved by, “try to make models interpretable, but if you can’t, just don’t make them so complicated that nobody knows what it’s doing.”

I ended up leaving early and grabbing dinner with some fellow attendees. By the time I left, my ears had become completely numb to the words interpretable and interpretability — a cacophany of syllables that just ran into each other.

Friday Workshop: Machine Learning Systems

I’ve never been to a conference where there were 27 workshops going on at the same time, on the same day. Then there was another 26 the following day, all at the same time. This was a bummer because there were so many good ones to choose from. One might as well just throw their hands in the air and go to the one that had the best seating.

Since I mainly work as a machine learning engineer, and have experienced the usual issues building and monitoring machine learning systems, I decided to attend the ML Systems workshop. Of course, that’s not what the workshop was about, but it was still very interesting. Ion Stoica presented on Ray, a distributed execution system for AI and reinforcement learning. Another interesting talk was on DLVM, which is a compiler framework for creating neural network DSLs. There was also a series of talks giving updates on current AI systems: TensorFlow (project), PyTorch (project), Caffe2 (project), CNTK (project), MXNet (project), TVM (project), Clipper (project), MacroBase (project) and ModelDB (project). There was also a presentation about ONNX (project), an ecosystem for interchageable models that can be used across deep learning systems (which reminds me of the seldom used PMML). Most, if not all, of these systems are based on Python. Woof!

There were two very interesting talks that did not involve deep learning frameworks. Alex Beutel from Google presented on The Case for Learned Index Structures which focused on using the distributions of the data within an index to speed up common database operations such as selects, by presumably using percentiles and other statistical measures. The premise of Alex’s talk was that the B-Tree induced by the index can be considered a machine learning model over an assumed uniform distribution. Further work is required for data that changes over time. Virginia Smith presented on Federated Multi-Task Learning (paper) which discussed a framework for building models from data provided by several heterogeneous devices all which have their own failure rates and communication limitations.

Takeaway Lessons and Learnings

I had a good time at NIPS, but because I have not yet embraced deep learning, I did not dive into it as much as I could have. My first learning is that I can no longer put off reading Ian Goodfellow’s book and some of the other deep learning books I’ve collected such as Josh Patterson’s book and Francois Chollet’s new book. NIPS is a very academic conference, and I do not believe I have been to this level of an academic conference before (and I’ve been to IJCAI, KDD, CIKM, and JSM). That is not a bad thing, but as a more applied person, I think KDD et. al. are more my cup of tea. With those conferences, I felt the application was the star, and that the methods and theory were discussed in detail as a means to an end. At NIPS, I feel the methods and theory are the star.