“Understanding” is the New Hotness

Since this is a wrap-up post, it makes sense to try and describe the current “hot topics” and trends. An easy way to look at think about the hot new trends is to look at and compare the oral session titles from CVPR 2016 and 2015. These were the topics that remained the same:

Computational photography

3D “stuff” — recognition, matching, reconstruction, etc.

Segmentation — in both images and video

Image processing and restoration

Action recognition

But here were the major new topics for 2016:

Image captioning and question answering (these were new for CVPR, but were also major themes at NIPS 2015 and ICML 2016)

Video understanding

The trend here should be obvious. Another way to state image captioning and visual question answering is image understanding. So between image and video understanding, it is clear that vision researchers (like natural language processing researchers before them) are trying to tackle increasingly complex problems. That isn’t to say that traditional problems such as image segmentation or object recognition are solved. Rather, researchers are making the choice to push the boundaries of what questions computer vision algorithms can answer. And I have to admit that the talks in these “harder” areas are the ones I paid most attention to.

DenseCap generally does an awesome job — but I found a case where it performs poorly!

Sidebar: Whenever there is a new problem, researchers create new metrics to score their solutions. You can imagine that with image captioning and question answering, there are thousands of “right” answers. The problem is that the audience/reader doesn’t have any intuition about these new metrics. And so when you read or hear that the solution has a Gleichauf score (and yes, I just made up that metric) of 6.3, you’re left wondering if that is good or not. I’d argue that a lot of the current interest in computer vision is due to ImageNet, and the thing that captivated the general public’s interest was that algorithms surpassed human performance. Authors of papers should consider giving the reader a similar reference point. FWIW, I left CVPR thinking that image captioning seems to work well and question answering has a long way to go.

Stop Using ImageNet

A few months ago, I had a conversation with Chris Re about the direction of Lab41. In particular, we discussed my desire to get academia and industry to participate in our challenges. He urged me to reconsider my definition of “participation.” His contention was that the way you get researchers interested in your problem is to create a dataset for others to download and use.

That point really struck home at CVPR this year. There were a ton of new and interesting datasets that researchers released. (far too many for my team to download and try). Here were some of my favorites:

MegaFace — This dataset is awesome. The idea behind it is that face recognition researchers have been limited to datasets with thousands of images and persons (LFW has 13,000). So they sought to build the first available face recognition dataset with over a million photos. In the paper that was released alongside the dataset, they show how algorithms that have similar performance on LFW have significantly different performance on MegaFace.

Sample Images from MegaFace

Multi-Person Video — As a big basketball fan, I’m really excited about the multi-person video dataset created by Vignesh and his colleagues over at Stanford. The dataset includes over 250 basketball games and 14,000 annotations in those games. The goal is for researchers to develop algorithms that can detect the key actors and actions within the videos. I must admit that counting 3-pointers made isn’t a focus of Lab41, but I want to find some excuse to play around with this data :).

Any guesses on who the key actor is from this clip?

TGIF — I talked about this dataset in my Day 4 post, but I like the idea of animated GIFs (instead of short videos) being used for action recognition research. People who make animated GIFs are pretty careful about making sure that every frame is related to the point they are trying to make. What that means is that these GIFs won’t have much in the way of extraneous information.

Train Networks End-to-End

A long time ago, when I was still an undergraduate, I spent a summer working at The Ohio State University for Ashok Krishnamurthy doing audio processing work. It was that summer that I was introduced to Matlab, Simulink, cepstral coefficients, and Star Wars: The Phantom Menace. In between watching clips of Qui Gon Jinn fighting Darth Maul, I learned a lot about the speech processing pipeline.

From raw audio there was a feature extraction step, followed by acoustic modeling, phoneme prediction, language modeling, etc. This was well before the advent of deep learning, so while this pipeline worked, the models were rather brittle. A few years ago, CNNs started replacing traditional algorithms within this defined pipeline. And performance soared.

Fast forward to modern times. I attended ICML last year and the best talk (IMO) was Towards End-to-End Speech Recognition using Deep Neural Nets given by Tara Sainath. In her presentation, Tara discussed how Google experimented with replacing these boxes on a step-by-step basis. The eventual goal was to build a single neural net that could replace the entire pipeline (instead of developing individual nets to replace each box one-at-a-time).

At CVPR this year, this end-to-end training was the no longer an eventual goal. From DenseCap to Image Question Answering, training end-to-end was the new norm. And this new norm happens to lead to straight-up better performance. It’s amazing how quickly that change occurred.

Students — Pick Computer Vision

The state of computer vision is strong. That was the clear takeaway from my week in Vegas. CVPR 2016 had a ton of speakers (with the usual split of unintelligible, not good, and reasonable), great papers, and crowded poster sessions. But most importantly, it had a crazy number of recruiters from companies. Computer vision has never been a hotter topic in academia and industry alike.

So if you’re a computer science undergraduate or graduate student, take some computer vision classes. And if your school doesn’t have an awesome class to take, watch the lectures from Stanford’s CS231n. The class is taught by Fei-Fei Li and her graduate students (Andrej “the Jiant” Karpathy and Justin Johnson).

People are Awesome

The human mind is an amazing thing. And I’m reminded of that every time I attend an academic conference. Yes, the people who do this cutting edge research are brilliant— but that’s not what I’m talking about here. I’m talking about the fact that all of us can do some pretty amazing things — things that even the deepest neural nets can’t do.

Haque, et. al. write: “A quick, partial view of a person is often sufficient for a human to recognize an individual. This remarkable ability has proven to be an elusive task for modern computer vision systems” in their paper about Depth-based person identification. In A Comparison of Human and Automated Face Verification Accuracy, Austin Blanton writes “ We examine the impact on performance when human recognizers are presented with varying amounts of imagery per subject, immutable attributes such as gender, and circumstantial attributes such as occlusion, illumination, and pose. Results indicate that humans greatly outperform state of the art automated face recognition algorithms.”

So while we aren’t all this awesome, take heart. Our AI overlords are still decades away!