Research, development and resources

Deep3D: Fully Automatic 2D-to-3D VideoConversion with Deep Convolutional Neural Networks, University of Washington, code here. 3D movies are growing in popularity (remember Avatar in 2008?), but they’re expensive to produce using either 3D cameras or 2D video manually converted to 3D. To automatically convert 2D to 3D, one needs to infer a depth map for each pixel in an image (i.e. how far each pixel is from the camera) such that an image for the opposing eye can be produced. Existing automated neural network-based pipelines require image-depth pairs for training, which are hard to procure. Here, the authors use stereo-frame pairs that exist in already-produced 3D movies to train a deep convolutional neural network to predict the novel view (right eye’s view) from the given view (left eye’s view) using an internally estimated soft (probabilistic) disparity map.

“Why Should I Trust You?” Explaining the Predictions of Any Classifier, University of Washington. Code here. A key hurdle to the mass adoption of machine learning models in fault intolerant commercial settings (e.g. finance, healthcare, security) is the ability to provide explanations as to why certain predictions were made. Many models, especially neural networks, are today functionally black boxes with trust in their performance relying on cross validation accuracy. The authors present a model-agnostic algorithm that presents textual or visual artifacts using interpretable representations of underlying data (not necessarily a model’s features) to provide the user with a qualitative understanding of what a given model is basing its classification predictions on. This is very nifty work. Further explanation here.

Dynamic Memory Networks for Visual and Textual Question Answering, MetaMind. A year ago, the MetaMind team published the dynamic memory network, a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. In this work, the team introduce a new input module to handle images instead of text, such that the network can now answer natural language questions from its understanding of features in the image. Specifically, the input module splits an image into small local regions and considers each region equivalent to a sentence in the input module for text.The Curious Robot: Learning Visual Representations via Physical Interactions, Carnegie Mellon University. The task of learning visual representations in the real world with CNNs typically requires a large dataset of labeled image examples. This group instead explores whether aBaxter robotic arm can learn visual representations only by performing four physical interactions: push, poke, grasp and active vision. They show that by experiencing 130k of these interactions with household objects (e.g. cups, bowls, bottles) and using each data point for back-propagation through a CNN, the network can learn some generalised features that helps it classify household object images on ImageNet without having seen any labeled images before.

Deep learning for chatbots, part 1 — Introduction. Given the excitement around chat interfaces and their ability to evolve user experiences for today’s generation of technophiles, here’s a piece that describes where we’re at technically, what’s possible and what will stay nearly impossible for at least a little while. This series will follow up with implementation details in upcoming posts.