Perceptual tasks such as vision and audition require the construction of good features, or good internal representations of the input. Deep Learning designates a set of supervised and unsupervised methods to construct feature hierarchies automatically by training systems composed of multiple stages of trainable modules.The recent history of OCR, speech recognition, and image analysis indicates that deep learning systems yield higher accuracy than systems that rely on hand-crafted features or "shallow" architectures whenever more training data and more computational resources become available. Deep learning systems, particularly convolutional nets, hold the performances record in a wide variety of benchmarks and competition, including object recognition in image, semantic image labeling (2D and 3D), acoustic modeling for speech recognition, drug design, handwriting recognition, pedestrian detection, road sign recognition, etc. The most recent speech recognition and image analysis systems deployed by Google, IBM, Microsoft, Baidu, NEC and others all use deep learning and many use convolutional nets.While the practical successes of deep learning are numerous, so are the theoretical questions that surround it. What can circuit complexity theory tell us about deep architectures with their multiple sequential steps of computation, compared to, say, kernel machines with simple kernels that have only two steps? What can learning theory tell us about unsupervised feature learning? What can theory tell us about the properties of deep architectures composed of layers that expand the dimension of their input (e.g. like sparse coding), followed by layers that reduce it (e.g. like pooling)? What can theory tell us about the properties of the non-convex objective functions that arise in deep learning? Why is it that the best-performing deep learning systems happen to be ridiculously over-parameterized with regularization so aggressive that it borders on genocide?