Most recent feed-forward deep neural networks for artificial vision are trained with supervision on data and labels from large collections of static images. These networks are missing the time variable present in video streams, and are not used to see the smooth transformation of scenes in videos. As a result, when applied to video streams, standard feed-forward networks provide poor output stability. This issue is direct consequence of their feed-forward architecture and the training framework. This project addresses both architectural and training shortcomings of standard feed-forward deep neural networks by proposing a novel network model and two training schemes. Inspired by the human visual system, CortexNet provides robust visual temporal representations by adding top-down feedback and lateral connections to bottom-up feed-forward connections, all of which are present in our visual cortex.

In the figure above we see (a) the full CortexNet architecture, which is made of several (b) discriminative and (c) generative blocks. The logits are a linear transformation of the embedding, which is obtained by (d) spatial averaging the output of the last discriminative block.

CortexNet can be trained in two ways to provide MatchNet and TempoNet. Details below.