Team name (with project link where available) Team members Abstract

1-HKUST Cewu Lu (Hong Kong University of Science and Technology)

Hei Law* (Hong Kong University of Science and Technology)

Hao Chen* (The Chinese University of Hong Kong)

Qifeng Chen* (Stanford University)

Yao Xiao* (Hong Kong University of Science and Technology)

Chi Keung Tang (Hong Kong University of Science and Technology)

(* indexes equal contribution, by Alphabets)





For the detection task, we first generate some candidate bounding boxes, and then our system recognizes objects on these candidate proposals. We try to improve both localization and recognition. On the localization side, initial candidate proposals are generated from selective search [1], and a novel bounding boxes regression method is used for better object localization. On the recognition side, to represent a candidate proposal, we adopt many features such as RCNN features [2], IFV features [3], DPM features [4], to name a few. Given these features, category-specific combination functions are learnt to improve object recognition. Background priors and object interaction priors are also learnt and applied into our system. In addition, our framework involves some other novel techniques. The pertinent technical details for the submission are in preparation. In the ILSVRC2014 competition, we do not use any outside training data.





[1]Uijlings J R R, van de Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International journal of computer vision, 2013, 104(2): 154-171.



[2]Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[J]. arXiv preprint arXiv:1311.2524, 2013.



[3]Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification[M]//Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010: 143-156.



[4]Felzenszwalb P, McAllester D, Ramanan D. A discriminatively trained, multiscale, deformable part model[C]//Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008: 1-8.

Conference on. IEEE, 2008: 1-8.

Adobe-UIUC Hailin Jin (Adobe)

Zhaowen Wang (UIUC)

Jianchao Yang (Adobe)

Zhe Lin (Adobe) Our algorithm is based on an integrated convolutional neural network framework for both classification and localization. We train several 6-layer convnets using 3000 ImageNet classes for classification and then adapt one model for bounding box regression. At test time, we use k-means to find bounding box clusters and rank the clusters according to the classification scores.

Andrew Howard Andrew Howard - Howard Vision Technologies Deep convolutional neural networks are very costly to train so my submission focuses on reusing networks through retraining and by using the same network to make multiple predictions.



I started with a deeper and wider Zeiler/Fergus net (ZF) [1]. The differences from the base ZF model are that I use 7 convolutional layers with convolutional layer 3-7 having 512 filters. It took over 6 weeks to train on a GTX Titan using cuda-convnet [2]. This base model is trained using 224x224 crops from the full 256xN image [3] with random horizontal flips [4]. Each training crop is further perturbed with color channel noise [4] and random variation in photometric properties (lighting,contrast,color) [3]. This base model is then adapted to build a high resolution [3] and a low resolution model. The high resolution model is retrained on 224x224 crops from a 448xN sized image with random variation in size (448 +- 10%) and no drop out due to the large number of training crops available. The low resolution model embeds the entire image resized to 150xN into a random location in the 224x224 crop for retraining. I also retrain the base model to increase the size of the fully connected layers to a size larger than would fit in GPU memory if the model was trained together (the fully connected layer is grown from 4096x4096 to 12288x12288 and trained from scratch while the convolutional layers are held fixed). When the new fully connected layers are retrained, I use a slow form of Polyak averaging which averages the model parameters after each epoch rather than after each iterate. Each retrained model takes roughly 1/3 the time that training a model from scratch would.



At test time predictions are made at 6 resolutions each one roughly 30% larger than the next smaller size. Each of the 3 models are responsible for 2 resolutions. The base resolution model acts on images scaled at 256xN and 340xN. The high resolution model acts on 448xN and 576xN and the low resolution acts on 150xN and 200xN. Each resolution uses locations selected on a dense spatial grid on the entire image similar to [5]. Predictions at each spatial location are averaged into a prediction for a given resolution and then predictions are each resolution are combined evenly.



I further build a KNN model on the validation set as suggested by the NUS team last year [6]. For features, I use the final 1000 dimension aggregate predictions. I use leave one out cross validation on the validation set to choose K (the number of neighbors) and the weighting between the final neural network prediction and the KNN prediction.



Finally I adapt the neural networks to the validation set distribution as suggested by the NUS team last year [6]. To do this, I hold fixed the convolutional layers and adapt the fully connected layers to the validation set. Each neural network model is adapted on a different random 80% subset of the validation set with early stopping based on the remaining 20% of the validation.



The final submission is made up of 2 sets of 3 networks plus 1 KNN prediction. The second set of networks are a smaller earlier version and only add a little value.



[1] M.D. Zeiler, R. Fergus, "Visualizing and Understanding Convolutional Networks." ECCV 2014.



[2] https://code.google.com/p/cuda-convnet



[3] A.G. Howard, "Some Improvements on Deep Convolutional Neural Network Based Image Classification." ICLR 2014.



[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks." NIPS 2012.



[5] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "Overfeat: Integrated recognition, localization and detection using convolutional networks." ICLR 2014.



[6] M. Lin, Q. Chen, J. Dong, J. Huang, W. Xia, "Adaptive Non-parametric Rectification of Shallow and Deep Experts." ILSVRC 2013.and evenly weights the predictions at each spatial location. Each resolution gets even weighting in the final prediction.

BDC-I2R,UPMC Big Deep Computing Team



Olivier Morère (1,2),

Hanlin Goh (1),

Antoine Veillard (2),

Vijay Chandrasekhar (1)



1: Institute for Infocomm Research, Singapore

2: Université Pierre et Marie Curie, Paris, France Multiple deep convolutional neural networks (CNN) [Krizhevsky et al. 2012], each trained with a different set of parameters. The deep representations are extracted across multiple scales and positions within an image. Model fusion is adaptively performed within each CNN model, and subsequently across the different models. Class distribution priors are used to rectify the outputs of the model. The CNN features are extracted across a GPU cluster, while a CPU cluster is used to optimize parameters in a MapReduce framework.



We submit three runs for the classification-only task. No external data was used in our models.

Run 1: A single CNN model.

Run 2: Adaptive fusion of multiple CNN models.

Run 3: Adaptive fusion of multiple CNN models with output rectification.

Berkeley Vision Ross Girshick, UC Berkeley

Jeff Donahue, UC Berkeley

Sergio Guadarrama, UC Berkeley

Trevor Darrell, UC Berkeley

Jitendra Malik, UC Berkeley Our detection entry is a baseline for R-CNN [1] on the expanded ILSVRC 2014 detection dataset. We followed the approach for training on ILSVRC 2013 detection described in the R-CNN tech report [2], but with two small changes.



1) We used the additional training annotations for the 2014 detection dataset.



2) We used a slightly larger convolutional neural network than in [1, 2]. In this network, convolutional layers one through five have 96, 384, 512, 512, and 384 filters, respectively. The two fully connected layers (before the linear classifiers) both have 4096 output units. This network was pre-trained on the ILSVRC 2013 CLS dataset before fine-tuning for detection.



We performed control experiments to compare these changes to the results in [2]. On the val2 validation set (see [2]), the new training data added for 2014 improved results from 29.7% to 31.2% mAP, using the same CNN as in [2] in both cases. Using the slightly larger CNN improved results on val2 to 32.1%. Bounding-box regression further increased this to 33.4% (compared to 31.0% in [2]).



[1] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014.



[2] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Technical report. http://arxiv.org/abs/1311.2524v4.

BREIL_KAIST KAIST department of EE



Jun-Cheol Park, Yunhun Jang, Hyungwon Choi, JaeYoung Jun Our team trained a deep convolutional neural network with similar architecture introduced in[1]. The overall training details are based on [2]. We used caffe[3] as our development environment. For localization, we computed image specific class saliency as in [4].



[1] Chatfield, Ken, et al. "Return of the Devil in the Details: Delving Deep into Convolutional Nets." arXiv preprint arXiv:1405.3531 (2014).

[2] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

[3] Jia, Yangqing. "Caffe: An open source convolutional architecture for fast feature embedding." h ttp://caffe. berkeleyvision. org (2013).

[4] Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps." arXiv preprint arXiv:1312.6034 (2013).

Brno University of Technology Martin Kolář, Michal Hradiš, Pavel Svoboda Our method is based on calculating the weighted average of multiple architectures of standard Convolutional Neural Networks (Krizhevsky et al. 2012) on randomly transformed images (color and geometry). Results were optimised using textual associations between synsets (Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.). We used code based on Caffe by Yangqing Jia on the IT4I computing cluster, and trained 17 CNNs on Kepler K20 GPUs.

CASIA_CRIPAC_2 Peihao Huang, Institute of Automation, Chinese Academy of Sciences

Yongzhen Huang, Institute of Automation, Chinese Academy of Sciences

Feng Liu, School of Automation, Southeast University

Zifeng Wu, Institute of Automation, Chinese Academy of Sciences

Fang Zhao, Institute of Automation, Chinese Academy of Sciences

Liang Wang, Institute of Automation, Chinese Academy of Sciences

Tieniu Tan, Institute of Automation, Chinese Academy of Sciences

Our method is mainly based on the framework of R-CNN for object detection. However, the object proposals are different from those used in R-CNN, explained as follows.

(1) We train a part classification model using CNN, to judge that a proposal (obtained by the selective search algorithm) belongs to an object or not.

(2) We train an object regression model using CNN, to estimate the location and the size of the object from a part.

(3) For each image, we use the K-means algorithm for clustering over the locations and the sizes estimated in (2).

(4) We choose the proposals close to the clustering centers.



Another difference is that, to obtain the pre-training CNN model, we use the 200 categories images on dataset 1 for training rather than the 1000 categories images on dataset 2.



CASIA_CRIPAC_Weak_Supervision Weiqiang Ren, CRIPAC, CASIA

Chong Wang, CRIPAC, CASIA

Yanhua Cheng, CRIPAC, CASIA

Kaiqi Huang, CRIPAC, CASIA

Tieniu Tan, CRIPAC, CASIA We use the weakly supervised object localization from only classification labels to enhance classification task. First, MCG proposal pre-trained on PASCAL VOC 2012 is used to extract the region proposals and each region proposal is represented using pre-trained convolutional networks.

Then, a multiple instance learning strategy is adopted to learn the object detectors with weak supervision. Using the learned object detectors, we are able to learn object classifiers instead of global image classifiers using multi-class softmax model. Finally, the detection models and classification models are fused to produce the final classification results.



Cldi-KAIST Kyunghyun Paeng (KAIST), Donggeun Yoo (KAIST), Sunggyun Park (KAIST), Jungin Lee (Cldi Inc.), Anthony S. Paek (Cldi Inc.), In So Kweon (KAIST), Seong Dae Kim (KAIST) Our submission is based on a combination of two methodologies – the Deep Convolutional Neural Network (DCNN) framework [1] as a global expert and the DCNN-based Fisher framework as a local expert. Simple reweighting techniques are used as well. Our localization method is a bounding box regression.



In order to train a global expert, we have used 10 networks under different settings: using various preprocessing methods, and/or different network architectures. We selected the best ensemble of the networks that demonstrate the best accuracy in the validation dataset.



Our local expert is trained using local features composed of DCNN responses from mid-layers. We encoded the local features into Fisher vectors [2] and trained SVM classifiers. In order to prevent overfitting, we trained our network using 0.9 million from the entire set of training images, and the remaining 0.3 million were used for Fisher encoding and SVM training.



[1] Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoff, "Imagenet classification with deep convolutional neural networks." NIPS 2012.



[2] Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." ECCV 2010.







CUHK DeepID-Net Wanli Ouyang, Ping Luo, Xingyu Zeng, Shi Qiu, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Yuanjun Xiong, Chen Qian, Zhenyao Zhu, Ruohui Wang, Chen-Change Loy, Xiaogang Wang, Xiaoou Tang



Multimedia Laboratory, The Chinese University of Hong Kong

The work uses ImageNet classification training set (1000 classes) to pre-train features, and fine tunes features on ImageNet detection training set (200 classes). This detection work is based on deep CNN with proposed new deformation layers, feature pre-training strategy, sub-region pooling and model combination. The effectiveness of learning deformation models of object parts has been proved in object detection by many existing non-deep-learning detectors, e.g. [a]. However, it is missed in current deep learning models. In deep CNN models, max pooling and average pooling are useful in handling deformation but cannot learn the deformation penalty and geometric model of object parts. We design the deformation layer for deep models so that the deformation penalty of objects can be learned by deep models. The deformation layer was first proposed in our recently published work [b], which showed significant improvement in pedestrian detection. In this submission, we extend it to general object detection on ImageNet. In [b], the deformation layer was only applied to a single level corresponding to body parts, while in this work the deformation layer was applied to every convolutional layer to capture geometric deformation at all the levels. In [b], it was assumed that a pedestrian only has one instance of a body part, so each part filter only has one optimal response in a detection window. In this work, it is assumed that an object has multiple instances of body part (e.g. a building has many windows), so each part filter is allowed to have multiple response peaks in a detection window. This new model is more suitable for general object detection.



The whole detection pipeline is much more complex than [b]. In addition to the above improvement, we also added several new components in the pipeline, including feature pre-training on the ImageNet classification dataset (objective function is the image classification task), feature fine-tuning on the ImageNet detection dataset (objective function is the object detection task), a proposed new sub-region pooling step, contextual modeling (which uses the whole image prediction scores over 1000 classes as contextual features to combine with features extracted from a detection window with deep CNN), SVM classification by using the extracted features. We also adopted bounding box regression [c].



A new sub-region pooling strategy is proposed. It divides the detection window into sub-regions, and applies max-pooling or average pooling across feature vectors extracted from different sub-regions. It improves the performance and also increases the model diversity.



Different from the state-of-the-art deep learning detection framework [c], which pretrain the net on ImageNet classification data (1000 classes), We proposed a new strategy of doing pre-training on the ImageNet classification data (1000 classes), such that the pre-trained features are much more effective on the detection task and with better discriminative power on object localization.



By changing the configuration of each component of the detection pipeline, multiple models with large diversity are generated. Multiple models are automatically selected and combined to generate the final detection result.

We have submitted the results of five different approaches. The first two results report the best performance to be achieved with a single model. Their difference is whether using contextual features from image classification or not. The remaining three results report the best performance to be achieved with model combination. Their differences are using contextual modeling or not, and whether using validation 2 dataset from ImageNet as part of training or not.





[a] P. Felzenszwalb, R. B. Grishick, D.McAllister, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. PAMI, 32:1627–1645, 2010.



[b] Wanli Ouyang, Xiaogang Wang, "Joint Deep Learning for Pedestrian Detection ", In Proc. IEEE ICCV 2013.



[c] R. Girshick, J. Donahue, T. Darrell, J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation", CVPR 2014.

CUHK DeepID-Net2 Wanli Ouyang, Xingyu Zeng, Shi Qiu, Ping Luo, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Yuanjun Xiong, Chen Qian, Zhenyao Zhu, Ruohui Wang, Chen-Change Loy, Xiaogang Wang, Xiaoou Tang



Multimedia Laboratory, The Chinese University of Hong Kong The work uses ImageNet classification training set (1000 classes) to pre-train features, and fine tunes features on ImageNet detection training set (200 classes). This detection work is based on multi-stage deep CNN and model combination. Multi-Stage classifiers have been widely used in object detection and achieved great success. With a cascaded structure, each classifier processes a different subset of data. However, these classifiers are usually trained sequentially without joint optimization. In this submission, we proposed a new deep architecture that can jointly train multiple classifiers through several stages of back-propagation. Each stage handles samples at a different difficulty levels. Specifically the first stage of deep CNN handles easy samples, the second state of deep CNN process more difficult samples which can’t be handled in the first stage, and so on. Through a specific design of the training strategy, this deep architecture is able to simulate the cascaded classifiers by mining hard samples to train the network stage-by-stage. The group of classifiers in the deep model choose training samples stage by stage. The training is split into several back-propagation (BP) stages. Due to the design of our training procedure, the gradients of classifier parameters at the current stage are mainly influenced by the samples misclassified by the classifiers at the previous stages. At each BP stage, the whole deep model has been initialized with a good starting point learned at the previous stage and the additional classifiers focus on the misclassified hard samples. Direct back-propagation on the multi-stage deep CNN easily lead to the overfitting problem. We design stage-wise supervised training to regularize the optimization problem. At each BP stage, classifiers at the previous stages jointly work with the classifier at the current stage in dealing with misclassified samples. Existing cascaded classifiers only pass a single score to the next stage, while our deep model keeps the score map within a local region and it serves as contextual information to support the decision at the next stage. Our recent work [1] has explored the idea of multi-stage deep learning, but it was only applied to pedestrian detection. In this submission, we apply it to general object detection on ImageNet.



The detection pipeline is much more complex than [1]. It includes feature pre-training, multi-stage deep CNN fine-tuning, sub-region pooling, contextual modeling , SVM classification, and bounding box regression. The state-of-the-art deep learning object detection framework in [2] pretrain the net on ImageNet classification data (1000 classes) and then fine-tune on ImageNet detection data (200 classes). We proposed a new strategy of doing pre-training on the ImageNet classification data (1000 classes), such that the pre-trained features are much more effective on the detection task and with better discriminative power on object localization. A new sub-region pooling strategy is proposed. It divides the detection window into sub-regions, and applies max-pooling or average pooling across feature vectors extracted from different sub-regions. Context modeling uses the whole image prediction scores over 1000 classes as contextual features to combine with features extracted from a detection window with deep CNN.



By changing the configuration of each step, we can generate multiple deep models. For example, the features can be pre-trained with Alex’s net or Clarifai. With extracted features, bounding boxes can be classified with fully connected networks with hinge loss or SVM, including sub-region pooling or not. Therefore, different models can be generated. Top N models with the highest accuracies are combined by averaging. The work uses ImageNet classification training set (1000 classes) to pre-train features, and fine tunes features on ImageNet detection training set (200 classes). No other training data is used.



[1] Xingyu Zeng, Wanli Ouyang, Xiaogang Wang, "Multi-Stage Contextual Deep Learning for Pedestrian Detection ", In Proc. IEEE ICCV 2013.



[2] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation", In Proc. CVPR, 2014.

Deep Insight Junjie Yan (NLPR)

Naiyan Wang (HKUST)

Stan Z. Li (NLPR)

Dit-Yan Yeung (HKUST)



We use the region proposal, CNN Feature and SVM classifier for object detection (similar to the framework RCNN ). In our entry, we use the selective search and structure edge to generate around 4000 object proposals for each image. The features of each object proposal are extracted from three CNNs, which are trained on the classification task and tuned on the detection task. The three CNNs are different in the depth of convolution layer. Deeper model always achieves better result according to the validation set. The bounding box regression uses the output of the final layer as the input to refine the result. For the context, we train 200 binary classifiers on the detection data and use them to re-score the detection.

DeepCNet Ben Graham-University of Warwick We trained a deep convolutional network with the architecture



(input=768x768x3)-200C3-MP2-400C2-MP2-600C2-MP2-800C2-MP2-1000C2-MP2-1750C2-MP2-2500C2-MP2-3250C2-MP2-4000C2-(output=1000N softmax layer)



The architecture is inspired by the paper (Ciresan, et al. Multi-column deep neural networks for image classification, 2012).

Input images are scaled to have approximately 2^16 pixel, maintaining aspect ratio, and placed in the centre of the input field.

Sparsity is used to accelerate the training process (Graham, Sparse arrays of signatures for online character recognition http://arxiv.org/abs/1308.0371, 2013).

For training, affine transformations are used. For testing, each image is fed forward through the network only once.



Regarding Q3 in the FAQ "Do teams have to submit both classification and localization results in order to participate in Task 2?"

Do to lack of time, I have not attempted the localization part of the challenge; but I hope to work on that in future.



Thank you to all the organisers.



DeeperVision DeeperVision We use very deep convolutional neural network which consists of 10+ layers in the competition. To fully optimize such a deep model, we adopt a Nesterov based optimization method which is shown to be superior to the conventional SGD. We also exploit more advanced data augmentation technique such as using various resolution, lightness and contrast variation, etc. For model ensemble, we directly use discrete optimization to optimize top 5 error rate.

Fengjun Lv Fengjun Lv - Fengjun Lv Consulting We followed the approach by Krizhevsky et al. in their NIPS 2012 paper but with a different pre-processing step. For non-square images, instead of using central crop (which in many cases, does not contain the object of interest at all or the object is incomplete), we apply Graph-Based Visual Saliency (by Harel et al. NIPS 2006) to the original image (both in training and testing) and use integral image to get a square crop that maximizes the visual saliency. One of the two submissions is from a single CNN. The other combines multiple CNNs.



GoogLeNet Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Drago Anguelov, Dumitru Erhan, Andrew Rabinovich We explore an improved convolutional neural network architecture which combines the multi-scale idea with intuitions gained from the Hebbian principle. Additional dimension reduction layers based on embedding learning intuition allow us to increase both the depth and the width of the network significantly without incurring significant computational overhead. Combining these ideas allow for increasing the number of parameters in convolutional layers significantly while cutting the total number of parameters and resulting in improved generalization. Various incarnations of this architecture are trained for and applied at various scales and the resulting scores are averaged for each image.

lffall Feng Liu, Southeast University, China This track is just for testing some off-the-shelf algorithms to provide a baseline for our subsequent researches and studies. In particular, we want to compare the results of different algorithms that can produce region proposals, and to find out which is the most important factor that influence the following classification.

DET entry 1 is our reproduction of the RCNN[1] algorithm trained on val + train1k set, whose region proposals are provided by selective search[2].

[1] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." arXiv preprint arXiv:1311.2524 (2013).

[2] Van de Sande, Koen EA, et al. "Segmentation as selective search for object recognition." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.



libccv Liu Liu, libccv.org Open-source implementation of MattNet (Visualizing and Understanding Convolutional Networks, Matthew D. Zeiler, and Rob Fergus) trained with 1 convnet, detailed in: http://libccv.org/doc/doc-convnet/

MIL Senthil Purushwalkam (The Univ. of Tokyo[intern] and IIT Guwahati)

Yuichiro Tsuchiya (The Univ. of Tokyo)

Atsushi Kanehira (The Univ. of Tokyo)

Asako Kanezaki (The Univ. of Tokyo)

Tatsuya Harada (The Univ. of Tokyo) Classification-Localisation Task



We combine two models - one based on fisher vectors extracted from two feature descriptors and the other using a special classifier trained on CNN features extracted using selective search boxes.

For the fisher based model [1], fisher vectors were extracted using local feature descriptors. Linear classifiers were trained for these fisher vectors using the averaged passive-aggressive algorithm.

For the CNN based model, CNN features were extracted on selective search windows. The classifier was trained using [2] which trains a multiclass classifier by creating 'negative classes' for each class. This optimises the separation between positive and negative features while simultaneously optimising the separation between classes.





Detection Task:

We use RCNN[3] as the base detector. We train separate fisher based classifiers for each class using the Passive Aggressive algorithm. The scores from these classifiers for each image is collected and is used for rescoring the detections.





1) N. Gunji, T. Higuchi, K. Yasumoto, H. Muraoka, Y. Ushiku, T. Harada, and Y. Kuniyoshi. Scalable Multiclass Object Categorization with Fisher Based Features. ILSVRC2012, 2012.



2) Asako Kanezaki, Sho Inaba, Yoshitaka Ushiku, Yuya Yamashita, Hiroshi Muraoka, Yasuo Kuniyoshi, and Tatsuya Harada. Hard Negative Classes for Multiple Object Detection. 2014 IEEE International Conference on Robotics and Automation (ICRA 2014), pp.3066-3073, 2014.



3) Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. "Rich feature hierarchies for accurate object detection and semantic segmentation." arXiv preprint arXiv:1311.2524 (2013).

MPG_UT Riku Togashi (The University of Tokyo)

Keita Iwamoto (The University of Tokyo)

Tomoaki Iwase (The University of Tokyo)

Hideki Nakayama (The University of Tokyo)

In this challenge, we focused on integrating object region proposals obtained from different methods to use as the inputs for the RCNN system [1]. Namely, we used objectness (OB) [2], selective search (SS) [3], and bounding box transfer (TR) [4]. We used public codes of RCNN, OB, SS (bundled in RCNN). For implementing TR, we extracted 4096-dimensional global CNN features by Caffe [5] and retrieved nearest training samples in terms of L2 distance.

We computed 500 to 1000 windows for each object region proposal method and then put them together for RCNN. Using pre-trained CNN and SVM models provided by RCNN software, we computed scores for each proposal and ran non-maxima suppression (without distinguishing proposal methods) to determine the final predictions. We did not perform bounding box regression (refinement) as the original RCNN paper does.



We observed that combining different object proposal methods worked better than just computing more proposals by one method. Particularly, TR method could greatly improve the performance from the original RCNN (based on SS), probably because TR can implicitly utilize global dataset statistics and conceptually very different from OB and SS.





[1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, In Proc. IEEE CVPR, 2014.



[2] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari , Measuring the objectness of image windows, IEEE Trans. PAMI, vol. 34, no. 11, pp. 2189-2202, 2012.



[3] Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, Volume 104 (2), page 154-171, 2013.



[4] Jose A. Rodriguez-Serrano and Diane Larlus, Predicting an Object Location using a Global Image Representation, In Proc. IEEE ICCV, 2013.



[5] Yangqing Jia, Caffe:An Open Source Convolutional Architecture for Fast Feature Embedding, 2013.



MSRA Visual Computing Kaiming He (Microsoft Research)

Xiangyu Zhang (Xi'an Jiaotong University)

Shaoqing Ren (University of Science and Technology of China)

Jian Sun (Microsoft Research) Our CLS and DET methods are both based on the SPP-net in our ECCV 2014 paper “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”. SPP (SPM) is a flexible solution for handling image scales/sizes, and is also robust to deformations. The usage of the SPP layer is independent of the CNN designs, and we show that SPP improves the classification accuracy of various CNNs, regardless of the network depth, width, strides, and other designs.



The SPP-net is also a fast and accurate solution to object detection. We compute the convolutional feature maps from the images only once, and use SPP to pool features from arbitrary proposal windows for training SVM detectors. Our method is tens of times faster than R-CNN. Our network is pre-trained only using the DET-200 data (without outside data such as CLS-1000). A few strategies are proposed to improve the pre-training, driven by the different statistical properties of the DET-200 set.



The algorithm details have been described in our ECCV paper. An extended technical report will be updated. The code will be released.



NUS Jian DONG(1), Yunchao WEI(1), min LIN(1), Qiang CHEN(2), Wei XIA(1), Shuicheng YAN(1)



(1) National University of Singapore

(2) IBM Research, Australia There are four major components for improving detection performance:



Network In Network (NIN) [Key Contribution]:

We trained an NIN which is a special modification of CNN [1] with 14 parameterized layers. NIN uses a shared multilayer perceptron as the convolution kernel to convolve the underlying input, the resulting structure is equivalent to adding cascaded cross channel parametric (CCCP) pooling on top of convolutional layer. Adding CCCP layer significantly improves the performance as compared to vanilla convolution.



Augmented training and testing sample:

This improvement was first described by Andrew Howard [Andrew 2014]. Instead of resizing and cropping the image to 256x256, the image is proportionally resized to 256xN(Nx256) with the short edge to 256. Subcrops of 224x224 are then randomly extracted for training.



Traditional framework with SVM:

Traditional classification framework can provide complementary information, such as scene-level information, to CNN network. Hence, we integrate the outputs from the traditional framework (based on our PASCAL VOC2012 winning solutions, with the new extension of high-order parametric coding in which the first and second order parameters of the adapted GMM for each instance are both considered) to further improve the performance.



Kernel regression for rescoring:

Finally, we employ non-parametric rectification method to correct/rectify the outputs from multiple models for obtaining more accurate prediction. Basically for each sample in the training and validation sets, we have a pair of outputs-from-multi-models and ground-truth label. For a testing sample, we use regularized kernel regression method to determine the affinities between the test sample and its auto-selected training/validation samples, and then the affinities are utilized to fuse the ground-truth labels of these selected samples to produce a rectified prediction.



Detection (Task 1) ------

The basic method is based on Ross Girshick's RCNN framework. We employ Network in Network as the feature extractor to improve the model discriminative capability. Features from multiple NINs are concatenated for both model training and bounding box regression. Raw detection scores are calculated based on the features from the refined bounding boxes.

To integrate the global context information beyond the information within the target bounding box, we concatenate all the raw detection scores and then combine them with the outputs from the traditional classification framework by context refinement [2]. Finally, the refined detection results are further updated through the adaptive kernel regression.



[1] Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. In ICLR 2014.

[2] Qiang Chen, Zheng Song, Jian Dong, Zhongyang Huang, Yang Hua, Shuicheng Yan. Contextualizing Object Detection and Classification. In TPAMI 2014.



NUS-BST Min Lin(1), Jian Dong(1), Hanjiang Lai(1), Junjun Xiong(2), Shuicheng Yan(1)



(1) National University of Singapore

(2) Beijing Samsung Telecom R&D Center This submission is based on our recent ICLR’14 work called “Network in Network”, and there are four major components for the whole solution:



Network In Network (NIN) [key contribution]:

We trained an NIN which is a special modification of CNN [Min et al. 2014] with 14 parameterized layers. NIN uses a shared multilayer perceptron as the convolution kernel to convolve the underlying input, the resulting structure is equivalent to adding cascaded cross channel parametric (CCCP) pooling on top of convolutional layer. Adding CCCP layer significantly improves the performance as compared to vanilla convolution.



Augmented training and testing sample:

This improvement is first described by Andrew Howard [Andrew 2014]. Instead of resizing and cropping the image to 256x256, the image is proportionally resized to 256xN (or Nx256) with the short edge to 256. Subcrops of 224x224 are then randomly extracted for training. During testing, 3 views of 256x256 are extracted and each view goes through the 10 view testing described by [Alex et al. 2013].



Traditional features with SVM:

Traditional classification framework can provide complementary information, such as scene level information, to NIN network. Hence, we integrate the outputs from the traditional framework (based on our PASCAL VOC2012 winning solutions, with the new extension of high-order parametric coding in which the first and second order parameters of the adapted GMM for each instance are both considered) to further improve the performance.



Kernel regression for fusion of results:

Finally, we employ non-parametric rectification method to correct/rectify the outputs from multiple models for obtaining more accurate prediction. Basically for each sample in the training and validation sets, we have a pair of outputs-from-multi-models and ground-truth label. For a testing sample, we use regularized kernel regression method to determine the affinities between the test sample and its auto-selected training/validation samples, and then the affinities are utilized to fuse the ground-truth labels of these selected samples to produce a rectified prediction.



Min Lin, Qiang Chen, and Shuicheng Yan. "Network In Network." International Conference on Learning Representations. 2014.



Howard, Andrew G. "Some Improvements on Deep Convolutional Neural Network Based Image Classification." International Conference on Learning Representations. 2014.



Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

ORANGE-BUPT Hongliang BAI, Orange Labs Beijing

Yinan LIU, Orange Labs Beijing

Bo LIU, BUPT, CHINA

Yanchao FENG, BUPT, CHINA

Kun TAO, Orange Labs Beijing

Yuan DONG, Orange Labs Beijing

It is the second time that we participate in ILSVRC. In this year, we submit maximal ten runs in the DET and LOC tasks. In DET, inspired by Ross’s rcnn method, we detect 200 classes in test images with selective search, pretrained CNN models in training set of LOC task, fine-tuning in the detection training set, neural network-based classification (201 classes including background) , and bounding box regression. In the validation dataset, we get 0.272 mAP. Three steps are conducted in LOC, (1) train seven classification models by deep learning in different network structure and parameters, and test with data augmentations (crop, flip and scale) (2)test images are segmented into ~2000 regions by selective search algorithm, then the regions are classified by the above classifiers into one of 1000 classes. (3) regions with highest possibility classes generated by the classification model are selected as the final output. In validation set of classification, the top-5/1 error rate is 0.3680 and 0.1526 compared with the last year’s 0.25194. For location task, the best performance is about 0.45 in validation data set.

PassBy Lin SUN(LENOVO/HKUST)

Zhanghui Kuang(LENOVO)

Cong Zhao(LENOVO)

Kui Jia (University of Macao)

Oscar C.Au (HKUST) Since the time limited, we do not obtain a good CNN baseline, about 80% on validation dataset. However, we want to indicate that we could apply some traditional computer vision methods to boost the performance even the tools at hand are poor. In this submission, we propose a saliency based method in order to better present the images when single CNN fails. Average and novel weighted average methods are applied to obtain the final prediction. We believe our method will be better if we get enough time to train and tune.



Reference:

1.DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, ICML, 2014

2. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012

SCUT_GLH Guo Lihua (south china university of technology)

Liao Qijun (south china university of technology)

Ma Qianli (south china university of technology)

Lin Junbin (south china university of technology)

Deep Neural networks have very stronger power to automatically learn the complex relation between the input and output than some traditional shallow model, such as SVM, PCA, and so on. Currently, the most widely used network which achieves better performance is CNN. CNN has been successfully applied to image classification, scene recognition, and natural speech analysis and other areas. This method uses the CNN network to train imagenet training image. We calculate the average accuracy of top20 in validation sets, and find that the average accuracy of validation sets has above 90%. Based on this, firstly, we establish the semantic relation of all the labels. Then, use CNN network to extract the top 20 candidated labels. Finally, rerank the result based on the semantic relation of the candidated labels.

Southeast-CASIA Feng Liu, School of Automation, Southeast University

Zifeng Wu, Institute of Automation, Chinese Academy of Sciences

Yongzhen Huang, Institute of Automation, Chinese Academy of Sciences Our algorithm is composed of five components:

(1) Using the selective search algorithm to generate about 2400 proposals for every image.

(2) Training a two-category proposal classification model using CNN on dataset 1 to remove proposals more likely from backgrounds. 700 proposals are preserved after this step.

(3) Training an initial 200-category image classification model using CNN on dataset 1.

(4) Fine-tuning the initial model using 700 proposals. We consider two strategies: with sample balance and without sample balance over categories in fine-tuning, and accordingly obtain two proposal representation models. The final proposal representation is the combination of these two models.

(5) Training 200 two-category proposal classification models using SVM, and using bounding box regression to obtain the final detection results.



SYSU_Vision Liliang Zhang, Tianshui Chen, Shuye Zhang, Wanglan He, Liang Lin, Dengguang Pang, Lingbo Liu. Sun Yat-Sen University, China. Solution 1:

Our solution 1 employed the classification-localization framework. For classification, we train a one-thousand-class classification model based on Alex network published on NIP 2012. For localization, we first train a one-thousand-class localization model based on Alex network. However, such a localization model is inclined to localize the saliency region, which can not work well for ImageNet localization. So we fine tune one thousand class-specific models based on the pre-train one-thousand-class localization model, one for each class. But because of the shortage of training images for each class, the over-fitting problem is very serious. To reduce this problem, we design a similarity-sorted fine tuning method. First, we choose one class to fine tune the pre-trian one-thousand-class localization model, and get a localization model for this chosen class. Then we choose the class most similar to the pre-chosen class and fine tune this class based on pre-chosen class localization model. In this way, the training image of similar classed are shared.

Solution 2:

Our solution 2 was got idea by R-CNN's framework. For testing each image, we: Firstly, used the classification model in solution 1 to get the top 5 class-predictions. Secondly, applyed Selective Search get the candidate regions. Thirdly, fine-tuned another classfication model specific for classifying regions based the classification model above, then used it to find out the scores of each regions. Fourthly, got the highest-score-region in each top 5 class-predictions to form the final result.

Solution 3:

We compared the class-specific localization accuracy of solution 1 and solution 2 by the validation set. Then we chosen better solution on each class based on the accuracy. General speaking, solution 2 outformed solution 1 when there were multiple objects in the image or the objects are relatively small.

Solution 4:

We just simply averaged the result between solution 1 and solution 2 to form our solution 4.



Trimps-Soushen Jie Shao, Xiaoteng Zhang, JianYing Zhou, Jian Wang, Jian Chen, Yanfeng Shang, Wenfei Wang, Lin Mei, Chuanping Hu.

The Third Research Institute of the Ministry of Public Security, P.R. China. Task 1: Detection

Our work is based on the R-CNN paper in CVPR2014. We use another region selection method called RP from ICCV 2013 paper, this method generate less regions without significant precision reduction. We use these new regions to train a new model with less space and time. Besides this, we try several combine methods. First, we combine the regions generated by selective search and RP on a single model. We individually train R-CNN on selective search regions and RP regions, then we just combine the results of different models using nms. In the training stage, we fine-tune the CNN model trained on ILSVRC2012 classification data with ILSVRC2014 detection data. We do not use any other outside data. We also try a simple method which use our localization pipline plus nms for object detection.



Task 2: Classification and localization

Our model is based on large deep convolutional neural network. We use several methods to improve the performance. 1. Data Augmentation. Some of our models are trained on original data plus about 396000 external images from ILSVRC2010 and ILSVRC2011 training data. All training data belong to original 1000 object categories. Other data augmentation methods include random crops from Nx256 resized images, contrast and color jittering, and Gaussian noise. We use opencv to resize images with cubic interpolation, which we found very useful. 2. Model Details. The biggest model we trained has about 120M parameters. To encourage model diversity, we use different normalization and pooling method, with partly random selected external data. We also train two kind of complementary models, supervised CNN pre-train model and vary resolution model (normal resolution --> high resolution (fine-tuning) --> normal resolution (fine-tuning)). Both of these models have lower accuracy, but play very important role in model voting. 3. Testing. We make predictions at multiscale, each scale with 7 cropped images and their horizontal flips.



For Localization task, a simple pipeline is taken. First, we use RP to extract region proposals, regions with IOU greater than 0.8 are used as positive samples, and regions with IOU between 0.2 and 0.3 (Localization data are not fully annotated) are used as background. Second, we fine-tune a classification model with these regions. Finally, for a test image, extracted region proposals are feed to the fine-tuned model to get region confidence and corresponding coordinates. Base on the result from Classification task, we select the top-k regions and averaging their coordinates as output.





[1] Rich feature hierarchies for accurate object detection and semantic segmentation. Girshick, Ross and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra. Computer Vision and Pattern Recognition 2014.

[2]Prime Object Proposals with Randomized Prim's Algorithm, Santiago Manen, Matthieu Guillaumin, Luc Van Gool, International Conference on Computer Vision (ICCV) 2013.

[3] Some Improvements on Deep Convolutional Neural Network Based Image Classification. Andrew G. Howard. http://arxiv.org/abs/1312.5402

TTIC_ECP - EpitomicVision George Papandreou, Toyota Technological Institute at Chicago (TTIC)

Iasonas Kokkinos, Ecole Centrale Paris (ECP) These entries showcase deep epitomic neural nets [1]. An epitomic convolution layer replaces a pair of consecutive convolution and max-pooling layers found in standard deep convolutional neural networks (CNNs). The model uses mini-epitomes [2] in place of filters and computes responses invariant to small translations by epitomic search instead of max-pooling over image positions. Epitomic search returns the maximum response of each image patch with all patches extracted from a larger epitome [3]. The model parameters (mini-epitome filters) are learned by error backpropagation in a supervised fashion, similar to standard CNNs [4, 5]. We have submitted the following entries:



EpitomicVision1 (vanilla epitomic NN):



This entry has been obtained with the EPITOMIC-NORM variant of the epitomic model described in detail in [1]. The only difference with [1] is that the current network has more hidden units in layers 1 to 6. A single large net has been used (no averaging over different nets). No attempt has been done for localization (we report the whole image as bounding box prediction).



EpitomicVision2 (+ scale and position search):



This model also searches over scale and position for the best match. This is implemented by building a mosaic with multiple versions of the image at different scales [6, 7], running the epitomic classifier in a convolutional fashion similar to [5], and selecting the position on the mosaic that gives the maximum response. The parameters of the model were initialized from a model similar to EpitomicVision1 and were fine-tuned. A single large net has been used (no averaging over different nets). No attempt has been done for localization (we report the whole image as bounding box prediction).



EpitomicVision3 (fusion of EpitomicVision1 + EpitomicVision2):



The class probabilities for this model are weighted averages of the EpitomicVision1 (w=0.4) and EpitomicVision (w=0.6) models. No attempt has been done for localization (we report the whole image as bounding box prediction).



EpitomicVision4 (EpitomicVision2 with fixed mapping of the best matching mosaic position to bounding box):



This is a simple attempt to equip the EpitomicVision2 predictions with localization estimates.





All models have been trained using the supplied CLOC training set alone.



Acknowledgments:



We implemented the methods by extending the excellent Caffe software framework [8]. We gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used for this research.



References:



[1] G. Papandreou, "Deep Epitomic Convolutional Neural Networks,"

arXiv:1406.2732, June 2014.



[2] G. Papandreou, L.-C. Chen, and A. Yuille, "Modeling image patches with a generic dictionary of mini-epitomes," in Proc. CVPR 2014.



[3] N. Jojic, B. Frey, and A. Kannan, "Epitomic analysis of appearance and shape", in Proc. ICCV 2003.



[4] A. Krizhevsky, I. Sutskever, and G. Hinton, "ImageNet classification with deep convolutional neural networks," in Proc. NIPS 2013.



[5] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "Overfeat: Integrated recognition, localization and detection using convolutional networks," in Proc. ICLR 2014.



[6] C. Dubout and F. Fleuret, "Exact acceleration of linear object detectors," in Proc. ECCV 2012.



[7] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, K. Keutzer, "DenseNet: Implementing efficient ConvNet descriptor pyramids," arXiv:1404.1869, April 2014.



[8] Y. Jia, "Caffe: An open source convolutional architecture for fast feature embedding," 2013.

UI Fatemeh Shafizadegan, Msc student of Artificial Intelligence, University of Isfahan.

Elham Shabaninia, PhD candidate of Artificial Intelligence,University of Isfahan. Our model is based on Spatial Pyramid Matching (SPM), similar to [1]. This is an extension of SPM using sparse codes of SIFT features that propose a linear kernel. SIFT features are robust in rotation, scale, affine and different intensities. This approach reduce the complexity of SVM in training phase to O(n) and the complexity in testing phase doesn’t change. This approach uses max spatial pooling that is robust to local spatial translations. The image representation turns out to work well with linear SVM classifiers.







[1] Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification, J.Yang, K.Yu, Y.Gong, T.Huang, CVPR 2009.







UvA-Euvision Koen van de Sande

Daniel Fontijne

Cees Snoek

Harro Stokman

Arnold Smeulders



University of Amsterdam and Euvision Technologies Task 1 Detection

================

Our first run is based on deep learning in combination with selective search. It is trained using some additional data from ImageNet.

Our second run is based on deep learning in combination with selective search. It is trained on just the provided data.

Our third run is Fisher with FLAIR. It is the equivalent of our top entry in 2013 with improved training procedure. See Van de Sande et al., "Fisher and VLAD with FLAIR", CVPR 2014 for algorithm details. It is trained on just the provided data. This run has a speed advantage over the previous two runs.



Task 2 CLS+LOC

==============

We participate in just the classification task using deep learning. No outside data is used.

VGG Karen Simonyan, University of Oxford

Andrew Zisserman, University of Oxford In this submission we explore the effect of the convolutional network (ConvNet) depth on its accuracy. We have used three ConvNet architectures with the following weight layer configurations:

1) ten 3x3 convolutional layers, three 1x1 convolutional layers, and three fully-connected layers - 16 weight layers in total;

2) thirteen 3x3 convolutional layers and three fully-connected layers - 16 weight layers in total;

3) sixteen 3x3 convolutional layers and three fully-connected layers - 19 weight layers in total.

All convolutional layers have stride 1 and are followed by ReLU non-linearity. The fully-connected layers are regularised with dropout. The networks were trained on fixed-size image crops, but at test time they were applied densely over the whole uncropped images.



For localisation, we used per-class bounding box regression similar to OverFeat, but over a smaller number of scales and without multiple max-pooling offsets.



Our implementation is derived from the Caffe toolbox, but contains a number of significant modifications, including parallel training on multiple GPUs installed in a single system. Training a single ConvNet on 4 NVIDIA Titan GPUs took from 2 to 3 weeks (depending on the ConvNet configuration).

Virginia Tech Akrit Mohapatra, Neelima Chavali



Virginia Tech An undergraduate summer research project by Akrit Mohapatra in collaboration with Neelima Chavali based on the RCNN paper (arXiv:1311.2524v4) (Ross B. Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik: Rich feature hierarchies for accurate object detection and semantic segmentation.) The algorithm and code from the paper were used and models were created by changing various hyper-parameters.