Team name Team members Abstract

360+MCG-ICT-CAS_SP Rui Zhang (1,2)

Min Lin (1)

Sheng Tang (2)

Yu Li (1,2)

YunPeng Chen (3)

YongDong Zhang (2)

JinTao Li (2)

YuGang Han (1)

ShuiCheng Yan (1,3)



(1) Qihoo 360

(2) Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China

(3) National University of Singapore (NUS) Technique Details for the Scene Parsing Task:



There are two core and general contributions for our scene parsing system: 1) Local-refinement-network for object boundary refinement, and 2) Iterative-boosting-network for overall parsing refinement.

These two networks collaboratively refine the parsing results from two perspectives, and the details are as below:

1) Local-refinement-network for object boundary refinement. This network takes the original image and the K object probability maps (each for one of the K classes) as inputs, and the output is m*m feature maps indicating how each of the m*m neighbors propagates the probability vector to the center point for local refinement. It works similar to bounding-box-refinement in object detection task in spirit, but here locally refine the object boundary instead of object bounding box.

2) Iterative-boosting-network for overall parsing refinement. This network takes the original image and the K object probability maps (each for one of the K classes) as inputs, and the output is the refined probability maps for all classes. It iterative boosting the parsing results in a global way.



Also two other tricks are used as below:

1) Global context aggregation: The scene classification information may potentially provide the global context information for decision as well as capture the co-occurrence relationship between scene and object/stuff in scene. Thus, we add the features from an independent scene classification model trained on ILSVRC 2016 Scene Classification dataset into our scene parsing system as contexts.

2) Multi-scale scheme: Considering the limited amount of training data and the various scales of objects in different training samples, we use multi-scale data argumentation in both training and inference stages. High resolution models are also trained on magnified images to capture details and small objects.





360+MCG-ICT-CAS_DET Yu Li (1,2),

Sheng Tang (2),

Min Lin (1),

Rui Zhang (1,2),

YunPeng Chen (3),

YongDong Zhang (2),

JinTao Li (2),

YuGang Han (1),

ShuiCheng Yan (1,3),



(1) Qihoo 360,

(2) Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China,

(3) National University of Singapore (NUS) Technique Details for Object Detection Task



The new contributions of this system are three-fold: 1) Implicit sub-categories of background class, 2) Sink class when necessary, and 3) new semantic segmentation features.



For training:



1) Implicit sub-categories of background class: for Faster-RCNN [1], the "background" class is considered as ONE class equally as other individual object classes, but it is quite diverse and impossible to describe as one pattern. Thus we use K output nodes, namely K patterns, to implicitly represent the sub-categories of the background class, which well improved the identification capability of the background class.

(2) Sink class when necessary: It is often the case that the ground-truth class may have low probability, and thus the result is incorrect since the sum of all probabilities for all classes equals 1. To address this issue and improve the chance for the ground-truth class with low probability to win, we add a so-called "sink" class, which shall take some probability value if the ground-truth class has low probability, make other classes to have even lower probabilities than the ground-truth class, and make the ground-truth to win. We also propose to use sink class for loss function only when necessary, namely when the ground-truth class is not in the top-k list.

(3) New semantic segmentation features: On one hand, motivated by [2], we generate weakly supervised segmentation feature which is used to train region proposal scoring functions and make the gradient flow among all branches. On the other hand, an independent segmentation model trained on ILSVRC Scene Parsing dataset is used to provide feature for our detection network, which is supposed to bring in both stuff and object information for decision.

(4) Dilation as context: Motivated by widely used dilated convolution [3] in segmentation, we introduce dilated convolutional layers (initialized as identity mapping) to obtain effective context for training.



For testing:

We utilize box refinement, box voting, multi-scale testing, co-occurrence refinement, and models ensemble approaches to benefit inference stage.



References:

[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

[2] Gidaris, Spyros, and Nikos Komodakis. "Object detection via a multi-region and semantic segmentation-aware cnn model." Proceedings of the IEEE International Conference on Computer Vision. 2015.

[3] Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." International Conference on Learning Representations. 2016.

ABTEST Ankan Bansal We have used a 22 layer GoogleNet [1] model to classify scenes. The model was trained on the LSUN [2] dataset and then finetuned on the Places dataset fot 365 categories. We did not use any intelligent data selection techniques. The network is simply trained using all the available data without considering the data distribution for different classes.



Before training on LSUN, this network was trained using the Places205 dataset. The model was trained till it saturated at around 85% (Top-1) accuracy on the validation dataset of the LSUN challenge. Then the model was fine-tuned on the 365 categories in the Places2 challenge.



We did not use the trained models provided by the organisers to initialise our network.



References:

[1] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[2] Yu, Fisher, et al. "Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop." arXiv preprint arXiv:1506.03365 (2015).

ACRV-Adelaide Guosheng Lin;

Chunhua Shen;

Anton van den Hengel;

Ian Reid;



Affiliations: ACRV; University of Adelaide; Our method is based on multi-level information fusion. We generate multi-level representation of the input image and develop a number of fusion networks with different architectures.

Our models are initialized from the pre-trained residual nets [1] with 50 and 101 layers. A part of the network design in our system is inspired by the multi-scale network with pyramid pooling which is described in [2] and the FCN network in [3].



Our system achieves good performance on the validation set. The IoU score on the validation set is 40.3 for using a single model, which is clearly better than the reported results of the baseline methods in [4]. Applying DenseCRF [5] slightly improves the result.





We are preparing a technical report on our method and it will be available in arXiv soon.





References:

[1] "Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016.

[2] "Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation", Guosheng Lin, Chunhua Shen, Anton van den Hengel, Ian Reid; CVPR 2016

[3] "Fully convolutional networks for semantic segmentation", J Long, E Shelhamer, T Darrell; CVPR 2015

[4] "Semantic Understanding of Scenes through ADE20K Dataset" B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442

[5] "Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials", Philipp Krahenbuhl, Vladlen Koltun; NIPS 2012.

Adelaide Zifeng Wu, University of Adelaide

Chunhua Shen, University of Adelaide

Anton van den Hengel, University of Adelaide We have trained networks with different newly designed structures. One of them performs as well as the Inception-Residual-v2 network in the classification task. It was further tuned for several epochs using the Places365 dataset, which finally obtained even better results on the validation set in the segmentation task. As for FCNs, we mostly followed the settings in our previous technical reports [1, 2]. The best result was obtained by combining the FCNs initialized using two pre-trained networks.



[1] High-performance Semantic Segmentation Using Very Deep Fully Convolutional Networks. https://arxiv.org/abs/1604.04339

[2] Bridging Category-level and Instance-level Semantic Image Segmentation. https://arxiv.org/abs/1605.06885

ASTAR_VA Romain Vial (VA Master Intern Student)

Zhu Hongyuan (VA Scientist)

Su Bolan (ex ASTAR Scientist)

Shijian Lu (VA Head) The problem of object detection from videos is an important part of computer vision that has yet to be solved. The diversity of scenes with the presence of movement make this task very challenging.



Our system localizes and recognizes objects from various scales, positions and classes. It takes into account spatial (local and global) and temporal information from several previous frames.



The model has been trained on both the training and validation set. We achieve a final score on the validation set of 76.5% mAP.

BSC- UPC Andrea Ferri This is the result of my thesis: Implementing a deep learning envirorment into a computational server and develop a Object Tracking in Video with Tensorflow suitable for the ImageNET VID challenge.

BUAA ERCACAT Biao Leng (Beihang University), Guanglu Song (Beihang University), Cheng Xu (Beihang University), Jiongchao Jin (Beihang University), Zhang Xiong (Beihang University)

Our group utilize two image object detection architectures, namely Fast R-CNN[1] and Faster R-CNN[2] for the task of object detection. The detection system Faster R-CNN can be divided into two modules including RPN (region proposal network), a fully convolutional network that proposes regions to tell the Faster R-CNN modules where to focus on in an image, and a Fast R-CNN detector that uses region proposals and classifies the objects in the proposal.

Our training model is based on the VGG_16 model, and we utilize a combined model for higher RPN recall.



[1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497.

[2]Ross Girshick. "Fast R-CNN: Fast Region-based Convolutional Networks for object detection", CVPR 2015.

CASIA_IVA Jun Fu,Jing Liu,Xinxin Zhu,Longteng Guo,Zhenwei Shen,Zhiwei Fang,Hanqing Lu We implement image semantic segmentation based on the fused result of the three deep models: DeepLab[1], OA-Seg[2] and the officially public model in this challenge. DeepLab is trained with the framework of Resnet101, and is further improved with object proposals and multiscale prediction combination. OA-Seg is trained with VGG, in which object proposals and multiscale supervision are considered. We argument training data by multiscale and mirrored variants for the above both models. We additionally employ multi-label annotation for images to refine the segmentation results.

[1]Liang-Chieh Chen et.al, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,arXiv:1606.00915,2016

[2]Yuhang Wang et.al, Objectness-aware Semantic Segmentation, Accepted by ACM Multimedia, 2016.



Choong Choong Hwan Choi (KAIST) Abstract

Ensemble of Deep learning model based on VGG16 & ResNet

Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate.

Reference :

[1] Liu, Wei, et. al. "SSD: Single Shot Multibox Detector"

[2] K. Simonyan, A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition"

[3] Kaiming He, et. al., "Deep Residual Learning for Image Recognition"





CIGIT_Media Youji Feng, Jiangjing Lv, Xiaohu Shao, Pengcheng Liu, Cheng Cheng



Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences We present a simple method combining still image object detection and object tracking for the ImageNet VID task. Object detection is first performed on each frame of the video, and the detected targets are then tracked through the nearby frames. Each tracked target is also assigned a detection score by the object detector. According to the scores, non-maximum suppression (NMS) is applied to all the detected and tracked targets on each frame to obtain the VID results. To improve the performance, we actually employ two state-of-the-art detectors for still image object detection, i.e. the R-FCN detector and the SSD detector. We run the above steps for both detectors independently and combine the respective results into the final ones through NMS.



[1] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016.

[2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg. SSD: Single Shot MultiBox Detector. arXiv 2016.

[3] K. Kang, W. Ouyang, H. Li, and X. Wang. Object Detection from Video Tubelets with Convolutional Neural Networks. CVPR 2016.

CIL Seongmin Kang

Seonghoon Kim

Yusun Lim

Kibum Bae

Heungwoo Han

Our model is based on Faster RCNN [1].

Pre-activation residual network[2] trained with ILSVRC 2016 dataset is modified for detection tasks.

Heavy data augmentation is applied. OHEM[3] and atrous convolution are also applied.

All of them are implemented on Tensorflow with multi-gpu training.[4]



To meet the deadline, the detection model was trained just for 1/3 training epoches we had planned.



[1]Shaoqing Ren et al., Faster R-CNN Towards real-time object detection with region proposal networks, NIPS, 2015

[2]Kaiming He et al., Identity Mappings in Deep Residual Networks, ICML, 2016

[3]Abhinav Shrivastava et al., Training Region-based Object Detectors with Online Hard Example Mining, CVPR, 2016

[4]Martín Abadi et al., TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org





CU-DeepLink Major team members

-------------------



Xingcheng Zhang ^1

Zhizhong Li ^1

Yang Shuo ^1

Yuanjun Xiong ^1

Yubin Deng ^1

Xiaoxiao Li ^1

Kai Chen ^1

Yingrui Wang ^2

Chen Huang ^1

Tong Xiao ^1

Wanshen Feng ^2

Xinyu Pan ^1

Yunxiang Ge ^1

Hang Song ^1

Yujun Shen ^1

Boyang Deng ^1

Ruohui Wang ^1



Supervisors

------------



Dahua Lin ^1

Chen Change Loy ^1

Wenzhi Liu ^2

Shengen Yan ^2



1 - Multimedia Lab, The Chinese University of Hong Kong.

2 - SenseTime Inc.

Our efforts are divided into two relatively independent directions, namely classification and localization. Specifically, the classification framework would predict five distinct class labels for each image, while the localization framework would produce bounding boxes, one for each predicted class label.



Classification

------------------



Our classification framework is built on top of Google's Inception-ResNet-v2 (IR-v2) [1]. We combined several important techniques, which together leads to substantial performance gain.



1. We developed a novel building block, called “PolyInception”. Each PolyInception can be considered as a meta-module that integrates multiple inception modules via K-way polynomial composition. In this way, we substantially improve a module's expressive power. Also, to facilitate the propagation of gradients across a very deep network, we retain an identity path [2] for each PolyInception.

2. At the core of our framework is the Grand Models. Each grand model comprises three sections operating on different spatial resolutions. Each section is a stack of multiple PolyInception modules. To achieve optimal overall performance (within a certain computational budget), we rebalance the number of modules across the sections.

3. Most of our grand models contain over 500 layers. Whereas they demonstrate remarkable model capacity, we observed notable overfitting at later stage of the training process. To overcome this difficulty, we adopted Stochastic Depth [3] for regularization.

4. We trained 20+ Grand Models, some deeper and others wider. These models constitute a performant yet diverse ensemble. The single most powerful Grand Model reached a top-5 classification error at 4.27%(single corp) on the validation set.

5. Given each image, the class label predictions are produced in two steps. First, multiple crops at 8 scales are generated. Predictions are respectively made on these crops, which are subsequently combined via a novel scheme called selective pooling. The multi-crop predictions generated by individual models are finally integrated to reach the final prediction. In particular, we explored two different integration strategies, namely ensemble-net (a two-layer neural-network designed to integrate predictions) and class-dependent model reweighting. With these ensemble techniques, we reached a top-5 classification error below 2.8% on the validation set.



Localization

-----------------



Our localization framework is a pipeline comprised of Region Proposal Networks (RPN) and R-CNN models.



1. We trained two RPNs with different design parameters based on ResNet.

2. Given an image, 300 bounding box proposals are derived based on the RPNs, using multi-scale NMS pooling.

3. We also trained four R-CNN models, respectively based on ResNet-101, ResNet-269, Extended IR-v2, and one of our Grand Models. These R-CNNs are used to predict how likely a bounding box belongs to each class as well as to refine the bounding box (via bounding box regression).

4. The four RCNN models form an ensemble. Their predictions (on both class scores and refined bounding boxes) are integrated via average pooling. Given a class label, the refined bounding box with highest score corresponding to that class is used as the result.



Deep Learning Framework

-----------------



Both our classification and localization frameworks are implemented using Parrots, a new Deep Learning framework developed internally by ourselves (from scratch). Parrots is featured with a highly scalable distributed training scheme, a memory manager that supports dynamic memory reuse, and a parallel preprocessing pipeline. With this framework, the training time is substantially reduced. Also, with the same GPU memory capacity, much larger networks can be accommodated.



References

-----------------



[1] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning". arXiv:1602.07261. 2016.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. "Identity Mappings in Deep Residual Networks". arXiv:1603.05027. 2016.

[3] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Weinberger. "Deep Networks with Stochastic Depth". arXiv:1603.09382. 2016.



CUImage Wanli Ouyang, Junjie Yan, Xingyu Zeng, Hongsheng Li, Tong Xiao, Kun Wang, Xin Zhu, Yucong Zhou, Yu Liu, Buyu Li, Zhiwei Fang, Changbao Wang, Zhe Wang, Hui Zhou, Liping Zhang, Xingcheng Zhang, Zhizhong Li, Hongyang Li, Ruohui Wang, Shengen Yan, Dahua Lin, Xiaogang Wang Compared with CUImage submission in ILSVRC 2015, the new components are as follows.

(1) The models are pretrained for 1000-class object detection task using the approach in [a] but adapted to the fast-RCNN for faster detection speed.

(2) The region proposal is obtained using the improved version of CRAFT in [b].

(3) A GBD network [c] with 269 layers is fine-tuned on 200 detection classes with the gated bidirectional network (GBD-Net), which passes messages between features from different support regions during both feature learning and feature extraction. The GBD-Net is found to bring ~3% mAP improvement on the baseline 269 model and ~5% mAP improvement on the Batch normalized GoogleNet.

(4) For handling their long-tail distribution problem, the 200 classes are clustered. Different from the original implementation in [d] that learns several models, a single model is learned, where different clusters have both shared and distinguished feature representations.

(5) Ensemble of the models using the approaches mentioned above lead to the final result in the provided data track.

(6) For the external data track, we propose object detection with landmarks. Comparing to the standard bounding box centric approach, our landmark centric approach provides more structural information and can be used to improve both the localization and classification step in object detection. Based on the landmark annotations provided in [e], we annotate 862 landmarks from 200 categories on the training set. Then we use them to train a CNN regressor to predict landmark position and visibility of each proposal in testing images. In the classification step, we use the landmark pooling on top of the fully convolutional network, where features around each landmark are mapped to be a confidence score of the corresponding category. The landmark level classification can be naturally combined with standard bounding box level classification to get the final detection result.

(7) Ensemble of the models using the approaches mentioned above lead to the final result in the external data track.





The fastest publicly available multi-GPU caffe code is our strong support [f].





[a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015.

[b] Yang, B., Yan, J., Lei, Z., Li, S. Z. "Craft objects from images." CVPR 2016.

[c] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, “Gated Bi-directional CNN for Object Detection,” ECCV 2016.

[d] Ouyang, W., Wang, X., Zhang, C., Yang, X. Factors in Finetuning Deep Model for Object Detection with Long-tail Distribution. CVPR 2016.

[e] Wanli Ouyang, Hongyang Li, Xingyu Zeng, and Xiaogang Wang, "Learning Deep Representation with Large-scale Attributes", In Proc. ICCV 2015.

[f] https://github.com/yjxiong/caffe



CUVideo Hongsheng Li*, Kai Kang* (* indicates equal contribution), Wanli Ouyang, Junejie Yan, Tong Xiao, Xingyu Zeng, Kun Wang, Xihui Liu, Qi Chu, Junming Fan, Yucong Zhou, Yu Liu, Ruohui Wang, Shengen Yan, Dahua Lin, Xiaogang Wang



The Chinese University of Hong Kong, SenseTime Group Limited We utilize several deep neural networks with different structures for the VID task.



(1) The models are pretrained for 200-class detection task using the approach in [a] but adapted to the fast-RCNN for faster detection speed.

(2) The region proposal is obtained by a separately-trained ResNet-269 model.

(3) A GBD network [b] with 269 layers is fine-tuned on 200 detection classes of the DET task and then on the 30 classes of the VID task. It passes messages between features from different support regions during both feature learning and feature extraction. The GBD-Net is found to bring ~3% mAP improvement on the baseline 269 model.

(4) Based on detection boxes of individual frames, tracklet proposals are efficiently generated by trained bounding box regressors. An LSTM network is integrate into the network to learn temporal-based appearance variation.

(5) Multi-context suppression and motion-guide propagation in [c] are utilized to post-process the per-frame detection results. They result in a ~3.5% mAP improvement on the validation set.

(6) Ensemble of the models using the approaches mentioned above lead to the final result in the provided data track.

(7) For the VID with tracking task, we modified an online multiple object tracking algorithm [d]. The tracking-by-detection algorithm utilizes our per-frame detection results and generates tracklets for different objects.



The fastest publicly available multi-GPU caffe code is our strong support [e].





[a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015.

[b] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, “Gated Bi-directional CNN for Object Detection,” ECCV 2016.

[c] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, W. Ouyang, “T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos”, arXiv:1604.02532

[d] J. H. Yoon, C.-R. Lee, M.-H. Yang, K.-J. Yoon, “Online Multi-Object Tracking via Structural Constraint Event Aggregation”, CVPR 2016

[e] https://github.com/yjxiong/caffe

Deep Cognition Labs Mandeep Kumar, Deep Cognition Labs

Krishna Kishore, Deep Cognition Labs

Rajendra Singh, Deep Cognition Labs We present these results for scene parsing task that are aquired using a modified Deeplab vgg16 network along with CRF.

DEEPimagine Sung-soo Park(DEEPimagine corp.)

Hyoung-jin Moon(DEEPimagine corp.)



Contact email : sspark@deepimagine.com 1.Model design

- Wide Residual SWAPOUT network

- Inception Residual SWAPOUT network

- We focused on the model multiplicity with many shallow networks

- We adopted a SWAPOUT architecture



2.Ensemble

- Fully convolutional dense crop

- Variant parameter model ensemble





[1] " Swapout: Learning an ensemble of deep architectures"

Saurabh Singh, Derek Hoiem, David Forsyth



[2] " Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning"

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi



[3] " Deep Residual Learning for Image Recognition "

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

DeepIST Heechul Jung*(DGIST/KAIST), Youngsoo Kim*(KAIST), Byungju Kim(KAIST), Jihun Jung(DGIST), Junkwang Kim(DGIST), Junho Yim(KAIST), Min-Kook Choi(DGIST), Yeakang Lee(KAIST), Soon Kwon(DGIST), Woo Young Jung(DGIST), Junmo Kim(KAIST)

* indicates equal contribution. We basically use nine networks. Networks consist of one 200-layer ResNet, one Inception-ResNet v2, one Inception v3 Net, two 212-layer ResNets and four Branched-ResNets.

Networks are trained for 95 epochs except Inception-ResNet v2 and Inception v3.

Ensemble A takes an average of one 212-layer ResNet, two Branched-ResNets and one Inception-ResNet v2.

Ensemble B takes a weighted sum over one 212-layer ResNet, two Branched-ResNets and one Inception-ResNet v2.

Ensemble C takes an average of one 200-layer ResNet, two 212-layer ResNets, two Branched-ResNets, one Inception v3 and one Inception-ResNet v2. It achieves a top-5 error rate of 3.16% for 20000 validation images.

Ensemble D takes an averaged result on all nine networks.



We submit only classification results.



References:

[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

[2] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).

[3] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).

[4] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).

[5] Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection using convolutional networks." arXiv preprint arXiv:1312.6229 (2013).

[6] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).



Acknowledgement

- DGIST was funded by the Ministry of Science, ICT and Future Planning.

- KAIST was funded by Hanwha Techwin CO., LTD.

DGIST-KAIST Heechul Jung(DGIST/KAIST), Jihun Jung(DGIST), Junkwang Kim(DGIST), Min-Kook Choi(DGIST), Soon Kwon(DGIST), Junmo Kim(KAIST), Woo Young Jung(DGIST) We basically use ensemble model of state-the-art architectures [1,2,3,4] as following:

[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

[2] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).

[3] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).

[4] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).



We train five deep neural networks, which models are two 212-layers ResNets, a 224-layers ResNet, an inception-v3, and an Inception-ResNet v2. Given models are linearly combined by weighted some of class probabilities using validation set to obtain appropriate contribution for each model.



- This work was funded by the Ministry of Science, ICT and Future Planning.

DPAI Vison Object detection: Chris Li, Savion Zhao, Bin Liu, Yuhang He, Lu Yang, Cena Liu

Scene classification: Lu Yang, Yuhang He, Cena Liu, Bin Liu, Bo Yu

Scene parsing: Bin Liu, Lu Yang, Yuhang He, Cena Liu, Bo Yu, Chris Li, Xiongwei Xia

Object detection from video: Bin Liu, Cena Liu, Savion Zhao, Yuhang He, Chris Li

Object detection:Our methods is based on faster-rcnn and extra classifier. (1) data processing: data equalization by deleting lots of examples in threee dominating classes (person, dog, and bird); adding extra data for classes with training data less than 1000; (2) COCO pre-train; (3) Iterative bounding box regression + multi-scale (trian/test) + random flip images (train / test) (4) Multimodel ensemble: resnet-101 and inception-v3 (5) Extra classifier with 200 classes which helps to promote recall and refine the detection scores of ultimate boxes.

[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015.

[2] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.



Scene classification: We trained the model on Caffe[1]. An ensemble of Inception-V3[2] and Inception-V4[3]. We totally integrated four models. Top1 error on validation is 0.431 and top5 error is 0.129. The single model is modified on Inception-V3[2], the top1 error on validation is 0.434, top5 error is 0.133.

[1] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093. 2014.

[2]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.

[3] C.Szegedy,S.Ioffe,V.Vanhoucke. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv preprint arXiv:1602.07261, 2016.



Scene parsing: We trained 3 models on modified deeplab[1] (inception-v3, resnet-101, resnet-152) and only used the ADEChallengeData2016[2] data. Multi-scale \ image crop \ image fliping \ contrast transformation are used for data augmentation and decseCRF is used as post-processing to refine object boundaries. On validation with combining 3 models, witch achieved 0.3966 mIoU and 0.7924 pixel-accuracy.

[1] L. Chen, G. Papandreou, I. K.; Murphy, K.; and Yuille, A. L. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In arXiv preprint arXiv:1606.00915.

[2] B. Zhou, H. Zhao, X. P. S. F. A. B., and Torralba, A. 2016. Semantic understanding of scenes through the ade20k dataset. In arXiv preprint arXiv:1608.05442.



Object detection from video: Our methods is based on faster-rcnn and extra classifier. We train Faster-RCNN based on RES-101 with the provided training data. We also train extra classifier with 30 classes which helps to promote recall and refine the detection scores of ultimate boxes.

DPFly Savion DP.co We fine-tune the detection models using the DET training set and the val1 set. The val2 set is used for validation.

data process to nearly equal ammount. since some categories have much more images than others. So,we need to process the initial data to let amount of each category near equal.

usr res101model+fater_rcnn.The networks are pre- trained on the 1000-class ImageNet classification set, and are fine-tuned on the DET data.

use box refinement:In Faster R-CNN, the final output is a regressed box that is different from its proposal box. So for inference, we pool a new feature from the regressed box and obtain a new classification score and a new regressed box. We combine these 300 new predictions with the orig- inal 300 predictions. Non-maximum suppression (NMS) is applied on the union set of predicted boxes using an IoU threshold of 0.3.

use multiscale test.In our current implementation, we have performed multi-scale testing. we compute conv feature maps on an image pyramid, where the image’s shorter sides are 300,450,600

use multiscale anchor. we add two anchor scales to original anchor scales of faster rcnn.

use test flip. we flip image and combine results with original image

Everphoto Yitong Wang, Zhonggan Ding, Zhengping Wei, Linfu Wen



Everphoto Our method is based on DCNN approaches.



We use 5 models with different input scales and different network structures as basic models. They are derived from GoogleNet, VGGNet and ResNet.



We also utilize the idea of dark knowledge [1] to train several specialist models, and use these specialist models to reassign probability scores and refine the basic outputs.



Our final results are based on the ensemble of refined outputs.



[1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.

F205_CV Cheng Zhou

Li Jiancheng

Lin Zhihui

Lin Zhiguan

Yang Dali

All came from Tsinghua university, Graduate School at ShenZhen Lab F205,China Our team has five student members from Tsinghua university, Graduate School at ShenZhen Lab F205,China. We have joined two sub-tasks of the ILSVRC2016 & COCO challenge which is the Scene Parsing and Object detection from video. We are the first time to attend this competition.

The two of the members have focus on the Scene Parsing, they mainly utilized several model fusion algorithms on some famous and effective CNN models like ResNet[1], FCN[2] and DilatedNet[3, 4] and used CRF to get more context features to improve the classification accuracy and mean IoU rate. Since the image size is large, the image is downsampled before feeding to the network. What's more, we used vertical mirror technique for data augmentation. The places2 scene classification 2016 pretrained model was used to fine-tune ResNet101 and FCN, while DilatedNet fine-tuned from the places2 scene parsing 2016 pretrained model[5]. Later fusion and CRF were added at last.

For object detection from video, the biggest challenge is there are more than 2 millions images with very high resolution in total. We didn't think about using the fast-RCNN[6] like models to solve it. It need much more training and testing time. So we chose the ssd[7] which is an effective and efficient framework for object detection. We utilized the ResNet101 as the base model, but it is slower than VGGNet[8]. For testing it can achieve about 10FPS on single GTX TITAN X GPU. However, there are more than 700 thousands images in the test set. It costed lots of time. On tracking task, we have a dynamic adjustment algorithm, but it need a ResNet101 model for scoring the patch. It can just achieve about less than 1FPS. So we cannot do this work on test set. For the submission, we used a simple method to filter the noise proposals and track the object.



References:

[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015.

[2] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3431-3440.

[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with

deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915, 2016.

[4] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.

[5] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442

[6] Girshick R. Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 1440-1448.

[8] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. arXiv preprint arXiv:1512.02325, 2015.

[9] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.



Faceall-BUPT Xuankun HUANG, BUPT, CHINA

Jiangqi ZHANG, BUPT, CHINA

Zhiqun HE, BUPT, CHINA

Junfei ZHUANG, BUPT, CHINA

Zesang HUANG, BUPT, CHINA

Yongqiang Yao, BUPT, CHINA

Kun HU, BUPT, CHINA

Fengye XIONG, BUPT, CHINA

Hongliang BAI, Beijing Faceall co., LTD

Wenjian FENG, Beijing Faceall co., LTD

Yuan DONG, BUPT, CHINA # Classification/Localization

We trained the ResNet-101, ResNet-152 and Inception-v3 for object classification. Multi-view testing and models ensemble is utilized to generate the final classification results.

For localization task, we trained a Region Proposal Network to generate proposals of each image, and we fine-tuned two models with object-level annotations of 1,000 classes. Moreover, a background class is added into the network. Then test images are segmented into 300 regions by RPN and these regions are classified by the fine-tuned model into one of 1,001 classes. And the final bounding box is generated by merging the bounding rectangle of three regions.



# Object detection

We utilize faster-rcnn with the publicly available resnet-101. Other than the baseline, we adopt multi-scale roi to obtain features containing richer context information. For testing, we use 3 scales and merge these results using the simple strategy introduced last year.



No validation data is used for training, and flipped images are used in only a third of the training epochs.



# Object detection from video

We use Faster R-CNN with Resnet-101 to do this as in the object detection task. One fifth of the images are tested with 2 scales. No tracking techniques are used because of some mishaps.



# Scene classification

We trained a single Inception-v3 network with multi-scale and tested with multi-view of 150 crops.

On validation the top-5 error is about 14.56%.



# Scene parsing

We trained 6 models with net structure inspired by fcn8s and dilatedNet with 3 scales(256,384,512). Then we test with flipped images using pre-trained fcn8s and dilatedNet. The pixel-wise accuracy is 76.94％ and mean of the class-wise IoU is 0.3552.

fusionf Nina Narodytska (Samsung Research America)

Shiva Kasiviswanathan (Samsung Research America)

Hamid Maei (Samsung Research America) We used several modifications of modern CNNs, including VGG[1], GoogleNet[2,4], and ResNet[3]. We used several fusion strategies,

including a standard averaging and scoring scheme. We also used different subsets of models in different submissions. Training was

performed on low-resolution dataset. We used balanced loading to take into account different numbers of images in each class.



[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for

large-scale image recognition.



[2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D.

Erhan, V. Vanhoucke, A. Rabinovich. Going Deeper with Convolutions.



[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Deep Residual Learning for Image Recognition



[4]Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi

Inception-v4, Inception-ResNet and the Impact of Residual Connections

on Learning





Future Vision Gautam Kumar Singh(independent)

Kunal Kumar Singh(independent)

Priyanka Singh(independent) Future Vision Project (based on Matcovnet )

========================================





This is an extremely simple CNN model which may not stand in ILSVRC competition. Our major goal was to get a working CNN model which can be enhanced to work efficiently on ILSVRC standards.



This project ran on following configurations :



processor : Intel core i3-4005U CPU @ 1.70GHz (4cpus)

RAM: 4GB



As we had no advance hardware resources like GPUs or high speed CPUs,cuDNN could not be used either. So we could not train on this vast data and we were forced to use this simplest shallow model . Besides we also threw away some 90% of the data. 10 % of the data was only used in this project, equally distributed as training data and test data.



Places2 validation data : NOT USED

Places2 training data : 90% discarded



10% of this training data were further equally divided into two categories('train'& 'test') which were used for training and testing of this project.





The output text file on the Places2 test images could not produced as we were facing some techinical dificulties and we were running out of time .





Reference :we referenced this project: http://www.cc.gatech.edu/~hays/compvision/proj6





Future Vision Team :



Gautam Kumar Singh

Kunal Kumar Singh

Priyanka Singh

Hikvision Qiaoyong Zhong*, Chao Li, Yingying Zhang(#), Haiming Sun*, Shicai Yang*, Di Xie, Shiliang Pu (* indicates equal contribution)



Hikvision Research Institute

(#)ShanghaiTech University, work is done at HRI [DET]

Our work on object detection is based on Faster R-CNN. We design and validate the following improvements:

* Better network. We find that the identity-mapping variant of ResNet-101 is superior for object detection over the original version.

* Better RPN proposals. A novel cascade RPN is proposed to refine proposals' scores and location. A constrained neg/pos anchor ratio further increases proposal recall dramatically.

* Pretraining matters. We find that a pretrained global context branch increases mAP by over 3 points. Pretraining on the 1000-class LOC dataset further increases mAP by ~0.5 point.

* Training strategies. To attack the imbalance problem, we design a balanced sampling strategy over different classes. With balanced sampling, the provided negative training data can be safely added for training. Other training strategies, like multi-scale training and online hard example mining are also applied.

* Testing strategies. During inference, multi-scale testing, horizontal flipping and weighted box voting are applied.

The final mAP is 65.1 (single model) and 67 (ensemble of 6 models) on val2.



[CLS-LOC]

A combination of 3 Inception networks and 3 residual networks is used to make the class prediction. For localization, the same Faster R-CNN configuration described above for DET is applied. The top5 classification error rate is 3.46%, and localization error is 8.8% on the validation set.



[Scene]

For the scene classification task, by drawing support from our newly-built M40-equipped GPU clusters, we have trained more than 20 models with various architectures, such as VGG, Inception, ResNet and different variants of them in the past two months. Fine-tuning very deep residual networks from pre-trained ImageNet models, like ResNet 101/152/200, seemed not to be as good enough as what we expected. Inception-style networks could get better performance in considerably less training time according to our experiments. Based on this observation, deep Inception-style networks, and not-so-deep residuals networks have been used. Besides, we have made several improvements for training and testing. First, a new data augmentation technique is proposed to better utilize the information of original images. Second, a new learning rate setting is adopted. Third, label shuffling and label smoothing is used to tackle the class imbalance problem. Fourth, some small tricks are used to improve the performance in test phase. Finally we achieved a very good top 5 error rate, which is below 9% on the validation set.



[Scene Parsing]

We utilize a fully convolutional network transferred from VGG-16 net, with a module, called mixed context network, and a refinement module appended to the end of the net. The mixed context network is constructed by a stack of dilated convolutions and skip connections. The refinement module generates predictions by making use of output of the mixed context network and feature maps from early layers of FCN. The predictions are then fed into a sub-network, which is designed to simulate message-passing process. Compared with baseline, our first major improvement is that, we construct the mixed context network, and find that it provides better features for dealing with stuff, big objects and small objects all at once. The second improvement is that, we propose a memory-efficient sub-network to simulate message-passing process. The proposed system can be trained end-to-end. On validation set, the mean iou of our system is 0.4099 (single model) and 0.4156 (ensemble of 3 models), and the pixel accuracy is 79.80% (single model) and 80.01% (ensemble of 3 models).



References

[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

[2] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." arXiv preprint arXiv:1604.03540 (2016).

[3] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

[4] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).

[5] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

[6] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).

[7] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).

[8] F. Yu and V. Koltun, "Multi-scale context aggregation by dilated convolutions," in ICLR, 2016.

[9] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in CVPR, 2015.

[10] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr, "Conditional random fields as recurrent neural networks," in ICCV, 2015.

[11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", arXiv:1606.00915, 2016.

[12] P. O. Pinheiro, T. Lin, R. Collobert, P. Dollar, "Learning to Refine Object Segments", arXiv:1603.08695, 2016.



Hitsz_BCC Qili Deng,Yifan Gu,Mengdie Chu,Shuai Wu,Yong Xu

Harbin Institute of Technology,Shenzhen We combined a residual learning framework with Single Shot MultiBox Detector for object detection. For using the ResNet-152,we fixed the all batch-normlization layers and conv1, conv2_x in ResNet.Inspired by HyperNet,we exploit multi-layer features to detect objects.

Reference:

[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015.

[2] Kong T, Yao A, Chen Y, et al. HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection[J]. arXiv preprint arXiv:1604.00600, 2016.

[3] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. arXiv preprint arXiv:1512.02325, 2015.

hustvision Xinggang Wang, Huazhong University of Science and Technology

Kaibin Chen, Huazhong University of Science and Technology We propose a very fast and accurate object detection method based on deep neural networks. The core of this method is an object detection loss layer named ConvBox, which directly regresses object boundingbox. The ConvBox loss layer can be plugged into any deep neural networks. In this competition, we choose google-net as the base network. Running on Nvidia GTX 1080, it takes about one day for training on ILSVRC 2016. Testing speed is about 60fps. In the training images of this competition, there are many positive instances are not labelled. To deal with this problem, in the proposed ConvBox loss, we tolerate hard negatives, which improves detection performance to some extent.

iMCB *Yucheng Xing,

*Yufeng Zhang,

Zhiqin Chen,

Weichen Xue,

Haohua Zhao,

Liqing Zhang

@Shanghai Jiao Tong University (SJTU)



(* indicates equal contribution) In this competition, we submit five entries.



The first model is a single model, which achieved 15.24% top-5 error on validation dataset. It is a Inception-V3[1] model that is modified and trained based on both the challenge and standard datasets[2]. When being tested, images are resized to 337*337 and then a 12-crops skill is used to get the 299*299 inputs to the model, which contributes to the improvement of performance.



The second model is a fusion-feature model(FeatureFusion_2L), which achieved 13.74% top-5 error on validation dataset. It is a two layers fusion-feature network, whose input is the combination of fully-connected layer's features extracted from several well performed CNNs(i.e. pretrained models[3], such as Resnet, VGG, Googlenet).As a result, it turns out to be efficient in reducing the error rate.



The third model is also a fusion-feature network(FeatureFusion_3L),which achieved 13.95% top-5 error on validation dataset. Comparing with the second model, it is a three layers fusion-feature network which contains two fully-connected layers.



The fourth is the combination of CNN models with a strategy w.r.t.validation accuracy, which achieved 13% top-5 error on validation dataset. It combines the probabilities provided by the softmax layer from three CNNs, in which the influential factor of each CNN is determined by the validation accuracy.



The fifth is the combination of CNN models based on researched influential factors, which achieved 12.65% top-5 error on validation dataset. There are six CNNs taken into consideration, while four models(Inception-V2, Inception-V3, FeatureFusion_2L and FeatureFusion_3L) of them are trained by us and the other two are pretrained. The influential factors of these models are optimized according to plenty of researches.





[1] Szegedy, Christian, et al. "Rethinking the Inception Architecture for Computer Vision." arXiv preprint arXiv:1512.00567 (2015).



[2]B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. "Places: An Image Database for Deep Scene Understanding." Arxiv, 2016.



[3] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database."Advances in Neural Information Processing Systems 27 (NIPS), 2014.

isia_ICT Xinhan Song, Institute of Computing Technology

Chengpeng Chen, Institute of Computing Technology

Shuqiang jiang, Institute of Computing Technology For convenience, we use the 4 provided models as our basic models, which are used for the following fine-tuning or networks adaptation. Besides, considering the non-uniform and the tremendous image number of the Challenge Dataset, we only use the Standard Dataset for all the following steps.

First, we fuse these models with average strategy as the baseline. And then, we add a SPP layer to VGG16 and ResNet152 perspectively to enable the models to be feed with images with larger scale. After fine-tuning the models, we also fuse them with average strategy, and we only submit the result of the size 288.

we also perform spectral clustering on the confusion matrix extracted from validation data to get 20 clusters, which means that 365 classes are separated into 20 clusters mainly dependent on their co-relationship. To classify the classes in the same cluster more precisely, we train an extra classifier within each cluster, which is implemented by fine-tuning the networks with all the layers fixed except for fc8 layer and combining them into a network at last.



D. Yoo, S. Park, J. Lee and I. Kweon. “Multi-scale pyramid pooling for deep convolutional representation”. In CVPR Workshop 2015

ITLab-Inha Byungjae Lee, Inha University,

Songguo Jin, Inha University,

Enkhbayar Erdenee, Inha University,

Mi Young Nam, NaeulTech,

Young Giu Jung, NaeulTech,

Phill Kyu Rhee, Inha University. We propose a robust multi-class multi-object tracking (MCMOT) formulated by a Bayesian framework [1]. Multi-object tracking for unlimited object classes is conducted by combining detection responses and changing point detection (CPD) algorithm. The CPD model is used to observe abrupt or abnormal changes due to a drift and an occlusion based spatiotemporal characteristics of track states.



The ensemble of object detector is based on the Faster R-CNN [2] using VGG16 [3], and ResNet [4] adaptively. For parameter optimization, POMDP based parameter learning approach is adopted which described in our previous work [5].



[1] “Multi-Class Multi-Object Tracking using Changing Point Detection”, Byungjae Lee, Enkhbayar Erdenee, Songguo Jin, Phill Kyu Rhee. arXiv 2016.

[2] “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. TPAMI 2016.

[3] “Very Deep Convolutional Networks for Large-Scale Image Recognition”, Karen Simonyan, Andrew Zisserman. arXiv 2015.

[4] “Deep Residual Learning for Image Recognition”, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016.

[5] “Adaptive Visual Tracking using the Prioritized Q-learning Algorithm: MDP-based Parameter Learning Approach”, Sarang Khim, Sungjin Hong, Yoonyoung Kim, Phill Kyu Rhee. Image and Vision Computing 2014.

KAIST-SLSP Sunghun Kang*(KAIST)

Jae Hyun Lim*(ETRI)

Houjeung Han(KAIST)

Donghoon Lee(KAIST)

Junyeong Kim(KAIST)

Chang D. Yoo(KAIST)

(* indicates equal contribution) For both image and video object detection, the faster-rcnn detection algorithm proposed by Shaoqing Ren et al. is integrated with various other state-of-the art key techniques described below in Torch. For image and video, post-processing techniques such as box-refinement and classification rescoring via global context feature. Classification rescoring by combining global context feature with feature outputs is conducted per prediction. In order to further enhance detection performance for video, classification probabilities within tracklets obtained by multiple object tracking were re-scored via combining feature responses weighted on various combination of tracklet lengths. Our architecture is based on ensemble of the several architectures independently trained. The faster-rcnn based on deep residual net is implemented to be learned in an end-to-end manner, and for inference, model ensemble and box-refinement are integrated into the two faster-rcnn architectures.



For both image and video object detection, the following three key components (1-3) that include three post-processing techniques (pp1-3) are integrated in torch for end-to-end learning and inferencing:

(1) Deep residual net[1]

(2) Faster-R-CNN[2] with end2end training

(3) post-processing

(pp1) box refinement[3]

(pp2) model ensemble

(pp3) classification re-scoring via SVM using global context features





For only video object detection, the following post-processing techniques (pp4-5) are additionally included in conjunction with the above three post-processing techniques:

(3) post-processing

(pp4) multiple object tracking[4]

(pp5) tracklets re-scoring



[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep Residual Learning for Image Recognition", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016



[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, "Faster {R-CNN}: Towards Real-Time Object Detection with Region Proposal Networks", Advances in Neural Information Processing Systems (NIPS), 2015





[3] Spyros Gidaris and Nikos Komodakis, "Object detection via a multi-region & semantic segmentation-aware CNN model", International Conference on Computer Vision (ICCV), 2015



[4] Hamed Pirsiavash, Deva Ramanan, and Charless C.Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of objects,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011

KAISTNIA_ETRI Keun Dong Lee(ETRI)

Seungjae Lee(ETRI)

Yunhun Jang(KAIST)

Hankook Lee(KAIST)

Hyung Kwan Son(ETRI)

Jinwoo Shin(KAIST)







For the localization task, we use a variant of Faster-RCNN with ResNet, where the overall training procedure is similar with that in [1]. For the classification task, we used an ensemble of ResNet and GoogLeNet [2] with various data augmentations. Then we recursively obtained attentions in input images to adjust localization outputs. It is further tuned by class-dependent regression models.



[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.





[2] Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:1602.07261, 2016.

KPST_VB Nguyen Hong Hanh

Seungjae Lee

Junhyeok Lee In this work, we used pre-trained ResNet200(ImageNet)[1] and retrained the network on Place 365 Challenge data (256 by 256). We also estimated scene probability using the output of pretrained ResNet200 and scene vs. object (ImageNet 1000 class) distribution on training data. For classification, we used ensemble of two networks with multiple crops and adjusted on scene probability.



[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.



*Our work is performed by deep learning analysis tool(Deep SDK by KPST).

Lean-T Yuechao Gao,Nianhong Liu,Sen Li @ Tsinghua University For the object dection task,our dectors is based on the Faster RCNN[1]. We used pre-traind VGG16[2] to initialize the net. we used caffe[3] to train our model and only 230K iterations were conducted.The images for the DET dataset served as negative training data were not used.

[1]Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99).

[2]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.

[3]Jia, Yangqing, et al. "Caffe: Convolutional architecture for fast feature embedding." Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014.

LZDTX Liu Yinan (Independent Individual);

Zhou Yusong (Beijing University of Posts and Telecommunications);

Deng Guanghui (BeijingUniversity of Technology);

Tuqiang (Beijing Insititute of Technology);

Xing Zongheng (University of Science & Technology Beijing);

In this year, we focus on the Object Detection Task because it is widely used in our project and in other areas such as self-driving, robot and image analysis. Researchers are all the time trying to find a real time object detection algorithm with relatively high accuracy. However, most proposed algorithms are based on proposals and transfer detection task to classification task by classifying the proposals selected from image. Sliding windows is an widely used way but it produces too much proposals. In recent years some methods try to combine traditional proposal select method and deep learning algorithm such as R-CNN. Some methods try to accelerate feature extraction process such as Fast R-CNN and Faster R-CNN. But it is still too slow to most real time applications. Recently some object detection methods without proposals are proposed such as YOLO and SSD. The speed of such methods are much faster than methods with proposals. The drawback of YOLO and SSD is that they are useless to small objects. This is because both YOLO and SSD directly map a box from image to target object. To overcome this drawback, we try to add a deconvolutional structure on the ssd network. The naive idea of our network structure is to enlarge the output feature map and we think larger feature map is able to provide more detailed predicting information and cover small size target object. Deconvolutional layer is the way we use to enlarge the feature map of ssd. We add 3 deconvolutional layers in the basic ssd network, and the deconvolutional layers output predict results as other ssd extra layers. Experimental results show that our deconv-ssd network improves performance of baseline of 300x300 ssd model on validation dataset of ILSVRC 2016 and PASCAL VOC. We submit one model with input size 300x300. We train our model on a Nvidia Titan GPU card, batch size is set as 32.



[1] Uijlings J R R, Sande K E A V D, Gevers T, et al. Selective Search for Object Recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171.

[2] Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Visual Recognition Challenge[J]. International Journal of Computer Vision, 2015, 115(3):211-252.

[3] Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. Computer Science, 2015.

[4] Sermanet P, Eigen D, Zhang X, et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks[J]. Eprint Arxiv, 2013.

[5] Girshick R. Fast R-CNN[J]. Computer Science, 2015.

[6] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2015:1-1.

[7] Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[J]. Computer Science, 2015.

[8] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. Computer Science, 2015.



MCC Lei You, Harbin Institute of Technology Shenzhen Graduate School

Yang Zhang, Harbin Institute of Technology Shenzhen Graduate School

Lingzhi Fu, Harbin Institute of Technology Shenzhen Graduate School

Tianyu Wang, Harbin Institute of Technology Shenzhen Graduate School

Huamen He, Harbin Institute of Technology Shenzhen Graduate School

Yuan Wang, Harbin Institute of Technology Shenzhen Graduate School We combined and modified the RESNET and faster-RCNN for image classification, and then we constructed several detection models for the target location according to the classification model results, finally we integrated the two steps and got the final results.



Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

Neural Information Processing Systems (NIPS), 2015

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.

Russell Stewart, Mykhaylo Andriluka. End-to-end people detection in crowded scenes. CVPR, 2016.





MCG-ICT-CAS Sheng Tang (Corresponding email: ts@ict.ac.cn),

Bin Wang,

JunBin Xiao,

Yu Li,

YongDong Zhang,

JinTao Li



Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China



Technique Details for the Object Detection from Video (VID) Task:



For this year’s VID task, our primary contribution is that we propose a novel tracking framework based on two complementary kinds of tubelet generation methods which focus on precision and recall respectively, followed by a novel tubelet merging method. Under this framework, our main contributions are two-fold:

(1) Tubelet generation based on detection and tracking: We propose to sequentialize the detection bounding boxes of same object with different tracking methods to form two complementary kinds of tubelets. One is to use the detection bounding boxes to refine the optical-flow based tracking for precise tubelet generation. The other is to integrate the detection bounding boxes with multi-target tracking based on MDNet to recall missing tubelets.

(2) Overlapping and successive tubelet fusion: Based on the above two complementary tubelet generation methods, we propose a novel effective union method to merge two overlapping tubelets, and a concatenation method to merge two successive tubelets, which improves the final AP by a substantial margin.



Also three other tricks are used as below:

(1) Non-coocurrence filtration: Based on the co-occurrence relationship mined from the training dataset, we filter out the true negative objects which have lower detection scores and whose categories are not concurrently appeared with those objects of highest detection scores.

(2) Coherent reclassification: After generating the object tubelets based on detection results and optical flow, we propose a coherent reclassification method to get coherent categories throughout a tubelet.

(3) Efficient multi-target tracking with MDNet: we first choose anchor frame, and exploit the adjacent information to determine the reliable anchor targets for efficient tracking. Then, we track each anchor target with a MDNet tracker in parallel. Finally, we use still-image detection results to recall missing tubelets.



In our implementation, we use Faster R-CNN [1] with ResNet [2] for still-image detection, optical flow [3] and MDNet [4] for tracking.



References:

[1] Ren S, He K, Girshick R, Sun J. “Faster R-CNN: Towards real-time object detection with region proposal networks”, NIPS 2015: 91-99.

[2] He K, Zhang X, Ren S, Sun J. “Deep residual learning for image recognition”, CVPR 2016.

[3] Kang K, Ouyang W, Li H, Wang X. “Object Detection from Video Tubelets with Convolutional Neural Networks”, CVPR 2016.

[4] Nam H, Han B. “Learning multi-domain convolutional neural networks for visual tracking”, CVPR 2016.



MIL_UT Kuniaki Saito

Shohei Yamamoto

Masataka Yamaguchi

Yoshitaka Ushiku

Tatsuya Harada



all of members are from University of Tokyo We used Faster RCNN[1] as a basic detection system.



We implemented Faster RCNN based ResNet-152 and ResNet-101[2]. We used pretrained model on 1000classes of ResNet.



We placed Region Proposal Network after conv4 on both models. We freezed weight before conv3 during training. We trained these models with end-to-end training procedure. We used Online Hard Example Mining[3] to train these models. We chose top 64 proposals with large loss from 128 proposals for calculating loss.



Our submission is from ensemble of the Faster RCNN on ResNet-152 and ResNet-101. For ensemble these models, we shared region proposals from two models, we merged proposals and scores from separately calculated ones.



Our result scored map 54.3 on validation dataset.





[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS 2015.

[2] He, Kaiming, et al. "Deep residual learning for image recognition." CVPR 2016.

[3] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." CVPR 2016.



MIPAL_SNU Sungheon Park and Nojun Kwak (Graduate School of Convergence Science and Technology, Seoul National University) We trained two ResNet-50 [1] networks. One network used 7x7 mean pooling, and the other used multiple mean poolings with various sizes and positions. We also used balanced sampling strategy which is similar to [2] to deal with the imbalanced training set.



[1] He, Kaiming, et al. "Deep residual learning for image recognition." CVPR, 2016.



[2] Shen, Li, et al. "Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks." arXiv, 2015.

mmap-o Qi Zheng, Wuhan University

Cheng Tong, Wuhan University

Xiang Li, Wuhan University We use the FULLY CONVOLUTIONAL NETWORKS [1] with VGG 16-layer net to parsing the scene images. The model is adopted with 8 pixel stride nets.



Initial results contain some labels irrelevant to the scene. Some high confidence labels are exploited to group the images into different scenes to remove irrelevant labels. Here we use data-driven classification strategy to refine the results.



[1] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]// IEEE Conference on Computer Vision and Pattern Recognition. 2015:1337-1342.



Multiscale-FCN-CRFRNN Shuai Zheng, Oxford

Anurag Arnab, Oxford

Philip Torr, Oxford This submission is trained based on Conditional Random Fields as Recurrent Neural Networks, (described in Zheng et al., ICCV 2015), with multi-scale training pipeline. Our base model is built on ResNet101, which is only pre-trained on ImageNet. After that, the model is built within a Fully Convolutional Network (FCN) structure and only fine-tuned on MIT Scene Parsing dataset. This is done using a multi-scale training pipeline, similar to Farabet et al. 2013. In the end, this FCN-ResNet101 model is plugged in with CRF-RNN and trained in an end-to-end pipeline.

MW Gang Sun (Institute of Software, Chinese Academy of Sciences)

Jie Hu (Peking University) We leverage the theory named CNA [1] (capacity and necessity analysis) to guide the design of CNNs. We add more layers on the larger feature map (e.g., 56x56) to increase the capacity, and remove some layers on the smaller feature map (e.g., 14x14) to avoid ineffective architectures. We have verified the effectiveness on the models in [2], ResNet-like models [3], and Inception-ResNet-like models [4]. In addition, we also apply cropped patches from original images as training samples by selecting random area and aspect ratio. To increase the ability of generalization, we prune the model weights periodically. Moreover, we utilize balanced sampling strategy [2] and label smooth regularization [5] during training, to alleviate the bias from the non-uniform sample distribution among categories and partial incorrect training labels. We use the provided data (Places365) for training models, do not use any additional data, and train all models from scratch. The algorithm and architecture details will be described in our arXiv paper (available online shortly).



[1] Xudong Cao. A practical theory for designing very deep convolutional neural networks, 2014. (unpublished)

[2] Li Shen, Zhouchen Lin, Qingming Huang. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. In ECCV 2016.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. In CVPR 2016.

[4] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. ArXiv:1602.07261,2016.

[5] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. ArXiv:1512.00567,2016.



NEU_SMILELAB YUE WU, YU KONG, JUN LI, LANCE BARCELONA, RAMI SALEH, SHANGQIAN GAO, RYAN BIRKE, HONGFU LIU, JOSEPH ROBINSON, TALEB ALASHKAR, YUN FU



Northeastern University, MA, USA

We focus on the object classification problem. The 1000 classes are split into 2 parts based on the analysis of the WORDNET structure and a visualization of features from a resnet-200 model [1]. The first part has 417 classes and is annotated as “LIVING THINGS”. The second part has the rest 583 classes and is annotated as “ARTIFACTS and OTHERS”. Two resnet-200 layer models are trained for each part separately. The model for “LIVING THINGS” has a top-5 error of 3.174% on validation set and the model for “ARTIFACTS and OTHERS” has a top-5 error of 7.874% with only center crop testing. However, we cannot find a proper way to combine these two models to get a good result for the total 1000 classes. Our combination of the two models [2] gets a top-5 error 7.62% for 1000 classes with 144-crop testing. We also train several resnet models with different layers. Our submission is based on an ensemble of these models. Our best result achieves a top-5 error 3.92% on validation set. For localization, we simply take the center of the image as the box for object.



[1] Identity Mappings in Deep Residual Networks, ECCV, 2016

[2] Deep Convolutional Neural Network with Independent Softmax for Large Scale Face Recognition, ACM Multimedia (MM), 2016



NQSCENE Chen Yunpeng ( NUS )

Jin Xiaojie ( NUS )

Zhang Rui ( CAS )

Li Yu ( CAS )

Yan Shuicheng ( Qihoo/NUS ) Technique Details for the Scene Classification:



For the scene classification task, we propose the following methods to address the data imbalance issues (aka the long tail distribution issue) which benefit and boost the final performance:



1) Category-wise Data Augmentation:

We implied a category wise data augmentation strategy, which associates each category with adaptive augmentation level. The augmentation level is updated iteratively during the training.



2) Multi-task Learning:

We proposed a multipath learning architecture to jointly learn feature representations from the Imagnet-1000 dataset and Places-365 dataset.



Vanilla ResNet-200 [1] is adopted with following elementary tricks: scale and aspect ratio augmentation, over-sampling, multi-scale (x224,x256,x288,x320) dense testing. In total, we have trained four models and fused them by averaging their scores. It costs about three days for training each model using MXNet [2] on a cluster with forty NVIDIA M40 (12GB).



------------------------------

[1] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).

[2] Chen, Tianqi, et al. "Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems." arXiv preprint arXiv:1512.01274(2015).

NTU-SC Jason Kuen, Xingxing Wang, Bing Shuai, Xiangfei Kong, Jianxiong Yin, Gang Wang*, Alex C Kot





Rapid-Rich Object Search Lab, Nanyang Technological University, Singapore. All of our scene classification models are built upon pre-activation ResNets [1]. For scene classification using the provided RGB images, we train from scratch a ResNet-200, as well as a relatively shallow Wide-ResNet [2]. In addition to RGB images, we make use of class activation maps [3] and (scene) semantic segmentation masks [4] as complementary cues, obtained from models pre-trained for ILSVRC image classification [5] and scene parsing [6] tasks respectively. Our final submissions consist of ensembles of multiple models.



References

[1] He, K., Zhang, X., Ren, S., & Sun, J. “Identity Mappings in Deep Residual Networks”. ECCV 2016.

[2] Zagoruyko, S., & Komodakis, N. “Wide Residual Networks”. BMVC 2016.

[3] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. “Learning Deep Features for Discriminative Localization”. CVPR 2016.

[4] Shuai, B., Zuo, Z., Wang, G., & Wang, B. "Dag-Recurrent Neural Networks for Scene Labeling". CVPR 2016.

[5] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. “Imagenet large scale visual recognition challenge”. International Journal of Computer Vision, 115(3), 211-252.

[6] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. “Semantic Understanding of Scenes through the ADE20K Dataset”. arXiv preprint arXiv:1608.05442.

NTU-SP Bing Shuai (Nanyang Technological University)

Xiangfei Kong (Nanyang Technological University)

Jason Kuen (Nanyang Technological University)

Xingxing Wang (Nanyang Technological University)

Jianxiong Yin (Nanyang Technological University)

Gang Wang* (Nanyang Technological University)

Alex Kot (Nanyang Technological University) We train our improved fully convolution networks (IFCN) for the scene parsing task. More specifically, we use the pre-trained Convolution Neural Network (pre-trained from ILSVRC CLS-LOC task) as encoder, and then adds a multi-branch deep convolution network to perform multi-scale context aggregation. Finally, simple deconvolution network (without unpooling layers) is used as the decoder to generate the high-resolution label prediction maps. IFCN subsumes the above three network components. The network (IFCN) is trained with the class weighted loss proposed in [Shuai et al, 2016].



[Shuai et al, 2016] Bing Shuai, Zhen Zuo, Bing Wang, Gang Wang. DAG-Recurrent Neural Network for Scene Labeling

NUIST Jing Yang, Hui Shuai, Zhengbo Yu, Rongrong Fan, Qiang Ma, Qingshan Liu, Jiankang Deng 1.inception v2 [1] is used in the VID task, which is almost real time with GPU.

2.cascaded region regression is used to detect and track different instances.

3.context inference between instances within each video

4.online detector and tracker update to improve recall



[1]Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[2]Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

[3]Dai, Jifeng, et al. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016).

NuistParsing Feng Wang:B-DAT Lab, Nanjing University of Information Science and Technology, China

Zhi Li:B-DAT Lab, Nanjing University of Information Science and Technology, China

Qingshan Liu:B-DAT Lab, Nanjing University of Information Science and Technology, China

Scene parsing problem is extremely challenging due to the diversity of appearance and the complexity of configuration,laying, and occasion. We mainly adopt SegNet architecture for scene parsing work. We first extract the edge information of images from ground truth and take the edge as a new class. Then we re-compute the weights of all classes to overcome the imbalance between classes. We use the new ground truth and new weights to train the model. In addition, we employ super-pixel smoothing to optimize the results.

[1] V. Badrinarayanan, A. Handa, and R. Cipolla. SegNet:a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293,2015.

[2]Wang F, Li Z, Liu Q. Coarse-to-fine human parsing with Fast R-CNN and over-segment retrieval[C]//2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016: 1938-1942.



NUS-AIPARSE XIAOJIE JIN (NUS)

YUNPENG CHEN (NUS)

XIN LI (NUS)

JIASHI FENG (NUS)

SHUICHENG YAN (360 AI INSTITUTE, NUS) The submissions are based on our proposed Multi-Path Feedback recurrent neural network (MPF-RNN) [1]. MPF-RNN aims to enhancing the capability of RNNs on modeling long-range context information at multiple levels and better distinguish pixels that are easy to confuse in pixel-wise classification. In contrast to CNNs without feedback and RNNs with only a single feedback path, MPF-RNN propagates the contextual features learned at top layers through weighted recurrent connections to multiple bottom layers to help them learn better features with such "hindsight". Besides, we propose a new training strategy which considers the loss accumulated at multiple recurrent steps to improve performance of the MPF-RNN on parsing small objects as well as stabilizing the training procedure.



In this contest, Res101 is used as baseline model. Multi-scale input data augmentation as well as multi-scale testing are used.



[1] Jin, Xiaojie, Yunpeng Chen, Jiashi Feng, Zequn Jie, and Shuicheng Yan. "Multi-Path Feedback Recurrent Neural Network for Scene Parsing." arXiv preprint arXiv:1608.07706 (2016).

NUS_FCRN Li Xin, Tsinghua University;

Jin xiaojie, National University of Singapore;

Jiashi Feng, National University of Singapore.

We trained a single fully convolutional neural network with ResNet-101 as frontend model.



We did not use any multiscale data augmentation in both training and testing.



NUS_VISENZE Kyaw Zaw Lin(dcskzl@nus.edu.sg)

Shangxuan Tian(shangxuan@visenze.com)

JingYuan Chen(a0117039@u.nus.edu) Fusion of three models SSD (VGG+Resnet)[2] with Faster Rcnn[4] with Resnet[3]. Context suppression is applied and then tracking is performed using according to [1]. Tracklets are greedily merged after tracking.



[1]Danelljan, Martin, et al. "Accurate scale estimation for robust visual tracking." Proceedings of the British Machine Vision Conference BMVC. 2014.

[2]Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015).

[3]He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

[4]Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

OceanVision Zhibin Yu Ocean University of China

Chao Wang Ocean University of China

ZiQiang Zheng Ocean University of China

Haiyong Zheng Ocean University of China Our homepage: http://vision.ouc.edu.cn/~zhenghaiyong/



We are interesting in scene classification and we aim to build a net for this problem.

OutOfMemory Shaohua Wan, UT Austin

Jiapeng Zhu, BIT, Beijing Faster RCNN [1] object detection framework plus ResNet-152 [2] network configuration is used in our object detection algorithm. Much effort is made towards optimizing the network such that it consumes much less GPU memory.



[1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, 2015.

[2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015

Rangers Y. Q. Gao,

W. H. Luo,

X. J. Deng,

H. Wang,

W. D. Chen,

---

ResNeXt Saining Xie, UCSD

Ross Girshick, FAIR

Piotr Dollar, FAIR

Kaiming He, FAIR





We present a simple, modularized multi-way extension of ResNet for ImageNet classification. In our network, each residual block consists of multiple ways that are of the same architectural shape, and the network is a simple stack of such residual blocks that share the same template, following the design of the original ResNet. Our model is highly modularized and thus reduces the burdens of exploring the design space. We carefully conducted ablation experiments showing the improvements of this architecture. More details will be available in a technical report. In the submissions we exploited multi-way ResNets-101. We submit no localization result.

RUC_BDAI Peng Han, Renmin University of China

An Zhao, Renmin University of China

Wenwu Yuan, Renmin University of China

Zhiwu Lu, Renmin University of China

Jirong Wen, Renmin University of China

Lidan Yang, Renmin University of China

Aoxue Li, Peking University We use the well-trained Faster R-CNN[1] to generate bounding boxes for every frame of the video. And we only use a few frames of every video to train that model. To reduce the effect of the unbalanced problem, the number of every category is basically the same. Then we utilize the contextual information of the video to reduce the noise and add the missing.



[1] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.

S-LAB-IIE-CAS Ou Xinyu [1,2]

Ling Hefei [2]

Liu Si [1]



1. Chinese Academy of Sciences, Institute of Information Engineering;

2. Huazhong University of Science and Technology

（This work was done when the first author worked as an intern at S-Lab of CASIIE.）

We exploit object-based contextual enhancement strategies to improve the performance of deep convolutional neural network over scene parsing task. Increasing the weights of objects on local proposal regions can enhance the structure characteristics of the object and correct the ambiguous areas which are wrongly judged as stuff. We have verified its effectiveness on ResNet101-like architecture [1], which is designed with multi-scale, CRF, atrous convolutional [2] technologies. We also apply various technologies (such as RPN [3], black hole padding, visual attention, iterative training) to this ResNet101-like architecture. The algorithm and architecture details will be described in our paper (available online shortly).

In this competition, we submit five entries. The first (model A) is a Multi-Scale Resnet101-like model with Fully Connected CRF and Atrous Convolutions, which achieved 0.3486 mIOU and 75.39% pixel-wise accuracy on validation dataset. The second model is a Multi-Scale deep CNN modified by object proposal, which achieved 0.3809 mIOU and 75.69% pixel-wise accuracy. A black hold restoration strategy is attached to model B to generate the model C. The model D attention strategies in deep CNN model. And the model E combined with the results of other four models.



[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille:

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. CoRR abs/1606.00915 (2016)

[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Conference on Neural Information Processing Systems (NIPS), 2015



SamExynos Qian Zhang(Beijing Samsung Telecom R&D Center)

Peng Liu(Beijing Samsung Telecom R&D Center)

Jinbin Lin(Beijing Samsung Telecom R&D Center)

Junjun Xiong(Beijing Samsung Telecom R&D Center) Object localization:



The submission is based on [1] and [2], but we modified the model, and the newtwork is 205 layers. Due to the limit of time and GPUs, we have just trained three CNN model for classification. The top-5 accuracy on the validation set with dense crops(scale:224,256,288,320,352,384,448,480) is 96.44% for the best single model. And the top-5 accuracy on the validation set with dense crops is 96.88% for three model ensemble.



places365 classification:



The submission is based on [3] and [4], we add 5 layers to resnet 50, and modified the network. Due to the limit of time and GPUs, we have just trained three CNN model for the scene classification task. The top-5 accuracy on the validation set with 72 crops is 87.79% for the best single model. And the top-5 accuracy on the validation set with multiple crops is 88.70% for three model ensemble.



[1]Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun,Identity Mappings in Deep Residual Networks. ECCV 2016.

[2]Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning". arXiv preprint arXiv:1602.07261 (2016)

[3]Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. "Rethinking the Inception Architecture for Computer Vision". arXiv preprint arXiv:1512.00567 (2015)

Samsung Research America: General Purpose Acceleration Group Dr. S. Eliuk (Samsung), C. Upright (Samsung), Dr. H. Vardhan (Samsung), T. Gale (Intern, Northeastern University), S. Walsh (Intern, University of Alberta). The General Purpose Acceleration Group is focused on accelerating training via HPC & distributed computing. We present Distributed Training Done Right (DTDR) where standard open-source models are trained in an effective manner via a multitude of techniques involving strong / weak scaling and strict distributed training modes. Several different models are used from standard Inception v3, to Inception v4 res2, and ensembles of such techniques. The training environment is unique as we can explore extremely deep models given the model-parallel nature of our partitioning of data.



scnu407 Li Shiqi South China Normal University

Zheng Weiping South China Normal University

Wu Jinhui South China Normal University We believe that the spatial relationships between objects in the image is a kind of time-series data. Therefore, we first use VGG16 to extract the features of the image, then add 4 LSTM layer in the back, four LSTM layer representing the four directions of the scanning feature map.

SegModel Falong Shen, Peking Univerisity

Rui Gan, Peking University

Gang Zeng, Peking Univerisity Abstract

Our models are finetuned from resnet152[1] and follow the methods introduced in [2].











References

[1] K He，X Zhang，S Ren，J Sun. Deep Residual Learning for Image Recognition.

[2] F Shen，G Zeng. Fast Semantic Image Segmentation with High Order Context and Guided Filtering.

SenseCUSceneParsing Hengshuang Zhao* (SenseTime, CUHK), Jianping Shi* (SenseTime), Xiaojuan Qi (CUHK), Xiaogang Wang (CUHK), Tong Xiao (CUHK), Jiaya Jia (CUHK) [* equal contribution] We have employed FCN based semantic segmentation for the scene parsing. We propose a context aware semantic segmentation framework. The additional image level information significantly improves the performance under complex scene in natural distribution. Moreover, we find that deeper pretrained model is better. Our pretrained models include ResNet269, ResNet101 from ImageNet dataset, and ResNet152 from Places2 dataset. Finally, we utilize the deeply supervised structure to assist training the deeper model. Our best single model reach 44.65 mIOU and 81.58 pixel accurcy in validation set.



[1]. Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR. 2015.

[2]. He, Kaiming, et al. "Deep residual learning for image recognition." arXiv:1512.03385, 2015.

[3]. Lee, Chen-Yu, et al. "Deeply-Supervised Nets." AISTATS, 2015.



SIAT_MMLAB Sheng Guo, Linjie Xing,

Shenzhen Institutes of Advanced Technology, CAS.

Limin Wang,

Computer Vision Lab, ETH Zurich.

Yuanjun Xiong,

Chinese University of Hong Kong.

Jiaming Liu and Yu Qiao,

Shenzhen Institutes of Advanced Technology, CAS. We propose a modular framework for large-scale scene recognition, called as multi-resolution CNN (MR-CNN) [1]. This framework addresses the characterization difficulty of scene concepts, which may be based on multi-level visual information, including local objects, spatial layout, and global context. Specifically, in this challenge submission, we utilizes four resolutions (224, 299, 336, 448) as the input sizes of MR-CNN architectures. For coarse resolution (224, 299), we exploit the existing powerful Inception architectures (Inception v2 [2], Inception v4 [3], and Inception-ResNet [3]), while for fine resolution (336, 448), we propose our new inception architectures by making original inception network deeper and wider. Our final submission is the prediction result of MR-CNNs by fusing the outputs of CNNs of different resolutions.



In addition, we propose several principled techniques to reduce the over-fitting risk of MR-CNNs, including class balancing and hard sample mining. These simple yet effective training techniques enable us to further improve the generalization performance of MR-CNNs on the validation dataset. Meanwhile, we use an efficient parallel version of Caffe toolbox [4] to allow for the fast training of our proposed deeper and wider Inception networks.





[1] L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao, Knowledge guided disambiguation for large-scale scene classification with Multi-Resolution CNNs, in arXiv, 2016.



[2] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in ICML, 2015.



[3] C. Szegedy, S. Ioffe, and V. Vanhouche, Inception-v4, Inception-ResNet and the impact of residual connections on learning, in arXiv, 2016.



[4] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in ECCV, 2016.

SIIT_KAIST Sihyeon Seong (KAIST)

Byungju Kim (KAIST)

Junmo Kim (KAIST) We used ResNet[1] (101 layers / 4GPUs) as our baseline model. From the model pre-trained with ImageNet classification dataset(provided by [2]), We re-tuned the model with Places365 dataset (256-resized small dataset). Then, we further fine-tuned the model based on the following ideas:



i) Analyzing correlations between labels : We calculated correlations between each pair of predictions p(i), p(j) where i, j are classes. Then, highly correlated label pairs are extracted by thresholding the correlation coefficients.



ii) Additional semantic label generation : Using the correlation table from i), we further generated super/subclass labels by clustering them. Additionally, we generated 170 binary labels for separations of confusing classes, which maximize margins between highly correlated label pairs.



iii) Boosting-like multi-loss terms :

A large number of loss terms are combined for classifying the labels generated in ii).



[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

[2] https://github.com/facebook/fb.resnet.torch



SIIT_KAIST-TECHWIN Byungju Kim (KAIST),

Youngsoo Kim (KAIST),

Yeakang Lee (KAIST),

Junho Yim (KAIST),

Sangji Park (Techwin),

Jaeho Jang (Techwin),

Shimin Yin (Techwin),

Soonmin Bae (Techwin),

Junmo Kim (KAIST) Our methods for classification and localization are based on ResNet[1].



We used Branched-200-layer ResNets, based on the original 200-layer ResNet and Label Smoothing method [2].



The networks are trained on ILSVRC2016 localization dataset. (from scrach)



For testing, dense sliding window method[3] was used on six scales and with horizontal flip.



'Single model' is one Branched-ResNet with Label Smoothing method. Validation top-5 classification error rate is 3.7240%



'Ensemble A' consists of one 200-layer ResNet, one Branched-ResNets without label smoothing and 'Single model', which is with label smoothing.



'Ensemble B' consists of three Branched-ResNets without label smoothing and 'Single model', which is with label smoothing.



'Ensemble C' consists of 'Ensemble B' and an original 200-layer ResNet.



Ensemble A and B are averaged on soft set of targets distilled by high temperature, which is similar to the method in [4].



Ensemble C is averaged on softmax outputs.



This work was supported by Hanwha Techwin CO., LTD.





[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

[2] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015).

[3] Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection using convolutional networks." arXiv preprint arXiv:1312.6229 (2013).

[4] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).

SIS ITMO University --- Single-Shot Detector



SJTU-ReadSense Qinchuan Zhang, Shanghai Jiao Tong University

Junxuan Chen, Shanghai Jiao Tong University

Thomas Tong, ReadSense

Leon Ding, ReadSense

Hongtao Lu, Shanghai Jiao Tong University We train two CNN models from the scratch. Model A based on Inception-BN [1] with one auxiliary classifier is trained on the Places365-Challenge dataset [2], which achieved 15.03% top-5 error on validation dataset. Model B based on ResNet [3] with depth of 50 layers is trained on the Places365-Standard dataset and finetuned for 2 epochs on the Places365-Challenge dataset due to the limit of time, which achieved 16.3% top-5 error on validation dataset. We also fuse features extracted from 3 baseline models [2] on the Places365-Challenge dataset and trained two fully connected layers with a softmax classifier. Moreover, we adopt the "class-aware" sampling strategy proposed by [4] for models trained on Places365-Challenge dataset to tackle the non-uniform distribution of images over 365 categories. We implement model A using Caffe [5] and conduct all other experiments using MXNet [6] to deploy larger batch size on a GPU.



We train all models with a 224x224 crop randomly sampled from an 256x256 image or its horizontal flip, with the per-pixel mean subtracted. We apply 12-crops [7] for evaluation on validation and test datasets.



We ensemble multiple models with weights (learnt on validation dataset or top-5 validation accuracies), and achieve 12.79% (4 models), 12.69% (5 models), 12.57% (6 models) top-5 error on validation dataset.



[1] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

[2] Places: An Image Database for Deep Scene Understanding. B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. Arxiv, 2016.

[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.

[4] L. Shen, Z. Lin , Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. arXiv:1512.05830, 2015.

[5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.

[6] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C.n Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS, 2015.

[7] C. Szegedy,W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.

SRA Hojjat Seyed Mousavi, The Pennsylvania State University, Samsung Research America



Da Zhang, University of California at Santa Barbara, Samsung Research America



Nina Narodytska, Samsung Research America



Hamid Maei, Samsung Research America



Shiva Kasiviswanathan, Samsung Research America Object detection from video is a challenging task in computer vision. These challenges sometime come from the temporal aspects of videos or the nature of objects present in the video. For example detection of objects when they disappear and reappear in the camera’s field of view, or detection of non-rigid objects that change appearances are of common challenges in object detection in video. In this work, we specifically focus on incorporating the temporal and contextual information in addressing some of these challenges. In our proposed method, initial candidates for objects are first detected in each frame of the video sequence. Then based on the information from adjacent cells and also contextual information from the whole video sequence, object detections and categories are recalculated for each video sequence. We have submitted two different submissions to this year’s competition. One corresponds to our algorithm using information from still video frames, temporal information from adjacent frames and contextual information of the whole video sequence. The other submission does not use the contextual information present in the video.

SunNMoon Moon Hyoung-jin.

Park Sung-soo. We ensembled two object detection, Faster-rcnn and SingleShotDetector.

we used pre-trained Resnet101 classificationmodel.



Faster-rcnn is combined with RPN and SPOPnet(Scale-aware Pixel-wise Object Proposal networks) algorithm to find better rois in Faster-rcnn. we trained Faster-rcnn and SingleShotDetector(SSD). the finally result is the combiantion of multi scale Faster-rcnn and SSD300x300.



The result of single Faster-rcnn is 42.8% mAP.

The result of SSD300x300 is 43.7% mAP.