Beijing-based computer vision unicorn Megvii Technology runs the world’s largest face-recognition technology platform, Face++. The company provides innovative solutions for object detection and image recognition using AI-powered techniques.

This week, Megvii (Face++) Chief Scientist Dr. Jian Sun and his research team will present multiple projects at the European Conference on Computer Vision (ECCV) 2018, one of the world’s top three international image processing and computer vision gatherings.

AI has done an impressive job solving visual recognition problems independently, but still lacks the human ability to visualize abundant information at a glance. For example, when a human looks at a living room they can easily parse concepts at multiple perceptual levels, e.g., scene, objects, parts, textures, materials, as well as the compositional structures linking detected concepts. Megvii defines this ability as Unified Perceptual Parsing (UPP). Accomplishing UPP with a learning framework called “UPerNet” is the subject of the recent paper Unified Perceptual Parsing for Scene Understanding by Dr. Sun et al., and one of the projects that will be presented at the ECCV.

UPP Task Demonstration

The research team’s first challenge was creating a high-quality training dataset, which is the foundation of a learning network. No single existing image dataset could provide all the levels of visual information required for UPP, so the authors merged and standardized various labeled image datasets for specific tasks: ADE20K; Pascal-Context and Pascal-Part for scene, object and part parsing; OpenSurfaces for material and surface recognition; and the Describable Textures Dataset (DTD) for texture recognition. The result was the working dataset Broden+, with 57,095 images for model training.

The authors overcame the annotation heterogeneity challenge (e.g., some annotations are image-level while some are pixel-level) by designing a multi-task framework to detect various visual concepts simultaneously. UPerNet was designed based on a Feature Pyramid Network (FPN), which exploits a top-down architecture to extract multi-level feature representations in an inherent and pyramidal hierarchy.

Because the FPN has an insufficient empirical receptive field, a Pyramid Pooling Module (PPM)is applied to the last layer of the backbone network before feeding it to the FPN top-down branch. In addition, a fusion of FPN feature maps is used in the object and part annotations for model performance enhancement.

A total of 36,500 images from 365 scenes in the Places-365 dataset were used for model validation. Both quantitative and qualitative results suggest UperNet is effective for unifying multi-level visual attributes simultaneously and has competitive model performance and training time requirement compared with current state-of-art methods.

Qualitative and quantitative results of UPP using UPerNet with Broden+ training dataset

Analysis of UPerNet v.s. state-of-the-art methods on ADE20K dataset.

UPerNet is also capable of exploring deeper understanding in a scene by identifying multi-compositional information such as scene-object, object/part-material, and material-texture relations from input images. Research results indicate the extracted information is reasonable and matches human understanding of compositional relations between these concepts.

Discovered Visual Knowledge by UPerNet Trained for UPP

The paper Unified Perceptual Parsing for Scene Understanding was published on arXiv in July. Related open source code is available at GitHub.

Source: Synced China