2.1.1 Siamese Tower

The first step of the pipeline finds a meaningful representation of image patches that can be accurately matched in later stages. ActiveStereoNet uses a feature network with shared weights between the two input images (also known as a Siamese network).

The Siamese tower is adjusted such that it will maintain more high-frequency signals from the input image. This is accomplished by running several residual blocks on full resolution and reducing resolution later using convolution with stride. The output is a 32-dimensional feature vector at each pixel in the downsampled image of size W/8 x H/8.

Fig 5. Detailed Network Architecture of Siamese Tower

2.1.2 Cost Volume

At this point, a cost-volume is formed at the coarse resolution by taking the difference between the feature vector of a pixel and the feature vectors of the matching candidates. The main idea behind this procedure is that it slides one image onto the other, finding the disparity that yields maximum correlation. In contrast to the standard approach, in ActiveStereoNet, the cost-volume is applied on the feature vector rather than on the high-resolution input images.

The animation below illustrates the procedure quite nicely. If you’re interested, you can refer to the original paper for more details on the cost-volume filter.

Fig. 6. Cost-Volume for stereo matching illustration

The input to the cost-volume is a feature vector of dimensions W/8 x H/8 x 32 that evaluates a maximum of D candidate disparities. The disparity with the minimum cost at each pixel is selected using a differentiable softmax-weighted combination of all the disparity values.

Eq.2 Differentiable softmax-weighted operator

This output of size W/8 x H/8 x 1 is then upsampled using bi-linear interpolation to the original resolution W x H.

2.1.3 Residual Refinement Network

The downside of relying on coarse matching is that the resulting myopic output lacks fine details. To handle this and maintain compact design, ActiveStereoNet learns an edge-preserving refinement network.

The refinement network takes as input the disparity bi-linearly upsampled to the output size, as well as the color resized to the same dimensions.

Fig. 7. Detailed Network Architecture of Residual Refinement Network

2.3 Loss

The architecture described is composed of a low-resolution disparity and a final refinement step to retrieve high-frequency details. A natural choice is to have a loss function for each of these two steps.

Due to a lack of ground truth data, ActiveStereoNet is trained in an unsupervised manner. The authors claim that while a viable choice for the training loss is the photometric error, it isn’t suitable for active cases because of a high correlation between intensity and disparity. Basically this means that brighter pixels are closer, and as pixel noise is proportional to intensity, this could lead to a network biased towards closeup scenes.

Eq.3 standard reconstruction loss between the left image I_l and and the reconstructed left image I^. The reconstructed image I^ is obtained by sampling pixels from the right image I_r using the predicted disparity d

In Fig.8. below, the reconstruction error for a given disparity map using photometric loss is shown. Notice how bright pixels on the pillow exhibit high reconstruction error due to the input-dependent nature of the noise.