Let’s break down a few key details here:

A variety of deep learning models (but no classic machine learning):

We were struck by how many different deep learning model architectures were used, and how similar the scores were among them. Some competitors used substantial ensembles — over 20 independently trained models — to generate predictions of which pixels corresponded to buildings, and then averaged those predictions for their final results. Another competitor (number13) trained a different set of weights for each individual collect, then generated predictions for each image using the corresponding look’s model weights.

We weren’t terribly surprised that every winning algorithm used deep learning. This is consistent with general trends in computer vision: almost all high-performance segmentation algorithms utilize deep learning. The only “classical” machine learning algorithms used were Gradient Boosted Trees that the top two competitors — cannab and selim_sef — used to filter out “bad” building footprint predictions from the neural nets.

Model tailoring to geospatial-specific (and related) problems

These algorithms taught us a lot about tailoring models to overhead imagery. Building pixels only account for 9.5% of the entire training dataset. Segmentation algorithms are trained to classify individual pixels as belonging to an object (here, a building) or background. In this case, an algorithm can achieve high pixel-wise accuracy by predicting “non-building” for everything! This causes algorithms trained with “standard” loss functions — such as binary cross-entropy — to collapse to predicting zeros (background) everywhere. Competitors overcame this through two approaches: 1. using the relatively new Focal Loss, which is a cross-entropy variant that hyper-penalizes low-confidence predictions, and 2. combining this loss function with an IoU-based loss such as Jaccard Index or Dice Coefficient. These loss functions guard against the “all-zero valley” by strongly penalizing under-prediction.

An additional challenge with overhead imagery (and related problems like instance segmentation of small, densely packed objects) is object merging. The semantic segmentation approach described above does nothing to separate individual objects (a task called “instance segmentation”, which is what competitors were asked to do in this challenge). Instances are usually extracted from semantic masks by labeling contiguous objects as a single instance; however, semantic segmentation can produce pixel masks where nearby objects are erroneously connected to one another (see the example below). This can cause problems:

Building instance segmentation from attached (left) vs. separated (right) semantic segmentation output masks. Red arrows, poor prediction results in connection of very closely apposed buildings.

This is a problem if the use case requires an understanding of how many objects exist in an image or where their precise boundaries are. Several competitors addressed this challenge by creating multi-channel learning objective masks, like the one below:

A sample pixel mask taken from cannab’s solution description. Black is background, blue is the first channel (building footprints), pink is the second channel (building boundaries), and green is the third channel (points very close to two different or more buildings). Cannab’s algorithm learned to predict outputs in this shape, and in post-processing he subtracted the boundaries and contact points from the predicted footprints to separate instances more effectively.

Rather than just predicting building/no building for each pixel, the algorithm is now effectively predicting three things: 1. building/no building, 2. edge of a building/no edge, 3. contact point between buildings/no contact point. Post-processing to subtract the edges and contact points can allow “cleanup” of apposed objects, improving instance segmentation.

Training and test time varied

The competition rules required that competitors’ algorithms could train in 7 days on 4 Titan Xp GPUs, and complete inference in no more than 1 day. The table above breaks down training and testing time. It’s noteworthy that many of these solutions are likely too slow to deploy in a product environment that requires constant, timely updates. Interestingly, individual models from the large ensembles could perhaps be used on their own without substantial degradation in performance (and with a dramatic increase in speed) — for example, cannab noted in his solution description that his best individual model scored nearly as well as the prize-winning ensemble.

Strengths and weaknesses of algorithms for off-nadir imagery analysis

We asked a few questions about the SpaceNet Off-Nadir Challenge winning algorithms:

What fraction of each building did winning algorithms identify? I.e., how precise were the footprints? How did each algorithm perform across different look angles? How similar were the predictions from the different algorithms? Did building size influence the likelihood that a building would be identified?

These questions are explored in more detail at the CosmiQ Works blog, The DownlinQ. A summary of interesting points is below.

How precise were the footprints?

When we ran the SpaceNet Off-Nadir Challenge, we set an IoU threshold of 0.5 for building detection — meaning that of all of the pixels present between a ground truth footprint and a prediction, >50% had to overlap to be counted as a success. Depending upon the use case, this threshold may be higher or lower than is actually necessary. A low IoU threshold means that you don’t care how much of a building is labeled correctly, only that some part of it is identified. This works for counting objects, but doesn’t work if you need precise outlines (for example, to localize damage after a disaster). It’s important to consider this threshold when evaluating computer vision algorithms for product deployment: how precisely must objects be labeled for the use case?

We asked what would have happened to algorithms’ building recall — the fraction of ground truth buildings they identified — if we had changed this threshold. The results were striking:

Recall, or the fraction of actual buildings identified by algorithms, depends on the IoU threshold. Some algorithms identified part of many buildings, but not enough to be counted as a successful identification at our threshold of 0.5. The inset shows the range of IoU thresholds where XD_XD (orange)’s algorithm went from being one of the best in the top five to one of the worst out of the prize-winners.

There is little change in competitor performance if the threshold is set below 0.3 or so — of the buildings competitors found, most achieved this score, if not better. However, performance begins to drop at this point, and once the threshold reaches ~0.75, scores have dropped by 50%. This stark decline highlights another area where computer algorithms could be improved: instance-level segmentation accuracy for small objects.

Performance by look angle

Next, let’s examine how each competitor’s algorithm performed at every different look angle. We’ll look at three performance metrics: recall (the fraction of actual buildings identified), precision (the fraction of predicted buildings that corresponded to real buildings, not false positives), and F1 score, the competition metric that combines both of these features:

F1 score, recall, and precision for the top five competitors stratified by look angle. Though F1 scores and recall are relatively tightly packed except in the most off-nadir look angles, precision varied dramatically among competitors.

Unsurprisingly, the competitors had very similar performance in these graphs, consistent with their tight packing at the top of the leaderboard. Most notable is where this separation arose: the competitors were very tightly packed in the “nadir” range (0–25 degrees). Indeed, the only look angles with substantial separation between the top two (cannab and selim_sef) were those >45 degrees. cannab seems to have won on his algorithm’s performance on very off-nadir imagery!

One final note from these graphs: there are some odd spiking patterns in the middle look angle ranges. The angles with lower scores correspond to images taken facing South, where shadows obscure many features, whereas North-facing images had brighter sunlight reflections off of buildings (below figure reproduced from earlier as a reminder):

Two looks at the same buildings at nearly the same look angle, but from different sides of the city. It’s visually much harder to see buildings in the South-facing imagery, and apparently the same is true for neural nets! Imagery courtesy of DigitalGlobe.

This pattern was even stronger in our baseline model. Look angle isn’t all that matters — look direction is also important!

Similarity between winning algorithms

We examined each building in the imagery and asked how many competitors successfully identified it. The results were striking:

Histograms showing how many competitors identified each building in the dataset, stratified by look angle subset. The vast majority of buildings were identified by all or none of the top five algorithms — very few were identified by only some of the top five.

Over 80% of buildings were identified by either zero or all five competitors in the nadir and off-nadir bins! This means that the algorithms only differed in their ability to identify about 20% of the buildings. Given the substantial difference in neural network architecture (and computing time needed to train and generate predictions from the different algorithms), we found this notable.

Performance vs. building size

The size of building footprints in this dataset varied dramatically. We scored competitors on their ability to identify everything larger than 20 square meters in extent, but did competitors perform equally well through the whole range? The graph below answers that question.

Building recall (y axis) stratified by building footprint size of varying size (x axis). The blue, orange, and green lines represent the fraction of building footprints of a given size. The red line denotes the number of building footprints of that size in the dataset (right y axis).

Even the best algorithm performed relatively poorly on small buildings. cannab identified only about 20% of buildings smaller than 40 square meters, even in images with look angle under 25 degrees off-nadir. This algorithm achieved its peak performance on buildings over 105 square meters in extent, but this only corresponded to about half of the objects in the dataset. It is notable, though, that this algorithm correctly identified about 90% of buildings with footprints larger than 105 square meters in nadir imagery.

Conclusion

The top five competitors solved this challenge very well, achieving excellent recall and relatively low false positive predictions. Though their neural net architectures varied, their solutions generated strikingly similar predictions, emphasizing that advancements in neural net architectures have diminishing returns for building footprint extraction and similar tasks — developing better loss functions, pre- and post-processing techniques, and optimizing solutions to specific challenges may provide more value. Object size can be a significant limitation for segmentation in overhead imagery, and look angle and direction dramatically alter performance. Finally, much more can be learned from examining the winning competitors’ code on GitHub and their descriptions of their solutions, and we encourage you to explore their solutions more!

What’s next?

We hope you enjoyed learning about off-nadir building footprint extraction from this challenge, and we hope you will explore the dataset for yourselves! There will be more SpaceNet Challenges coming soon — follow us for updates, and thank you for reading.

Model references: