A human heart is an astounding machine that is designed to continually function for up to a century without failure. One of the key ways to measure how well your heart is functioning is to compute its ejection fraction: after your heart relaxes at its diastole to fully fill with blood, what percentage does it pump out upon contracting to its systole? The first step of getting at this metric relies on segmenting (delineating the area of) the ventricles from cardiac images.

During my time at the Insight AI Program in NYC, I decided to tackle the right ventricle segmentation challenge from the calls for research hosted by the AI Open Network. I managed to achieve state of the art results with over an order of magnitude less parameters; below is a brief account of how.

Problem description

From the call for research:

Develop a system capable of automatic segmentation of the right ventricle in images from cardiac magnetic resonance imaging (MRI) datasets. Until now, this has been mostly handled by classical image processing methods. Modern deep learning techniques have the potential to provide a more reliable, fully-automated solution.

All three winners of the left ventricle segmentation challenge sponsored by Kaggle in 2016 were deep learning solutions. However, segmenting the right ventricle (RV) is more challenging, because of:

[the] presence of trabeculations in the cavity with signal intensities similar to that of the myocardium; the complex crescent shape of the RV, which varies from the base to the apex; difficulty in segmenting the apical image slices; considerable variability in shape and intensity of the chamber among subjects, notably in pathological cases, etc.

Medical jargon aside, it’s simply more difficult to identify the RV. The left ventricle is a thick-walled circle while the right ventricle is an irregularly shaped object with thin walls that sometimes blends in with the surrounding tissue. Here are the manually drawn contours for the inner and outer walls (endocardium and epicardium) of the right ventricle in an MRI snapshot:

That was an easy example. This one is more difficult:

And this one is downright challenging to the untrained eye:

Human physicians in fact take twice as long to determine the RV volume and produce results that have 2–3 times the variability as compared to the left ventricle [1]. The goal of this work is to build a deep learning model that automates right ventricle segmentation with high accuracy. The output of the model is a segmentation mask, a pixel-by-pixel mask that indicates whether each pixel is part of the right ventricle or the background.

The dataset

The biggest challenge facing a deep learning approach to this problem is the small size of the dataset. The dataset (accessible here) contains only 243 physician-segmented images like those shown above drawn from the MRIs of 16 patients. There are 3697 additional unlabeled images, which may be useful for unsupervised or semi-supervised techniques, but I set those aside in this work since this was a supervised learning problem. The images are 216×256 pixels in size.

Given the small dataset, one would suspect generalization to unseen images would be hopeless! This unfortunately is the typical situation in medical settings where labeled data is expensive and hard to come by. The standard procedure is to apply affine transformations to the data: random rotations, translations, zooms and shears. In addition, I implemented elastic deformations, which locally stretch and compress the image [2].

The goal of such augmentations is to prevent the network from memorizing just the training examples, and to force it to learn that the RV is a solid, crescent-shaped object that can appear in a variety of orientations. In my training framework, I apply the transformations on the fly so the network sees new random transformations during each epoch.

As is also common, there is a large class imbalance since most of the pixels are background. Normalizing the pixel intensities to lie between 0 and 1, we see that across the entire dataset, only 5% of the pixels are part of the RV cavity.

In constructing the loss functions, I experimented with reweighting schemes to balance the class distributions, but ultimately found that the unweighted average performed best.

During training, 20% of the images were split out as a validation set. The organizers of the RV segmentation challenge have a separate test set consisting of another 514 MRI images derived from a separate set of 32 patients, for which I submitted predicted contours for final evaluation.

Also needed is a way to quantify model performance on the dataset. The organizers of the segmentation challenge chose to use the dice coefficient. The model will output a mask X delineating what it thinks is the RV, and the dice coefficient compares it to the mask Y produced by a physician via:

The metric is (twice) the ratio of the intersection over the sum of areas. It is 0 for disjoint areas, and 1 for perfect agreement. For example, model performance is written as 0.82 (0.23), where the parentheses contain the standard deviation.

Let’s look at model architectures.

U-net: the baseline model

Since we only had a 4 week timeframe to complete our projects at Insight, I wanted to get a baseline model up and running as quickly as possible. I chose to implement a u-net model, proposed by Ronneberger, Fischer and Brox [3], since it had been quite successful in biomedical segmentation tasks. U-net models are promising, as the authors were able to train their network with only 30 images by using aggressive image augmentation combined with pixel-wise reweighting. (Interested readers: here are reviews for CNN [4] and conventional [5] approaches.)

The u-net architecture consists of a contracting path, which collapses an image down into a set of high level features, followed by an expanding path which uses the feature information to construct a pixel-wise segmentation mask. The unique aspect of the u-net are its “copy and concatenate” connections which pass information from early feature maps to the later portions of the network tasked with constructing the segmentation mask. The authors propose that these connections allow the network to incorporate high level features and pixel-wise detail simultaneously.

The architecture we used is shown here:

We adapted the u-net to our purposes by reducing the number of downsampling layers in the original model from four to three, since our images were roughly half the size as those considered by the u-net authors. We also zero pad our convolutions (as opposed to unpadded) to keep the images the same size. The model was implemented in Keras.

Without image augmentation, the u-net reaches a dice coefficient of 0.99 (0.01) on the training dataset, which means the model has sufficient capacity to capture the complexity of the RV segmentation problem. However, the validation dice score is 0.79 (0.24), so the u-net is overfitting pretty strongly. Image augmentation improves generalization and raises the validation accuracy to 0.82 (0.23), at the cost of decreasing the training accuracy to 0.91 (0.06).

How can we further reduce the training-validation gap? As Andrew Ng describes in this excellent talk, we can get more data (not possible), regularize (dropout and batch normalization did not help), or try a new model architecture.

Dilated u-nets: global receptive fields

Segmenting organs requires some knowledge of the global context of how organs are arranged relative to one another. It turned out that the neurons in even the deepest part of the u-net only had receptive fields that spanned 68×68 pixels. No part of the network could “see” the entire image and integrate global context in producing the segmentation mask. The reasoning is that the network would have no understanding that there is only one right ventricle in a human. For example, it misclassifies the blob marked with an arrow in the following image:

Rather than adding two more downsampling layers at the cost of a huge increase in network parameters, I used dilated convolutions [6] to increase the receptive fields of the network.

Dilated convolutions space out the pixels summed over in the convolution by a dilation factor. In the diagram above, the convolutions in the bottom layer are regular 3×3 convolutions. The next layer up, we have dilated the convolutions by a factor of 2, so their effective receptive field in the original image is 7×7. The top layer convolutions are dilated by 4, producing 15×15 receptive fields. Dilated convolutions produce exponentially expanding receptive fields with depth, in contrast to linear expansion for stacked conventional convolutions.

Schematically, the convolutional layers producing the feature maps marked in yellow are replaced with dilated convolutions in the u-net. The innermost neurons now have receptive fields spanning the entire input image. I call this a “dilated u-net”.

Quantitatively, the dilated u-net does improve performance, reaching a validation dice score of 0.85 (0.19), while maintaining training performance at 0.92 (0.08)!

Dilated densenets: multiple scales at once

Loosely inspired by tensor networks used in physics, I decided to experiment with a novel architecture for image segmentation, which I will call a “dilated densenet”. It combines two ideas, dilated convolutions and densenets [7], to drastically reduce network depth and parameters.

For segmentation tasks, we need both global context and information from multiple scales to produce the pixel-wise mask. What if we relied entirely on dilated convolutions to generate global context, rather than downsampling to “smash” the image down to a small height and width? Now that the convolutional layers all have the same size, we can apply the key idea of the densenet architecture and use “copy and concatenate” connections between all the layers. The result is a dilated densenet:

In densenets, the output of the first convolutional layer is fed as input into all subsequent layers, and similarly with the second, third, and so forth. The authors show that densenets have several advantages:

they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

At publication, densenets had surpassed the state of the art on the CIFAR and ImageNet classification benchmarks. However, densenets have a serious drawback: they are extremely memory intensive since the number of feature maps grow quadratically with network depth. The authors used “transition layers” to cut down on the number of feature maps midway through the network in order to train their 40, 100 and 250 layer densenets.

Dilated convolutions eliminates the need for such deep networks: the exponentially expanding receptive fields means only 8 layers are needed to “see” an entire 256×256 image. In the final convolutional layer of a dilated densenet, the neurons have access to global context as well as features produced at every prior scale in the network. In our work, we use an 8-layer dilated densenet and a growth rate 24. Here’s the astounding aspect: the dilated densenet is extremely parameter efficient. Our final model uses only 190K parameters, a point we’ll come back to when discussing results.

The dilated densenets do well, achieving a dice score of 0.87 (0.15) on the validation set, with a training accuracy of 0.91 (0.10), while remaining extremely parameter efficient!

Results

Having an estimate of human performance provides a roadmap for how to evaluate model performance. Researchers estimated that humans achieve dice scores of 0.90 (0.10) on RV segmentation tasks (we write the standard deviation in parentheses) [8]. The leading published model is a fully convolutional network (FCN) by Tran [9] with 0.84 (0.21) accuracy on the test set.

The models I developed had surpassed state of the art on the validation set, and was approaching human performance! However, the real benchmark was an evaluation of their performance on the held out test set. Also, the numbers quoted above were for the endocardium — what would be the performance on the epicardium? I trained a separate model on the epicardium, submitted my segmentation contours to the organizers and hoped for the best.

Here are the results, first for the endocardium (bolded numbers are state of the art):

and for the epicardium:

The dilated u-net matches the state of the art on the endocardium, and exceeds it on the epicardium. The dilated densenet is close behind, with only 190K parameters. That’s a 60× reduction in model size from the published benchmark!

Interested readers can find the details of the training methodology, learning curves, analysis of edge cases, as well as the loss functions, regularization schemes and hyperparameters considered, in the original post located here.

Summary and future directions

The performance of deep learning models can sometimes seem magical, but they are the result of careful engineering. Even in regimes with small datasets, well-selected data augmentation schemes allow deep learning models to generalize well. Reasoning through how data flows through a model leads to architectures well-matched to the problem domain.

Following these ideas, I was able to create models that achieve state of the art for segmenting the right ventricle in cardiac MRIs. I’m especially excited to see how dilated densenets will perform on other image segmentation benchmarks and explore permutations of its architecture.

I’ll end with some thoughts for the future: