OCT data

The dataset used consists of spectral domain OCT (SD-OCT) scans from a longitudinal study that has been described in detail in a number of previous publications5,6. In this study, OCT scans were collected from 101 children at four different visits over an 18-month period. Approval from the Queensland University of Technology human research ethics committee was obtained before the study, and written informed consent was provided by all participating children and their parents. All participants were treated in accordance with the tenets of the Declaration of Helsinki. At each visit, two sets of six foveal centred radial chorio-retinal scans were taken on each subject, however, only the data from the first visit is used in this paper. The scans were acquired using the Heidelberg Spectralis (Heidelberg Engineering, Heidelberg, Germany) SD-OCT instrument using the enhanced depth imaging mode. To improve the signal to noise ratio, automatic real time tracking was used with 30 frames averaged for each scan. The acquired images each measure 1536 × 496 pixels (width x height). With a vertical scale of 3.9 µm per pixel and a horizontal scale of 5.7 µm per pixel which corresponds to an approximate physical area of 8.8 × 1.9 mm. These images were exported as bmp (lossless) images with other related data stored in an accompanying xml file, and subsequently analysed using custom software where an automated graph based method was used to segment three layer boundaries for each image. This segmented data was then assessed by an expert human observer who manually corrected any segmentation errors. The three layer boundaries within the labelled data include the outer boundary of the retinal pigment epithelium (RPE), the inner boundary of the inner limiting membrane (ILM), and the CSI. An example of the positions of these boundaries is shown in Fig. 1.

Figure 1 Illustration of the steps involved in each of the two deep learning methods (patch-based and fully-convolutional) proposed in this work for segmentation of the retina and choroid. Where applicable the ILM is marked in red, the RPE in green and the CSI in blue. Full size image

For computational reasons, only a subset of the dataset described above is utilised here. This consists of a single set of scans (six scans) for 99 participants from their first visit only. These participants are randomly divided into two sets; set A for neural network training and validation (50 participants, 300 B-scans in total) and set B for evaluation (49 participants, 294 B-scans in total). Within set A, an 80/20 split is used for training (40 participants, 240 B-scans) and validation (10 participants, 60 B-scans) with participants selected randomly for each. There is no overlap of participants between the training and validation sets or between sets A and B. Henceforth; ‘A-scan’ refers to a single-column of an OCT image while ‘B-scan’ refers to a full-size OCT image.

Overview

The deep learning automatic segmentation methods considered in this work are comprised of two main types: patch-based and semantic segmentation. Each method involves a number of steps. Firstly, a set of OCT scans (set A) is used to train a neural network for patch classification (patch-based method) or for area segmentation on full-size B-scans (semantic segmentation method). Next, a second set of OCT scans is used to evaluate the network (set B). For each scan in set B, per-boundary probability maps are constructed by classifying each pixel in the scan (patch-based method) or segmenting the scan and then applying the Sobel filter (semantic segmentation method). In both cases, each probability map is then used to construct a graph, and a boundary position prediction is obtained by performing a shortest-path graph search. The following sections provide greater detail of the two methods while Fig. 1 illustrates the various steps involved in each. Some of the patch based methods have been presented elsewhere32. The software environment used throughout this work consists of Keras 2.2.446 using Tensorflow47 (GPU) 1.8.0 backend in Python 3.6.4. For the purposes of evaluating the speed of each method an identical hardware and software setup is used. Here, the hardware consists of an Intel Xeon W-2125 CPU, Nvidia GeForce GTX 1080Ti GPU, Samsung SM961 SSD and 32GB 2400 MHz DDR4 ECC RAM.

Patch-based networks

Convolutional neural network (CNN) architecture

Convolutional neural networks (CNNs) have had considerable use and demonstrated success for a range of image classification48, and segmentation tasks49. CNNs consist of a number of different layers with a set of parameters associated with each layer. Convolutional layers take a number of equal sized kernels (filters) which are convolved with the input and stacked together to produce an output. The parameters include: the kernel size (height × width), the stride lengths (vertical, horizontal), the quantity of zero-padding (top, bottom, left, right) applied to the input, and the number of kernels. Pooling layers takes a single window sliding step-by-step over the input. At each step, an operation is performed to pool the input to a smaller size. Such operations that are commonly used include max pooling (where the maximum value is taken from within the window), and average pooling (where the average of the values is taken). The parameters of this layer include: the window size (height × width), the stride (step) lengths (vertical, horizontal), the quantity of zero-padding applied to the input (top, bottom, left, right) and the pooling operation (max or average). Activation layers are used to introduce non-linearity into neural networks where the rectified linear unit (ReLU)50 is a common choice for CNNs and has been shown to outperform other variants such as tanh and sigmoid51. Fully-connected (FC) layers are equivalent to convolutional layers where the kernel size is equal to the spatial size of the input and there is no zero-padding applied to the input. Two CNNs with a variety of different patch sizes and complexity are used within this work with the architectures listed in Supplementary Table S1. These include: the Cifar CNN (CNN 1) introduced by Fang et al.20, and the Complex CNN (CNN 2) presented by Hamwood et al.30, with variants for a range of patch sizes. Dropout for regularisation has not been used for the CNNs in this work, consistent with previous approaches20,30.

Recurrent neural network (RNN) architecture

Recurrent neural networks (RNNs) have been widely applied to, and have shown to be useful for, problems involving sequential data such as speech recognition52,53, and handwriting recognition54. However, there are just a handful of examples of their application to images. To perform OCT image classification using a recurrent neural network, the architecture to be used here is that introduced by Kugelman et al.31. This network, partially inspired by the ReNet architecture55, possesses a number of parameters associated with each layer including: the direction of operation (vertical or horizontal), number of passes (1: unidirectional, 2: bidirectional), number of filters, dropout percentage and receptive field size (height, width). The size of the receptive field represents the size of the region of the input which is processed by the RNN at each step. The direction of operation corresponds to whether the RNN will process each row of a column (vertical) or each column of a row (horizontal) before moving to the next column or row respectively. A unidirectional layer will pass over the input only in a single direction (left to right or top to bottom) whereas a bidirectional layer will additionally pass over the input in the opposite direction (right to left or bottom to top) with the outputs for each pass concatenated along the feature axis. The number of filters in each layer indicates the depth of the output, with the addition of more filters enabling the network to learn an increased number of patterns from the input. The dropout percentage56 corresponds to the number of units within a layer that are randomly turned off at each epoch. The RNN architecture used within this work is described in Supplementary Table S2.

Training

The Cifar CNN, Complex CNN and RNN networks are trained to perform classification using specific sized (height × width pixels) patches of the OCT images. Here, each patch is assigned to a class based on the layer boundary that it is centred upon, with classes constructed for each of the three layer boundaries of interest (ILM, RPE and CSI) as well as an additional background class (BG) for patches that are not centred upon any of the three layer boundaries. This is a similar procedure to that used in previous work20,30. In their work, Fang et al.20, utilised 33 × 33 patches while Hamwood et al.30, extended upon this and, using 33 × 33 and 65 × 65 patch sizes, showed that utilising a larger patch size can improve performance. Kugelman et al.31 also experimented with the patch size using 32 × 32 and 64 × 64 patch sizes as well as 64 × 32 and 32 × 64 sized rectangular patches. Of their tested sizes, the vertically oriented patch size (64 × 32) provided the best trade-off between accuracy and complexity in the context of retinal segmentation using RNNs. With this in mind, to assess the effect on choroidal segmentation, patches of various sizes including 32 × 32, 64 × 32, 64 × 64 and 128 × 32 (height × width pixels) are utilised with layer boundaries centred one pixel above and to the left of the central point.

Patches are constructed for training (~1,200,000 patches) and validation (~300,000 patches) from the data in set A. In each scan, three boundary patches and one random background patch are sampled from each column ensuring equally balanced classes. However, patches are only created within a cropped region of each scan (approximately 100 pixels from the left to 250 pixels from the right) due to the lack of true boundary locations present as a result of the optic nerve head as well as shadowing within this region for some scans. The Adam algorithm57 with default parameters (\(\alpha =0.001,\,{\beta }_{1}=0.9,\,{\beta }_{2}=0.999,\,{\epsilon }=1\times {10}^{-8})\) is used for training to minimise cross-entropy loss with each network trained until convergence is observed with respect to the validation loss. No early-stopping is employed. Here, convergence is determined based on the inspection of the validation losses. No transfer learning is performed. Instead, each network is trained from scratch with weights initialised using small random values. Afterwards, the model with the highest validation accuracy (percentage of patches correctly classified) is chosen for evaluation. It should also be noted that no learning rate schedule is used.

Semantic segmentation networks

Architecture

Semantic segmentation network architectures have evolved over time with a number of modifications proposed. Supplementary Table S3 summarises some of the key features presented, which are used to inform the choice of network architectures in this study. Building upon previous work58,59 in the area of semantic segmentation using fully-convolutional neural networks, the U-Net60 was proposed for biomedical image segmentation. Architectures based on the U-Net have been used previously for OCT retinal segmentation22,23,31, and as such, a similar standard U-net architecture (referred to as ‘Standard’) will be used in this work, along with a number of modified variants to assess the potential for performance improvement in choroidal segmentation. These modifications include the incorporation of residual learning61,62,63,64 (referred to as ‘Residual’), the replacement of the bottleneck with RNN layers65 (referred to as ‘RNN bottleneck’), and the addition of squeeze-excitation blocks66,67,68 (referred to as ‘Squeeze + Excitation’). Additionally, the combination of all three modifications is also considered (referred to as ‘Combined’). There are three squeeze and excitation block variants considered: spatial squeeze and channel excitation (cSE), channel squeeze and spatial excitation (sSE) and concurrent spatial and channel squeeze and channel excitation (scSE). Note that the ‘Combined’ network utilises the ‘scSE’ squeeze and excitation block variant. An illustration of each architecture used is provided in Fig. 2. Note that, in each network, convolutional layers incorporate zero-padding such that the input and output of each are the same size and no cropping is required. Batch-normalization69, is utilised at the input to each rectified linear unit in an effort to enhance training performance. A dropout of 50%56, is used at the output of the bottleneck of the network for regularisation. Each network used consists of four pooling layers and four up sampling layers. The first layer contains eight filters with this number doubled at each subsequent pooling layer and halved in a similar fashion for each up sampling layer.

Figure 2 Illustration of the various network architectures used for the semantic segmentation method in this work. Due to space constraints, illustrated networks are shown with just two pooling layers, however this is by no means a restriction on the architectures. Note that the specific implementation of the squeeze-excite block may vary (one of cSE, sSE, scSE). Full size image

Training

Each of the networks illustrated in Fig. 2 and described above are trained to perform semantic segmentation on full-size OCT images. To do this, a network is tasked with classifying each pixel in an image into one of four area classes. These area classes are defined as the vitreous (top of the image to ILM), retina (ILM to RPE), choroid (RPE to CSI) and sclera (CSI to bottom of the image). Therefore, each image has an associated area mask which is the target output for the FCNs. As described in set A in the data, 240 full-size OCT images are used for training while a separate 60 images are used for validation. For each column where at least one true boundary location is not present in the data (normally associated with shadows at the edge of some images), the corresponding column in the area mask is set to be the top area class (vitreous) and the same column in the image is zeroed. Due to the relatively small number of images, the data was augmented using horizontal flips (left to right/right to left). For each epoch, each image was randomly flipped horizontally with a 50% chance.

The Adam algorithm57, with default parameters \((\alpha =0.001,\,{\beta }_{1}=0.9,\,{\beta }_{2}=0.999,\,{\epsilon }=1\times {10}^{-8})\) is used for training to minimise the sum of cross-entropy loss and Dice overlap loss70. This loss combination is similar to that used in previous work22, although no additional weighting scheme is employed here. Each network is trained until convergence is observed with respect to the validation loss while the epoch with model with the highest validation accuracy (Dice overlap percentage) is chosen for evaluation. No early-stopping is employed, with convergence determined based on the inspection of the validation losses. No transfer learning is performed and no learning rate schedule is used. Instead, each network is trained from scratch with weights initialised with small random values.

Image pre-processing

The choroid is a vascular layer of the eye. Its vascular nature, combined with the fact that is located behind a hyper-reflective layer (RPE), means that the contrast and visibility of the posterior boundary tends to be weak. The use of OCT image contrast enhancement techniques71, also known as attenuation coefficients72, was therefore considered in this work since it may improve the visibility of the boundaries, especially for the CSI, and also reduces the effect of shadows caused by the retinal blood vessels. This method has been used previously for improving visibility of the CSI73. The technique works under the assumption that local backscattering can be related to that of the corresponding attenuation, and therefore can be compensated. In this work the effect of the attenuation compensation was tested with two different network-input options; the standard OCT intensity image and the contrast enhanced (attenuation coefficient) equivalent.

Boundary prediction and model evaluation

Given a scan and a trained network, probability maps for each of the boundaries can be calculated. For a patch-based method the probability maps are obtained by classifying patches centred on each pixel in the scan20. For a fully-convolutional method, the boundary probability maps are acquired by applying the Sobel filter to the area probability output of the FCN37. In both cases, the boundary positions may then be delineated by performing a graph search using Dijkstra’s shortest path algorithm74, where each pixel in the probability map corresponds to a vertex in the graph. This is inspired by the approach originally used by Chiu et al.33. Directed edges associated with each vertex are connected to neighbouring vertices to the immediate right (horizontally, diagonally above and diagonally below). To remove the need for manual start and end point initialisation, columns of maximum probability vertices, connected top to bottom, are appended to each end of the graph, with additional left to right connections made to the existing graph as required. The edge weights between each pair of vertices are determined by the respective probabilities and are given by Eq. (1).:

$$\begin{array}{c}{{\rm{w}}}_{{\rm{sd}}}=2-({{\rm{P}}}_{{\rm{s}}}+{{\rm{P}}}_{{\rm{d}}})+{{\rm{w}}}_{{\rm{\min }}}\end{array}$$ (1)

where P s and P d are the probabilities (0–1) of the source and destination vertices respectively, and \({{\rm{w}}}_{{\rm{\min }}}={1\times \mathrm{10}}^{-{\rm{5}}}\) is a small positive number added for system stability.

This step is performed using all scans in set B. To evaluate the performance, the delineated boundary positions for each image were compared to the true positions (the boundary position from manual segmentation of an expert human observer), from which the Dice overlap percentage is calculated for the four regions of interest, including the vitreous, retina, choroid, and sclera, as well as the mean pixel error and mean absolute pixel error (for the ILM, RPE and CSI) for each scan. Because the patch-based networks do not output area maps, Dice values cannot be calculated directly from the network output. Due to this and for the purposes of consistency between the methods, all Dice overlap values are calculated post-segmentation. Note that these values will be greater than Dice values obtained directly from the network output (in the semantic segmentation case) for cases where misclassifications do not affect the boundary errors.

In an effort to obtain a fair indication of the performance of the models, the full-width scans are used for input to the networks with a graph search performed on the corresponding full-size probability map. However, final error calculations and comparisons are only performed on a cropped region of all scans (approximately 100 pixels from the left and 250 pixels from the right) due to the presence of artefacts with this region (i.e. optic nerve head and shadows).