Deep neural network achieves a highly accurate classification of embryo images

We used time-lapse images from 10,148 human embryos, obtained from the Center for Reproductive Medicine at Weill Cornell Medicine to train and validate our DNN. The 10,148 embryos (WCM-NY dataset) were classified into three major quality groups, good-quality (n = 1345 embryos), fair-quality (n = 4062 embryos), and poor-quality (n = 4741 embryos) (Fig. 2a, b) based on their assigned grades (see Methods). We obtained time-lapse images from each of the embryos, each consisting of several time points, seven focal depths per time point (Fig. 2a, b) and 500 × 500 pixels black and white images per focal depth (+45, +30, +15, 0, −15, −30, and −45). Upon preprocessing and removal of images with readability issues (e.g., those with a dark background) and random selection of a balanced set of images (see Methods), we were left with a total of 12,001 images from up to seven focal depths: 6000 images in 877 good-quality embryos, and 6001 images in 887 poor-quality embryos.

Fig. 2 Embryologists’ evaluation: a This figure shows three examples of Veeck and Zaninovich grades and their corresponding quality labels across seven focal depths. b Embryologists evaluate embryo quality using an internal scoring system and subsequently classify them into three major groups (good-quality, fair-quality, poor-quality) based on the pregnancy rate Full size image

We then trained an Inception-V1 DNN–based algorithm using the two quality groups at both ends of the spectrum, i.e., good-quality and poor-quality. The Inception-V1 architecture is a transfer learning algorithm, where we initially performed fine-tuning of the parameters for all of the layers. We used 50,000 steps for training the DNN and subsequently evaluated the performance of our DNN (called STORK) using a randomly selected independent test set with 964 good-quality images from 141 embryos and 966 poor-quality images from 142 embryos. Our results showed that the trained algorithm was able to identify good-quality and poor-quality images with 96.94% accuracy (1871 correct predictions out of 1930 images).

To measure the accuracy of STORK for individual embryos, we used a simple voting system across multiple image focal depths. If the majority of images from the same embryo were predicted to be of good-quality, then the final quality of the embryo was considered good. For a small number of cases in which the number of good-quality and poor-quality images was equal (e.g., three good-quality and three poor-quality for six focal depths), we used STORK’s output probability scores to break the tie. At the embryo level, we obtained 97.53% accuracy with 276 correct predictions out of 283 embryos.

At the image level, we observed an average area under the curve (AUC) of 0.987 (Fig. 3a) on the blind test set. We also found that training an Inception-V1 model without parameter fine-tuning did not affect performance (accuracy; Fig. 3b). This observation is in agreement with previous studies using these deep learning techniques.20,24,25

Fig. 3 Deep neural network results: a Inception-V1 (fine-tuning the parameters for all layers) results for three datasets. b Inception-V1 via two different training methods (fine-tuning the parameters for all layers and training from scratch) in good-quality and poor-quality embryo quality discrimination dataset. WCM-NY: data from the Center for Reproductive Medicine and Infertility at Weill Cornell Medicine of New York; IRDB-IC: data from the Institute of Reproduction and Developmental Biology of Imperial College; Universidad de Valencia: data from the Institute Valenciano de Infertilidad, Universidad de Valencia Full size image

We also found that STORK classified the fair-quality embryo (intermediate group, Figs. 2 and 4) images (4480 images from 640 embryos) as 82% good-quality (526 embryos) and 18% poor-quality (114 embryos), respectively. As Inception-V1 was trained for good-quality and poor-quality classes with different pregnancy probabilities (an ~58% and 35% chance of pregnancy for good-quality and poor-quality classes, respectively), we wondered if STORK nonetheless produced relevant predictions (association between embryo quality and pregnancy rate) within the fair-quality class. A closer look showed that embryos with fair-quality images that were classified as poor-quality by STORK had a lower likelihood of positive live birth (50.9%) as compared to those classified as good-quality (61.4% positive live birth; p < 0.05 by the two-tailed Fisher’s test). Note that STORK alone cannot estimate the pregnancy rate. However, it can detect the association between embryo quality and pregnancy rate based on morphological classification.

Fig. 4 STORK vs. embryologists classification: STORK classifies the fair-quality images into existing good-quality and poor-quality classes. For example, panels “a” and “b” are labeled 3A-B (fair-quality) according to the Veeck and Zaninovic grading system, while STORK classified them as poor-quality and good-quality, respectively. Also, panels “c” and “d” are both labeled 3BB (fair-quality). However, the algorithm correctly classified panel “c” as poor-quality and panel “d” as good-quality. As the figure shows, the outcome in the embryos in “b” and “d” is positive live birth, whereas it is negative live birth in “a” and “c” Full size image

In addition, we found that fair-quality embryos predicted to be good-quality by STORK came from younger patients (33.9 years old on average) than those predicted to be poor-quality (34.25 years old on average). Interestingly, these numbers are similar to the age of patients with good-quality and poor-quality embryos: 33.86 and 34.72 years old on average, respectively. This suggests that STORK finds sufficient structure within embryos classified as fair-quality to make clinically relevant predictions (Fig. 4).

STORK is robust when applied to datasets from other clinics

To evaluate STORK’s robustness, we tested its performance by using additional datasets of embryo images obtained from two other IVF centers, Universidad de Valencia and IRDB-IC, comprising 127 (74 good-quality, 53 poor-quality) and 87 (61 good-quality, 26 poor-quality) embryos, respectively (Supplementary Table 2). Our experimental results (See Fig. 3a) demonstrate that although the scoring systems used for these centers are different from the system used to train our model, STORK can successfully identify and register score variations and robustly discriminate between them, with an AUC of 0.90 and 0.76 for the IRDB-IC and Universidad de Valencia and (Fig. 3a), respectively. Lower concordance of the classification results (by STORK) for the Universidad de Valencia dataset could be related to different grading systems used by that clinic. The images of Universidad de Valencia dataset are labeled using Asebir26 while IRDB-IC is labeled using the Gardner system.27 The Veeck and Zaninovic grading system is a slightly modified version of the Gardner system (Supplementary Table 4).

STORK outperforms individual embryologists for embryo selection

It is well known that embryo scoring frequently varies among embryologists,28 mainly due to the subjectivity of the scoring process and different interpretations of embryo quality. We, therefore, sought to create a small but robust benchmark embryo dataset that would represent the consensus of several embryologists. We asked five embryologists from three different clinics to provide scores for each of 394 embryos generated in different labs (Supplementary Table 6). Note that these images were not used in the training phase of our algorithm. The embryo images were scored using the Gardner scoring system27 and then mapped onto our simplified three groups (good-quality, fair-quality, and poor-quality; see Supplementary Table 4 for the mapping method).

As expected, we found a low level of agreement among the embryologists (Supplementary Fig. 1b), with only 89 embryos out of the 394 classified as the same quality by all five embryologists (Supplementary Fig. 1a). Therefore, to create a larger and more accurate gold standard dataset, we used an embryologist majority voting procedure (i.e., the quality of each image was determined by the score given by at least three out of the five embryologists) to classify 239 images (32 good-quality and 207 poor-quality).

When we applied STORK to these 239 images, we found that it predicted the embryologist majority vote with precision of 95.7% (Cohen’s kappa = 0.63). In comparison, STORK agreed with each individual embryologist as follows: 0.69, 0.54, 0.25, 0.62, and 0.54 Cohen’s kappa score. These results indicate that STORK may outperform individual embryologists when assessing embryo image quality (Fig. 5).

Fig. 5 Assessment comparison of STORK with five embryologists: This circular heatmap demonstrates the prediction of STORK and five embryologists in the labeling of the same images from 394 embryos. STORK outputs good and poor grades. The heatmap compares STORK’s result with the majority vote results from all of the embryologists for 239 embryos in which the majority (i.e., at least three out of five embryologists) gives good or poor. The embryologists assess the embryos quality using Gardner grading system. Then, they convert the grades to the three different quality scores as good-quality (orange), fair-quality (gray), and poor-quality (navy) based on the pregnancy rate. Also, for a few embryos, the embryologist uses “?” signs (e.g. 3A?), which refer to the low certainty (red) as they are not sure about the exact label. The heatmap illustrates the result of STORK, Majority vote, Embryologist-V, Embryologist-IV, Embryologist-III, Embryologist-II, and Embryologist-I from the outer circle to the inner ones. Orange: embryos with good-quality; navy: embryos with poor-quality; gray: embryos with fair-quality; red: embryos that are not labeled due to uncertainty Full size image

A decision tree predicts likelihood of successful pregnancy based on embryo quality and clinical parameters

It is known that other factors, besides embryo quality, such as patient age, the patient’s genetic background, clinical diagnosis, and treatment-related characteristics, can affect pregnancy outcome.29,30 As embryo quality is one of the most important of these factors, the ultimate aim of any embryo assessment approach is to identify embryos that have the highest implantation potential resulting in live birth.27,31,32 However, embryo quality alone is not enough to accurately determine the pregnancy probability (see Supplementary Method 1, Supplementary Method 2 and Supplementary Fig. 2).

Therefore, in this section we present an alternative method for predicting successful pregnancy probability based on a state-of-the-art decision tree method that integrates clinical information and embryo quality. We wondered if we could assess the successful pregnancy rate by using a combination of embryo quality and patient age, as age is one of the most important clinical variables. For this purpose, we used a hierarchical decision tree method known as chi-squared automatic interaction detection (CHAID) algorithm.33

We designed a CHAID34,35 decision tree using 2182 embryos from the WCM-NY database (Supplementary Table 6) with available clinical information and pregnancy outcome results (Fig. 6). We then investigated the interaction between patient age (consisting of seven classes: ≤30, 31–32, 33–34, 35–36, 37–38, 39–40, and ≥41) (Supplementary Fig. 3a) and embryo quality (consisting of two classes: good-quality and poor-quality). The fully de-identified data consists of a very diverse population of patients (Supplementary Fig. 3b). The effect on live birth outcome is demonstrated in (Supplementary Fig. 3c). The CHAID algorithm can project interactions between variables and non-linear effects, which are generally missed by traditional statistical techniques. CHAID builds a tree to determine how variables can explain an outcome in a statistically meaningful way.34,35 CHAID uses chi-squared statistics for identification of optimal multi-way splits, and identifies a set of characteristics (e.g., patient age and embryo quality) that best differentiates individuals based on a categorical outcome (here, live birth) and creates exhaustive and mutually exclusive subgroups of individuals. It chooses the best partition on the basis of statistical significance and uses Bonferroni-adjusted p-values to determine significance with a predetermined minimum size of end nodes. We used a 1% Bonferroni-adjusted p-value, a maximum depth of the tree (n = 5), and a minimum size of end nodes (n = 20) as the stopping criteria. The application of a tree-based algorithm on the embryo data would help to more precisely define the effect of patient age and embryo quality (good-quality or poor-quality) on live birth outcome, and to better understand any interactions between these two clinical variables (patient age and embryo quality).

Fig. 6 Interactions between age and embryo quality: The decision tree shows the interactions between IVF patient age and embryo quality using CHAID Full size image

Note that while several other classification algorithms could have been employed for the prediction, CHAID enabled a user-friendly visualization of the resulting decision tree.36,37

As Fig. 6 shows, patients were automatically classified into three age groups: (i) ≤36, (ii) 37 and 38, and (iii) ≥39 years old due to age data distribution. For each age group, embryos were classified in good- and poor-quality groups (Supplementary Fig. 3c).

The results confirm the association between probability of successful pregnancy and patient age. The live birth probability for patients with good-quality embryos is significantly (1% Bonferroni-adjusted p-value) higher than that for patients with poor-quality embryos across different ages. Figure 6 indicates that patients ≤36 years old have a higher successful pregnancy rate compared to patients in the other two age groups. The CHAID decision tree analysis also indicates that the chance of favorable outcome using IVF varies from 13.8% (e.g., when the embryo is of poor-quality as assessed by STORK and the patient is ≥41 years old) to 66.3% (e.g., when the embryo is of good-quality and the patient is <37 years old) (Fig. 6).

Probability analysis optimizes embryo selection and maximizes likelihood of single pregnancy

It is a common practice in IVF clinics to select and transfer more than one embryo in order to increase the chance of a successful pregnancy. As the success rate of individual embryos are typically <50% and transferring two or more embryos can increase the success probability. However, when the number of transferred embryos increases, the chance of multiple pregnancies (twins or even triplets) and associated complications also increase. For example, let’s simply assume we transfer three embryos each with an independent success probability of 1/4. The chance of a pregnancy can be, thus, calculated as p = 1−(3/4)3 ≈ 0.58. However, in this scenario, the chance of twin and triplets pregnancy would be 3 × (1/4)2(3/4) ≈ 0.14 and (1/4)3 ≈ 0.02, respectively. The chance of a single pregnancy would be 3 × (1/4)(3/4)2 ≈ 0.42.