Microscopy is a central method in life sciences. Many popular methods, such as antibody labeling, are used to add physical fluorescent labels to specific cellular constituents. However, these approaches have significant drawbacks, including inconsistency; limitations in the number of simultaneous labels because of spectral overlap; and necessary perturbations of the experiment, such as fixing the cells, to generate the measurement. Here, we show that a computational machine-learning approach, which we call “in silico labeling” (ISL), reliably predicts some fluorescent labels from transmitted-light images of unlabeled fixed or live biological samples. ISL predicts a range of labels, such as those for nuclei, cell type (e.g., neural), and cell state (e.g., cell death). Because prediction happens in silico, the method is consistent, is not limited by spectral overlap, and does not disturb the experiment. ISL generates biological measurements that would otherwise be problematic or impossible to acquire.

Here, we sought to determine if computers can find and predict features in unlabeled images that normally only become visible with invasive labeling. We designed a deep neural network and trained it on paired sets of unlabeled and labeled images. Using additional unlabeled images of fixed or live cells never seen by the network, we show it can accurately predict the location and texture of cell nuclei, the health of a cell, the type of cell in a mixture, and the type of subcellular structure. We also show that the trained network exhibits transfer learning: once trained to predict a set of labels, it could learn new labels with a small number of additional data, resulting in a highly generalizable algorithm, adaptable across experiments.

We hypothesized that microscopic images of unlabeled cells contain more information than is readily apparent, information that traditionally requires immunohistochemistry to reveal. To test this, we leveraged major advances in deep learning (DL), a type of machine learning that has resulted in deep neural networks capable of superhuman performance on specialized tasks (). Prior work using deep learning to analyze microscopy images has been limited, often relying on known cell locations () or the imposition of special and somewhat artificial sample preparation procedures, such as the requirement for low-plating density (). As such, it is unclear whether deep learning approaches would provide a significant and broad-based advance in image analysis and are capable of extracting useful, not readily apparent, information from unlabeled images.

Nevertheless, fluorescence labeling has limitations. Specificity varies; labeling is time consuming; specialized reagents are required; labeling protocols can kill cells; and even live cell protocols can be phototoxic. The reagents used for immunocytochemistry commonly produce non-specific signals because of antibody cross-reactivity, have significant batch-to-batch variability, and have limited time windows for image acquisition in which they maintain signal. Lastly, measuring the label requires an optical system that can reliably distinguish it from other signals in the sample while coping with fluorophore bleaching.

Microscopy offers a uniquely powerful way to observe cells and molecules across time and space. However, visualizing cellular structure is challenging, as biological samples are mostly water and are poorly refractile. Optical and electronic techniques amplify contrast and make small signals visible to the human eye, but resolving certain structural features or functional characteristics requires different techniques. In particular, fluorescence labeling with dyes or dye-conjugated antibodies provides unprecedented opportunities to reveal macromolecular structures, metabolites, and other subcellular constituents.

Does the network require large training datasets to learn to predict new things? Or does the generic model represented by a trained network enable it to learn new relationships in different datasets more quickly or with less training data than an untrained network? To address these questions, we used transfer learning to learn a label from a single well, demonstrating that the network can share learned features across tasks. To further emulate the experience of a new practitioner adapting this technique to their research, we chose data using a new label from a different cell type, imaged with a different transmitted-light technology, produced by a laboratory other than those that provided the previous training data. In condition E, differential interference contrast imaging was used to collect transmitted-light data from unlabeled cancer cells, and CellMask, a membrane label, was used to collect foreground data ( Table 1 ). With only the 1,1001,100 μm center of the one training well, regularized by simultaneously training on conditions A, B, C, and D, the network learned to predict cell foreground with a Pearson ρ score of 0.95 ( Figures S1 and S5 ). Though that metric was computed on a single test well, the test images of the well contain 12 million pixels each and hundreds of cells. This suggests that the generic model represented by the trained network could continue to improve its performance with additional training examples, and increase the ability and speed with which it learns to perform new tasks.

(B) Pixel intensity heatmaps and the calculated Pearson ρ coefficient for the correlation between the pixel intensities of the actual and predicted label. Although very good, the predictions have visual artifacts such as clusters of very dark or very bright pixels (e.g., boxes 3 and 4, second row). These may be a product of a paucity of training data.

(A) Upper-left-corner crops of nuclear (DAPI) and foreground (CellMask) label predictions on the Condition E dataset, representing 9% of the full image. The unlabeled image used for the prediction and the images of the true and predicted fluorescent labels are organized similarly to Figure 4 . Predicted pixels that are too bright (false positives) are magenta and those too dim (false negatives) are shown in teal. In the second row, the true and predicted nuclear labels have been added to the true and predicted images in blue for visual context. Outset 2 for the nuclear label task shows a false negative in which the network entirely misses a nucleus below a false positive in which it overestimates the size of the nucleus. Outset 3 for the same row shows the network overestimate the sizes of nuclei. Outsets 3,4 for the foreground label task show prediction artifacts; Outset 3 is a false positive in a field that contains no cells, and Outset 4 is a false negative at a point that is clearly within a cell. All other outsets show correct predictions. The scale bars are 40 μm.

Given the success of the network in predicting whether a cell is a neuron, we wondered whether it also could accurately predict whether a neurite extending from a cell was an axon or a dendrite. The task suffers from a global coherence problem ( STAR Methods ), and it was also unclear to us a priori whether transmitted-light images contained enough information to distinguish dendrites from axons. Surprisingly, the final network could predict independent dendrite and axon labels ( Figures S1 and S4 ). It does well in predicting dendrites in conditions of low- (condition B) and high- (condition D) plating densities, whereas the axon predictions are much better under conditions of low-plating densities (condition B).

With TuJ1 labels for the condition A culture, the performance of biologists annotating whether an object is a neuron was highly variable, consistent with the prevailing view that determining cell type based on human judgment is difficult. We found humans disagree on whether an object is a neuron ∼10% of the time, and ∼2% of the time they disagree on whether an object is one cell or several cells. When a biologist was presented with true and predicted labels of the same sample, 11%–15% of the time the type of cell is scored differently from one occasion to the next, and 2%–3% of the time the number of cells is scored differently. Thus, the frequency of inconsistency introduced by using the predicted labels instead of the true labels is comparable to the frequency of inconsistency between biologists evaluating the same true labels.

We tested the network’s ability to predict which cells were neurons in mixed cultures of cells containing neurons, astrocytes, and immature dividing cells ( Figures 6 and S1 ). Four biologists independently annotated real and predicted TuJ1 labeling, an indication that the cell is a neuron. We compared the annotations of each biologist ( Figure 6 ) and assessed variability among biologists by conducting pairwise comparisons of their annotations on the real labels only.

(C) A further categorization of the errors and the percentage of time they occurred. The error categories of split, merged, added, and missed are the same as in Figure 4 . An additional “human vs. human” column shows the expected disagreement between expert humans predicting which cells were neurons from the true fluorescence image, treating a random expert’s annotations as ground truth.

(B) The heatmap compares the true fluorescence pixel intensity to the network’s predictions, with inset Pearson ρ values, on the full condition A test set. The bin width is 0.1 on a scale of zero to one ( STAR Methods ). The numbers in the bins are frequency counts per 1,000.

(A) Upper-left corner crops of neuron label (TuJ1) predictions, shown in green, on the condition A data ( Table 1 ). The unlabeled image that is the basis for the prediction and the images of the true and predicted fluorescent labels are organized similarly to Figure 4 . Predicted pixels that are too bright (false positives) are magenta and those too dim (false negatives) are shown in teal. The true and predicted nuclear (Hoechst) labels have been added in blue to the true and predicted images for visual context. Outset 3 in (A) shows a false positive: a cell with a neuronal morphology that was not TuJ1 positive. The other outsets show correct predictions, though exact intensity is rarely predicted perfectly. Scale bars, 40 μm.

(A–C) The network was tested for its ability to predict from unlabeled images which cells are neurons. The neurons come from cultures of induced pluripotent stem cells differentiated toward the motor neuron lineage but which contain mixtures of neurons, astrocytes, and immature dividing cells.

To further evaluate the utility and biological significance of the quantitative pixel-wise predictions of the network, we wondered whether network predictions of DAPI/Hoechst labeling could be used to perform morphological analysis of nuclei and accurately detect and distinguish live cells from dead ones. We showed previously that neurons in vitro tend to die by apoptosis, a programmed cell death process that causes nuclei to shrink and round up (). To perform the analysis, we used the transmitted-light images above to make predictions of nuclear labels and then used those collections of pixel predictions to define nuclear objects and measured their dimensions. We then compared the dimensions of nuclei among cells determined to be dead or alive based on propidium iodide labeling. We found that the mean size of nuclei of live cells quantified from morphological analysis of pixel-wise predictions was very similar to that measured from actual labels (6.81.3m vs. 7.01.4m) ( Figure S3 ). Likewise, the nuclear sizes of dead cells from predicted labels was very similar to actual measurements (4.71.1m versus 4.91.0m). Importantly, quantitative analysis of nuclear morphology based on pixel predictions sensitively and accurately identified and distinguished a subset of dead cells from neighboring live cells based on a change in the size of their nucleus. The result corroborates the predictions based on propidium iodide staining and demonstrates the utility of the network to make biologically meaningful quantitative morphological measurements based on pixel predictions.

(B) Radii measured from the predicted DAPI label. The sample mean and standard deviation of the living and dead cells were 7.0 ± 1.4 μ m and 4.9 ± 1.0 μ m, respectively. For both true and predicted labels, dead cell nuclei are on average smaller than live cell nuclei. Cell debris was not excluded from these histograms, so the very small radii might be overcounted. All distributions are statistically distinct from one another; though the predicted distributions may be similar to their true counterparts, they are distinguishable using the two sample Kolmogorov-Smirnov test.

To understand the network’s ability to recognize cell death and how it compared to a trained biologist, we had the real and predicted propidium iodide-labeled images annotated, following the same method as for the nuclear labels ( Figure 5 C). A subset of the discrepancies between the two annotations in which a biologist inspecting the phase contrast images determined that an “added” error is a correct prediction of DNA-free cell debris was reclassified into a new category ( Figure S2 STAR Methods ). The network has an empirical precision and recall of 98% at 97%, with a 1% chance that two dead cells will be predicted to be one dead cell.

To determine whether transmitted-light images contain sufficient information to predict whether a cell is alive or dead, we trained the network with images of live cells treated with propidium iodide (PI), a dye that preferentially labels dead cells. We then made predictions on withheld images of live cells ( Figures 5 A and S1 ). The network was remarkably accurate, though not as much as it was for nuclear prediction. For example, it correctly guessed that an entity ( Figure 5 A, second magnified outset) is actually DNA-free cell debris and not a proper cell and picked out a single dead cell in a mass of live cells (third outset). To obtain a quantitative grasp of the network’s behavior, we created heatmaps and calculated linear fits ( Figure 5 B). The Pearson ρ value of 0.85 for propidium iodide indicated a strong linear relationship between the true and predicted labels.

(C) A further categorization of the errors and the percentage of time they occurred. Split is when the network mistakes one cell as two or more cells. Merged is when the network mistakes two or more cells as one. Added is when the network predicts a cell when there is none (i.e., a false positive), and missed is when the network fails to predict a cell when there is one (i.e., a false negative).

(B) The heatmap compares the true fluorescence pixel intensity to the network’s predictions, with an inset Pearson ρ value, on the full condition C test set. The bin width is 0.1 on a scale of zero to one ( STAR Methods ). The numbers in the bins are frequency counts per 1,000.

(A) Upper-left corner crops of cell death predictions on the datasets from condition C ( Table 1 ). Similarly to Figure 4 , the first column is the center phase contrast image of the z-stack of images of unlabeled cells used by the network to make its prediction. The second and third columns are the true and predicted fluorescent labels, respectively, shown in green. Predicted pixels that are too bright (false positives) are magenta and those too dim (false negatives) are shown in teal. The true (Hoechst) and predicted nuclear labels have been added in blue to the true and predicted images for visual context. Outset 1 in (A) shows a misprediction of the extent of a dead cell, and outset 3 in (A) shows a true positive adjacent to a false positive. The other outsets show correct predictions, though exact intensity is rarely predicted perfectly. Scale bars, 40 μm.

To assess the utility of the per-pixel predictions, we gave a team of biologists real and predicted nuclear label images and asked them to annotate the images with the locations of the cell centers. With annotations on real images as ground truth, we used the methodology ofto classify the network’s errors into four categories ( Figures 4 B and S2 A). Under conditions where the amount of cellular debris was high (e.g., condition B) or distortions in image quality evident (e.g., condition C), the network’s precision and recall drops to the mid-90%. In other cases, the network was nearly perfect, even with dense cell clumps (e.g., condition D).

(B) Sample manual error annotations for the cell death label (propidium iodide) prediction task on the Condition C data. The unlabeled image that is the basis for the prediction and the images of the true and predicted fluorescent labels are organized similarly to Figures 4 and 5 , but the fourth column instead displays manual annotations, and the true and predicted nuclear (DAPI) labels have been added for visual context. Merge errors are shown as red dots, add errors are shown as light blue dots, miss errors are shown as pink dots, and add errors which were reclassified as correct debris predictions are shown as yellow dots. There are no split errors. Outset 2 shows an add error at the bottom and a reclassified add error shown at top. The top error was reclassified because of the visible debris in the phase contrast image. Outset 5 shows an add error at the top and a reclassified add error at the left. Outset 7 shows a reclassified add error. Outset 8 shows a merge error at the top and a reclassified add error at the bottom. All other dots in the outsets show correct predictions. Note, the dead cell on the left in Outset 3 is slightly positive for the true death label, though it is very dim. The scale bars are 40 μm.

(A) Sample manual error annotations for the nuclear label (DAPI) prediction task on the Condition C data. The unlabeled image that is the basis for the prediction and the images of the true and predicted fluorescent labels are organized similarly to Figure 4 , but the fourth column instead displays manual annotations. Merge errors are shown as red dots, add errors are shown as light blue dots, and miss errors are shown as pink dots. There are no split errors. All other dots indicate agreement between the true and predicted labels. Outset 1 shows an add error in the upper left, a miss error in the center, and six correct predictions. Outset 2 shows a merge error. Outset 4 shows an add error and four correct predictions. Outset 3 shows one correct prediction, and a cell clump excluded from consideration because the human annotators could not determine where the cells are in the true label image. The scale bars are 40 μm.

We asked whether we could train a network to predict the labeling of cell nuclei with Hoechst or DAPI in transmitted-light images of fixed and live cells. With our trained network, we made predictions of nuclear labels ( Figures 4 and S1 ) on the test images ( Table 1 ) (i.e., images withheld during network development and training). Qualitatively, the true and predicted nuclear labels looked nearly identical, and the network’s few mistakes appeared to be special cases (e.g., cell-like debris lacking DNA). We created heatmaps of true versus predicted pixel intensities and quantified the correlation. Pearson correlation (ρ) values of 0.87 or higher indicated that the network accurately predicted the extent and level of labeling and that the predicted pixel intensities reflect the true intensities on a per-pixel basis. The network learned features that could be generalized, given that these predictions were made using different cell types and image acquisition methods.

Unlike the images in all other figures, pixel intensities here were not cropped at the approximate noise floor; this is why background regions are not black. Each of the two blocks shows ground truth, predicted, and error images in the style of Figure 4 . The color bars in the fluorescence images indicate the color of zero brightness at the bottom and color of full brightness at the top. The color bars in the error images indicate the color of a full false negative (true intensity 1.0, predicted 0.0) at the bottom and of a full false positive (true intensity 0.0, predicted 1.0) at the top. The inset text indicates the fluorescent labels, and the condition names at the sides indicate the source conditions. Unlike in Figures 5 and 6 , nuclear labels are not provided as context for the predictions. In Condition B, the predicted MAP2 image lacks the stitching artifact (vertical boundary in the lower right) and the disk-shaped dust artifact present in the ground truth. It also contains dim neurites which are not visible above the noise in the ground truth. The scale bars are 40 μm.

(B) The heatmaps compare the true fluorescence pixel intensity to the network’s predictions, with inset Pearson ρ values. The bin width is 0.1 on a scale of zero to one ( STAR Methods ). The numbers in the bins are frequency counts per 1,000. Under each heatmap plot is a further categorization of the errors and the percentage of time they occurred. Split is when the network mistakes one cell as two or more cells. Merged is when the network mistakes two or more cells as one. Added is when the network predicts a cell when there is none (i.e., a false positive), and missed is when the network fails to predict a cell when there is one (i.e., a false negative).

(A) Upper-left corner crops of test images from datasets in Table 1 ; please note that images in all figures are small crops from much larger images and that the crops were not cherry-picked. The first column is the center transmitted image of the z-stack of images of unlabeled cells used by the network to make its prediction. The second and third columns are the true and predicted fluorescent labels, respectively. Predicted pixels that are too bright (false positives) are magenta and those too dim (false negatives) are shown in teal. Condition A, outset 4, and condition B, outset 2, shows false negatives. Condition C, outset 3, and condition D, outset 1, show false positives. Condition B, outsets 3 and 4, and condition C, outset 2, show a common source of error, where the extent of the nuclear label is predicted imprecisely. Other outsets show correct predictions, though exact intensity is rarely predicted perfectly. Scale bars, 40 μm.

The final network ( STAR Methods ) produces a discrete probability distribution over 256 intensity values (corresponding to 8-bit pixels) for each pixel of the output image. It reads z-stacks of transmitted-light images collected with bright field, phase contrast, or differential interference contrast methods and outputs simultaneous predictions for every label kind that appeared in the training datasets. It achieves a lower loss on our data than other popular models while using fewer parameters ( Figure S4 B; STAR Methods ).

(A) Upper-left-corner crops of dendrite (MAP2) and axon (neurofilament) label predictions on the Conditions B and D datasets. The unlabeled image that is the basis for the prediction and the images of the true and predicted fluorescent labels are organized similarly to Figure 4 . Predicted pixels that are too bright (false positives) are magenta and those too dim (false negatives) are shown in teal. The true and predicted nuclear (DAPI) labels have been added to the true and predicted images in blue for visual context. Outset 4 for the axon label prediction task in Condition B shows a false positive, where an axon label was predicted to be brighter than it actually was. Outset 1 for the dendrite label prediction task in Condition D shows a false negative, where a dendrite was predicted to be an axon. Outset 4 in the same row shows an error in which the network underestimates the extent and brightness of the dendrite label. Outsets 1,2 for the axon label prediction task in Condition D are false negatives, where the network underestimated the brightness of the axon labels. All outsets in this row show the network does a poor job predicting fine axonal structures in Condition D. All other outsets show basically correct predictions. Scale bars are 40 μm.

The learned part of the deep neural network is primarily made up of convolutional kernels, small filters that convolve over prior layers to compute the next layers. These kernels are restricted to the interiors of the input layers (i.e., the convolutions are valid or not zero-padded) ( Table S1 ) (), making the network approximately translation invariant. As such, each predicted pixel of the network’s final output is computed by approximately the same function, but using different input data, improving the scalability and accuracy while minimizing boundary effects.

The network is composed of repeated modules, as in the popular Inception network used in computer vision (), but with the Inception module optimized for performance ( STAR Methods Figure 2 in Data S1 ) using Google Hypertune (). Hypertune is an automatic function optimizer that tries to find a minimum of a function in a bounded space. We expressed module design choices as parameters and the prediction error as the function to be optimized, and used Hypertune to select the design, optimizing over the training dataset, with the test set withheld.

Our deep neural network performs the task of non-linear pixel-wise classification. It has a multi-scale input ( Figure 3 ). This endows it with five computational paths: a path for processing fine detail that operates on a small length-scale near the center of the network’s input, a path for processing coarse context that operates on a large length-scale in a broad region around the center of the network’s input, and three paths in between. Inspired by U-Net () and shown in the leftmost path of Figure 3 in Data S1 , the computational path with the finest detail stays at the original length scale of the input so that local information can flow from the input to the output without being blurred. Multi-scale architectures are common in animal vision systems and have been reported to be useful in vision networks (). We took a multi-scale approach (), in which intermediate layers at multiple scales are aligned by resizing, but used transposed convolutions () to learn the resizing function rather than fixing it like in. This lets the network learn the spatial interpolation rule that best fits its task.

(C) Predicted images at an intermediate stage of image prediction. The network has already predicted pixels to the upper left of its fixation point, but hasn’t yet predicted pixels for the lower right part of the image. The input and output fixation points are kept in lockstep and are scanned in raster in order to produce the full predicted images.

(B) Simplified network architecture. The network composes six serial sub-networks (towers) and one or more pixel-distribution-valued predictors (heads). The first five towers process information at one of five spatial scales and then, if needed, rescale to the native spatial scale. The sixth and last tower processes the information from these towers.

(A) Example z-stack of transmitted-light images with five colored squares showing the network’s multi-scale input. The squares range in size, increasing from 72 × 72 pixels to 250 × 250 pixels, and they are all centered on the same fixation point. Each square is cropped out of the transmitted-light image from the z-stack and input to the network component of the same color in (B).

With these training sets, we used supervised machine learning (ML) ( Table S1 ) to determine if predictive relationships could be found between transmitted-light and fluorescence images of the same cells. We used the unprocessed z-stack as input for machine-learning algorithm development. The images were preprocessed to accommodate constraints imposed by the samples, data acquisition, and the network. For example, we normalized pixel values of the fluorescence images ( STAR Methods ) as a way to make the pixel-prediction problem well defined. In addition, we aimed to predict the maximum projection of the fluorescence images in the z axis. This was to account for the fact that pairs of transmitted and fluorescence images were not perfectly registered along the z axis and exhibited differences in depth of field and optical sectioning.

During collection, the microscope stage was kept fixed in x and y, while all images in a set were acquired, to preserve (x, y) registration of pixels between the transmitted-light and fluorescence images ( Figure 2 Table 1 ).

Each row is a typical example of labeled and unlabeled images from the datasets described in Table 1 . The first column is the center image from the z-stack of unlabeled transmitted-light images from which the network makes its predictions. Subsequent columns show fluorescence images of labels that the network will use to learn correspondences with the unlabeled images and eventually try to predict from unlabeled images. The numbered outsets show magnified views of subregions of images within a row. The training data are diverse: sourced from two independent laboratories using two different cell types, six fluorescent labels, and both bright-field and phase-contrast methods to acquire transmitted-light images of unlabeled cells. Scale bars, 40 μm.

To improve the accuracy of the network, we collected multiple transmitted-light images with varying focal planes. Monolayer cultures are not strictly two dimensional, so any single image plane contains limited information about each cell. Translating the focal plane through the sample captures features that are in sharp focus in some images while out of focus in others ( Figure 1 in Data S1 ). Normally, out-of-focus features are undesirable, but we hypothesized the implicit three-dimensional information in these blurred features could be an additional source of information. We, thus, collected sets of images (z-stacks) of the same microscope field from several planes at equidistant intervals along the z axis and centered at the plane that was most in-focus for the majority of the cell bodies.

The training datasets ( Table 1 ) include different cell types with different labels made by different laboratories. We used human motor neurons from induced pluripotent stem cells (iPSCs), primary murine cortical cultures, and a breast cancer cell line. Hoechst or DAPI was used to label cell nuclei; CellMask was used to label plasma membrane; and propidium iodide was used to label cells with compromised membranes. Some cells were immunolabeled with antibodies against the neuron-specific β-tubulin III (TuJ1) protein, the Islet1 protein for identifying motor neurons, the dendrite-localized microtubule associated protein-2 (MAP2), or pan-axonal neurofilaments.

Color code, which is denoted in parentheses in the first column, refers to the border color in the figures that was added to enhance readability.

This condition purposely contains only a single well of training data to demonstrate that the model can learn new tasks from very little data through multi-task learning.

c This condition purposely contains only a single well of training data to demonstrate that the model can learn new tasks from very little data through multi-task learning.

To train a deep neural network to predict fluorescence images from transmitted-light images, we first created a dataset of training examples, consisting of pairs of transmitted-light z-stack images and fluorescence images that are pixel registered. The training pairs come from numerous experiments across various labs, samples, imaging modalities, and fluorescent labels. This is a means to improve the network via multi-task learning: having it learn across several tasks ( Figure 1 A). Multi-task learning can improve networks when the tasks are similar, because common features can be learned and refined across the tasks. We chose deep neural networks ( Figure 1 B) as the statistical model to learn from the dataset because they can express many patterns and result in systems with substantially superhuman performance. We trained the network to learn the correspondence rule ( Figure 1 C) - a function mapping from the set of z-stacks of transmitted-light images to the set of images of all fluorescent labels in the training set. If our hypothesis is correct, the trained network would examine an unseen z-stack of transmitted-light images ( Figure 1 D) and generate images of corresponding fluorescent signals ( Figure 1 E). Performance is measured by the similarity of the predicted fluorescence images and the true images for held-out examples.

(E) The trained network, C, was used to predict fluorescence labels learned from (A) for each pixel in the novel images (D). The accuracy of the predictions was then evaluated by comparing the predictions to the actual images of fluorescence labeling from (D) (data not shown).

(D) To test whether the system could make accurate predictions from novel images, a z-stack of images from a novel scene was generated with one of the transmitted-light microscopy methods used to produce the training dataset (A).

(A) Dataset of training examples: pairs of transmitted-light images from z-stacks of a scene with pixel-registered sets of fluorescence images of the same scene. The scenes contain varying numbers of cells; they are not crops of individual cells. The z-stacks of transmitted-light microscopy images were acquired with different methods for enhancing contrast in unlabeled images. Several different fluorescent labels were used to generate fluorescence images and were varied between training examples; the checkerboard images indicate fluorescent labels that were not acquired for a given example.

Discussion

Here, we report a new approach: in silico labeling (ISL). This deep learning system can predict fluorescent labels from transmitted-light images. The deep neural network we developed could be trained on unlabeled images to make accurate per pixel predictions of the location and intensity of nuclear labeling with DAPI or Hoechst dye and to indicate if cells were dead or alive by predicting propidium iodide labeling. We further show that the network could be trained to accurately distinguish neurons from other cells in mixed cultures and to predict whether a neurite is an axon or dendrite. These predictions showed a high correlation between the location and intensity of the actual and predicted pixels. They were accurate for live cells, enabling longitudinal fluorescence-like imaging with no additional sample preparation and minimal impact to cells. Thus, we conclude that unlabeled images contain substantial information that can be used to train deep neural networks to predict labels in both live and fixed cells that normally require invasive approaches to reveal, or which cannot be revealed using current methods.

Ronneberger et al., 2015 Ronneberger O.

Fischer P.

Brox T. U-net: convolutional networks for biomedical image segmentation. Our deep neural network comprises repeated modules, such as the reported Inception network, but the modules differ in important ways ( STAR Methods ). Inspired by U-Net (), it is constructed so that fine-grain information can flow from the input to the output without being degraded by locality destroying transformations. It is multi-scale to provide context, and it preserves approximate translation invariance by avoiding zero-padding in the convolutions ( STAR Methods ), which minimizes boundary effects in the predicted images. Finally, it is specified as the repeated application of a single parameterized module, which simplifies the design space and makes it tractable to automatically search over network architectures.

We also gained insights into the strengths, limitations, and potential applications of deep learning for biologists. The accurate predictions at a per-pixel level indicate that direct correspondences exist between unlabeled images and at least some fluorescent labels. The high correlation coefficients for several labels indicate that the unlabeled images contain the information for a deep neural network to accurately predict the location and intensity of the fluorescent label. Importantly, we were able to show, in at least one case ( Figure S3 ), that the predicted label could be used to accurately quantify the dimensions of the cellular structure it represented and thereby correctly classify the biological state of the cell, which we validated with independent direct measurements. This shows that labels predicted from a deep learning network may be useful for accurately inferring measurements of the underlying biological structures, concentrations, etc, . . . that they are trained to represent. Lastly, the fact that successful predictions were made under differing conditions suggests that the approach is robust and may have wide applications.

ISL may offer, at negligible additional cost, a computational approach to reliably predict more labels than would be feasible to collect otherwise from an unlabeled image of a single sample. Also, because ISL works on unlabeled images of live cells, repeated predictions can be made for the same cell over time without invasive labeling or other perturbations. Many-label (multi-plexed) methods exist that partially overcome the barrier imposed by spectral overlap, notably via iterative labeling or hyperspectral imaging. However, the iterative methods are lethal to cells, and the hyperspectral methods require a specialized setup and are limited by the distinctiveness of the fluorophores’ spectra.

That successful predictions could be made by a singly trained network on data from three laboratories suggests that the learned features are robust and generalizable. We showed that the trained network could learn a new fluorescent label from a very limited set of labeled data collected with a different microscopy method. This suggests that the trained network exhibited transfer learning. In transfer learning, the more a model has learned, the less data it needs to learn a new similar task. It applies previous lessons to new tasks. Thus, this network could improve with additional training data and might make accurate predictions on a broader set of data than we measured.