Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems—patient classification, fundamental biological processes and treatment of patients—and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.

1. Introduction to deep learning

Biology and medicine are rapidly becoming data-intensive. A recent comparison of genomics with social media, online videos and other data-intensive disciplines suggests that genomics alone will equal or surpass other fields in data generation and analysis within the next decade [1]. The volume and complexity of these data present new opportunities, but also pose new challenges. Automated algorithms that extract meaningful patterns could lead to actionable knowledge and change how we develop treatments, categorize patients or study diseases, all within privacy-critical environments.

The term deep learning has come to refer to a collection of new techniques that, together, have demonstrated breakthrough gains over existing best-in-class machine learning algorithms across several fields. For example, over the past 5 years, these methods have revolutionized image classification and speech recognition due to their flexibility and high accuracy [2]. More recently, deep learning algorithms have shown promise in fields as diverse as high-energy physics [3], computational chemistry [4], dermatology [5] and translation among written languages [6]. Across fields, ‘off-the-shelf’ implementations of these algorithms have produced comparable or higher accuracy than previous best-in-class methods that required years of extensive customization, and specialized implementations are now being used at industrial scales.

Deep learning approaches grew from research on artificial neurons, which were first proposed in 1943 [7] as a model for how the neurons in a biological brain process information. The history of artificial neural networks—referred to as ‘neural networks’ throughout this article—is interesting in its own right [8]. In neural networks, inputs are fed into the input layer, which feeds into one or more hidden layers, which eventually link to an output layer. A layer consists of a set of nodes, sometimes called ‘features’ or ‘units’, which are connected via edges to the immediately earlier and the immediately deeper layers. In some special neural network architectures, nodes can connect to themselves with a delay. The nodes of the input layer generally consist of the variables being measured in the dataset of interest—for example, each node could represent the intensity value of a specific pixel in an image or the expression level of a gene in a specific transcriptomic experiment. The neural networks used for deep learning have multiple hidden layers. Each layer essentially performs feature construction for the layers before it. The training process used often allows layers deeper in the network to contribute to the refinement of earlier layers. For this reason, these algorithms can automatically engineer features that are suitable for many tasks and customize those features for one or more specific tasks.

Deep learning does many of the same things as more familiar machine learning approaches. In particular, deep learning approaches can be used both in supervised applications—where the goal is to accurately predict one or more labels or outcomes associated with each data point—in the place of regression approaches, as well as in unsupervised, or ‘exploratory’ applications—where the goal is to summarize, explain or identify interesting patterns in a dataset—as a form of clustering. Deep learning methods may, in fact, combine both of these steps. When sufficient data are available and labelled, these methods construct features tuned to a specific problem and combine those features into a predictor. In fact, if the dataset is ‘labelled’ with binary classes, a simple neural network with no hidden layers and no cycles between units is equivalent to logistic regression if the output layer is a sigmoid (logistic) function of the input layer. Similarly, for continuous outcomes, linear regression can be seen as a single-layer neural network. Thus, in some ways, supervised deep learning approaches can be seen as an extension of regression models that allow for greater flexibility and are especially well suited for modelling nonlinear relationships among the input features. Recently, hardware improvements and very large training datasets have allowed these deep learning techniques to surpass other machine learning algorithms for many problems. In a famous and early example, scientists from Google demonstrated that a neural network ‘discovered’ that cats, faces and pedestrians were important components of online videos [9] without being told to look for them. What if, more generally, deep learning takes advantage of the growth of data in biomedicine to tackle challenges in this field? Could these algorithms identify the ‘cats’ hidden in our data—the patterns unknown to the researcher—and suggest ways to act on them? In this review, we examine deep learning's application to biomedical science and discuss the unique challenges that biomedical data pose for deep learning methods.

Several important advances make the current surge of work done in this area possible. Easy-to-use software packages have brought the techniques of the field out of the specialist's toolkit to a broad community of computational scientists. Additionally, new techniques for fast training have enabled their application to larger datasets [10]. Dropout of nodes, edges and layers makes networks more robust, even when the number of parameters is very large. Finally, the larger datasets now available are also sufficient for fitting the many parameters that exist for deep neural networks. The convergence of these factors currently makes deep learning extremely adaptable and capable of addressing the nuanced differences of each domain to which it is applied.

This review discusses recent work in the biomedical domain, and most successful applications select neural network architectures that are well suited to the problem at hand. We sketch out a few simple example architectures in figure 1. If data have a natural adjacency structure, a convolutional neural network (CNN) can take advantage of that structure by emphasizing local relationships, especially when convolutional layers are used in early layers of the neural network. Other neural network architectures such as autoencoders require no labels and are now regularly used for unsupervised tasks. In this review, we do not exhaustively discuss the different types of deep neural network architectures; an overview of the principal terms used herein is given in table 1. Table 1 also provides select example applications, though in practice each neural network architecture has been broadly applied across multiple types of biomedical data. A recent book from Goodfellow et al. [11] covers neural network architectures in detail, and LeCun et al. [2] provide a more general introduction. Figure 1. Neural networks come in many different forms. Left: A key for the various types of nodes used in neural networks. Simple FFNN: a feed-forward neural network in which inputs are connected via some function to an output node and the model is trained to produce some output for a set of inputs. MLP: the multi-layer perceptron is a feed-forward neural network in which there is at least one hidden layer between the input and output nodes. CNN: the convolutional neural network is a feed-forward neural network in which the inputs are grouped spatially into hidden nodes. In the case of this example, each input node is only connected to hidden nodes alongside their neighbouring input node. Autoencoder: a type of MLP in which the neural network is trained to produce an output that matches the input to the network. RNN: a deep recurrent neural network is used to allow the neural network to retain memory over time or sequential inputs. This figure was inspired by the Neural Network Zoo by Fjodor Van Veen.

Table 1.Glossary. Collapse term definition example applications supervised learning machine learning approaches with goal of prediction of labels or outcomes unsupervised learning machine learning approaches with goal of data summarization or pattern identification neural network (NN) machine learning approach inspired by biological neurons where inputs are fed into one or more layers, producing an output layer deep neural network NN with multiple hidden layers. Training happens over the network, and consequently such architectures allow for feature construction to occur alongside optimization of the overall training objective feed-forward neural network (FFNN) NN that does not have cycles between nodes in the same layer most of the examples below are special cases of FFNNs, except recurrent neural networks MLP type of FFNN with at least one hidden layer where each deeper layer is a nonlinear function of each earlier layer MLPs do not impose structure and are frequently used when there is no natural ordering of the inputs (e.g. as with gene expression measurements) CNN an NN with layers in which connectivity preserves local structure. If the data meet the underlying assumptions performance is often good, and such networks can require fewer examples to train effectively because they have fewer parameters and also provide improved efficiency CNNs are used for sequence data—such as DNA sequences—or grid data—such as medical and microscopy images recurrent neural network (RNN) a neural network with cycles between nodes within a hidden layer. the RNN architecture is used for sequential data—such as clinical time series and text or genome sequences LSTM neural network this special type of RNN has features that enable models to capture longer-term dependencies LSTMs are gaining a substantial foothold in the analysis of natural language, and may become more widely applied to biological sequence data autoencoder (AE) an NN where the training objective is to minimize the error between the output layer and the input layer. Such neural networks are unsupervised and are often used for dimensionality reduction autoencoders have been used for unsupervised analysis of gene expression data as well as data extracted from the EHR variational autoencoder (VAE) this special type of generative AE learns a probabilistic latent variable model VAEs have been shown to often produce meaningful reduced representations in the imaging domain, and some early publications have used VAEs to analyse gene expression data denoising autoencoder (DA) this special type of AE includes a step where noise is added to the input during the training process. The denoising step acts as smoothing and may allow for effective use on input data that is inherently noisy like AEs, DAs have been used for unsupervised analysis of gene expression data as well as data extracted from the EHR generative neural network neural networks that fall into this class can be used to generate data similar to input data. These models can be sampled to produce hypothetical examples a number of the unsupervised learning neural network architectures that are summarized here can be used in a generative fashion RBM a generative NN that forms the building block for many deep learning approaches, having a single input layer and a single hidden layer, with no connections between the nodes within each layer RBMs have been applied to combine multiple types of omic data (e.g. DNA methylation, mRNA expression and miRNA expression) DBN generative NN with several hidden layers, which can be obtained from combining multiple RBMs DBNs can be used to predict new relationships in a drug–target interaction network generative adversarial network (GAN) a generative NN approach where two neural networks are trained. One neural network, the generator, is provided with a set of randomly generated inputs and tasked with generating samples. The second, the discriminator, is trained to differentiate real and generated samples. After the two neural networks are trained against each other, the resulting generator can be used to produce new examples GANs can synthesize new examples with the same statistical properties of datasets that contain individual-level records and are subject to sharing restrictions. They have also been applied to generate microscopy images adversarial training a process by which artificial training examples are maliciously designed to fool an NN and then input as training examples to make the resulting NN robust (no relation to GANs) adversarial training has been used in image analysis data augmentation a process by which transformations that do not affect relevant properties of the input data (e.g. arbitrary rotations of histopathology images) are applied to training examples to increase the size of the training set data augmentation is widely used in the analysis of images because rotation transformations for biomedical images often do not change relevant properties of the image

While deep learning shows increased flexibility over other machine learning approaches, as seen in the remainder of this review, it requires large training sets in order to fit the hidden layers, as well as accurate labels for the supervised learning applications. For these reasons, deep learning has recently become popular in some areas of biology and medicine, while having lower adoption in other areas. At the same time, this highlights the potentially even larger role that it may play in future research, given the increases in data in all biomedical fields. It is also important to see it as a branch of machine learning and acknowledge that it has the same limitations as other approaches in that field. In particular, the results are still dependent on the underlying study design and the usual caveats of correlation versus causation still apply—a more precise answer is only better than a less precise one if it answers the correct question.

1.1. Will deep learning transform the study of human disease?

With this review, we ask the question: what is needed for deep learning to transform how we categorize, study and treat individuals to maintain or restore health? We choose a high bar for ‘transform’. Grove [12], the former CEO of Intel, coined the term Strategic Inflection Point to refer to a change in technologies or environment that requires a business to be fundamentally reshaped. Here, we seek to identify whether deep learning is an innovation that can induce a Strategic Inflection Point in the practice of biology or medicine.

There are already a number of reviews focused on applications of deep learning in biology [13–17], healthcare [18–20] and drug discovery [4,21–23]. Under our guiding question, we sought to highlight cases where deep learning enabled researchers to solve challenges that were previously considered infeasible or makes difficult, tedious analyses routine. We also identified approaches that researchers are using to sidestep challenges posed by biomedical data. We find that domain-specific considerations have greatly influenced how to best harness the power and flexibility of deep learning. Model interpretability is often critical. Understanding the patterns in data may be just as important as fitting the data. In addition, there are important and pressing questions about how to build networks that efficiently represent the underlying structure and logic of the data. Domain experts can play important roles in designing networks to represent data appropriately, encoding the most salient prior knowledge and assessing success or failure. There is also great potential to create deep learning systems that augment biologists and clinicians by prioritizing experiments or streamlining tasks that do not require expert judgement. We have divided the large range of topics into three broad classes: disease and patient categorization, fundamental biological study and treatment of patients. Below, we briefly introduce the types of questions, approaches and data that are typical for each class in the application of deep learning.

1.1.1. Disease and patient categorization

A key challenge in biomedicine is the accurate classification of diseases and disease subtypes. In oncology, current ‘gold standard’ approaches include histology, which requires interpretation by experts, or assessment of molecular markers such as cell surface receptors or gene expression. One example is the PAM50 approach to classifying breast cancer where the expression of 50 marker genes divides breast cancer patients into four subtypes. Substantial heterogeneity still remains within these four subtypes [24,25]. Given the increasing wealth of molecular data available, a more comprehensive subtyping seems possible. Several studies have used deep learning methods to better categorize breast cancer patients: for instance, denoising autoencoders, an unsupervised approach, can be used to cluster breast cancer patients [26], and CNNs can help count mitotic divisions, a feature that is highly correlated with disease outcome in histological images [27]. Despite these recent advances, a number of challenges exist in this area of research, most notably the integration of molecular and imaging data with other disparate types of data such as electronic health records (EHRs).

1.1.2. Fundamental biological study

Deep learning can be applied to answer more fundamental biological questions; it is especially suited to leveraging large amounts of data from high-throughput ‘omics’ studies. One classic biological problem where machine learning, and now deep learning, has been extensively applied is molecular target prediction. For example, deep recurrent neural networks (RNNs) have been used to predict gene targets of microRNAs (miRNAs) [28], and CNNs have been applied to predict protein residue–residue contacts and secondary structure [29–31]. Other recent exciting applications of deep learning include recognition of functional genomic elements such as enhancers and promoters [32–34] and prediction of the deleterious effects of nucleotide polymorphisms [35].

1.1.3. Treatment of patients

Although the application of deep learning to patient treatment is just beginning, we expect new methods to recommend patient treatments, predict treatment outcomes and guide the development of new therapies. One type of effort in this area aims to identify drug targets and interactions or predict drug response. Another uses deep learning on protein structures to predict drug interactions and drug bioactivity [36]. Drug repositioning using deep learning on transcriptomic data is another exciting area of research [37]. Restricted Boltzmann machines (RBMs) can be combined into deep belief networks (DBNs) to predict novel drug–target interactions and formulate drug repositioning hypotheses [38,39]. Finally, deep learning is also prioritizing chemicals in the early stages of drug discovery for new targets [23].

2. Deep learning and patient categorization

In healthcare, individuals are diagnosed with a disease or condition based on symptoms, the results of certain diagnostic tests, or other factors. Once diagnosed with a disease, an individual might be assigned a stage based on another set of human-defined rules. While these rules are refined over time, the process is evolutionary and ad hoc, potentially impeding the identification of underlying biological mechanisms and their corresponding treatment interventions.

Deep learning methods applied to a large corpus of patient phenotypes may provide a meaningful and more data-driven approach to patient categorization. For example, they may identify new shared mechanisms that would otherwise be obscured due to ad hoc historical definitions of disease. Perhaps deep neural networks, by reevaluating data without the context of our assumptions, can reveal novel classes of treatable conditions.

In spite of such optimism, the ability of deep learning models to indiscriminately extract predictive signals must also be assessed and operationalized with care. Imagine a deep neural network is provided with clinical test results gleaned from EHRs. Because physicians may order certain tests based on their suspected diagnosis, a deep neural network may learn to ‘diagnose’ patients simply based on the tests that are ordered. For some objective functions, such as predicting an International Classification of Diseases (ICD) code, this may offer good performance even though it does not provide insight into the underlying disease beyond physician activity. This challenge is not unique to deep learning approaches; however, it is important for practitioners to be aware of these challenges and the possibility in this domain of constructing highly predictive classifiers of questionable utility.

Our goal in this section is to assess the extent to which deep learning is already contributing to the discovery of novel categories. Where it is not, we focus on barriers to achieving these goals. We also highlight approaches that researchers are taking to address challenges within the field, particularly with regards to data availability and labelling.

2.1. Imaging applications in healthcare

Deep learning methods have transformed the analysis of natural images and video, and similar examples are beginning to emerge with medical images. Deep learning has been used to classify lesions and nodules; localize organs, regions, landmarks and lesions; segment organs, organ substructures and lesions; retrieve images based on content; generate and enhance images; and combine images with clinical reports [19,40].

Though there are many commonalities with the analysis of natural images, there are also key differences. In all cases that we examined, fewer than one million images were available for training, and datasets are often many orders of magnitude smaller than collections of natural images. Researchers have developed subtask-specific strategies to address this challenge.

Data augmentation provides an effective strategy for working with small training sets. The practice is exemplified by a series of papers that analyse images from mammographies [41–45]. To expand the number and diversity of images, researchers constructed adversarial [44] or augmented [45] examples. Adversarial training examples are constructed by selecting targeted small transformations to input data that cause a model to produce very different outputs. Augmented training applies perturbations to the input data that do not change the underlying meaning, such as rotations for pathology images. An alternative in the domain is to train towards human-created features before subsequent fine-tuning [42], which can help to sidestep this challenge though it does give up deep learning techniques' strength as feature constructors.

A second strategy repurposes features extracted from natural images by deep learning models, such as ImageNet [46], for new purposes. Diagnosing diabetic retinopathy through colour fundus images became an area of focus for deep learning researchers after a large labelled image set was made publicly available during a 2015 Kaggle competition [47]. Most participants trained neural networks from scratch [47–49], but Gulshan et al. [50] repurposed a 48-layer Inception-v3 deep architecture pre-trained on natural images and surpassed the state-of-the-art specificity and sensitivity. Such features were also repurposed to detect melanoma, the deadliest form of skin cancer, from dermoscopic [51,52] and non-dermoscopic images of skin lesions [5,53,54] as well as age-related macular degeneration [55]. Pre-training on natural images can enable very deep networks to succeed without overfitting. For the melanoma task, reported performance was competitive with or better than a board of certified dermatologists [5,51]. Reusing features from natural images is also an emerging approach for radiographic images, where datasets are often too small to train large deep neural networks without these techniques [56–59]. A deep CNN trained on natural images boosts performance in radiographic images [58]. However, the target task required either re-training the initial model from scratch with special preprocessing or fine-tuning of the whole network on radiographs with heavy data augmentation to avoid overfitting.

The technique of reusing features from a different task falls into the broader area of transfer learning (see Discussion). Though we have mentioned numerous successes for the transfer of natural image features to new tasks, we expect that a lower proportion of negative results have been published. The analysis of magnetic resonance images is also faced with the challenge of small training sets. In this domain, Amit et al. [60] investigated the trade-off between pre-trained models from a different domain and a small CNN trained only with MRI images. In contrast with the other selected literature, they found a smaller network trained with data augmentation on a few hundred images from a few dozen patients can outperform a pre-trained out-of-domain classifier.

Another way of dealing with limited training data is to divide rich data—e.g. 3D images—into numerous reduced projections. Shin et al. [57] compared various deep network architectures, dataset characteristics and training procedures for computer tomography (CT)-based abnormality detection. They concluded that networks as deep as 22 layers could be useful for 3D data, despite the limited size of training datasets. However, they noted that choice of architecture, parameter setting and model fine-tuning needed is very problem- and dataset-specific. Moreover, this type of task often depends on both lesion localization and appearance, which poses challenges for CNN-based approaches. Straightforward attempts to capture useful information from full-size images in all three dimensions simultaneously via standard neural network architectures were computationally unfeasible. Instead, two-dimensional models were used to either process image slices individually (2D) or aggregate information from a number of 2D projections in the native space (2.5D).

Roth et al. [61] compared 2D, 2.5D and 3D CNNs on a number of tasks for computer-aided detection from CT scans and showed that 2.5D CNNs performed comparably well to 3D analogues, while requiring much less training time, especially on augmented training sets. Another advantage of 2D and 2.5D networks is the wider availability of pre-trained models. However, reducing the dimensionality is not always helpful. Nie et al. [62] showed that multimodal, multi-channel 3D deep architecture was successful at learning high-level brain tumour appearance features jointly from MRI, functional MRI and diffusion MRI images, outperforming single-modality or 2D models. Overall, the variety of modalities, properties and sizes of training sets, the dimensionality of input and the importance of end goals in medical image analysis are provoking a development of specialized deep neural network architectures, training and validation protocols, and input representations that are not characteristic of widely-studied natural images.

Predictions from deep neural networks can be evaluated for use in workflows that also incorporate human experts. In a large dataset of mammography images, Kooi et al. [63] demonstrated that deep neural networks outperform a traditional computer-aided diagnosis system at low sensitivity and perform comparably at high sensitivity. They also compared network performance to certified screening radiologists on a patch level and found no significant difference between the network and the readers. However, using deep methods for clinical practice is challenged by the difficulty of assigning a level of confidence to each prediction. Leibig et al. [49] estimated the uncertainty of deep networks for diabetic retinopathy diagnosis by linking dropout networks with approximate Bayesian inference. Techniques that assign confidences to each prediction should aid physician–computer interactions and improve uptake by physicians.

Systems to aid in the analysis of histology slides are also promising use cases for deep learning [64]. Ciresan et al. [27] developed one of the earliest approaches for histology slides, winning the 2012 International Conference on Pattern Recognition's Contest on Mitosis Detection while achieving human-competitive accuracy. In more recent work, Wang et al. [65] analysed stained slides of lymph node slices to identify cancers. On this task, a pathologist has about a 3% error rate. The pathologist did not produce any false positives but did have a number of false negatives. The algorithm had about twice the error rate of a pathologist, but the errors were not strongly correlated. Combining pre-trained deep network architectures with multiple augmentation techniques enabled accurate detection of breast cancer from a very small set of histology images with less than 100 images per class [66]. In this area, these algorithms may be ready to be incorporated into existing tools to aid pathologists and reduce the false negative rate. Ensembles of deep learning and human experts may help overcome some of the challenges presented by data limitations.

One source of training examples with rich phenotypical annotations is the EHR. Billing information in the form of ICD codes are simple annotations but phenotypic algorithms can combine laboratory tests, medication prescriptions and patient notes to generate more reliable phenotypes. Recently, Lee et al. [67] developed an approach to distinguish individuals with age-related macular degeneration from control individuals. They trained a deep neural network on approximately 100 000 images extracted from structured EHRs, reaching greater than 93% accuracy. The authors used their test set to evaluate when to stop training. In other domains, this has resulted in a minimal change in the estimated accuracy [68], but we recommend the use of an independent test set whenever feasible.

Rich clinical information is stored in EHRs. However, manually annotating a large set requires experts and is time-consuming. For chest X-ray studies, a radiologist usually spends a few minutes per example. Generating the number of examples needed for deep learning is infeasibly expensive. Instead, researchers may benefit from using text mining to generate annotations [69], even if those annotations are of modest accuracy. Wang et al. [70] proposed to build predictive deep neural network models through the use of images with weak labels. Such labels are automatically generated and not verified by humans, so they may be noisy or incomplete. In this case, they applied a series of natural language processing (NLP) techniques to the associated chest X-ray radiological reports. They first extracted all diseases mentioned in the reports using a state-of-the-art NLP tool, then applied a new method, NegBio [71], to filter negative and equivocal findings in the reports. Evaluation of four independent datasets demonstrated that NegBio is highly accurate for detecting negative and equivocal findings (approx. 90% in the F 1 score, which balances precision and recall [72]). The resulting dataset [73] consisted of 112 120 frontal-view chest X-ray images from 30 805 patients, and each image was associated with one or more text-mined (weakly labelled) pathology categories (e.g. pneumonia and cardiomegaly) or ‘no finding’ otherwise. Further, Wang et al. [70] used this dataset with a unified weakly supervised multi-label image classification framework to detect common thoracic diseases. It showed superior performance over a benchmark using fully labelled data.

Another example of semi-automated label generation for hand radiograph segmentation employed positive mining, an iterative procedure that combines manual labelling with automatic processing [74]. First, the initial training set was created by manually labelling 100 of 12 600 unlabelled radiographs that were used to train a model and predict labels for the rest of the dataset. Then, poor-quality predictions were discarded through manual inspection, the initial training set was expanded with the acceptable segmentations, and the process was repeated. This procedure had to be repeated six times to obtain good quality segmentation labelling for all radiographs, except for 100 corner cases that still required manual annotation. These annotations allowed accurate segmentation of all hand images in the test set and boosted the final performance in radiograph classification [74].

With the exception of natural image-like problems (e.g. melanoma detection), biomedical imaging poses a number of challenges for deep learning. Datasets are typically small, annotations can be sparse, and images are often high-dimensional, multimodal and multi-channel. Techniques like transfer learning, heavy dataset augmentation and the use of multi-view and multi-stream architectures are more common than in the natural image domain. Furthermore, high model sensitivity and specificity can translate directly into clinical value. Thus, prediction evaluation, uncertainty estimation and model interpretation methods are also of great importance in this domain (see Discussion). Finally, there is a need for better pathologist–computer interaction techniques that will allow combining the power of deep learning methods with human expertise and lead to better-informed decisions for patient treatment and care.

2.2. Text applications in healthcare

Owing to the rapid growth of scholarly publications and EHRs, biomedical text mining has become increasingly important in recent years. The main tasks in biological and clinical text mining include, but are not limited to, named entity recognition (NER), relation/event extraction and information retrieval (figure 2). Deep learning is appealing in this domain because of its competitive performance versus traditional methods and ability to overcome challenges in feature engineering. Relevant applications can be stratified by the application domain (biomedical literature versus clinical notes) and the actual task (e.g. concept or relation extraction). Figure 2. Deep learning applications, tasks and models based on NLP perspectives.

NER is a task of identifying text spans that refer to a biological concept of a specific class, such as disease or chemical, in a controlled vocabulary or ontology. NER is often needed as a first step in many complex text mining systems. The current state-of-the-art methods typically reformulate the task as a sequence labelling problem and use conditional random fields [75–77]. In recent years, word embeddings that contain rich latent semantic information of words have been widely used to improve the NER performance. Liu et al. [78] studied the effect of word embeddings on drug name recognition and compared them with traditional semantic features. Tang et al. [79] investigated word embeddings in the gene, DNA and cell line mention detection tasks. Moreover, Wu et al. [80] examined the use of neural word embeddings for clinical abbreviation disambiguation. Liu et al. [81] exploited task-oriented resources to learn word embeddings for clinical abbreviation expansion.

Relation extraction involves detecting and classifying semantic relationships between entities from the literature. At present, kernel methods or feature-based approaches are commonly applied [82–84]. Deep learning can relieve the feature sparsity and engineering problems. Some studies focused on jointly extracting biomedical entities and relations simultaneously [85,86], while others applied deep learning on relation classification given the relevant entities. For example, both multi-channel dependency-based CNNs [87] and shortest path-based CNNs [88,89] are well suited for sentence-based protein–protein extraction. Jiang et al. [90] proposed a biomedical domain-specific word embedding model to reduce the manual labour of designing semantic representation for the same task. Gu et al. [91] employed a maximum-entropy model and a CNN model for chemical-induced disease relation extraction at the inter- and intra-sentence level, respectively. For drug–drug interactions, Zhao et al. [92] used a CNN that employs word embeddings with the syntactic information of a sentence as well as features of part-of-speech tags and dependency trees. Asada et al. [93] experimented with an attention CNN, and Yi et al. [94] proposed an RNN model with multiple attention layers. In both cases, it is a single model with attention mechanism, which allows the decoder to focus on different parts of the source sentence. As a result, it does not require dependency parsing or training multiple models. Both attention CNN and RNN have comparable results, but the CNN model has an advantage in that it can be easily computed in parallel, hence making it faster with recent graphics processing units (GPUs).

For biotopes event extraction, Li et al. [95] employed CNNs and distributed representation, while Mehryary et al. [96] used long short-term memory (LSTM) networks to extract complicated relations. Li et al. [97] applied word embedding to extract complete events from the biomedical text and achieved results comparable to the state-of-the-art systems. There are also approaches that identify event triggers rather than the complete event [98,99]. Taken together, deep learning models outperform traditional kernel methods or feature-based approaches by 1–5% in f-score. Among various deep learning approaches, CNNs stand out as the most popular model both in terms of computational complexity and performance, while RNNs have achieved continuous progress.

Information retrieval is a task of finding relevant text that satisfies an information need from within a large document collection. While deep learning has not yet achieved the same level of success in this area as seen in others, the recent surge of interest and work suggest that this may be quickly changing. For example, Mohan et al. [100] described a deep learning approach to modelling the relevance of a document's text to a query, which they applied to the entire biomedical literature [100].

To summarize, deep learning has shown promising results in many biomedical text mining tasks and applications. However, to realize its full potential in this domain, either large amounts of labelled data or technical advancements in current methods coping with limited labelled data are required.

2.3. Electronic health records

EHR data include substantial amounts of free text, which remains challenging to approach [101]. Often, researchers developing algorithms that perform well on specific tasks must design and implement domain-specific features [102]. These features capture unique aspects of the literature being processed. Deep learning methods are natural feature constructors. In recent work, Chalapathy et al. evaluated the extent to which deep learning methods could be applied on top of generic features for domain-specific concept extraction [103]. They found that performance was in line with, but lower than the best domain-specific method [103]. This raises the possibility that deep learning may impact the field by reducing the researcher time and cost required to develop specific solutions, but it may not always lead to performance increases.

In recent work, Yoon et al. [104] analysed simple features using deep neural networks and found that the patterns recognized by the algorithms could be re-used across tasks. Their aim was to analyse the free text portions of pathology reports to identify the primary site and laterality of tumours. The only features the authors supplied to the algorithms were unigrams (counts for single words) and bigrams (counts for two-word combinations) in a free text document. They subset the full set of words and word combinations to the 400 most common. The machine learning algorithms that they employed (naive Bayes, logistic regression and deep neural networks) all performed relatively similarly on the task of identifying the primary site. However, when the authors evaluated the more challenging task, evaluating the laterality of each tumour, the deep neural network outperformed the other methods. Of particular interest, when the authors first trained a neural network to predict the primary site and then repurposed those features as a component of a secondary neural network trained to predict laterality, the performance was higher than a laterality-trained neural network. This demonstrates how deep learning methods can repurpose features across tasks, improving overall predictions as the field tackles new challenges. The Discussion further reviews this type of transfer learning.

Several authors have created reusable feature sets for medical terminologies using NLP and neural embedding models, as popularized by word2vec [105]. Minarro-Giménez et al. [106] applied the word2vec deep learning toolkit to medical corpora and evaluated the efficiency of word2vec in identifying properties of pharmaceuticals based on mid-sized, unstructured medical text corpora without any additional background knowledge. A goal of learning terminologies for different entities in the same vector space is to find relationships between different domains (e.g. drugs and the diseases they treat). It is difficult for us to provide a strong statement on the broad utility of these methods. Manuscripts in this area tend to compare algorithms applied to the same data but lack a comparison against overall best practices for one or more tasks addressed by these methods. Techniques have been developed for free text medical notes [107], ICD and National Drug Codes [108,109] and claims data [110]. Methods for neural embeddings learned from EHRs have at least some ability to predict disease–disease associations and implicate genes with a statistical association with a disease [111], but the evaluations performed did not differentiate between simple predictions (i.e. the same disease in different sites of the body) and non-intuitive ones. Jagannatha & Yu [112] further employed a bidirectional LSTM structure to extract adverse drug events from EHRs, and Lin et al. [113] investigated using CNNs to extract temporal relations. While promising, a lack of rigorous evaluation of the real-world utility of these kinds of features makes current contributions in this area difficult to evaluate. Comparisons need to be performed to examine the true utility against leading approaches (i.e. algorithms and data) as opposed to simply evaluating multiple algorithms on the same potentially limited dataset.

Identifying consistent subgroups of individuals and individual health trajectories from clinical tests is also an active area of research. Approaches inspired by deep learning have been used for both unsupervised feature construction and supervised prediction. Early work by Lasko et al. [114], combined sparse autoencoders and Gaussian processes to distinguish gout from leukaemia from uric acid sequences. Later work showed that unsupervised feature construction of many features via denoising autoencoder neural networks could dramatically reduce the number of labelled examples required for subsequent supervised analyses [115]. In addition, it pointed towards features learned during unsupervised training being useful for visualizing and stratifying subgroups of patients within a single disease. In a concurrent large-scale analysis of EHR data from 700 000 patients, Miotto et al. [116] used a deep denoising autoencoder architecture applied to the number and co-occurrence of clinical events to learn a representation of patients (DeepPatient). The model was able to predict disease trajectories within 1 year with over 90% accuracy, and patient-level predictions were improved by up to 15% when compared to other methods. Choi et al. [117] attempted to model the longitudinal structure of EHRs with an RNN to predict future diagnosis and medication prescriptions on a cohort of 260 000 patients followed for 8 years (Doctor AI). Pham et al. [118] built upon this concept by using an RNN with an LSTM architecture enabling explicit modelling of patient trajectories through the use of memory cells. The method, DeepCare, performed better than shallow models or plain RNN when tested on two independent cohorts for its ability to predict disease progression, intervention recommendation and future risk prediction. Nguyen et al. [119] took a different approach and used word embeddings from EHRs to train a CNN that could detect and pool local clinical motifs to predict unplanned readmission after six months, with performance better than the baseline method (Deepr). Razavian et al. [120] used a set of 18 common laboratory tests to predict disease onset using both CNN and LSTM architectures and demonstrated an improvement over baseline regression models. However, numerous challenges including data integration (patient demographics, family history, laboratory tests, text-based patient records, image analysis, genomic data) and better handling of streaming temporal data with many features will need to be overcome before we can fully assess the potential of deep learning for this application area.

Still, recent work has also revealed domains in which deep networks have proven superior to traditional methods. Survival analysis models the time leading to an event of interest from a shared starting point, and in the context of EHR data, often associates these events to subject covariates. Exploring this relationship is difficult, however, given that EHR data types are often heterogeneous, covariates are often missing and conventional approaches require the covariate–event relationship be linear and aligned to a specific starting point [121]. Early approaches, such as the Faraggi–Simon feed-forward network, aimed to relax the linearity assumption, but performance gains were lacking [122]. Katzman et al. [123] in turn developed a deep implementation of the Faraggi–Simon network that, in addition to outperforming Cox regression, was capable of comparing the risk between a given pair of treatments, thus potentially acting as recommender system. To overcome the remaining difficulties, researchers have turned to deep exponential families, a class of latent generative models that are constructed from any type of exponential family distributions [124]. The result was a deep survival analysis model capable of overcoming challenges posed by missing data and heterogeneous data types, while uncovering nonlinear relationships between covariates and failure time. They showed their model more accurately stratified patients as a function of disease-risk score compared to the current clinical implementation.

There is a computational cost for these methods, however, when compared to traditional, non-neural network approaches. For the exponential family models, despite their scalability [125], an important question for the investigator is whether he or she is interested in estimates of posterior uncertainty. Given that these models are effectively Bayesian neural networks, much of their utility simplifies to whether a Bayesian approach is warranted for a given increase in computational cost. Moreover, as with all variational methods, future work must continue to explore just how well the posterior distributions are approximated, especially as model complexity increases [126].

2.4. Challenges and opportunities in patient categorization

2.4.1. Generating ground-truth labels can be expensive or impossible

A dearth of true labels is perhaps among the biggest obstacles for EHR-based analyses that employ machine learning. Popular deep learning (and other machine learning) methods are often used to tackle classification tasks and thus require ground-truth labels for training. For EHRs, this can mean that researchers must hire multiple clinicians to manually read and annotate individual patients' records through a process called chart review. This allows researchers to assign ‘true’ labels, i.e. those that match our best available knowledge. Depending on the application, sometimes the features constructed by algorithms also need to be manually validated and interpreted by clinicians. This can be time-consuming and expensive [127]. Because of these costs, much of this research, including the work cited in this review, skips the process of expert review. Clinicians' skepticism for research without expert review may greatly dampen their enthusiasm for the work and consequently reduce its impact. To date, even well-resourced large national consortia have been challenged by the task of acquiring enough expert-validated labelled data. For instance, in the eMERGE consortia and PheKB database [128], most samples with expert validation contain only 100–300 patients. These datasets are quite small even for simple machine learning algorithms. The challenge is greater for deep learning models with many parameters. While unsupervised and semi-supervised approaches can help with small sample sizes, the field would benefit greatly from large collections of anonymized records in which a substantial number of records have undergone expert review. This challenge is not unique to EHR-based studies. Work on medical images, omics data in applications for which detailed metadata are required, and other applications for which labels are costly to obtain will be hampered as long as abundant curated data are unavailable.

Successful approaches to date in this domain have sidestepped this challenge by making methodological choices that either reduce the need for labelled examples or use transformations to training data to increase the number of times it can be used before overfitting occurs. For example, the unsupervised and semi-supervised methods that we have discussed reduce the need for labelled examples [115]. The anchor and learn framework [129] uses expert knowledge to identify high-confidence observations from which labels can be inferred. If transformations are available that preserve the meaningful content of the data, the adversarial and augmented training techniques discussed above can reduce overfitting. While these can be easily imagined for certain methods that operate on images, it is more challenging to figure out equivalent transformations for a patient's clinical test results. Consequently, it may be hard to employ such training examples with other applications. Finally, approaches that transfer features can also help use valuable training data most efficiently. Rajkomar et al. [58] trained a deep neural network using generic images before tuning using only radiology images. Datasets that require many of the same types of features might be used for initial training, before fine-tuning takes place with the more sparse biomedical examples. Though the analysis has not yet been attempted, it is possible that analogous strategies may be possible with EHRs. For example, features learned from the EHR for one type of clinical test (e.g. a decrease over time in a laboratory value) may transfer across phenotypes. Methods to accomplish more with little high-quality labelled data arose in other domains and may also be adapted to this challenge, e.g. data programming [130]. In data programming, noisy automated labelling functions are integrated.

Numerous commentators have described data as the new oil [131,132]. The idea behind this metaphor is that data are available in large quantities, valuable once refined, and this underlying resource will enable a data-driven revolution in how work is done. Contrasting with this perspective, Ratner et al. [133] described labelled training data, instead of data, as ‘The New New Oil’. In this framing, data are abundant and not a scarce resource. Instead, new approaches to solving problems arise when labelled training data become sufficient to enable them. Based on our review of research on deep learning methods to categorize disease, the latter framing rings true.

We expect improved methods for domains with limited data to play an important role if deep learning is going to transform how we categorize states of human health. We do not expect that deep learning methods will replace expert review. We expect them to complement expert review by allowing more efficient use of the costly practice of manual annotation.

2.4.2. Data sharing is hampered by standardization and privacy considerations

To construct the types of very large datasets that deep learning methods thrive on, we need robust sharing of large collections of data. This is, in part, a cultural challenge. We touch on this challenge in the Discussion section. Beyond the cultural hurdles around data sharing, there are also technological and legal hurdles related to sharing individual health records or deep models built from such records. This subsection deals primarily with these challenges.

EHRs are designed chiefly for clinical, administrative and financial purposes, such as patient care, insurance and billing [134]. Science is at best a tertiary priority, presenting challenges to EHR-based research, in general, and to deep learning research, in particular. Although there is significant work in the literature around EHR data quality and the impact on research [135], we focus on three types of challenges: local bias, wider standards and legal issues. Note these problems are not restricted to EHRs but can also apply to any large biomedical dataset, e.g. clinical trial data.

Even within the same healthcare system, EHRs can be used differently [136,137]. Individual users have unique documentation and ordering patterns, with different departments and different hospitals having different priorities that code patients and introduce missing data in a non-random fashion [138]. Patient data may be kept across several ‘silos’ within a single health system (e.g. separate nursing documentation, registries, etc.). Even the most basic task of matching patients across systems can be challenging due to data entry issues [139]. The situation is further exacerbated by the ongoing introduction, evolution and migration of EHR systems, especially where reorganized and acquired healthcare facilities have to merge. Furthermore, even the ostensibly least-biased data type, laboratory measurements, can be biased based by both the healthcare process and patient health state [140]. As a result, EHR data can be less complete and less objective than expected.

In the wider picture, standards for EHRs are numerous and evolving. Proprietary systems, indifferent and scattered use of health information standards, and controlled terminologies makes combining and comparison of data across systems challenging [141]. Further diversity arises from variation in languages, healthcare practices and demographics. Merging EHRs gathered in different systems (and even under different assumptions) is challenging [142].

Combining or replicating studies across systems thus requires controlling for both the above biases and dealing with mismatching standards. This has the practical effect of reducing cohort size, limiting statistical significance, preventing the detection of weak effects [143], and restricting the number of parameters that can be trained in a model. Furthermore, rule-based algorithms have been popular in EHR-based research, but because these are developed at a single institution and trained with a specific patient population, they do not transfer easily to other healthcare systems [144]. Genetic studies using EHR data are subject to even more bias, as the differences in population ancestry across health centres (e.g. proportion of patients with African or Asian ancestry) can affect algorithm performance. For example, Wiley et al. [145] showed that warfarin dosing algorithms often under-perform in African Americans, illustrating that some of these issues are unresolved even at a treatment best practices level. Lack of standardization also makes it challenging for investigators skilled in deep learning to enter the field, as numerous data processing steps must be performed before algorithms are applied.

Finally, even if data were perfectly consistent and compatible across systems, attempts to share and combine EHR data face considerable legal and ethical barriers. Patient privacy can severely restrict the sharing and use of EHR data [146]. Here again, standards are heterogeneous and evolving, but often EHR data cannot be exported or even accessed directly for research purposes without appropriate consent. In the USA, research use of EHR data is subject both to the Common Rule and the Health Insurance Portability and Accountability Act. Ambiguity in the regulatory language and individual interpretation of these rules can hamper use of EHR data [147]. Once again, this has the effect of making data gathering more laborious and expensive, reducing sample size and study power.

Several technological solutions have been proposed in this direction, allowing access to sensitive data satisfying privacy and legal concerns. Software like DataShield [148] and ViPAR [149], although not EHR-specific, allow querying and combining of datasets and calculation of summary statistics across remote sites by ‘taking the analysis to the data’. The computation is carried out at the remote site. Conversely, the EH4CR project [141] allows analysis of private data by use of an inter-mediation layer that interprets remote queries across internal formats and datastores and returns the results in a de-identified standard form, thus giving real-time consistent but secure access. Continuous analysis [150] can allow reproducible computing on private data. Using such techniques, intermediate results can be automatically tracked and shared without sharing the original data. While none of these have been used in deep learning, the potential is there.

Even without sharing data, algorithms trained on confidential patient data may present security risks or accidentally allow for the exposure of individual-level patient data. Tramer et al. [151] showed the ability to steal trained models via public application programming interfaces (APIs). Dwork & Roth [152] demonstrate the ability to expose individual-level information from accurate answers in a machine learning model. Attackers can use similar attacks to find out if a particular data instance was present in the original training set for the machine learning model [153], in this case, whether a person's record was present. To protect against these attacks, Simmons et al. [154] developed the ability to perform genome-wide association studies in a differentially private manner, and Abadi et al. [155] show the ability to train deep learning classifiers under the differential privacy framework.

These attacks also present a potential hazard for approaches that aim to generate data. Choi et al. [156] propose generative adversarial neural networks (GANs) as a tool to make sharable EHR data, and Esteban et al. [157] showed that recurrent GANs could be used for time-series data. However, in both cases the authors did not take steps to protect the model from such attacks. There are approaches to protect models, but they pose their own challenges. Training in a differentially private manner provides a limited guarantee that an algorithm's output will be equally likely to occur regardless of the participation of any one individual. The limit is determined by parameters which provide a quantification of privacy. Beaulieu-Jones et al. [158] demonstrated the ability to generate data that preserved properties of the SPRINT clinical trial with GANs under the differential privacy framework. Both Beaulieu-Jones et al. and Esteban et al. train models on synthetic data generated under differential privacy and observe performance from a transfer learning evaluation that is only slightly below models trained on the original, real data. Taken together, these results suggest that differentially private GANs may be an attractive way to generate sharable datasets for downstream reanalysis.

Federated learning [159] and secure aggregations [160] are complementary approaches that reinforce differential privacy. Both aim to maintain privacy by training deep learning models from decentralized data sources such as personal mobile devices without transferring actual training instances. This is becoming of increasing importance with the rapid growth of mobile health applications. However, the training process in these approaches places constraints on the algorithms used and can make fitting a model substantially more challenging. It can be trivial to train a model without differential privacy, but quite difficult to train one within the differential privacy framework [158]. This problem can be particularly pronounced with small sample sizes.

While none of these problems are insurmountable or restricted to deep learning, they present challenges that cannot be ignored. Technical evolution in EHRs and data standards will doubtless ease—although not solve—the problems of data sharing and merging. More problematic are the privacy issues. Those applying deep learning to the domain should consider the potential of inadvertently disclosing the participants' identities. Techniques that enable training on data without sharing the raw data may have a part to play. Training within a differential privacy framework may often be warranted.

2.4.3. Discrimination and ‘right to an explanation’ laws

In April 2016, the European Union adopted new rules regarding the use of personal information, the General Data Protection Regulation [161]. A component of these rules can be summed up by the phrase ‘right to an explanation’. Those who use machine learning algorithms must be able to explain how a decision was reached. For example, a clinician treating a patient who is aided by a machine learning algorithm may be expected to explain decisions that use the patient's data. The new rules were designed to target categorization or recommendation systems, which inherently profile individuals. Such systems can do so in ways that are discriminatory and unlawful.

As datasets become larger and more complex, we may begin to identify relationships in data that are important for human health but difficult to understand. The algorithms described in this review and others like them may become highly accurate and useful for various purposes, including within medical practice. However, to discover and avoid discriminatory applications it will be important to consider interpretability alongside accuracy. A number of properties of genomic and healthcare data will make this difficult.

First, research samples are frequently non-representative of the general population of interest; they tend to be disproportionately sick [162], male [163] and European in ancestry [164]. One well-known consequence of these biases in genomics is that penetrance is consistently lower in the general population than would be implied by case–control data, as reviewed in [162]. Moreover, real genetic associations found in one population may not hold in other populations with different patterns of linkage disequilibrium (even when population stratification is explicitly controlled for [165]). As a result, many genomic findings are of limited value for people of non-European ancestry [164] and may even lead to worse treatment outcomes for them. Methods have been developed for mitigating some of these problems in genomic studies [162,165], but it is not clear how easily they can be adapted for deep models that are designed specifically to extract subtle effects from high-dimensional data. For example, differences in the equipment that tended to be used for cases versus controls have led to spurious genetic findings (e.g. Sebastiani et al.'s retraction [166]). In some contexts, it may not be possible to correct for all of these differences to the degree that a deep network is unable to use them. Moreover, the complexity of deep networks makes it difficult to determine when their predictions are likely to be based on such nominally irrelevant features of the data (called ‘leakage’ in other fields [167]). When we are not careful with our data and models, we may inadvertently say more about the way the data were collected (which may involve a history of unequal access and discrimination) than about anything of scientific or predictive value. This fact can undermine the privacy of patient data [167] or lead to severe discriminatory consequences [168].

There is a small but growing literature on the prevention and mitigation of data leakage [167], as well as a closely related literature on discriminatory model behaviour [169], but it remains difficult to predict when these problems will arise, how to diagnose them and how to resolve them in practice. There is even disagreement about which kinds of algorithmic outcomes should be considered discriminatory [170]. Despite the difficulties and uncertainties, machine learning practitioners (and particularly those who use deep neural networks, which are challenging to interpret) must remain cognizant of these dangers and make every effort to prevent harm from discriminatory predictions. To reach their potential in this domain, deep learning methods will need to be interpretable (see Discussion). Researchers need to consider the extent to which biases may be learned by the model and whether or not a model is sufficiently interpretable to identify bias. We discuss the challenge of model interpretability more thoroughly in Discussion.

2.4.4. Applications of deep learning to longitudinal analysis

The longitudinal analysis follows a population across time, for example, prospectively from birth or from the onset of particular conditions. In large patient populations, longitudinal analyses such as the Framingham Heart Study [171] and the Avon Longitudinal Study of Parents and Children [172] have yielded important discoveries about the development of disease and the factors contributing to health status. Yet, a common practice in EHR-based research is to take a snapshot at a point in time and convert patient data to a traditional vector for machine learning and statistical analysis. This results in loss of information as timing and order of events can provide insight into a patient's disease and treatment [173]. Efforts to model sequences of events have shown promise [174] but require exceedingly large patient sizes due to discrete combinatorial bucketing. Lasko et al. [114] used autoencoders on longitudinal sequences of serum uric acid measurements to identify population subtypes. More recently, deep learning has shown promise working with both sequences (CNNs) [175] and the incorporation of past and current state (RNNs, LSTMs) [118]. This may be a particular area of opportunity for deep neural networks. The ability to recognize relevant sequences of events from a large number of trajectories requires powerful and flexible feature construction methods—an area in which deep neural networks excel.

3. Deep learning to study the fundamental biological processes underlying human disease

The study of cellular structure and core biological processes—transcription, translation, signalling, metabolism, etc.—in humans and model organisms will greatly impact our understanding of human disease over the long horizon [176]. Predicting how cellular systems respond to environmental perturbations and are altered by genetic variation remain daunting tasks. Deep learning offers new approaches for modelling biological processes and integrating multiple types of omic data [177], which could eventually help predict how these processes are disrupted in disease. Recent work has already advanced our ability to identify and interpret genetic variants, study microbial communities and predict protein structures, which also relates to the problems discussed in the drug development section. In addition, unsupervised deep learning has enormous potential for discovering novel cellular states from gene expression, fluorescence microscopy and other types of data that may ultimately prove to be clinically relevant.

Progress has been rapid in genomics and imaging, fields where important tasks are readily adapted to well-established deep learning paradigms. One-dimensional CNNs and RNNs are well suited for tasks related to DNA- and RNA-binding proteins, epigenomics and RNA splicing. Two-dimensional CNNs are ideal for segmentation, feature extraction and classification in fluorescence microscopy images [17]. Other areas, such as cellular signalling, are biologically important but studied less-frequently to date, with some exceptions [178]. This may be a consequence of data limitations or greater challenges in adapting neural network architectures to the available data. Here, we highlight several areas of investigation and assess how deep learning might move these fields forward.

3.1. Gene expression

Gene expression technologies characterize the abundance of many thousands of RNA transcripts within a given organism, tissue or cell. This characterization can represent the underlying state of the given system and can be used to study heterogeneity across samples as well as how the system reacts to perturbation. While gene expression measurements were traditionally made by quantitative polymerase chain reaction, low-throughput fluorescence-based methods and microarray technologies, the field has shifted in recent years to primarily performing RNA sequencing (RNA-seq) to catalogue whole transcriptomes. As RNA-seq continues to fall in price and rise in throughput, sample sizes will increase and training deep models to study gene expression will become even more useful.

Already several deep learning approaches have been applied to gene expression data with varying aims. For instance, many researchers have applied unsupervised deep learning models to extract meaningful representations of gene modules or sample clusters. Denoising autoencoders have been used to cluster yeast expression microarrays into known modules representing cell cycle processes [179] and to stratify yeast strains based on chemical and mutational perturbations [180]. Shallow (one hidden layer) denoising autoencoders have also been fruitful in extracting biological insight from thousands of Pseudomonas aeruginosa experiments [181,182] and in aggregating features relevant to specific breast cancer subtypes [26]. These unsupervised approaches applied to gene expression data are powerful methods for identifying gene signatures that may otherwise be overlooked. An additional benefit of unsupervised approaches is that ground-truth labels, which are often difficult to acquire or are incorrect, are non-essential. However, the genes that have been aggregated into features must be interpreted carefully. Attributing each node to a single specific biological function risks over-interpreting models. Batch effects could cause models to discover non-biological features, and downstream analyses should take this into consideration.

Deep learning approaches are also being applied to gene expression prediction tasks. For example, a deep neural network with three hidden layers outperformed linear regression in inferring the expression of over 20 000 target genes based on a representative, well-connected set of about 1000 landmark genes [183]. However, while the deep learning model outperformed existing algorithms in nearly every scenario, the model still displayed poor performance. The paper was also limited by computational bottlenecks that required data to be split randomly into two distinct models and trained separately. It is unclear how much performance would have increased if not for computational restrictions.

Epigenomic data, combined with deep learning, may have sufficient explanatory power to infer gene expression. For instance, the DeepChrome CNN [184] improved the prediction accuracy of high or low gene expression from histone modifications over existing methods. AttentiveChrome [185] added a deep attention model to further enhance DeepChrome. Deep learning can also integrate different data types. For example, Liang et al. [186] combined RBMs to integrate gene expression, DNA methylation and miRNA data to define ovarian cancer subtypes. While these approaches are promising, many convert gene expression measurements to categorical or binary variables, thus ablating many complex gene expression signatures present in intermediate and relative numbers.

Deep learning applied to gene expression data is still in its infancy, but the future is bright. Many previously untestable hypotheses can now be interrogated as deep learning enables analysis of increasing amounts of data generated by new technologies. For example, the effects of cellular heterogeneity on basic biology and disease aetiology can now be explored by single-cell RNA-seq and high-throughput fluorescence-based imaging, techniques we discuss below that will benefit immensely from deep learning approaches.

3.2. Splicing

Pre-mRNA transcripts can be spliced into different isoforms by retaining or skipping subsets of exons or including parts of introns, creating enormous spatio-temporal flexibility to generate multiple distinct proteins from a single gene. This remarkable complexity can lend itself to defects that underlie many diseases. For instance, splicing mutations in the lamin A (LMNA) gene can lead to specific variants of dilated cardiomyopathy and limb-girdle muscular dystrophy [187]. A recent study found that quantitative trait loci that affect splicing in lymphoblastoid cell lines are enriched within risk loci for schizophrenia, multiple sclerosis and other immune diseases, implicating mis-splicing as a more widespread feature of human pathologies than previously thought [188]. Therapeutic strategies that aim to modulate splicing are also currently being considered for disorders such as Duchenne muscular dystrophy and spinal muscular atrophy [187].

Sequencing studies routinely return thousands of unannotated variants, but which cause functional changes in splicing and how are those changes manifested? Prediction of a ‘splicing code’ has been a goal of the field for the past decade. Initial machine learning approaches used a naive Bayes model and a two-layer Bayesian neural network with thousands of hand-derived sequence-based features to predict the probability of exon skipping [189,190]. With the advent of deep learning, more complex models provided better predictive accuracy [191,192]. Importantly, these new approaches can take in multiple kinds of epigenomic measurements as well as tissue identity and RNA-binding partners of splicing factors. Deep learning is critical in furthering these kinds of integrative studies where different data types and inputs interact in unpredictable (often nonlinear) ways to create higher-order features. Moreover, as in gene expression network analysis, interrogating the hidden nodes within neural networks could potentially illuminate important aspects of splicing behaviour. For instance, tissue-specific splicing mechanisms could be inferred by training networks on splicing data from different tissues, then searching for common versus distinctive hidden nodes, a technique employed by Qin et al. [193] for tissue-specific transcription factor (TF) binding predictions.

A parallel effort has been to use more data with simpler models. An exhaustive study using readouts of splicing for millions of synthetic intronic sequences uncovered motifs that influence the strength of alternative splice sites [194]. The authors built a simple linear model using hexamer motif frequencies that successfully generalized to exon skipping. In a limited analysis using single-nucleotide polymorphisms (SNPs) from three genes, it predicted exon skipping with three times the accuracy of an existing deep learning-based framework [191]. This case is instructive in that clever sources of data, not just more descriptive models, are still critical.

We already understand how mis-splicing of a single gene can cause diseases such as limb-girdle muscular dystrophy. The challenge now is to uncover how genome-wide alternative splicing underlies complex, non-Mendelian diseases such as autism, schizophrenia, Type 1 diabetes and multiple sclerosis [195]. As a proof of concept, Xiong et al. [191] sequenced five autism spectrum disorder and 12 control samples, each with an average of 42 000 rare variants, and identified mis-splicing in 19 genes with neural functions. Such methods may one day enable scientists and clinicians to rapidly profile thousands of unannotated variants for functional effects on splicing and nominate candidates for further investigation. Moreover, these nonlinear algorithms can deconvolve the effects of multiple variants on a single splice event without the need to perform combinatorial in vitro experiments. The ultimate goal is to predict an individual's tissue-specific, exon-specific splicing patterns from their genome sequence and other measurements to enable a new branch of precision diagnostics that also stratifies patients and suggests targeted therapies to correct splicing defects. However, to achieve this we expect that methods to interpret the ‘black box’ of deep neural networks and integrate diverse data sources will be required.

3.3. Transcription factors

TFs are proteins that bind regulatory DNA in a sequence-specific manner to modulate the activation and repression of gene transcription. High-throughput in vitro experimental assays that quantitatively measure the binding specificity of a TF to a large library of short oligonucleotides [196] provide rich datasets to model the naked DNA sequence affinity of individual TFs in isolation. However, in vivo TF binding is affected by a variety of other factors beyond sequence affinity, such as competition and cooperation with other TFs, TF concentration and chromatin state (chemical modifications to DNA and other packaging proteins that DNA is wrapped around) [196]. TFs can thus exhibit highly variable binding landscapes across the same genomic DNA sequence across diverse cell types and states. Several experimental approaches such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) have been developed to profile in vivo binding maps of TFs [196]. Large reference compendia of ChIP-seq data are now freely available for a large collection of TFs in a small number of reference cell states in humans and a few other model organisms [197]. Owing to fundamental material and cost constraints, it is infeasible to perform these experiments for all TFs in every possible cellular state and species. Hence, predictive computational models of TF binding are essential to understand gene regulation in diverse cellular contexts.

Several machine learning approaches have been developed to learn generative and discriminative models of TF binding from in vitro and in vivo TF binding datasets that associate collections of synthetic DNA sequences or genomic DNA sequences to binary labels (bound/unbound) or continuous measures of binding. The most common class of TF binding models in the literature are those that only model the DNA sequence affinity of TFs from in vitro and in vivo binding data. The earliest models were based on deriving simple, compact, interpretable sequence motif representations such as position weight matrices (PWMs) and other biophysically inspired models [198–200]. These models were outperformed by general k-mer-based models including support vector machines (SVMs) with string kernels [201,202].

In 2015, Alipanahi et al. [203] developed DeepBind, the first CNN to classify bound DNA sequences based on in vitro and in vivo assays against random DNA sequences matched for dinucleotide sequence composition. The convolutional layers learn pattern detectors reminiscent of PWMs from a one-hot encoding of the raw input DNA sequences. DeepBind outperformed several state-of-the-art methods from the DREAM5 in vitro TF-DNA motif recognition challenge [200]. Although DeepBind was also applied to RNA-binding proteins, in general, RNA binding is a separate problem [204] and accurate models will need to account for RNA secondary structure. Following DeepBind, several optimized convolutional and recurrent neural network architectures as well as novel hybrid approaches that combine kernel methods with neural networks have been proposed that further improve performance [205–208]. Specialized layers and regularizers have also been proposed to reduce parameters and learn more robust models by taking advantage of specific properties of DNA sequences such as their reverse complement equivalence [209,210].

While most of these methods learn independent models for different TFs, in vivo multiple TFs compete or cooperate to occupy DNA binding sites, resulting in complex combinatorial co-binding landscapes. To take advantage of this shared structure in in vivo TF binding data, multi-task neural network architectures have been developed that explicitly share parameters across models for multiple TFs [208,211,212]. Some of these multi-task models train and evaluate classification performance relative to an unbound background set of regulatory DNA sequences sampled from the genome rather than using synthetic background sequences with matched dinucleotide composition.

The above-mentioned TF binding prediction models that use only DNA sequences as inputs have a fundamental limitation. Because the DNA sequence of a genome is the same across different cell types and states, a sequence-only model of TF binding cannot predict different in vivo TF binding landscapes in new cell types not used during training. One approach for generalizing TF binding predictions to new cell types is to learn models that integrate DNA sequence inputs with other cell-type-specific data modalities that modulate in vivo TF binding such as surrogate measures of TF concentration (e.g. TF gene expression) and chromatin state. Arvey et al. [213] showed that combining the predictions of SVMs trained on DNA sequence inputs and cell-type specific DNase-seq data, which measures genome-wide chromatin accessibility, improved in vivo TF binding prediction within and across cell types. Several ‘footprinting’-based methods have also been developed that learn to discriminate bound from unbound instances of known canonical motifs of a target TF based on high-resolution footprint patterns of chromatin accessibility that are specific to the target TF [214]. However, the genome-wide predictive performance of these methods in new cell types and states has not been evaluated.

Recently, a community challenge known as the ‘ENCODE-DREAM in vivo TF Binding Site Prediction Challenge’ was introduced to systematically evaluate the genome-wide performance of methods that can predict TF binding across cell states by integrating DNA sequence and in vitro DNA shape with cell-type-specific chromatin accessibility and gene expression [215]. A deep learning model called FactorNet was among the top three performing methods in the challenge [216]. FactorNet uses a multimodal hybrid convolutional and recurrent architecture that integrates DNA sequence with chromatin accessibility profiles, gene expression and evolutionary conservation of sequence. It is worth noting that FactorNet was slightly outperformed by an approach that does not use neural networks [217]. This top ranking approach uses an extensive set of curated features in a weighted variant of a discriminative maximum conditional likelihood model in combination with a novel iterative training strategy and model stacking. There appears to be significant room for improvement because none of the current approaches for cross cell-type prediction explicitly account for the fact that TFs can co-bind with distinct cofactors in different cell states. In such cases, sequence features that are predictive of TF binding in one cell state may be detrimental to predicting binding in another.

Singh et al. [218] developed transfer string kernels for SVMs for cross-context TF binding. Domain adaptation methods that allow training neural networks which are transferable between differing training and test set distributions of sequence features could be a promising avenue going forward [219,220]. These approaches may also be useful for transferring TF binding models across species.

Another class of imputation-based cross cell type in vivo TF binding prediction methods leverage the strong correlation between combinatorial binding landscapes of multiple TFs. Given a partially complete panel of binding profiles of multiple TFs in multiple cell types, a deep learning method called TFImpute learns to predict the missing binding profile of a target TF in some target cell type in the panel based on the binding profiles of other TFs in the target cell type and the binding profile of the target TF in other cell types in the panel [193]. However, TFImpute cannot generalize predictions beyond the training panel of cell types and requires TF binding profiles of related TFs.

It is worth noting that TF binding prediction methods in the literature based on neural networks and other machine learning approaches choose to sample the set of bound and unbound sequences in a variety of different ways. These choices and the choice of performance evaluation measures significantly confound systematic comparison of model performance (see Discussion).

Several methods have also been developed to interpret neural network models of TF binding. Alipanahi et al. [203] visualize convolutional filters to obtain insights into the sequence preferences of TFs. They also introduced in silico mutation maps for identifying important predictive nucleotides in input DNA sequences by exhaustively forward propagating perturbations to individual nucleotides to record the corresponding change in output prediction. Shrikumar et al. [221] proposed efficient backpropagation-based approaches to simultaneously score the contribution of all nucleotides in an input DNA sequence to an output prediction. Lanchantin et al. [206] developed tools to visualize TF motifs learned from TF binding site classification tasks. These and other general interpretation techniques (see Discussion) will be critical to improve our understanding of the biologically meaningful patterns learned by deep learning models of TF binding.

3.4. Promoters and enhancers

3.4.1. From transcription factor binding to promoters and enhancers

Multiple TFs act in concert to coordinate changes in gene regulation at the genomic regions known as promoters and enhancers. Each gene has an upstream promoter, essential for initiating that gene's transcription. The gene may also interact with multiple enhancers, which can amplify transcription in particular cellular contexts. These contexts include different cell types in development or environmental stresses.

Promoters and enhancers provide a nexus where clusters of TFs and binding sites mediate downstream gene regulation, starting with transcription. The gold standard to identify an active promoter or enhancer requires demonstrating its ability to affect transcription or other downstream gene products. Even extensive biochemical TF binding data has thus far proven insufficient on its own to accurately and comprehensively locate promoters and enhancers. We lack sufficient understanding of these elements to derive a mechanistic ‘promoter code’ or ‘enhancer code’. But extensive labelled data on promoters and enhancers lends itself to probabilistic classification. The complex interplay of TFs and chromatin leading to the emergent properties of promoter and enhancer activity seems particularly apt for representation by deep neural networks.

3.4.2. Promoters

Despite decades of work, computational identification of promoters remains a stubborn problem [222]. Researchers have used neural networks for promoter recognition as early as 1996 [223]. Recently, a CNN recognized promoter sequences with sensitivity and specificity exceeding 90% [224]. Most activity in computational prediction of regulatory regions, however, has moved to enhancer identification. Because one can identify promoters with straightforward biochemical assays [225,226], the direct rewards of promoter prediction alone have decreased. But the reliable ground-truth provided by these assays makes promoter identification an appealing test bed for deep learning approaches that can also identify enhancers.

3.4.3. Enhancers

Recognizing enhancers presents additional challenges. Enhancers may be up to 1 000 000 bp away from the affected promoter, and even within introns of other genes [227]. Enhancers do not necessarily operate on the nearest gene and may affect multiple genes. Their activity is frequently tissue- or context-specific. No biochemical assay can reliably identify all enhancers. Distinguishing them from other regulatory elements remains difficult, and some believe the distinction somewhat artificial [228]. While these factors make the enhancer identification problem more difficult, they also make a solution more valuable.

Several neural network approaches yielded promising results in enhancer prediction. Both Basset [229] and DeepEnhancer [230] used CNNs to predict enhancers. DECRES used a feed-forward neural network [231] to distinguish between different kinds of regulatory elements, such as active enhancers and promoters. DECRES had difficulty distinguishing between inactive enhancers and promoters. They also investigated the power of sequence features to drive classification, finding that beyond CpG islands, few were useful.

Comparing the performance of enhancer prediction methods illustrates the problems in using metrics created with different benchmarking procedures. Both the Basset and DeepEnhancer studies include comparisons to a baseline SVM approach, gkm-SVM [202]. The Basset study reports gkm-SVM attains a mean area under the precision-recall curve (AUPR) of 0.322 over 164 cell types [229]. The DeepEnhancer study reports for gkm-SVM a dramatically different AUPR of 0.899 on nine cell types [230]. This large difference means it is impossible to directly compare the performance of Basset and DeepEnhancer based solely on their reported metrics. DECRES used a different set of metrics altogether. To drive further progress in enhancer identification, we must develop a common and comparable benchmarking procedure (see Discussion).

3.4.4. Promoter–enhancer interactions

In addition to the location of enhancers, identifying enhancer–promoter interactions in three-dimensional space will provide critical knowledge for understanding transcriptional regulation. SPEID used a CNN to predict these interactions with only sequence and the location of putative enhancers and promoters along a one-dimensional chromosome [232]. It compared well to other methods using a full complement of biochemical data from ChIP-seq and other epigenomic methods. Of course, the putative enhancers and promoters used were themselves derived from epigenomic methods. But one could easily replace them with the output of one of the enhancer or promoter prediction methods above.

3.5. MicroRNA binding

Prediction of miRNAs and miRNA targets is of great interest, as they are critical components of gene regulatory networks and are often conserved across great evolutionary distance [233,234]. While many machine learning algorithms have been applied to these tasks, they currently require extensive feature selection and optimization. For instance, one of the most widely adopted tools for miRNA target prediction, TargetScan, trained multiple linear regression models on 14 hand-curated features including structural accessibility of the target site on the mRNA, the degree of site conservation and predicted thermodynamic stability of the miRNA–mRNA complex [235]. Some of these features, including structural accessibility, are imperfect or empirically derived. In addition, current algorithms suffer from low specificity [236].

As in other applications, deep learning promises to achieve equal or better performance in predictive tasks by automatically engineering complex features to minimize an objective function. Two recently published tools use different recurrent neural network-based architectures to perform miRNA and target prediction with solely sequence data as input [236,237]. Though the results are preliminary and still based on a validation set rather than a completely independent test set, they were able to predict microRNA target sites with higher specificity and sensitivity than TargetScan. Excitingly, these tools seem to show that RNNs can accurately align sequences and predict bulges, mismatches and wobble base pairing without requiring the user to input secondary structure predictions or thermodynamic calculations. Further incremental advances in deep learning for miRNA and target prediction will likely be sufficient to meet the current needs of systems biologists and other researchers who use prediction tools mainly to nominate candidates that are then tested experimentally.

3.6. Protein secondary and tertiary structure

Proteins play fundamental roles in almost all biological processes, and understanding their structure is critical for basic biology and drug development. UniProt currently has about 94 million protein sequences, yet fewer than 100 000 proteins across all species have experimentally solved structures in Protein Data Bank (PDB). As a result, computational structure prediction is essential for a majority of proteins. However, this is very challenging, especially when similar solved structures, called templates, are not available in PDB. Over the past several decades, many computational methods have been developed to predict aspects of protein structure such as secondary structure, torsion angles, solvent accessibility, inter-residue contact maps, disorder regions and side-chain packing. In recent years, multiple deep learning architectures have been applied, including DBNs, LSTMs, CNNs and deep convolutional neural fields [31,238].

Here, we focus on deep learning methods for two representative sub-problems: secondary structure prediction and contact map prediction. Secondary structure refers to local conformation of a sequence segment, while a contact map contains information on all residue–residue contacts. Secondary structure prediction is a basic problem and an almost essential module of any protein structure prediction package. Contact prediction is much more challenging than secondary structure prediction, but it has a much larger impact on tertiary structure prediction. In recent years, the accuracy of contact prediction has greatly improved [29,239–241].

One can represent protein secondary structure with three different states (α-helix, β-strand and loop regions) or eight finer-grained states. The accuracy of a three-state prediction is called Q3, and accuracy of an eight-state prediction is called Q8. Several groups [30,242,243] applied deep learning to protein secondary structure prediction but were unable to achieve significant improvement over the de facto standard method PSIPRED [244], which uses two shallow feed-forward neural networks. In 2014, Zhou & Troyanskaya [245] demonstrated that they could improve Q8 accuracy by using a deep supervised and convolutional generative stochastic network. In 2016, Wang et al. developed a DeepCNF model that improved Q3 and Q8 accuracy as well as prediction of solvent accessibility and disorder regions [31,238]. DeepCNF achieved a higher Q3 accuracy than the standard maintained by PSIPRED for more than 10 years. This improvement may be mainly due to the ability of convolutional neural fields to capture long-range sequential information, which is important for β-strand prediction. Nevertheless, the improvements in secondary structure prediction from DeepCNF are unlikely to result in a commensurate improvement in tertiary structure prediction because secondary structure mainly reflects coarse-grained local conformation of a protein structure.

Protein contact prediction and contact-assisted folding (i.e. folding proteins using predicted contacts as restraints) represent a promising new direction for ab initio folding of proteins without good templates in PDB. Coevolution analysis is effective for proteins with a very large number (more than 1000) of sequence homologues [241], but fares poorly for proteins without many sequence homologues. By combining coevolution information with a few other protein features, shallow neural network methods such as MetaPSICOV [239] and CoinDCA-NN [246] have shown some advantage over pure coevolution analysis for proteins with few sequence homologues, but their accuracy is still far from satisfactory. In recent years, deeper architectures have been explored for contact prediction, such as CMAPpro [247], DNCON [248] and PConsC [249]. However, blindly tested in the well-known CASP competitions, these methods did not show any advantage over MetaPSICOV [239].

Recently, Wang et al. [29] proposed the deep learning method RaptorX-Contact, which significantly improves contact prediction over MetaPSICOV and pure coevolution methods, especially for proteins without many sequence homologues. It employs a network architecture formed by one one-dimensional residual neural network and one 2D residual neural network. Blindly tested in the latest CASP competition (i.e. CASP12 [250]), RaptorX-Contact ranked first in F 1 score on free-modelling targets as well as the whole set of targets. In CAMEO (which can be interpreted as a fully automated CASP) [251], its predicted contacts were also able to fold proteins with a novel fold and only 65–330 sequence homologues. This technique also worked well on membrane proteins even when trained on non-membrane proteins [252]. RaptorX-Contact performed better mainly due to the introduction of residual neural networks and exploitation of contact occurrence patterns by simultaneously predicting all the contacts in a single p