This August, I heard the words that no one wants to hear from their doctor: “You have cancer.” I was diagnosed with a rare non-Hodgkin’s lymphoma. After a tumultuous couple of weeks of testing and second opinions it was clear that my prognosis was good. The months of treatment had me thinking about my luck; even though I had to live with cancer, I was fortunate to have a good prognosis. I found myself pondering the age-old question, “is there some reason for this?”

During this time, I was really starting to dive into data science. I began thinking about the impact of machine learning (ML) and artificial intelligence (AI) on cancer treatment and its effect on prognosis. Given my situation, I wondered if I should begin to focus studying this area of data science. This article is the first step in doing just that.

On Friday, January 12, 2018, my doctor was happy to deliver the news that my PET scan was clear. This series of articles is a celebration and catharsis for me. In this first article I’m going to introduce deep learning.

Deep Learning is being developed to help read radiology scans (PET/CT, MRI, X-Ray, etc.) to improve early detection of cancer, to help reduce false positives and unnecessary surgery in breast cancer screenings, to enable near real time analysis of polyps during colorectal cancer screenings, among other ways of improving detection. I will be looking at these in subsequent posts.

What is Deep Learning?

Since most of the work relevant to cancer diagnosis I’ll be exploring depends heavily on deep learning, I decided that the series should start with a short introduction to some of the basics of deep learning.

Disclaimer: I’m not an expert on deep learning, but I’ve had the article reviewed for accuracy.

Deep learning is a subset of artificial intelligence in the machine learning domain. It uses large neural networks to learn which features are important in data and how those features can be utilized to make predictions about new datasets [1, 2, 3].

What do we mean by (supervised) learning? Well, machine learning algorithms need three things to function: input data, examples of the expected output (training data), and a feedback signal to indicate how well the algorithm performs. Essentially a machine learning algorithm works to transform data in a useful way such that it produces meaningful output. For example, one of the papers I will look at takes a cell phone picture of a skin lesion and outputs a prediction of whether or not the lesion is malignant. Therefore learning in this context is an automatic search for better representations [4].

FIG 1: Diagram of a multi-layer feedforward artificial neural network. By Chrislb (derivative work: — HELLKNOWZ, TALK, enWP TALK ), CC BY-SA 3.0 via Wikimedia Commons

I said deep learning uses large neural networks, but what is a neural network? A neural network is based on the human understanding of the brain. Unlike the brain, where neurons can make a connection to any other neuron, neural networks are discretized into layers with a directionality of information flow [4, 5]. This is illustrated in FIG 1. Data is passed into the first layer of neurons and processed. The processed output is passed onward to another layer and so on until it returns a result.

The Basis of Neural Networks — Neurons

Neural Networks are comprised of analogues called neurons. We’ll look at two common implementations of neurons: the perceptron and the sigmoid neuron [1]. Historically neural networks originated with the perceptron. The perceptron was first conceived in the 1950s and 60s by Frank Rosenblatt [1, 2].

A perceptron essentially behaves like a logic gate [3, 4, 5, 6]. If you’re unfamiliar with logic gates, they are analog circuits that implement boolean logic [3, 4, 6]. They will take input values and process them to determine if certain conditions are met [FIG 2].

FIG 2: A schematic representation of a neuron. The inputs, X, are weighted (indicated by the varying width of the arrows) and processed by the neuron’s activation function producing the output, y.

A gate that is of particular importance is the NAND gate (a negative AND gate). A NAND gate is universal in computing, which means any other boolean function can be represented as a combination of NAND gates [1, 7, 8]. An implication of this idea is that, since perceptrons can implement NAND gates, perceptrons are also universal in computation [1].

FIG 3: Illustration of the threshold function. This variation has a third component: if the product of the inputs is equal to the bias, that is they sum to zero, the output is 0.5. By PAR~commonswiki, CC-BY-SA-3.0 via Wikimedia Commons

The two implementations of neurons differ in their activation/transfer functions (the equation that determines if the neuron is on or off)[1, 5]. The beauty of Rosenblatt’s perceptron is in its simplicity. The perceptron will take inputs and combine them according to their weights and output a value of 1 if the inputs indicate the object being analyzed is a member of the class or a 0 if it is not a member [FIG 2]. The perceptron uses a threshold function that will output a value of 0 unless the the sum of the weighted inputs is greater than the perceptron’s bias or threshold in which case it will output a 1 [FIG 3, FIG 4].

The problem with a threshold function is that small changes in the weights or bias of the perceptron can produce hard to control outcomes in the behavior of the output [1]. This is because of the all or nothing nature of the activation function. This becomes a problem when trying to optimize a network of neurons.

FIG 4: Vector representations of the activation functions of the two main types of neurons.

We can avoid this problem by using a sigmoid function [FIG 5] rather than the threshold function.

FIG 5: Illustration the the sigmoid function. Public Domain via Wikimedia Commons

The sigmoid function (also called the logistic function, as it is the particular form of sigmoid function used) has the property of having a flattened S shape [FIG 5]. This is compared to the ‘on/off’ shape of the curve for the threshold function [FIG 3].

The sigmoid function produces a much smoother transition for the output of the neuron. As we approach the edges of the graph, the neuron behaves like the perceptron outputting approximately 0 or 1. The neuron will be on as the value of wx + b gets larger and off as it gets smaller. The area around zero is where we have the largest change in behavior. This smooth and continuous region allows for us to make small changes in the weights and bias of the perceptron without producing the large change in output that the threshold function could produce. In fact, the sigmoid neurons are linear functions of these changes in weights and bias [1]. Another difference between these two functions is that the allowed outputs of the sigmoid function are any value between 0 and 1, where the threshold function will output either 0 or 1 [1]. This is useful for determining certain features but is less clear when applied to classification [1]. One way to interpret the output of this single neuron is as the probability of belonging to the class in question.

Combining Neurons into Networks

Single neurons are useful for making a single decision but how can we extend their usefulness? In FIG 1, the network shows that a number of neurons can be combined to make up a layer of neurons. Just looking at the input layer in the figure, we can see it is made up of three neurons and so it is capable of making three very simple decisions (although since there are two output neurons the network itself will only make two classifications). We can scale the number of simple decisions by increasing the number of neurons in the layer. The output of the layer is a representation of the data [1, 2]. Ideally, the representation provides some insight into the problem being solved [FIG 6].

FIG 6: Example mean activation shapes in a learned hierarchy. These are the representations of the data that are learned by the network. ViCoS Lab

The layers of neurons can be stacked to further process the data. One can think of each successive layer as taking the output of the previous layer and further distilling the information contained in the data [2]. A multi-layer network is capable of abstracting the data enough to make sophisticated decisions [1].

FIG 7: A schematic representation of a multi-layer neural network. The input layer takes data from X and processes it and passes it to the first hidden layer. Each hidden layer processes and passes the data to the next. The output layer makes a prediction <y> that is an array of probabilities of the input belonging to each of the possible classes. The result is checked by the loss function to determine the networks accuracy and this is passed to the optimizer which adjusts the weights of the information passing to each layer.

An example of a multi-layer network is shown in FIG 7. Some input data is transformed by the first layer of the neural network (the input layer). The transformed data is then processed by a second layer (called a hidden layer because it is neither an input nor an output) [1]. This output will either go to the next hidden layer or to the output layer. The output layer will return an array of probability scores (which sum to 1) [2].

The output of the network is analyzed by a loss function (or feedback signal) which measures the accuracy of the prediction [2]. The network’s weights are altered by an optimizer [2].

FIG 8: A schematic of the information flow during optimization. The data loss shows the difference between the scores and the labels. The regularization loss is only dependent on the weights. Gradient descent allows us to find the slope of the weights and update them. Optimization: Stochastic Gradient Descent.

The optimizer uses gradient descent to determine how to update the weights. Essentially, the algorithm follows the slope of the surface of the loss function until a valley is found [3, 4, FIG 8]. This information is augmented through the process of backpropagation: “Backpropagation can […] be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher” [5]. Backpropagation provides detailed information about how changing the weights and biases will affect the entire network. [6]

FIG 9: Detailed activation of neurons in five layers. The activations show the representations that are deemed important by the network. Zeiler and Fergus, 2013.

Wrap Up

In summary:

Deep Learning plays an important role in cancer detection, biopsy analyses, and opens the possibility of increasing screening in regions where it is hard to find a doctor

The basic building block of a neural network is a neuron

The neuron can function with different activation functions which each have their benefits and deficits

Neurons can be stacked into layers and the layers can be stacked together adding to the sophistication of the decisions that can be made by the network

These decisions provide new representations of the data [FIG 6, FIG 9] that hopefully provide insight

The output of the neurons can be optimized using the loss function and backpropagation to enhance the sophistication of the network

Thank you to Irhum Shafkat and John Bjorn Nelson for reading my drafts and making great suggestions and ensuring accuracy. I’d also like to thank my wife for proof-reading and making suggestions to improve the flow and clarity of this article. A special thank you to my friends whose varying levels of technical familiarity allowed me to be sure this was accessible to a wide audience.

I hope you enjoyed this introduction to deep learning and will come back to read the next article in the series. Follow me to get notifications of new posts; leave a comment to let me know what you thought; send a tweet with feedback or to start a conversation; or check out my portfolio of projects.