by Vinay Rao Dandin and Avijit Chatterjee

Wikimedia Commons license

Today, many industries are using machine learning to empower their businesses and solve unique challenges. And it’s not just businesses. The medical field is using machine learning to solve problems for the general betterment of humanity. Let’s walk through a medical use case on an infectious disease, sepsis, where we’ll use an unsupervised learning technique to discover interesting insights.

Sepsis is a disease caused by the body’s extreme response to an infection. It has a high mortality rate when diagnosed late. There’s no miracle treatment for this disease; though, when diagnosed in time, doctors can administer antibiotics intravenously to fight the infection. Research about sepsis is active and infectious disease specialists regularly review published literature to keep up on early diagnosis techniques and any new interventions.

Getting started

We worked with medical experts in this field to define the search criteria, and then looked at a collection of 10,000 abstracts from PubMed, the popular online repository for medical literature data. After gathering and preparing the data containing medical publications, we then created a Python notebook for text analytics using an open source package called Gensim.

We did a case-insensitive match of medical terms in the abstracts to their canonical form (mapping across synonym occurrences) using the 2017 MeSH (Medical Subject Headings) taxonomy (or hierarchical organization of medical terms) created, maintained, and provided by the US National Library of Medicine. To apply machine learning algorithms on text data, we had to convert the abstracts to numeric representations. Next we used a bag-of-words approach to represent every term as a feature and to convert each abstract into a feature vector using the count of terms in the abstracts (term frequency) as the value for each feature.

Creating LDA Clusters

Using LDA (Latent Dirichlet Allocation) technique, we discovered eight topic clusters from the collection, and to further explore the clusters and the salient terms, we put together a visualization with the pyLDAviz package. See Figure 1.

Figure 1: Topic clusters visualized using pyLDAviz, showing salient terms across the collection

The interactive visualization allowed us to explore the salient terms that define each unique cluster. The clusters on the left show the bag-of-words-based feature space reduced to two dimensions by PCA. The size of each cluster is proportional to the number of abstracts it contains and a clear separation can be seen between the clusters. The right side shows the top 30 salient terms across the whole collection (such as mice, newborn infant, and septic shock), with their respective frequencies displayed as a bar chart.

Careful analysis of the topic clusters exposed a few interesting insights. In cluster 7 (on the far left of the PC1 axis), a term which popped up was “hmgb1 protein”. Other terms in the cluster were “mice”, “lung”, “inflammation” and “cytokines” (Figure 2). The terms are related to one another through the published literature, and so the cluster makes sense because several studies have been done on this protein and its links to sepsis. High-mobility group box 1 (HMGB1) protein is a cytokine responsible for mediating inflammation, whose levels increase during the late stages of sepsis [1]. It can cause acute lung injury because lungs are the most susceptible organ during a sepsis attack, and one in two patients are affected [2]. A growing list of HMGB1-inhibiting interventions for sepsis are available today such as intravenous immunoglobulin (IVIG), anti-coagulant agents, endogenous hormones, and small molecules [3].

Figure 2: Cluster 7 selected with terms like ‘hmgb1 protein’, ‘lung’ and ‘inflammation’.

We noticed that terms like “hydroxymethylglutaryl-coa reductase inhibitors” (Statin) and “acute kidney injury” appear in cluster 8. Statins are used in the prevention and treatment of severe sepsis [4]. Experiments have shown that Statins like “Simvastatin” can help improve survival rates in sepsis and acute kidney injury [5].

Using LDA clustering, we found two topics that turned out to be research areas actively being studied for the treatment of sepsis.

Figure 3: Cluster 8 selected with “hydroxymethylglutaryl-coa reductase inhibitors”

Using word embedding

The bag-of-words model ignores the context of a word, but an emerging approach called the word embedding method captures word context. The word embedding method represents each word as a vector in a low-dimensional space (for example, 300 dimensions) instead of tens of thousands of dimensions as happens with the bag-of-words model. These low-dimensional vectors have been found to group together contextually similar words and are used in many complex NLP tasks. This seminal paper from Mikolov et. al. explains the popular word embedding or word2vec model.

To train a word2vec model, we needed a larger collection of documents. Our original collection of 10K abstracts on sepsis was not going to be enough. So, working with experts in the field, we enhanced our search term to crawl a broader collection of 170K PubMed abstracts on infectious disease. We trained a word2vec model using this new collection in a Python notebook. We used a totally unsupervised approach on unigrams (single words) and hence did not map the terms to the MeSH ontology this time. We then exported the word vectors for visualization in TensorBoard, and within TensorBoard we applied PCA and then generated the visualization.

Helpfully, TensorBoard allowed us to project the word vectors onto a custom axis representing a synthetic dimension. We projected the word vectors trained on our collection onto the x-axis traversing from “recovery” to “death”, using a random vector for the y-axis (Figure 4). The word vectors were now aligned to the x-axis such that words on the right side had more affinity towards “death” and those on the left were more aligned with “recovery”. As expected, the word “remissions” appeared on the far left because it’s related to both “recovery” and “illness”. In other words, remission can be a precursor to either recovery or death. Our term of interest, “sepsis”, was on the right side of the x-axis. All the terms closer to “sepsis” in the original space (by cosine distance) were on the right side as well. This finding aligns with the observation that sepsis has a higher mortality rate.

Figure 4: word2vec visualization in TensorBoard with the vectors projected onto a custom axis of ‘death’ — ‘recovery’

You can watch a demo of the above analysis in this YouTube video. We are extremely grateful to Dr. Shravan Kethireddy and Dr. Vida Abedi from Geisinger for their kind advice and input on this study.

In summary, machine learning using text data, such as medical literature, is truly invaluable in the medical field to discover interesting insights and new relations between medical terms. Single-user tools such as IBM DSX Desktop provide an enriched download-and-go experience and familiar notebook environment for data scientists, supporting all three popular languages Python, R and Scala. Very soon, data scientists will be able to save models built using DSX Desktop directly into IBM DSX Local to create a scoring endpoint and to manage the end-to-end lifecycle of the model.