Part 1 Overview:

In this post, I plan on 1. Explaining what a decision tree and random forest is and why it has to do with machine learning 2. Explain the metric that is used to see how good the decision tree is 3. Implement a program using a decision tree myself.

Part 2 Decision Trees:

A decision tree is a decision support tool that utilizes only conditional control statements that branch from an upper node. Modeling this graphically looks like a branching tree flow chart. Fundamentally, a decision tree is a type of classification program with the goal being to sort data with different labels such as “spam” or “not spam”.

Qualitatively, a decision tree works by utilizing many conditional statements that help the program to decide what to do or suggest. Decision trees are useful because they are simple to understand and can be implemented with sparse data. However, consequently even small changes in data result in an entire rewrite of the decision tree. Calculations for decision trees can be complex, especially when results are non-independent from each other. Fundamentally, decision trees are Boolean/binary in nature, and are best used in those scenarios.

An example of a spam filter is shown below, where the data has 20 emails, 10 of which are spam and 10 of which are not spam. The goal was to ask a question such that the resulting trees would help us better predict whether an email is spam or not. The question posed here is: “Does the title include a “$” sign.” In this case, 10 of the emails did include a “$” sign and the other 10 did not. Of the 10 emails that did include a “$” sign, 8 of them were spam, and the other 2 were not. Of the emails that did not have a “$” sign, 3 of them were spam and 7 of them were not. Intuitively, it becomes apparent that asking the question of whether the title has a “$” sign supplies us with a better intuition as to whether an email is a spam or not (in the next section I will explain how we can mathematically describe whether a question helps us).

Figure 1: Example of a decision tree

Part 2 How to determine how good your question is in a Decision Tree:

Many a times, it is fairly easy for humans to tell whether a question results in useful information, such as the one shown in the above example. However, translating this into a mathematical model that a computer can evaluate is tricky. Luckily computer scientists have come up with a way of doing this; it’s called information entropy.

Conceptually one should imagine the quantity entropy as being defined by the amount of disorder in the system. The higher the entropy value, the more disorder there is in your system, and the less information you have. The goal is to ask a question such that this entropy value is reduced as much as possible. For the example given, one can imagine that the entropy of the starting position is pretty high because out of the 20 emails you have, you have an even split of 10 spam and 10 non spam emails. This means that if you were to randomly guess, your chances of predicting whether the mail is spam or not is 50/50.

On the contrary, if the 20 emails were split in such a way that 18 of them were spam and 2 of them were not, you would imagine that guessing spam would give you an 18/20 chance that you were correct. This would then mean that the entropy of such a system would be lower than the 50/50 split case!

Another factor that this entropy should encompass is that when you have the best information or that in the 20 emails all of them were spam, that the entropy value would be 0, or that there would be no way to lessen the entropy!

The final property this entropy should have is that a spam/not spam ratio of 11/9 should be no different than in a spam/not spam ratio of 9/11. This is because while you now know which side you would want to guess, you are not more comfortable than if you were to have known the spam/spam ratio were inverted!

It turns out this idea can be encapsulated in the simple equation:

Equation 1: Mathematical definition of information entropy

Here, S is defined as the entropy. And pi is defined as the probability of situation i. So in the case that the spam/not spam ratio is 1 (10/10) the entropy would be calculated as:

In this case, entropy turns out to be a big fat 1, or equivalently you literally have the most disorder in the system!

In the case that the spam/not spam ratio is 11/9 we would get:

If you calculate this value you get an entropy value of : S = 0.993!

If you were to do a spam/not spam ratio of 9/11:

Even without calculating you would realize that the value of entropy is: S = 0.993, the same as if it were a spam/not spam ratio of 11/9!

The last case we would have to check is the ratio being 20/0 case or the ideal case. In this case you would get:

And as you can see, in this case the entropy value is: S = 0!

So now that we have defined entropy, or the amount of good information you have in one state, our next goal is to see what question we should ask to maximize our information gain, or maximize our entropy gain the most!

To do this, we must first explain how we recalculate the entropy value after a decision has been made. To do this look at the image below as an example of what one decision might lead to (and yes it is the same as the one above):

Figure 1: Same example of decision tree

In this case, we already know that the initial entropy value is 1 because we already calculated it.

The next thing we do is to separately calculate the entropy values of each of the outcomes. For the YES case we have:

This gives us a value of: S = 0.72

For the no case we have:

This gives us a value of S = 0.88

But now we have to find the total entropy of the 2nd state, and we simply can’t add the values because that would give us a value greater than 1, and tell us that we actually lose information, when we actually know the opposite happens. The way we go about this is to normalize the entropy values by the amount the question was able to split. So for this case, asking the question is there a “$” value in the title split the emails into 10 yes and 10 no. So equivalently:

This gives us an entropy value of 0.8 and shows us that the entropy decreased by 0.2 from the original state!

Now literally the role of a decision tree program is to go through every feature/possibilities and simply ask a yes or no question and see what the resulting state is and see how much entropy decreased by. The machine then chooses to use the question that resulted in the most information gain. It subsequently uses the final state resulting from that question as its next initial state and repeats this multiple times for however many decisions the user wants.

The problem is that, while having many decisions will result in better separation, overfitting may occur where the questions are too specific like: “Was there a $ sign after the letter “e” in the 5th line of the text.” And while that question may help better separate the emails in the test cases, it doesn’t seem to generalize to the other emails in question.

To remedy this, people have done things like saying:

1. Don’t have decisions longer than 10 layers in length; in this case a layer means the number of consecutive branching questions

2. Use a random forest ensemble model where we change the test data an “x” number of times and create “x” number of decision trees and use the average of those results as our confidence

N.B. Aside from the entropy gain method, there is something called the Gini index (which bases itself off entropy gain) which is the actual method used to see how good a decision is.

Part 3 Implementing a Decision Tree:

For this machine learning program, I plan on making it so that my algorithm classifies whether a chemical seems like it will target either Cytochrome P450 3A4 , or multidrug resistant protein1.

To do this, I go onto DrugBank to get the dataset I plan on working with.

Part 1: Getting the data

I simply went onto: https://www.drugbank.ca/unearth/advanced/drug and had the search condition being target with the target being: Cytochrome P450 3A4, and asked the display field to include the name and SMILE id. The reason I want the SMILE id is because it contains the 2D structural information of the compound which I plan on later using to get features regarding the structural property of the chemical.

I did the same as the above, but for multidrug resistant protein 1. Both of which I will upload onto my GitHub page here: https://github.com/Tacosushi/DecisionTree_1

Part 2: Cleaning the data

Now my next goal is to make sure that none of the datasets have a common drug, and that none of the fields is left blank.

I plan on doing this using python:

import pandas as pd

#Here I read from the csv files I downloaded from DrugBank

df = pd.read_csv(‘cytochrome.csv’)

df2 = pd.read_csv(‘multidrug.csv’) #Here I combine the dataframes together so it’s easier to work

frames = [df, df2]

df3 = pd.concat(frames) #I drop any of the rows with NaN values

df3 = df3.dropna() #drop any row with duplicates

df3 = df3.drop_duplicates(subset=[‘Name’]) #Save the file into a csv

df3.to_csv(‘cleandata.csv’)

Part 3: Add structural data of drug

To do this, I need to first install a library called RDkit. Because I have a mac I do this going into my terminal and using this code:

conda install -c rdkit rdkit

With that I have rdkit installed. Pretty easy, right. If you have a linux, it’s just as easy, but if you have a PC you might suffer a bit trying to get it. Anyways.

The next thing I did was to use rdkit to get more data regarding the physical structure of the chemical. I got the number of carboxylic acid groups, nitrogen, sulfur, phosphorous, carbon, and oxygen atoms. I was also trying to get the length of the molecule, but rdkit wouldn’t let me do it effectively (but you can try doing this if you have more time), so I used the length of the SMILE string as a pseudo indicator of length. I also used the number of @ and = signs as a pseudo indicator of length and aspect ratio. The program is kind of jank but im just trying to get a working prototype.

import pandas as pd

import numpy as np

from rdkit import Chem

from rdkit.Chem import Descriptors

from rdkit.Chem import PandasTools

import os

from rdkit import RDConfig

from rdkit.Chem import FragmentCatalog #Read the data in

df=pd.read_csv(‘cleandata.csv’) #The features I want to include are here:

numcarboxyl = [0]*len(df)

numnitrogen = [0] * len(df)

numcarbon = [0]*len(df)

numphospho = [0]*len(df)

numsulf = [0]*len(df)

numoxygen =[0]*len(df)

numhydroxyl = [0] * len(df)

numhalogen = [0]*len(df)

numalkyl = [0]*len(df)

molwt = [0]*len(df)

numRing = [0]*len(df)

lengg =[0]*len(df)

equal =[0]*len(df)

atsign =[0]*len(df)

target =[0]*len(df) for k in range(len(df)):

j = k smiles = df[‘SMILES’][j]

m = Chem.MolFromSmiles(smiles)

#print(len(df)) #This target value simply tells me if its a positive or negative result

target[j] = df[‘Target’][j] #Here I check what type of functional group exists, or if an =, or @ sign appears in

#the string of the name of the compound

#I plan on using the = and @ sign as pseudo number of how fat the molecule is functionalgroup = “[CX3](=O)[OX1H0-,OX2H1]” #SMART value for carboxylic acid

smarts = Chem.MolFromSmarts(functionalgroup)

functional_amount= Chem.Mol.GetSubstructMatches(m, smarts, uniquify=True)

numcarboxyl[j] = len(functional_amount) numnitrogen[j] = smiles.count(‘N’)

numcarbon[j] = smiles.count(‘C’)

numphospho[j] = smiles.count(‘P’)

numsulf[j] = smiles.count(‘S’)

numoxygen[j] = smiles.count(‘O’)

lengg[j] = len(smiles)

equal[j] = smiles.count(‘=’)

atsign[j] = smiles.count(‘@’) functionalgroup = “[OX2H]” #SMART value for hydroxyl group

smarts = Chem.MolFromSmarts(functionalgroup)

functional_amount= Chem.Mol.GetSubstructMatches(m, smarts,

uniquify=True)numhydroxyl[j] = len(functional_amount) functionalgroup = “[F,Cl,Br,I]” #SMART value for halogen group

smarts = Chem.MolFromSmarts(functionalgroup)

functional_amount= Chem.Mol.GetSubstructMatches(m, smarts, uniquify=True)

numhalogen[j] = len(functional_amount) functionalgroup = “[CX4]” #SMART value for alkyl group

smarts = Chem.MolFromSmarts(functionalgroup)

functional_amount= Chem.Mol.GetSubstructMatches(m, smarts, uniquify=True)

numalkyl[j] = len(functional_amount) #These are two extra features, one is the MW of the compound, and the other

#has to do with the number of rings in the compound

molwt[j] = Descriptors.MolWt(m)

numRing[j] = Descriptors.RingCount(m) #Here I create my dataframe and save it to an external csv file

d = {‘Target’:target, ‘Carboxyl’: numcarboxyl, ‘Nitrogen’:numnitrogen, ‘Carbon’: numcarbon, ‘Phosphorous’:numphospho, ‘Sulfur’:numsulf, ‘Oxygen’:numoxygen, ‘Hydroxyl’:numhydroxyl, ‘Halogen’:numhalogen, ‘Alkyl’:numalkyl, ‘MW’:molwt, ‘Length’:lengg, ‘Equal’:equal, ‘AtSign’:atsign} df2 = pd.DataFrame(data=d)

df2.to_csv(‘finalData.csv’)

Part 4: Creating a decision tree using SciKitlearn

Scikit learn is a library that makes it quick to implement machine learning techniques and you can download it easily using pip or conda (Anaconda). From pip you do:

pip install -U scikit-learn

or from Conda you do:

conda install scitkit-learn

Anyways, while I can go into making my own decision tree from scratch, instead I use ScikitLearn, which has all of the functionality I want, such as using the Gini index (another method of tracking information gain).

Also, I recommend getting something called Graphiz (which can you done using homebrew with the code: conda install graphiz). This will allow you to physically see what the tree looks like.

I also plan on retrofitting my code according to this great person’s, if you want to follow his: http://benalexkeen.com/decision-tree-classifier-in-python-using-scikit-learn/

import pandas as pd

import numpy as np #We import the thing below so we can get a train and test data

from sklearn.model_selection import train_test_split #We import tree because we are making a decision tree

from sklearn import tree #We import accuracy to see how well we do

from sklearn.metrics import accuracy_score #The confusion matrix will tell us how we misplaced our values

from sklearn.metrics import confusion_matrix #Subprocess allows us to to make a png image

from subprocess import call #Read the data in

df=pd.read_csv(‘finalData.csv’) #Recnonvert the df into a more easily read df

df = df[[‘Alkyl’, ‘AtSign’,’Carbon’,’Carboxyl’,’Equal’, Halogen’, ‘Hydroxyl’, ‘Length’, ‘MW’, ‘Nitrogen’, ‘Oxygen’, ‘Phosphorous’,’Sulfur’,’Target’]] #Get all the features into one df, and the classification into another

X = df.drop(‘Target’, axis=1)

y = df[‘Target’] #Split the data into a test and train set

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) #State what type of machine learning model we will use

model = tree.DecisionTreeClassifier() #Next use the model, and use the training data to train it

model.fit(X_train, y_train) #Lets now see how well our model did (pretty simple right?)

y_predict = model.predict(X_test)

accuracy = accuracy_score(y_test, y_predict)

print(accuracy) #Here we actually use the confusion matrix to see what we did right and wrong

blah = pd.DataFrame(

confusion_matrix(y_test, y_predict),

columns=[‘Predicted Not target’, ‘Predicted target’],

index=[‘True Not target’, ‘True target’]) #The following allows us to visualize the tree

tree.export_graphviz(model, out_file=’tree.dot’) #We later have to go to the command line and type in

#dot -Tpng tree.dot -o tree.png

Anyways with that all set up, I run my script to train my decision tree.

I have an accuracy around 63%, and my confusion matrix looks like the following, not so good.

However, with that you have now implemented your first decision tree!

Part 5: How does the code work

While you do know how to implement and make your own decision tree you may be confused as to what exactly the scikit/the program did.

The first thing we did in the program was to import some of the data. Below is a partial example of what the dataset includes:

Table 1: Partial example of what the imported dataset looks like

In the example the first 6 columns are “features” of a given drug, and the last column tells us whether the drug was used for Cytochrome or the multidrug resistant protein. If its 1 it’s the former, and if its 0, it’s the latter.

Scikit learn basically goes through each column that is a feature, and each row within each column and chooses that value as a cutoff. For example, it first uses the length feature with a value of 37 (first row, first column), and sorts drugs into two piles depending on whether the length of that molecule is greater than or less than 37. It then calculates the gini index (information gain) of that decision. The program then saves this value. It then goes onto the second row, first column value (length of 35) and uses that as a cutoff to sort the drugs. i.e) if a drug length is greater than 35, the drug is sorted into the yes bin, and if it is not greater than 35, it sorts the drug into a no bin. The program then recalculates the gini index. The program then repeats this for every row and column so “n” features times “m” number of drugs. It chooses the decision question that resulted in the most entropy gain (best gini index) and continues to create its next branch. The program then repeats the above as many time as the user specified. For my example I did not set a cutoff, so the program made decisions until it could not anymore.

Part 5: Final comments

It is important to note that while my ML script did create a decision tree with 63% accuracy, there are many issues with it. They are:

1. Overfitting due to not specifying the max depth of branches (the decisions become really really specific); I can resolve this by either specifying the max depth, and/or using a random forest, an ensemble of decision trees made from using different testing data

2. Not having features that are useful. I did this to get a prototype working, but a lot of the values I used were pseudo representative of the physical features of the chemical molecule. This can be resolved by using the deepchem, or rdkit library to get “fingerprints” that you desire. One that I recommend is using the circularfingerprint. It gives a lot more representative physical structure data regarding the molecule.

I recommend you play around by trying to find the best “max branch” length, or by using rdkit to find better features.

Finally, it is important to note that the limitation of this model is that it only really tells you whether a chemical is more like to target one target or another, but not really if the compound can even be a drug. If you are really interested in using AI for drug discovery using a parameter that is more indicative of how good a drug is would be useful. For example, using a target binding affinity would be useful to use, and a regression model for predicting that would be even more useful.