Introduction

Let’s learn from a precise demo on Fitting Naive Bayes Classifier on Titanic Data Set for Machine Learning

Description:

On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This tragedy has led to better safety regulations for ships.

Machine learning Problem :

To predict which passengers survived in this tragedy tragedy based on the data given

What we will do :

1. basic cleaning for missing values in train and test data set

2. 5 fold crossvalidation

3. Model used is Naive Bayes Classifier

4. Predict for test data set

Importing libraries

Let’s import the library

import numpy as np import pandas as pd from sklearn.naive_bayes import GaussianNB from sklearn import cross_validation import matplotlib.pyplot as plt %matplotlib inline

Reading training and testing data set

Let’s import the data set

train=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\train.csv') test=pd.read_csv('C:\\Users\\Arpan\\Desktop\\titanic data set\\test.csv')

Data Cleaning

Let’s create a function for cleaning the training and testing data .Here we are doing two things.

1. Encoding the categorical variables manually

2. Imputing the missing values.

def data_cleaning(train): train["Age"] = train["Age"].fillna(train["Age"].median()) train["Fare"] = train["Age"].fillna(train["Fare"].median()) train["Embarked"] = train["Embarked"].fillna("S") train.loc[train["Sex"] == "male", "Sex"] = 0 train.loc[train["Sex"] == "female", "Sex"] = 1 train.loc[train["Embarked"] == "S", "Embarked"] = 0 train.loc[train["Embarked"] == "C", "Embarked"] = 1 train.loc[train["Embarked"] == "Q", "Embarked"] = 2 return train

Let’s clean the data

train=data_cleaning(train) test=data_cleaning(test)

Selecting predictor Variables

Let’s choose the predictor variables.We will not choose the cabin and Passenger id variable

predictor_Vars = [ "Sex", "Age", "SibSp", "Parch", "Fare"]

X & y

Let’s separate predictors and target.X is array of predictor variables and y is target variable.We will use these while data modelling.

X, y = train[predictor_Vars], train.Survived

Let’s check X

X.iloc[:5]

Sex Age SibSp Parch Fare 0 0 22 1 0 22 1 1 38 1 0 38 2 1 26 0 0 26 3 1 35 1 0 35 4 0 35 0 0 35

Let’s check y

y.iloc[:5]

0 0 1 1 2 1 3 1 4 0 Name: Survived, dtype: int64

Model Initialization

Let’s initialize the Naive Bayes classifier model and choose model parameters if you want.

modelNaiveBayes= GaussianNB()

Cross validation

Let’s do the 5 fold crossvalidation

modelNaiveBayesCV= cross_validation.cross_val_score(modelNaiveBayes,X,y,cv=5)

Let’s check the accuracy metric of each of the five folds

modelNaiveBayesCV

array([ 0.79329609, 0.81005587, 0.80337079, 0.78089888, 0.79661017])

Let’s see the same information on the plot

plt.plot(modelNaiveBayesCV,"p")

[<matplotlib.lines.Line2D at 0xa6ec278>]

Let’s check the mean model accuracy of all five folds

print(modelNaiveBayesCV.mean())

0.796846357544





Model Fitting

Let’s now fit the model with the same parameters on the whole data set instead of 4/5th part of data set as we did in crossvalidation

modelNaiveBayes= GaussianNB() modelNaiveBayes= modelNaiveBayes.fit(X, y)

Predictions on test data set

Let’s get the prediction values for the test data set

predictions=modelNaiveBayes.predict(test[predictor_Vars]) predictions