The following problems are taken from the projects / assignments in the edX course Python for Data Science (UCSanDiagoX) and the coursera course Applied Machine Learning in Python (UMich).

1. Exploratory Analysis to Find Trends in Average Movie Ratings for different Genres

Dataset

● The IMDB Movie Dataset (MovieLens 20M) is used for the analysis.

● The dataset is downloaded from here .

● This dataset contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users and was released in 4/2015.

● The csv files movies.csv and ratings.csv are used for the analysis.

Motivation

● Understand the trend in average ratings for different movie genres over years (from 1995 to 2015) and Correlation between the trends for different genres (8 different genres are considered: Animation, Comedy, Romance, Thriller, Horror, Sci-Fi and Musical).

● This will give us an insight about how the people’s liking for the different movie genres change over time and about the strength of association between trends in between different movie genres, insights possibly useful for the critics.

Research Questions

The answer to the following research questions will be searched for, using exploratory analysis and visualization:

1) The trend in average ratings for different movie genres: How the average ratings for a few different movie genres (namely, Animation, Comedy, Romance, Thriller, Horror,

Sci-Fi and Musical) change over time (different years, from 1995 to 2015)? How can the average ratings for different genres be compared among themselves?

2) Correlation between the trends for different genres: How are the trends for the genres correlated? For example, are the average ratings for Comedy and Sci-Fi movies positively associated with each other? What is the strength of association?

Pre-processing the dataset

The next table shows the first few rows of the movies dataset, loaded in a pandas DataFrame.

movieId title genres 0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1 2 Jumanji (1995) Adventure|Children|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 4 5 Father of the Bride Part II (1995) Comedy

print movies . shape (27278, 3)

with in explode() function implementation similar to this stackoverflow post , the DataFrame is transformed to the one shown below. movies . genres = movies . genres . str . split ( '|' ) movies = explode ( movies , [ 'genres' ]) movies . head () movieId title genres 0 1 Toy Story (1995) Adventure 1 1 Toy Story (1995) Animation 2 1 Toy Story (1995) Children 3 1 Toy Story (1995) Comedy 4 1 Toy Story (1995) Fantasy

The next table shows the first few rows of the ratings dataset, again loaded with pandas. userId movieId rating timestamp 0 1 2 3.5 1112486027 1 1 29 3.5 1112484676 2 1 32 3.5 1112484819 3 1 47 3.5 1112484727 4 1 50 3.5 1112484580 The input tables are pre-processed using the following code to get the data in the desired format, ready for the analysis.

import time ratings [ 'timestamp' ] = ratings [ 'timestamp' ]. apply ( lambda x : time . strftime ( '%Y' , time . localtime ( x ))) ratings . head () userId movieId rating timestamp 0 1 2 3.5 2005 1 1 29 3.5 2005 2 1 32 3.5 2005 3 1 47 3.5 2005 4 1 50 3.5 2005

movies = movies . drop ( 'title' , axis = 1 ) movies . head () movieId genres 0 1 Adventure 1 1 Animation 2 1 Children 3 1 Comedy 4 1 Fantasy

ratings = ratings . merge ( movies , left_on = 'movieId' , right_on = 'movieId' , how = 'inner' ) ratings . head () userId movieId rating timestamp genres 0 1 2 3.5 2005 Adventure 1 1 2 3.5 2005 Children 2 1 2 3.5 2005 Fantasy 3 5 2 3.0 1996 Adventure 4 5 2 3.0 1996 Children

ratings . shape (53177059, 5)

ratings = ratings . loc [ ratings [ 'genres' ]. isin ([ 'Sci-Fi' , 'Animation' , 'Comedy' , 'Romance' , 'Thriller' , 'Horror' , 'Musical' ])] mean_ratings = ratings . groupby ([ 'timestamp' , 'genres' ], as_index = False )[ 'rating' ]. aggregate ( np . mean ) mean_ratings . rename ( columns ={ ' timestamp ' : 'year' }, inplace = True ) sd_ratings = ratings . groupby ([ 'timestamp' , 'genres' ], as_index = False )[ 'rating' ]. aggregate ( np . std ) sd_ratings . rename ( columns ={ ' timestamp ' : 'year' }, inplace = True ) sd_ratings . head () year genres rating 0 1995 Comedy 0.000000 1 1995 Romance NaN 2 1995 Thriller 1.414214 3 1996 Animation 0.953820 4 1996 Comedy 1.010918

Visualizations

Let’s first visualize (visualization done using seaborn) the distributions of the movie-ratings across different genres, as shown in the following figure.

The next figure shows the trends of the average ratings by users for different genres across different years. The vertical error bars represent the standard deviations of the average ratings (ratings for different movies averaged over users) for the same genres (s.d. across the movies belonging to the same genre).

ratings2 = ratings . groupby ([ 'movieId' , 'timestamp' , 'genres' ], as_index = False )[ 'rating' ] . aggregate ( np . mean ) ratings2 . head ()

movieId year genres rating 0 1 1996 Animation 4.132947 1 1 1996 Comedy 4.132947 2 1 1997 Animation 3.875221 3 1 1997 Comedy 3.875221 4 1 1998 Animation 3.883929

The next figure shows the trends of the ratings (averaged over users and movies for each genre) for different genres across different years.

ratings3 = ratings . groupby ([ 'userId' , 'genres' ], as_index = False )[ 'rating' ] . aggregate ( np . mean ) ratings3 . head () userId genres rating 0 1 Animation 3.650000 1 1 Comedy 3.731707 2 1 Horror 3.744444 3 1 Musical 3.666667 4 1 Romance 3.954545

pivot = ratings3 . pivot ( index = 'userId' , columns = 'genres' , values = 'rating' ) pivot . head () genres Animation Comedy Horror Musical Romance Sci-Fi Thriller userId 1 3.650000 3.731707 3.744444 3.666667 3.954545 3.712500 3.761905 2 3.000000 3.900000 3.555556 3.000000 3.833333 4.608696 4.263158 3 3.750000 4.057692 3.937500 4.000000 4.062500 4.000000 4.260000 4 4.000000 3.545455 NaN 4.000000 3.500000 3.000000 3.461538 5 4.666667 4.083333 3.000000 4.375000 3.937500 4.600000 4.333333

The next figures show the trends for different genres for different sub-windows of time and with the variations (inter-quartile ranges) in average ratings for each genre.

The next figures show how correlated are the trends for average ratings for different genres.

Some Findings

● There is a decreasing trend in the average ratings for all 8 genres during 1995-98, then the ratings become stable during 1999-2007, then again increase.

● Horror movies always have the lowest average ratings. Sci-Fi and Comedy movies also get low average ratings.

● Musical, Animation and Romance movies get the highest average ratings.

● Sci-Fi and Animation movies show very similar trends, they again become popular during 2009-2013.

● Trends in the average ratings of Romance and Horror movies show positive association between them.

2. Fraud Detection with Classification

Abstract

In this project, we aim to build machine learning models to automatically detect frauds in credit card transactions. Several supervised binary classification models will be trained using 75-25 validation on this credit card transaction dataset from Kaggle. Given a transaction instance, a model will predict whether it is fraud or not. Different models will then be evaluated on a held-out subset of this data by measuring how effectively they predict instances of credit card fraud. The model with the best recall value (the one which is able to detect the highest number of true frauds) will be selected for prediction. We found that the k-Nearest Neighbors (k=3) and LogisticRegression models perform the best when only the recall (sensitivity) is concerned, whereas the RandomForest model gives pretty high specificity and sensitivity at the same time.

Motivation

Using the credit card transaction dataset, we want to train a few machine learning models that can predict whether an unseen transaction in the future is likely to be fraud or not. This automatic prediction / detection of fraud can immediately raise an alarm and the transaction could be stopped before it completes. This can also help by expediting the process of manual inspection / examination / investigation of the predicted fraudulent transactions, thereby ensuring safety for the financial institution / bank / customers.

Dataset

● This dataset from Kaggle is used for credit card fraud detection. The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available here and here.

● The datasets contains transactions made by credit cards in September 2013 by the European cardholders. This dataset presents transactions that occurred in two days, where there were 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Each row in the dataset creditcard.csv corresponds to a credit card transaction. The dataset contains 284,807 rows and 30 columns.

● Features (Columns) include an integer variable Time (which represents the seconds elapsed between each transaction and the first transaction in the dataset), 28 confidential variables V1 through V28 (these are numerical input variables which are the result of a PCA transformation – unfortunately, due to confidentiality issues, the original features and more background information about the data could not be provided), as well as the variable Amount which is the amount of the transaction.

● The target is stored in the Class column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

● The next table shows the first few rows.

Data Preparation and Cleaning

● The main problem of this dataset is that it is highly unbalanced, the positive class (frauds) account for only 0.172% of all transactions. There are only 492 frauds out of 284,807 transactions: too many negative instances and too few positive (fraud) instances.

● In order to mitigate this high imbalance ratio, so that while training the models can see enough fraud examples, the following techniques were used.

Oversampling: the Synthetic Minority Oversampling Technique (SMOTE) is used to generate new fraud (minority class) samples with interpolation and k-nearest neighbors. Using this technique, the number of positive examples were increased to 5000 samples. Under-sampling: the NeverMiss-1 Algorithm is used to select samples from the majority (non-fraud) class for which the average distance of the k nearest samples of the minority class is the smallest. Using this technique, the number of positive examples were decreased to 10000 samples.

● The input feature Time is not that relevant to predict fraud, that is why that feature is dropped from the input.

Research Questions

Predict whether a given transaction is fraudulent or not.

● Given a credit card transaction (represented by the values of the 30 input features), the goal is to answer the following question: is the transaction fraud?

● More mathematically, given the labelled data, we want to learn a function f :



from the data. Also we shall be interested to compute the conditional probability Prob(fraud|transaction).

● We want to use the function learnt to predict new transactions (not seen while learning the function f ).

● We want to evaluate how correctly we can find frauds from the unseen data and find which model performs the best (model selection).

Methods

● The dataset is first split (with sklearn) into random train (75%) and test (25%) subsets. Both subsets roughly maintain the ratio of majority vs. minority class.

Split the data into train and test

from sklearn.model_selection import train_test_split df = pd . read_csv ( path + 'creditcard.csv' ) df . drop ( 'Time' , 1 , inplace = True ) X = df . iloc [:,: - 1 ] y = df . iloc [:, - 1 ] X_train , X_test , y_train , y_test = train_test_split ( X , y , random_state = 0 )

● The training dataset is highly imbalanced (only 372 fraud instances out of 213,607 total instances) w.r.t. class-ratio, it’s balanced using SMOTE (oversampling: the number of the fraud instances to increase to 5000) and NeverMiss-1 (under-sampling: decreasing the number of the non-fraud instances to 10000). The test dataset is not touched.

Preprocessing: Resample to balance the dataset

● A few sklearn models (kNN, SVM, LogisticRegression, RandomForest, DecesionTree, AdaBoost, NaiveBayesian) are then trained separately on the training dataset and every time a model is learnt, it is used to predict the class of the hitherto-unseen test dataset.

● Given the class imbalance ratio, one of the recommend measures for model evaluation is the Area Under the Precision-Recall Curve (AUPRC), since Confusion matrix accuracy is not meaningful for unbalanced classification. e.g., Training a dummy classifier that classifies everything as the majority class of the training data, the accuracy of this classifier obtained is 0.996643 but the recall obtained is very poor: 0.008333, as shown below.

Using X_train , X_test , y_train , and y_test (as defined above), let’s train a dummy classifier that classifies everything as the majority class of the training data and compute what is the accuracy of this classifier? What is the recall?

from sklearn.metrics import recall_score , precision_score from sklearn.dummy import DummyClassifier clf = DummyClassifier ( random_state = 0 ) clf . fit ( X_train , y_train ) y_pred = clf . predict ( X_test ) accuracy = float ( sum ( y_pred == y_test )) / len ( y_test ) recall = recall_score ( y_test , y_pred ) print ( accuracy , recall ) # (0.9966433527148114, 0.0083333333333333332)

● Particularly we shall be interested in high Recall, since ideally we want all the fraud instances to be predicted correctly as fraud instances by the model, with zero False Negatives.

Findings

● The next figure shows the prediction evaluation results on the test dataset using the python sklearn Support Vector Classifier with RBF kernel. As can be seen, using training with resampling, the recall becomes quite high (~91.7%), although the accuracy and precision drops.

from sklearn.svm import SVC clf = SVC ( random_state = 0 , C = 1e9 , gamma = 1e-07 , probability = True ) clf . fit ( X_train , y_train ) y_pred = clf . predict ( X_test ) y_prob = clf . predict_proba ( X_test ) precision , recall , _ = precision_recall_curve ( y_test , y_pred ) fpr , tpr , _ = roc_curve ( y_test , y_prob [:, 1 ], pos_label = 1 ) print ( auc ( fpr , tpr )) cnf_matrix = confusion_matrix ( y_test , y_pred ) plt . figure () plot_confusion_matrix ( cnf_matrix , classes = [ '0' , '1' ], title = 'Confusion matrix, without normalization' )

● The next figure shows the prediction evaluation results on the test dataset using the python sklearn LogisticRegression classifier. As can be seen, using training with resampling, the recall becomes even higher (~94.2%), although the accuracy and precision drops further.

● The next figure again shows the prediction recall values on the test dataset using the sklearn LogisticRegression classifier, but this time using GridSearch with different hyper-parameter values (for C and regularization).

from sklearn.linear_model import LogisticRegression parameters = { 'penalty' : [ 'l1' , 'l2' ], 'C' :[ 0.01 , 0.1 , 1 , 10 , 100 ]} lr = LogisticRegression () clf = GridSearchCV ( lr , parameters , scoring = 'recall' ) clf . fit ( X_train , y_train ) #print(clf.cv_results_['mean_test_score']) print np . reshape ( clf . cv_results_ [ 'mean_test_score' ], ( 5 , 2 ))



● As can be seen, C = 100 with L1 penalty obtains the best recall on the test dataset among the values searched on the grid.

● Also, there are 120 fraud instances in the test dataset, out of which all but 7 are detected correctly with the best Logistic Regression Model. The next table shows a few highlighted in red for which the model failed to predict a fraud instance.



● The next figure shows visualizations of the classification decision functions learnt from the training dataset and the class labels predicted for the test dataset, projected on the variables V1 and V2, for different classifiers.

● With every classifier (specificity, sensitivity) metrics on the test dataset are also computed and shown.

● As can be seen, the RandomForest Classifier does a descent job in having both specificity and sensitivity quite high (89% and 88% resp.) but the precision is very poor.

● The k-NearestNeibors (with k = 3) and LogisticRegression classifiers do the best in terms of recall (95%, 94% respectively) but their specificity is quite poor and precision is very poor.

Limitations and Future Work

● The models are learnt from this particular credit card fraud dataset and hence may not generalize to other fraud datasets.

● The dataset being highly imbalanced, other methods (such as different variants of SMOTE and ADASYN oversampling, Tomek’s link / different variants of Edited Nearest Neighbor methods e.g. AllKNN) might be tried as pre-processing step, in order to create a more balanced dataset. Also, the resampling was done to result in a final training dataset with 1:2 ratio of the number of minority vs. majority instances, we could try different ratios (e.g., 1:1 or 2:3).

● 75-25 validation is used to evaluate the classifier models, instead k-fold (e.g., k=10) cross-validation could be used to obtain more generalizable models.

● Fine-tuning with hyper-parameters for all models could be done, it was only done with grid search for LogisticRegression. For probabilistic classifiers (e.g., Logistic Regression), we could try different thresholds for classification instead of default.

● We assumed the transactions to be independent of each other (not time dependent) and ignored the Time feature, one could use Time to build a sequential model for prediction.

● The entire dataset might be used with data augmentation for the minority fraud class and deep learning models (CNN or RNN if the Time feature is used) can be learnt and later used for prediction to improve specificity and sensitivity (recall) simultaneously.

● We could try to find the patterns in the input features for a probable fraud instance.

Conclusions

● As can be seen, with LogisticRegression and k-NearestNeighbors (k=3) classifiers we could obtain the best recall on the held-out dataset (although precision / specificity with these models are really poor). Hence, if recall is the only concerned metric, we can go ahead with either of these models for fraud detection / prediction of a fraud instance.

References

● Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

● Credit card transaction dataset from Kaggle.

● https://www.coursera.org/learn/python-machine-learning/

● http://scikit-learn.org/stable/supervised_learning.html#supervised-learning

● http://contrib.scikit-learn.org/imbalanced-learn/stable/