Welcome to the world of machine learning with scikit-learn. Machine learning can be overwhelming at times and this is partly due to a large number of tools that are available on the market. This post will simplify this process of tool selection down to one – scikit-learn.

In this Series, you will learn how to construct an end-to-end machine learning pipeline using some of the most popular algorithms that are widely used in industry and professional competitions, such as Kaggle.

However, in this introductory post, we will go through the following topics:

Algorithms that you will learn to implement scikit-learn in series.

Now, let's begin this fun journey into the world of machine learning with scikit-learn!

A brief introduction to machine learning!

Machine learning has generated quite the buzz – from Elon Musk fearing the role of unregulated artificial intelligence in society, to

Mark Zuckerberg

having a view that contradicts

Musk

. 💥

So, what exactly is machine learning? Simply put, machine learning is a set of methods that can detect patterns in data and use those patterns to make future predictions. Machine learning has found immense value in a wide range of industries, ranging from finance to healthcare. This translates to a higher requirement of talent with the skill capital in the field of machine learning.



Here is a quick Overview of google trend for machine learning.😧

Broadly speaking, machine learning can be categorized into three main types:

Supervised learning

Unsupervised learning

Reinforcement learning

Supervised learning

Supervised learning is a form of machine learning in which our data comes with a set of labels or a target variable that is numeric. These labels/categories usually belong to one feature/attribute, which is commonly known as the target variable. For instance, each row of your data could either belong to the category of Healthy or Not Healthy. Given a set of features such as weight, blood sugar levels, and age, we can use the supervised machine learning algorithm to predict whether the person is healthy or not. In the following simple mathematical expression, S is the supervised learning algorithm, X is the set of input features, such as weight and age, and Y is the target variable with the labels Healthy or Not Healthy: Although supervised machine learning is the most common type of machine learning that is implemented with scikit-learn and in the industry, most datasets typically do not come with predefined labels. Unsupervised learning algorithms are first used to cluster data without labels into distinct groups to which we can then assign labels. This is discussed in detail in the following section. Supervised learning algorithms



Supervised learning algorithms can be used to solve both classification and regression problems. you will learn how to implement some of the most popular supervised machine learning algorithms. Popular supervised machine learning algorithms are the ones that are widely used in industry and research, and have helped us solve a wide range of problems across a wide range of domains. These are some of supervised learning algorithms are as follows:









Linear regression : This supervised learning algorithm is used to predict continuous numeric outcomes such as house prices, stock prices, and temperature, to name a few

Logistic regression : The logistic learning algorithm is a popular classification algorithm that is especially used in the credit industry in order to predict loan defaults

k-Nearest Neighbors : The k-NN algorithm is a classification algorithm that is used to classify data into two or more categories, and is widely used to classify houses into expensive and affordable categories based on price, area, bedrooms, and a whole range of other features

Support vector machines : The SVM algorithm is a popular classification algorithm that is used in image and face detection, along with applications such as handwriting recognition

Tree-Based algorithms: Tree-based algorithms such as decision trees, Random Forests, and Boosted trees are used to solve both classification and regression problems

Naive Bayes: The Naive Bayes classifier is a machine learning algorithm that uses the mathematical model of probability to solve classification problems research,supervised Unsupervised learning Unsupervised learning is a form of machine learning in which the algorithm tries to detect/find patterns in data that do not have an outcome/target variable. In other words, we do not have data that comes with pre-existing labels. Thus, the algorithm will typically use a metric such as distance to group data together depending on how close they are to each other. As discussed in the previous section, most of the data that you will encounter in the real world will not come with a set of predefined labels and, as such, will only have a set of input features without a target attribute. In the following simple mathematical expression, U is the unsupervised learning algorithm, while X is a set of input features, such as weight and age: Given this data, our objective is to create groups that could potentially be labeled as Healthy or Not Healthy. The unsupervised learning algorithm will use a metric such as distance in order to identify how close a set of points are to each other and how far apart two such groups are.

Unsupervised learning algorithms Unsupervised machine learning algorithms are typically used to cluster points of data based on distance. The unsupervised learning algorithm that you will learn is as follows:

k-means : The k-means algorithm is a popular algorithm that is typically used to segment customers into unique categories based on a variety of features, such as their spending habits. This algorithm is also used to segment houses into categories based on their features, such as price and area.

Reinforcement learning

Reinforcement learning is an area of Machine Learning. Reinforcement. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from the supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of training dataset, it is bound to learn from its experience.

Pre-requisite for the Machine learning (Must Read🙏):



How to Setup Jupyter Notebook perfectly for Data Analysis Pandas in Python for Data Analysis with Example(Step-by-Step guide) Data Visualization

Lets Deep Dive In!!



How we are going to do it?

-- Scikit-learn

Scikit-learn is a free and open source software that helps you tackle supervised and unsupervised machine learning problems. The software is built entirely in Python and utilizes some of the most popular libraries that Python has to offer, namely NumPy and SciPy.

The main reason why scikit-learn is very popular stems from the fact that most of the world's most popular machine learning algorithms can be implemented quite quickly in a plug and play format once you know what the core pipeline is like. Another reason is that popular algorithms for classification such as logistic regression and support vector machines are written in Cython. Cython is used to give these algorithms C-like performance and thus makes the use of scikit-learn quite efficient in the process.



Scikit-learn is designed to tackle problems pertaining to supervised and unsupervised learning only and does not support reinforcement learning at present.

Installing the Scikit- learn package

There are two ways in which you can install scikit-learn on your personal device:

By using the pip method

By using the Anaconda method

The pip method can be implemented on the macOS/Linux Terminal or the Windows PowerShell, while the Anaconda method will work with the Anaconda prompt.

Choosing between these two methods of installation is pretty straightforward:

The pip method

pip3 install NumPy pip3 install SciPy pip3 install scikit-learn pip3 install -U scikit-learn



The Anaconda method conda install NumPy conda install SciPy conda install scikit-learn conda install -U scikit-learn

So far, this lesson has focused on the brief introduction into what machine learning is for those of you who are just beginning your journey into the world of machine learning. You have learned about how scikit-learn fits into the context of machine learning and how you can go about installing the necessary software.

Now, we'll put this into practice💪 and do some data exploration and analysis.

The dataset we'll look at in this section is the so-called Boston housing dataset.



Loading the Data into Jupyter Using a Pandas DataFrame

Often times, data is stored in tables, which means it can be saved as a comma-separated variable (CSV) file. This format, and many others can be read into Python as a DataFrame object, using the Pandas library. Other common formats include tab-separated variable (TSV), SQL tables, and JSON data structures. Indeed, Pandas has support for all of these. In this example, however, we are not going to load the data this way because of the dataset is available directly through scikit-learn.



The Boston housing dataset can be accessed from the module sklearn.datasets using the method. load_boston

from sklearn import datasets boston = datasets.load_boston() type(boston) print(boston['DESCR'])





import pandas as pd ## Loading the data as Dataframe in pandas df = pd.DataFrame(data=boston['data'], columns = boston['feature_names']) #Checking our top 5 rows of the dataframe df.head()

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

In machine learning, the variable that is being modeled is called the target variable; it's what you are trying to predict given the features. For this dataset, the suggested target is MEDV, the median house value in 1,000s of dollars.



## Adding Target temp Column to our dataframe df['MEDV'] = boston['target'] ## Creating copy of the target Value y = df['MEDV'].copy() ##Deleting the Newly created column del df['MEDV'] ## Concat the target columns to our existing dataframe df = pd.concat((y, df), axis=1) In machine learning, the variable that is being modeled is called the target variable; it's what you are trying to predict given the features. For this dataset, the suggested target is MEDV, the median house value in 1,000s of dollars.



MEDV CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT 0 24.0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.9 4.98 1 21.6 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.9 9.14