If you’re looking to transition to a Data Science career, you’ll most likely be looking to practice your skills collecting, cleaning and analysing data to answer a variety of challenging business questions.

We’ve listed a bunch of projects you’re able to learn from and practice your skills on. We’ll be adding further additions to the series over time, so it’s worth following us or subscribing to our newsletter at the bottom to not miss out!

1. Turkiye Student Evaluation

The problem: Use classification and clustering techniques to deal with the data.

This is a great beginners tutorial, featuring a dataset is based on an evaluation form filled out by students for various courses. The data features different attributes including attendance, difficulty, score for each evaluation question, among others. This is an unsupervised learning problem. The dataset has 5820 rows and 33 columns.

Dataset: Get Data | Tutorial: Get Here

2. Height and Weight Prediction

The problem: Predict the height or weight of a person.

An ideal beginners project for anyone looking to get into Data Science. It’s a regression problem. The dataset has 25,000 rows and 3 columns (index, height and weight).

Dataset: Get Data | Tutorial: Get Here

3. Predicting the Activity Category of a Human

The problem: Predict the activity category of a human.

This is a multi-classification problem and features a dataset taken from the recordings of 30 subjects, captured via smartphones enabled with embedded inertial sensors. This data set has 10,299 rows and 561 columns.

Dataset: Get Data | Tutorial: Get Here

4. Classifying Documents According to Labels

The problem: Classify documents according to their labels.

This is an intermediate-level project, featuring a dataset originally from the Siam Text Mining Competition (2007). The data features aviation safety reports which describe problem(s) that have occurred on certain flights. It’s a multi-classification and high dimensional problem. This dataset has 21,519 rows and 30,438 columns.

Dataset: Get Data | Tutorial: Get Here

5. Which Celebrities Voice is it?

The problem: Figure out which celebrity the voice belongs to.

Audio processing is rapidly growing into an important field in deep learning, therefore this is a great (advanced) tutorial to practice your skills on and learn from. The dataset is for large-scale speaker identification and contains words spoken by celebrities, taken from various YouTube videos. This dataset contains 100,000 phrases spoken by 1,251 celebrities.

Dataset: Get Data | Tutorial: Get Here

6. Implementing your own Cluster

The problem: Determine the optimal number of clusters for k-means clustering.

K-means is a type of unsupervised learning and one of the popular methods of clustering unlabelled data into k clusters. This tutorial, provides an overview of how k-means works and discusses how to implement your own clusters.

You’ll also understand how to use the elbow method as a way to estimate the value k. Another popular method of estimating k is through silhouette analysis, a scikit learn example can be found here.

Dataset: Get Data | Tutorial: Get Here