I believe that the best predictor of future work quality is previous work quality. Many data science managers also share this belief. Not everyone has already worked a data science job, but EVERYONE can put together data science projects to demonstrate the quality of his/her work. If you produce great projects and showcase them on your github and resume, you will drastically improve your chances of getting a job.

Projects are great for a few reasons:

They show that you are a self starter and are willing to tackle large problems

They demonstrate that you can apply data science techniques to real problems that you encounter every day

They allow you to “show” that you understand a concept rather than “tell”

You have complete control over the topic of your projects so you can cater them to illustrate specific skill-sets

Project based learning is one of the quickest ways to actually acquire data science skills and learn new tools

*If they are robust and meaningful, they can be as good or better than on-the-job experience

*more on this in the final thoughts

By the end of this article, I hope that you understand how to choose projects and how their life-cycles work. I also hope that you explore the 4 projects that I recommend to position yourself for success.

This article expands on one of my most popular YouTube videos. If you’re interested, you can check that out here: The Projects You Should Do to Get a Data Science Job

How do you choose a project topic?

One of the most important things about a data science project is that it should be unique to you. The more specific the project and the more you can explain it’s meaning, the better. Unique projects are great because they showcase some of your personalty and are difficult to copy. It is unfortunate, but I have come across candidates that have copied projects on their github or used large portions of code without giving credit.

I think that you should work on projects that fit into one of the following categories (or both):

(1) They are interesting or important to you — If you are interested in the subject of your project, you will be significantly more inclined to work on it and do a good job. This really shows when you are interviewing and asked to talk about the work. When candidates are proud of a project, you can see them visibly light up when asked about it.

(2) They are targeted at an industry or job that you are looking to get into — Doing projects like these demonstrate why you are applying to a specific position. They also illustrate that you have some familiarity with the subject area that you could potentially be working in (This article explains why that is important).

As it so happens, I do most of my projects on sports. This is an intersection of my work as well as my passion. In my opinion, this is a best case scenario.

What are the components of a data science project?

All data science projects should have a few things in common. Your project should roughly follow the below life-cycle and you should be able to speak at length about each step.

Step 1: Planning and Establishing a Reason for the Project — This is what sets everything in motion. You don’t always know what you will find, but you should be trying to answer a question or solve a problem with your analysis. Have a concrete question that you want to solve before starting on step 2.

(e.g. is it possible to predict NBA scores with enough accuracy to create a betting edge)

Step 2: Data Collection — There are plenty of great places to find data online (kaggle, google, reddit, etc.). You can either choose a data-set from one of these places, or you can find data on your own. I find that it sets candidates apart if they pull data from an API, scrape it, or have another more unique way for collecting it. Having data that others are not privy to adds to your uniqueness and “wow” factor.

(e.g. used python to scrape data from basketball reference)

Step 3: Data Aggregation & Cleaning — This step is commonly overlooked, but is one of the most important. The way that you format and clean your data can have large implications on the outcome of an analysis. You should be able to explain the decisions you made for handling null values, choosing to include or remove certain features, and dealing with outliers.

(e.g. removed games where star players were resting for load management, this is a newer phenomenon and would skew our historic results)

Step 4: Data Exploration — In this part of the analysis, it is important to show that you understand the specifics of your data. You want to dive into the distribution of each feature and also evaluate how the features are related to each other. To show these relationships, you should be using visuals like box plots, histograms, corr plots, etc. This process helps inform you about which variables will be relevant to the overall question that you are trying to answer.

(e.g. histograms of the points scored per game, number of shots taken, etc.)

Step 5: Data Analysis — Here, you start evaluating your data set for trends. I recommend using pivot tables to understand if there are differences between groups or over time. Visualization tools should also be heavily used in this portion of the analysis. Much like the previous step, this one helps you to understand which variables to test in your models.

(e.g. points scored per game per team, scatter plot of shots taken vs. points scored, etc.)

Step 6: Feature Engineering — This part of your analysis is extremely important (so it has it’s own step); however, it should usually be done in parallel with the data analysis phase. Feature engineering comes in two flavors: (1) creating new features that could improve the quality of predictions or (2) changing the nature of the data so it is more suitable for analysis.

You should be creative when building new features. You can use composites of others, convert from numeric to categorical (or vice versa), or apply a transformative function to a feature. My favorite example is if you had geographic data points and instead of just throwing out the latitude/longitude, you used them to determine a distance from a common location.

(e.g. calculated player efficiency rating, a composite metric, from existing data to be used in the model)

The other type of feature engineering makes data more suitable for your analysis. Many people use principal component analysis (PCA) or factor analysis to reduce the number of features in their data. For some types of models, this can improve results and reduce multicollinearity. For other analysis, you also have to scale the data. This is important when geometric distance is being used in the algorithm.

(e.g. using PCA on the data-set with many correlated variables so that we could use linear model to predict season points)

Step 7: Model Building and Evaluation— I will go more into this in the next section, but you should be comparing multiple models to determine which has the best results for your specific problem. You want to cross-validate using training and test data so that you can see which model generalizes best. You should also pay particular attention to how you are evaluating your model. Be able to explain why you chose your evaluation metric(s).

(e.g. compared a random forest, lasso regression, and SVM regression for predicting NBA scores)

Step 8: Put Model into Production (optional) — If I see someone who has made their model “live” through a web page or an API, I am always impressed. This shows that they are comfortable with using more advanced programming techniques or packages. I am partial to python, so I usually use flask to do this, but I have seen others use R Shiny.

(e.g. made a web page that gives you a projected score after you choose a team, an opponent, and the location)

Step 9: Retrospective — You should always look back on the project to see what you could have done better. Not all projects go perfectly (most don’t), so you should be able to speak to any holes that an interviewer may be able to poke in your analysis. I would also recommend thinking about the next project that you would do based on your findings from the current one.

(e.g. I should have considered pace in this analysis, I would like to see if I can find games where the ref influenced the outcome by building on this methodology)

If you want more tips on separating your data science projects from the pack, check out: Make Memorable Data Science Projects.

The 4 Projects You Should Do

Following the life-cycle steps above, these are the projects that I recommend. You should absolutely not limit yourself to these projects, but doing them will illustrate that you have experience with most of the fundamental data science concepts.

Project 1: Predict a continuous outcome (Regression) — For starters, you should create a question that has a numeric outcome. Then you should compare how various linear and non-linear regression models answer that question (OLS, lasso, svm, decision tree, random forest, etc.). You should be able to explain the benefits and drawbacks of the techniques that you use. You should also consider combining them (ensemble) to see what results you get.

Project 2: Predict a categorical outcome (Classifier) — The steps here are quite similar for the regression project. This time, you should choose a classification problem to solve (binary or non-binary). Again, you should compare the performance of various algorithms on answering this problem (Naive Bayes, KNN, SVM, Decision Tree, Random Forest, etc.).

Project 3: Group data based on similarity (Clustering) — Clustering can help you make sense out of unlabeled data. It is one of the most useful ways to establish categories from the noise. I recommend doing a project using this technique to show that you have an understanding of unsupervised learning.

Project 4: Use an advanced technique (Neural Net, XG Boosted Tree, etc.) — You are welcome to use advanced techniques in any of the previous projects, but I believe that you should have one project that specifically focuses on them. Not all data scientists use deep learning, but you should be familiar with the how the concepts work and how they are applied.

In these projects, the model with the best accuracy or mse may not actually be optimal for solving the question posed. Be sure that you understand the alternative reasons for recommending one algorithm over another.

Final Thoughts*

In my list of why data science projects are great, I note that they can actually be as good or better than real job experience. I say this because I have seen many data science projects generate traffic, revenue, or even be the foundation for a new venture. Projects can help you to learn concepts and get a job, but they also have the potential to replace the need for a job altogether.