Lean Startup has been a breakthrough during the last years in the entrepreneurial landscape. But this movement is not only a methodology that can be used to create successful businesses, it can also be applied to other domains like Data Science. In this post, I will discuss how to apply the Lean Startup method to a Data Science project. As an example, I created a project that visualizes all football matches that took place in the UEFA Champions League since 1955.

What is Lean Startup

Lean Startup is a methodology for developing successful business and products by eliminating uncertainty via process iteration and customer development. The term was coined by Eric Ries, and comes from the idea of Lean Manufacturing developed by Toyota, whose objective is to minimize the waste in their car production chain.

Lean Startup method is based on a cycle of three steps. The first step is Learn, where you propose your ideas, set out your hypothesis and decide to pivot or persever in your strategy. The next step is Build, where you turn your ideas into products or make an iteration to further develop your product. Finally, the next step is to Measure, where you validate your hypothesis and see how your audience responds. This is an iterative process whose main focus is to develop products or business that people actually want and minimize the expenditure of time, money and effort while managing uncertainty. This method can be applied when developing a Data Science project.

A Lean Data Science Process

There are several methodologies to develop data science projects. At Microsoft, we developed and intensively use the Team Data Science Process (TDSP). This methodology enables us to efectively implement projects while collaborating between teams inside and outside the company.

The TDSP is composed by a data science lifecycle definition, a standard project structure, publicly available in github, a shared and distributed analytics infrastructure, provided by Azure and productivity tools and utilities for data scientists. A full description of the method is explained in this post.

In some situations, when starting a Data Science project, we have a clear view of the business case, access to all customer data and a clear roadmap of what the customer wants. In that situation, the Lean Startup process has little value. However, when there is uncertainty, when the customer doesn't know what he wants or when we don't know if a product is going to be sucessful, then the Lean Startup method can prove its benefit.

Implementing the Lean Startup method in the TDSP is easy, we can use all the tools TDSP proposes. In the Lean Startup method the priority is to reduce (or eliminate) the uncertainty to understand what the customer really wants. So following the 3 steps of the Lean Startup method, first we have to set the hyphotesis. The next step is to build a Minimun Viable Product (MVP), using the TDSP. The MVP has two important features, it helps us to validate the hypothesis or ideas we proposed and it is a complete, end to end product with the minimum number of features. The final step is to show the MVP to the customer and measure its impact.

This process should be ideally completed in days or weeks. The sooner we show the MVP to the customer, the sooner we get feedback, reduce the uncertainty and can iterate in our project. Next, I'm going to show a practical example of how to create a data science project using the Lean Startup method.

Visualization of Champions Football Matches since 1955

My lean project starts with a hypothesis: "my blog readers would like to see a post about visualization". In the initial commits, I focused on visualization using the library datashader. This was my first MVP, a notebook where I showed two visualization examples. The next step is to get metrics from my audience. That's difficult because sometimes the audience just doesn't reply and it is not always straightforward to attract their attention. Luckily, I have geek friends and relatives that are interested (or have been forced) to give me feedback. After some thoughts I realized that speaking only about visualization has not much value.

So I decided to set a second hypothesis: "my blog readers would like to read about how to apply lean startup to data science and see an end to end example related to visualization". The project will consist of a visualization of all the football matches in the Champions League since 1955 where each match will be represented as a line between the coordinates of each team's stadium.

The first step is to gather the data. I was able to get a dataset of all football matches and the stadium coordinates from Spanish, English, French and German clubs. There was obviosly not enough teams, there were important teams from Italy, Portugal, Netherlands and others that were winners in several years. However, they are enough to create a second MVP.

The next iteration (third MVP) needed all the teams that actually participated in the competition. The data was gathered from wikidata and filtered afterwards. Finally, I ended up with this visualization.

Each MVP has been published in different webs such as reddit, hacker news or stumbleupon to get feedback . Developing the project using Lean Startup helped me to realize that people is interested in learning about a new methodology rather than learning about visualization in general. I was able to detect problems in the whole pipeline at the very beginning, since I had an end to end version of the project very soon. Finally, this post is my fourth MVP, so I'm waiting for your comments.