I was recently asked: What was the most challenging part of your last Data Science/ML project and why?

In my experience? Estimating project timelines .

The real world is messy and a data scientist's job is to try to make sense of it. Data science is often about doing novel things with data and using that output to move the needle for your organization. Given the way the role has been hyped, leaders are expecting money to be saved or made. They want data science driven solutions and they want them now -- the clock is ticking and time is money after all.

But data science is often a sea of unknown unknowns. How do you best estimate how long a task might take if no one on your team has ever done it? Or your organization? Or ever?

Maybe your training set doesn't have enough examples of each class... Maybe X isn't really related to Y... Maybe your data is too noisy and you need to supplement it with another set... Maybe the whole process takes too long to compute...

These are real examples that I've faced that I didn't know that I didn't know about until I had already agreed to a timeline and had sunk some major hours into the project. But just because an accurate estimation of project completion is hard doesn't give you carte blanche to not track it or try. Compromises must be struck with the business and data science team. They need to understand the nature of the beast: progress is made step by step -- sometimes forward, sometimes backward. You can only learn by trying.

So what works? Our project team has found that Scrum matches the iterative and incremental tasks often done in data science. For the uninitiated, Scrum is an Agile framework used to manage the completion of complex tasks. These tasks are usually broken down into bite size pieces that one can reasonably expect to get done in a day or less with a complexity score to help gauge risk or uncertainty. Short meetings called stand-ups help others know what you did, what you are planning to do and if you are stuck. It enables those that have followups to connect outside of the stand-up and let others do their work. (The only thing a data scientist hates more than cleaning data is meetings).

But that's not all folks -- these Scrum stand-up meetings can also be applied to a data science team that is working on all different projects. Your skeptical data scientists might point out that most of the time the team is spending a 15 minute meeting rehashing updates but it's during these meetings in which the magic happens. Data scientists talking to each other will step in and share knowledge when another is stuck, possibly saving the team DAYS from reinventing the wheel. Additionally, this knowledge sharing can not only be a huge boon to team efficiency and productivity (which is much more valuable than a mere status update for the manager), they can also help a team reach their full potential by leveraging the total knowledge of the team and not just of an individual.

Recent examples of benefits of Scrum collaboration:

An individual was looking into similarity distance metrics, two teammates met up with them to discuss pros and cons of different ones leading to better performing model.

One team member mentioned a table was loading slowly, a fellow data scientist met with them and shared how to do a bulk insert and then made a standard practice out of it for the rest of the team.

Someone was struggling with a process taking too long to compute. He/she connected and worked with a few people to recommend best ways to profile their code.

I have also personally benefited, where I was looking to source pricing and promotion data, a data scientist on another project connected me to the data subject matter expert removing the game of telephone to find the data.

In my experience Scrum pushes the envelope in our team capacity, expands our team's knowledge, and, most importantly, generates better results for our clients. So don't shoehorn data science into other project management methodologies -- let it be what it is: iterative and incremental. Tackle your unknown timelines by dividing it up into tangible pieces and take note of their complexity to mitigate risks. It's time for data science to put the irony of being able to forecast and estimate everything except their project timelines away for good.









Special thanks to @Jason Stevens for enlightening me on the ways of Scrum!

Opinions in the article are my own and do not represent my employer.