The Need for a Process

Creating a new product from scratch is a complicated and long endeavor. Creating a new data-product from scratch is at least as complicated and long, and brings its own set of very specific challenges.

So we’re on the same page, I’m using the term data-product to refer to any digital product (or even service) that leverages advanced analytics built on top of a dataset that needs to be maintained indefinitely.

When you’re building up one of these beasts, it’s easy to lose track of where you are and where you’re going. You can easily get caught up in the minutia of optimizing a sub-system that ultimately doesn’t matter, or spend tons of time on a data-engineering task that may ultimately be unnecessary.

At Apteo, we’ve done a lot of work to get our initial software up and running, and in the process, we’ve learned a lot about what works and what doesn’t. One of the things that has helped us through this complexity is to have a defined workflow that provides us with some structure around where we should spend our time and efforts.

Having a defined process or workflow that provides us with a framework under which we can proceed allows us to understand where we are in our process, and it allows us to course-correct if we find ourselves proceeding down a path that’s not ideal, urgent, or important.

Over time, we’ve developed a process that has worked for us. Of course, the process wasn’t immediately obvious or intuitive. Unlike pure software development, data-products require specialized resources and additional steps within the development lifecycle.

There are a lot of blockers related to the very nature of data science. Oftentimes you don’t know whether it will be worth it to go down a path of investigation. You have no idea if adding new features will result in better or worse performance. With any data-product, there’s always going to be some amount of research that needs to be done in order to guide the team’s efforts.

Sometimes you don’t know if you can actually accomplish what you hope to, either because you don’t have the data you need, or the data you have isn’t predictive of the objective you’re trying to optimize. You also need a good mix of analysis, ML engineering, data engineering, and model tweaking and experimentation. Things like exploratory data analysis (EDA), building a baseline model, and maintaining and updating a golden set are all crucial to success.

Finally, all of this may vary depending on whether you’re developing a brand new product or you’re trying to improve an existing one. Suffice it to say, there are a lot of considerations when you’re working through an ML/data task. That’s why a defined workflow helps.

Our Workflow

A fellow named Eren Golge created a handy data science workflow that he posted in this article. It’s a useful piece that discusses his understanding of the workflow proposed in one of Andrew Ng’s courses.

Our workflow incorporates a lot of the same ideas, however it also involves things that need to be done when creating a productionized data-product. The image below shows a graphical representation of our workflow, followed by a very brief summary of the key points.