In this post, I am going to describe the life of a newly hired data scientist. The use case is that the data scientist is given a project where he needs to build an online learning model.

By Harish Doddi, Datatron. Sponsored Post.





When a new data scientist is hired into your team, it takes some time for the person to be productive. Data science on boarding process takes longer than other on boarding processes inside Enterprise. In this post, I am going to describe the life of a newly hired data scientist. The use case is that the data scientist is given a project where he needs to build an online learning model. He needs to understand the problem, write experiments for it and deploy a model to production. Also, to make the model useful, he needs to boot up a service which gives live predictions using this model. To make things simple, I am going to narrate the story of Alex, a recently hired data scientist.

The new hire’s first week



Being new to the team, new hires are usually assigned simple tasks, so that they can focus on understanding the interface and infrastructure as their first step. Data science being a relatively time consuming and involved field w.r.t. understanding the problems to solve and domain knowledge, walking through a simple problem gives enough time to understand the framework of the team and how to progress ahead. As a next step, data scientists can focus on domain related problems.

Given that data science being a fairly involved field, quite a bit of time is required to understand the problem at hand and try out different experiments. And for a problem at enterprise scale, eg., recommendation system for users coming to the website OR price prediction for an item for this user, it takes time in getting to a model with exact prediction for what items to recommend or prices to predict, since the data scientist needs to run through several experiments. It is usually the case that more than couple of experiments are needed to come up to a model which gives near to expected results.

The data scientist needs to write several scripts and test out if the model is working or not, make changes accordingly looking at results of the previous experiment and this keeps on going into a loop. This continues until a proper optimazation is not obtained. Most importantly, data scientists are time constrained and they can only do so much in the time allotted. It becomes important to deliver results quickly and accurately. This restrains data scientists productivity where they could have tried more alternatives if they had more time at their expense.

After carrying out the experiments, data scientists typically want to deploy the model to production along with converting it into a service, which is a pretty exciting task. Data scientists typically need help from data engineers for data, DevOps for deployment of models and software engineering team for any task related to booting up a service.

Frustrations of a new hire



Alex comes into work and starts operating on his first project which his manager gave him. Typically, he wants to grab data from some database or data lake and write some code to operate on it and understand the context of the problem. The next step after understanding the problem is to perform experiments on it. This involves some code writing tasks. Also, there is not one single experiment that you need to do in Prod, you may want to try out several experiments at the same time, split traffic, perform A/B testing etc. Also, after the model is deployed to production, you want to hit the model to get current live predictions for the user. But, today a typical data scientist like Alex is facing the following problems:

Setup discovery: Typically every new hire data scientist needs to learn infrastructure on which he is going to perform experiments and deploy code for the model. This is common regardless of which company he is going to. Access to data: Data access for every new hire adds to the cost of the company. If every new hire is asking about credentials stuff and how to access data and where it lies and meaning of data!! This already adds a lot of time to understanding stuff which is already existing in place. There is a lack of framework which surfaces this to the top and makes it more apparent. Repeated code: Since there is no central code repository system as in for software developement engineering for data scientists, each new data scientist tends to write his own script. It is also pretty difficult to cross check with team, if this piece of code already exists or not. A central software module to achieving a bigger project is lacking. Every data scientist has his own tools to achieve his own objective which makes the pipeline very spaggheti like and hard to understand and scale for problems. This also adds a lot of cost to maintain scripts at different places. Hard to deploy models: Data scientists can perform experiments and can write good machine learning models. They spend enough time on it to do training, testing, cross-validation etc. Once they see that results are coming close to expectancy, they typically want to give it out into production. They want to try it out on live production traffic as well, whether their model is performing for real or not. The challenging part is, data scientists are not infrastructure guys. They need to take help from DevOps for this. There is a continuous communication back and forth between the two teams for deployment of models. This is pretty hard work. Hard to deploy multiple models: Deploying one model to production is fine, DevOps can help out there. But if request is to ask to deploy multiple models to production, then it is near to impossible task currently. It is not near achievable to think of a use case where multiple models can be flipped back and forth with ease within minutes manually for live traffic in production. Publishing endpoint: Typically for online models, a data scientist wants to expose model as a service, so that live traffic can be routed and model results can be given as final prices for example to the end customer. Writing prediction logic rests completely with data scientist today, but to make it come true in production, he needs to seek help from his software dev teams.

There is currently a lack of a proper framework which kind of acknowledges above problems and tries to give out a PRODUCTION ready solution where data scientist along with doing their own task are able to have complete control over model deployment, multiple models at the same time along with prediction logic service support.

Let’s try to look at an example where if such a platform existed, how it would make a difference in data scientist life. We will again narrate the story from the perspective of Alex coming in as a new hire and working on an online learning model as his first project.

How to make life easy for a newly hired data scientist



Imagine a centralized user experience where data scientists have complete control over their models and are able to control how they operate on it. They don’t have to worry about dependency on any sister teams for deploying their models or for accessing data etc.

Let’s just say for example that when Alex joined the team, he is introduced to a user experience portal where he is supposed to operate independently on his first project. The time taken is to understand the existing project, what kind of data is included in the project, what kind of code is part of the project etc. This gives a huge kick start to understand what the current state of the project is. Also, if the data scientist needs to perform experiments, he would already have a place to start from and can progress ahead. This is because all scripts by Alex’s team mates are pushed to a central repo and are reflected on a centralized dashboard or workspace.

The data scientist also does not need to worry about where the data is, because the platform provides complete abstraction of the data. This eliminates his blockers for figuring out where is the data and how to access it. Deploying stuff is made so easy that the data scientist now just writes prediction logic and throws it into the project and that’s all. The framework then makes it available as a service. Let’s look at how this would make a difference.

Centralized user experience for data scientists



Making code, data, credentials, models, prediction logic etc. centrally available, which serves all data science development hassles is pretty tough and not so trivial. This is because data scientists life differs from company to company and there are different frameworks and different niche details which needs to be looked at. There is no generalization and standard process of performing a task in data science lifecycle of a project.

Think of a system which is able to overcome the above mentioned obstacles and offer much more than that via a very good user experience. Think of a web interface where you can do the following:

Setup discovery: This still stays where it is and is inevitable. But the big difference is that rather than understanding the spaghetti system that we had, we are now trying to gain the one shot big picture of the complete framework. It is more likely to miss out a simple point in spaghetti systems rather than where everything is centralized in one single place. Centralized code repository per project: There are no more code repetitions, because Alex is able to figure out code from the project itself. He can tell easily what code is live in pipeline, which code lies where and everything is neat and properly organized for a model. This adds to savings in time where the data scientist needs to write scripts and test it out. Almost half the work is already there. Centralized data access across projects: Admin is the only person responsible to configure credentials for a data source and this would become a one time process. This saves time for every new hire joining the team for hunting down how to get access to data and makes this data avaiable generically as a dataset. Data scientists also does not need to worry around writing logic for that technology, rather this becomes an abstracted layer in code. Think about the framework also providing details about the data. This adds to time savings for data scientists where they want to get details about data. Model training and deployment: If the framework is also able to provide deployment capabilities to data scientists, that would be pretty amazing and outstanding. Data scientists can do whatever they want at their desk and don’t need to communicate with DevOps for their projects. They are completely independent for doing their job. Prediction logic and publishing results: The data scientist role in the complete project is to write prediction logic and hand over to software dev team. Data scientist does not get enough admin capabilities to own the complete service by himself. Think of a user interface platform which gives you complete control over starting, stopping the service to data scientist himself. Then there is a tremendous cost cutting in terms of communication time between data scientist and software dev team. Monitoring dashboards: Think of a platform which also provides complete control over incoming traffic, service latency, autoscaling, number of incoming requests per sec etc. If this comes bundled with the software out of the box, then this is all that is needed to see if service is healthy, predicting correctly etc.

This essentially means that the data scientist has the project in his complete control with all data and code residing in the project he is working on. He does not need to worry about hassles and his only focus is how to move the project ahead. Since it is usually the case that he wants to perform new live experiments, he can also do that without any problems. Now he is also able to deploy the model to production and also take care of the service booted up. Paired programming and sharing domain knowledge via a common framework would make a group of data scientists even more productive.

Now, Alex enjoys solving problems at a quick pace with his mentor Bob and is able to focus on the problems at hand. His hurdles of getting work done from several other teams which was adding time to the complete process is completely eliminated. He and his other data science team mates only focus on problems to solve and nothing else, since they own the complete pipeline end to end now.

To learn more about the product, you may contact info[AT]datatron.io.