The merger

With this conflict arising, a need was born. A need for a person that would reunite two warring parties. One being fluent just enough in both fields to get the product up and running. Somebody taking data scientists’ code and making it more effective and scalable. Introducing to them various programming rules and good practices. Abstracting away parts of code that might be used in future. Joining the results from potentially unrelated tasks to enhance the models performance even more. Explaining the reasons behind architectural ideas to the devops team. Sparing software developers from learning concepts way beyond their scopes of interests.

That need has been met with emerge of machine learning engineer role.

What is always missing from all the articles, tutorials and books concerning the ML is the production environment. It literally does not exist. Data is loaded from CSVs, models are created in Jupyter, ROC curves are drawn and voilà — your machine learning product is up and running. Time for another round of seed funding!

Hold on.

In reality the majority of your code is not tied to machine learning. In fact, the code regarding it usually takes just a few percents of your entire codebase! Your pretrained black box gives only the tiny JSON answer — there are thousands of lines of code required to act on that prediction. Or maybe all you get is a generated database table with insights. Again, an entire system needs to be built on top of it to make it useful! You have to get the data, transform and munge it, automate your jobs, present the insights somewhere to the end user. No matter how small the problem is, the amount of work to be done around the machine learning itself is tremendous, even if you bootstrap your project with technologies such as Apache Airflow or NiFi.

that was not on Coursera, was it?

Yet, somebody has to glue all the “data science” and “software” parts together. Take the trained model and make it work on quality production environment. Schedule batch jobs recalculating insight tables. Serve model in real time and monitor its performance in the wild. And this is the exact area in which machine learning engineer shines.

When creating software, developers are naturally looking for all the possible outcomes in every part of application. What you get from a data scientist is just a happy path that leads to model creation for the particular data at the particular moment in time. Unless it is one-time specific analysis, the model will live for a long time after it gets productionized. And as the time flies, the bugs and all the edge cases are popping up(many of them were not even possible when the code was written). Suddenly a new unknown value shows up in one of the columns and the entire model start to perform way worse.

As a machine learning engineer you prepare your applications for such events. You provide the logging and monitoring pipelines not only around machine learning tasks but also inside them. You try to preserve all the information so it is possible to answer a very important questions: What is the cause of bad model’s performance? Since when does it happen?

It is just another API

Because you do not treat ML as magic, you are aware of all other typical programming dangers that may arise when a machine learning job is executed. Database might refuse connection. GroupBy may blow up for large dataset. Memory or disk can be full. Combination of parameters specified by user might be illegal for certain algorithm. External service could respond with Timeout Exception instead of credentials. Column may not exist anymore. While nobody blinks an eye when such events take place in a safe lab environment on a daily basis, it is your responsibility to ensure they won’t happen when the end product is actually delivered.

machine learning project roles

Your data science team is always full of ideas. You have to make sure that no technology is limiting them. As good and customizable as the current ML frameworks are, sooner or later your teammates will have an intriguing use case that is not achievable with any of them. Well, not with standard APIs. But when you dig into their internals, tweak them a little and mix in another library or two, you make it possible. You abuse the frameworks and use them to their full potential. That requires both extensive programming and machine learning knowledge, something that is quite unique to your role in the team.

And even when framework provides all you need programming wise, there still might be issues with the lack of computation power. Large neural networks take large amount of time to train. This precious time could be reduced by an order of magnitude if you used GPU frameworks running on powerful machines. You are the one to scout the possibilities, see the pros and cons of various cloud options and choose the most suited one.

You may also be responsible for picking other tools and ecosystems, always taking into consideration the whole the project lifecycle(not just the reckless research part) — e.g. Azure ML Workbench or IBM Watson might be great tools for bootstrapping the project and conducting research but not necessarily meet all the requirements of your final version of the product when it comes to custom scheduling and monitoring.

You must stay up to date with the state of art technologies and constantly look for the places in which the overall product performance could be improved. Be it a battle-tested programming language, new technology in the cloud, smart scheduling or monitoring system — by seeing your product on the bigger picture and knowing it well from both engineering, business and science sides, you are often the only person that has the opportunity to spot the potential area of improvement.

This frequently means taking the working code and rewriting it entirely in another technology and language. Thankfully, as soon as you “get the grip” of what this fuzz is actually about and what steps are always taken in the process of learning and productionizing the models, you realize that most of these APIs do not differ at all. When you juggle between various frameworks, the vast majority of the whole process stays the same. You bring all the best software craftsmanship practices and quickly begin to build an abstraction over many repetitive tasks that data science team fails to automate and software development team is afraid to look at. A strong bridge between two worlds. A solid, robust foundation for a working software.