I spent 7 months as an intern at craft ai, a machine learning API focused on learning habits. My main project was about making predictions on arrival time of waste collection trucks in the streets of Paris. If you’re interested, I wrote a blog post explaining how the overall solution works.

Retrospectively, I want to sum up the 9 main elements I’ve learned. Firstly, 3 technicals lessons about machine learning realities to keep in mind. Secondly, the 3 main skills every data scientist should have. And lastly, I want to share 3 ways to learn data science nowadays that helped me.

Three machine learning lessons

There is no “best model ever”. Some models perform better at some problems, yes. For instance, neural networks have shown their superiority in computer vision and natural language processing. But in general, no one can assert that a model will always perform better than the others. Model performance metrics are the only way to get truth. That’s why we shouldn’t have influences like “neural networks are awesome in complex problems like computer vision then let’s use them for our sales predictions”. That’s a bad way of thinking. To find the best learning algorithm, it’s easy to make a first learning phase with a bunch of different algorithms on a smaller subset of the data. Then make quick learning and cross validation before comparing results. Repeat that process on different subset and identify on average the best learning algorithm to use on the whole dataset.

Learning phase is nothing. Data preprocessing is everything. Usually, we don’t spend much time on implementing the learning algorithms. Plenty of frameworks and libraries already did that work. They are well optimized with sufficient time and memory performances for our use case. The single task on the learning phase will be to play with hyper parameters and compute a cross validation to measure improvements. Well, that’s not the biggest energy consuming part. But preprocessing the data is. By preprocessing data I mean adding complementary features from external datasets, cleaning it (remove “bad” rows), extracting new features from others (feature combination or decomposition), selecting the best ones (feature selection or dimensionality reduction), etc. The only way to get an idea of the preprocessing work that has been done is, once again, to measure improvements with cross validation or comparing a learning curve. Transforming the input data then computing validation metrics (and repeating again and again) is the real work time we spend on our machine learning problem.

The model is not enough. Safeguards are compulsory. A machine learning model aims to deliver the best prediction compared to what it has seen during the learning phase. The predictions won’t be always accurate and there are several reasons for that. So an application delivering predictions to its users shouldn’t rely exclusively on the model behind. For instance, in my personal internship experience, at the beginning my model was predicting that the recycled waste truck will drive by the streets every day of the week. The time predicted was consistent but actually, recycled waste truck show up only three times a week, not every day. That last fact is what I called a business rule, in other words a rule depending on the problem being solved. It’s not that easy to directly implement that rule in the core learning algorithm. Thus an additional layer, filtering predictions with written rules, was unavoidably required. Safeguards must be implemented on top of the predictions checking their validity. Otherwise it could result in stupid or dangerous situations.

Three data scientist skills

Maths and statistics. To my mind, one can gets its hands in machine learning without advanced maths or statistics knowledge related to the field. Thought programming skills, frameworks and basic understanding of machine learning principles, a lot of things can still be achieved. But a data scientist should preferably have a deep understanding of its model. For instance, he should be able to answer “why the model made that output Y given the input X”, mainly because users could positively seek to know why a decision has been made by the service. Also, accuracy improvements of the model are usually made by tuning hyper parameters and they actually take part in the maths of the model. For that, a very good knowledge of the maths or statistics behind the scene is needed. That’s why, to my mind, data scientists should keep learning it.

Computer science and programming. Yes, data science requires to be confortable with computer science skills. Mainly for manipulating data (files, databases) and operating on them (programming or using high-level software). During the analysis phase, scripting is useful to handle data and make some visualizations before learning. But later in the project, when it comes to build a full service application, software engineering skills are required. While a machine learning engineer will keep its focus on the learning phase, a data scientist should work on the whole pipeline from data manipulation to delivering the predictions into the application that use them.

Communication. Communicating results and discoveries within the data is the main point for a data scientist. They should get people to understand what’s hidden in the data and what can be useful for their company. It can be oral speaking, a written report, or even better a visual report with data visualization techniques. When dealing with people that are not in the field, a very business related vocabulary or business metric should be used to facilitate their understanding, instead of plain lower level statistical metrics less meaningful from their point of view.

Three ways to learn data science

Online resources. Obviously the first starting point, at least for machine learning, is Coursera. The machine learning and deep learning classes are considered as the best ways to start learning it. Overall, online classes website can be a great way to learn complex theories that requires a long time to study. Also, frameworks documentations, like Scikit-learn one, are like a bible. Not only they provide the code snippets for using algorithms, but also they provide the theoretical (maths, papers) side of them and a benchmark describing when they perform well or not. Finally a lot of specialized website are producing very good quality content for data scientists like Kdnuggets, Siraj Youtube Channel, Open Data Science.

Kaggle. It is the place to be as a data scientist. That online platform hosts many datasets where data scientists compete to build the best predictive model. There a lot of data is available to play with. As well, many code examples with most of the time explanations give an idea on how using some languages, frameworks, libraries, etc. One can learn a lot of data science knowledge on it while competing with experimented data scientists. Kaggle team also has a blog where they regularly transcribed competitors interviews along with more technical articles.

Meetups. Paris is famous for its AI ecosystem with a lot of companies and experts. They often get involved in Meetups based in Paris as speakers. A large part of the artificial intelligence culture I acquired during my internship period is due to my attendance at some Data Science related Meetups.

The most famous one is obviously Paris Machine Learning with the biggest network, more than 6000 members. Speakers are usually well experienced experts and give best quality talks on either applied or theoretical machine learning. On the same trend, Deep Learning Paris, is more about state-of-the-art theories but I’ve seen awesome speakers very good at explaining even the most complex things. Likewise, Paris Intelligence Artificielle and ParisAI are very good ones with the particularity — based on what I’ve seen — that they are more focus on projects and high level problems.

Kaggle Paris Meetup is also a great place to get explanations on latest learning algorithms and how they work. Speakers usually go in depth in order to understand what role hyper parameters play. The goal of this Meetup group is to share how to get best accuracy and compete on Kaggle contests. There you can meet top competitors delivering their best secrets about features engineering and model configuration.

Conclusion

Those are the main points I learned. But that content may become quickly outdated, with new best practices, new job skills requirements, new ways to learn data science. Still, the most important things after 7 months as an intern were the personal reflections I had on myself. It was a very open minded experience that made me definitely different and even more enthusiast about data science and particularly on machine learning.