A Case for OSDS

I think by now you should be convinced that OSS is valuable. Now, let’s talk about how these concepts transfer to the domain of data science. To do that, let’s paint a more detailed vision of how OSDS can work by giving an example that abides by the ‘Open Source Criterion’ — Face recognition.

You can probably easily think of a few applications that can use face recognition. For this example, let’s look at face recognition in the context of disease diagnosis, and fashion recommendation. A quick Google search for each will show that there are companies doing exactly these things.

In the first use case, a photo of a patient would be taken as part of a medical checkup (or at home), which would then be analyzed in order to recognize possible diseases, helping doctors prioritize urgent cases and treat more people.

In the second use case, an app would recommend the best clothes according to your purchase history, but also the customer’s facial features or body structure.

These companies are definitely NOT in competition with one another. Their IP lies in providing a system for disease diagnosis or fashion recommendation respectively. Yet, today, both are most likely developing a crucial part of their system — face recognition — in parallel. This means duplicate work, and a waste of a lot of data scientist time and money.

In an OSDS world, both companies would work on an open face detection project, contributing code and data to help battle edge cases and create a more robust solution. They could dedicate less data scientists to the task compared to the current state, and have them focus on tougher, more critical problems.

New data scientists wanting to learn and showcase their skills to improve their portfolio would jump in to identify bugs, inefficiencies and create alternative models that prioritize various metrics for different use cases. They could see what a serious data science project looks like.

On top of all that, companies looking for prospective data scientists could find and reach out to those that have already contributed to the projects they care about, thus shortening the hiring and on-boarding process for new team members.

Recently the topics of AI transparency, diversity and inclusion have received a central focus in tech. OSDS can have a significant impact in these areas. To give an example, let’s say a user of one of these products, who also happens to be a data scientist, discovers that the face detection performs poorly on her image. She reviews the dataset, and realizes there are no images of a certain ethnic minority — one she belongs to. She adds images from another dataset that fill in the gap, and submits a pull request — And Voilà, the new model performs much better for her, and the companies enjoy a more diverse and improved model.

A true win-win(-win-win) situation.

When I started writing this article, this was a theoretical example. Now, it is no longer the case.

To finish off this piece, I’d like to write about two examples of how Open Source Data Science should and shouldn’t look in practice.

Open Source Software disguised as Open Source Data Science

Now, I can already hear your objection: “But I cloned tensor2tensor and BERT from GitHub, it’s already Open Sourced Data Science!” or “A lot of papers I read on arXiv have their code posted online”.

The Trinity — code, pre-trained models, and an arXiv paper. Photo on GitHub

This is not a jab at BERT’s authors and this section is not meant to disrespect these projects. These projects are extremely useful, and it takes a lot of hard work to make them accessible to the public. Their authors do everything they can given the tools they have and industry standards.

The argument I would like to make is that they are OSS but not OSDS.

If the code in one such example has a bug, which makes it so that the API for the published model isn’t working for some reason, an independent programmer can usually contribute a bug fix in the form of a pull request. On the other hand, if an independent data scientist realizes there is a problem in the data, say the model is biased for certain ethnic groups, in most of these cases, they could not modify the dataset and create an improved model.

This results in contributions made to the code, but not to the data science parts of the project. In other words, this project effectively becomes a software project.

OSDS can also have an important side effect for data science research.

Improving the quality of data science research

Improving the quality of research is an important goal to aspire to. A few notable efforts towards this goal are:

A lot has been said about the problematic nature of some of the State Of The Art (SOTA) results in data science research. Usually, when a paper is published, you see only the end result, which is the percentage improved compared to the former SOTA. Many times, small improvements might be attributed to a successful choice of random seed or performing many experiments until one has successfully improved on a metric, without giving enough consideration to the statistical significance of the result.

In simpler terms, since the research process is not transparent, we don’t know if a result actually represents an advancement in research or a fluke of luck.

In an OSDS world, research teams could publish their whole research history with the paper being submitted for review. Reviewers will be able to provide high quality feedback and ensure a rigorous research method. Everyone will enjoy higher quality data science research.

Final Thoughts

I am a co-founder of DAGsHub, a platform that enables data scientists to collaborate, manage and share their projects. At DAGsHub, we spend a lot of time thinking about OSDS and talking to data scientists.

This article is a summary of some of the conversations we’ve had with data scientists in the community. The purpose of this article was to formulate why Open Source is an important part of software development today and to put forward the argument, that it will be an important part of data science in the near future.

The next article will dive into technologies and technicalities — Why creating an OSDS community requires different tools from the ones used by the OSS communities, the difficulties, and how they can be overcome. For more details on our approach to solving this challenge, you’re invited to visit DAGsHub.

If you have had the chance to collaborate on an OSDS project, I’d love to hear in the comments (and get a link to what you’re building).