A few weeks ago I ran into a problem and could not find a good solution online. It was the first time I found a niche where I could contribute and fix my own problem at the same time. Full of excitement I got to work — from the outside there was no visible change, hands still on the keyboard. I’m back after 12 days to report the first working version of kono data.

The problem

I had a growing dataset of more than fifteen thousand images and no quick way to label it. The goal was to train a deep learning model, but without labels assigned to each image there was nothing to learn. I hesitated to outsource the processing to a random company or people that I didn’t know, because assigning labels required some specific domain knowledge. Plus paying other people to do the work costs a lot of money.

What is the idea?

This problem lead me to the idea to build a place to label the data myself. I wanted a website to track the progress, be able to change the list of possible labels on the fly and work on the labelling with others at the same time. As a bonus, these requirements made it fault tolerant if my laptop gets lost. Some other demands I had for myself were: a one-button export of the labels at any point in time, both binary and multi-label classification and keyboard shortcuts to speed up the labelling process.

What did you do?

I build a framework to label data and the first version is ready now. While all my wishes are fulfilled, it is limited in two ways: it only supports images in an S3 bucket.

Some of the todos done in the last 12 days are:

- use bootstrap theme for process view (DONT WASTE TIME HERE)

- dataset creation: add form for dataset creation

- create dataset csv export, basic version (minimal to no options)

These 3 and the 29 other todos were solved in 22 commits in the last 12 days. The project is small with 13514 lines.

A few screenshots of what it actually looks like:

The processing view with the image on the left side and possible labels on the right side

The admin view with export, fetch data and details buttons

First version

The goal of this first version is to proof the concept, find possible bugs and collect feedback. There are probably hundreds of paths which could have been implemented. I chose the most straightforward way to get a working application as soon as possible.

Why open-source

Most of my everyday work is based on open-source software. I wanted to make this project freely available to give back to the community and invite others to contribute. I tried to start the project friendly for beginners. Let me know if you have ideas or feedback on this! If one other person can learn or even use this project, it would be a great success.

Which license

Everyone can modify, distribute or build a business from this project. I chose the MIT license to be as open as possible while not being liable when others use the software.

Problems along the way

The hardest part in this sprint for the first version was to focus on essentials. I could see hundreds of fancy features from the beginning. It was a great exercise in priotisation to list all todos I can come up with and chose the most important ones.

While running quickly I still had to step back sometimes and look at what I was building in a bigger picture. I wanted to set up a foundation in the data model and code which could be extended in the future. This required some changes and reverts and is still not perfect.

I got stuck several times on technical details which were really unimportant but seemed like a great way to spend the time. When looking back I could have chosen any solution to these problem and it would have been fine. Sometimes this held me back from keeping a high pace, but overall I’m happy with the progress I made in this short time period.

What’s next?

The basic functionality of the site is in place. There are a few obvious feature candidates such as adding leaderboards, time measurement and other content types like audio or video. Most of my ideas are listed in the Readme of the project.

Contribute

You’re encouraged to contribute to kono data! There are several ways to expand the project: you can fix a bug, translate content or work on new features. Using the website and giving feedback is a great way to contribute as well. Your ideas on what should be done next are very welcome!

Links

Go to kono data homepage to see the project live in action and label a dataset into Hotdog vs. Not Hotdog. Visit kono data on Github to see the code and more technical information.

This post appeared first on runningco_de

Photo by Alexandre Godreau on Unsplash