KUNGFU.AI has decided to open source Kaishi (Chinese for “begin”), a tool to help automate exploratory data analysis and data cleaning. One of KUNGFU.AI’s core values is to be open, and in that spirit, we’ve decided to release our first open-source project under the MIT License (the most permissive license available).

Short on time? Check out the image dataset tutorial for a crash course.

When changing from one problem set to another, there is always a balance to strike between reusable code and custom code for the task at hand. However, we realize that it’s often the case that data scientists end up rewriting a lot of code when changing over to a new task, as often it’s difficult to take time to make the code reusable. We wanted to tackle this problem.

What is being contributed here?

We’ve seen a lot of good progress around open source libraries for making and interacting with cloud deployments, building machine learning models (e.g. scikit-learn), as well as performing granular data analysis and transformation (e.g. pandas). However, we often find that our datasets are delivered as simply a directory of files that require common operations to prepare for these machine learning tasks, where we find ourselves redoing a lot of repetitive tasks. This is where we focused Kaishi.

What does Kaishi do?

Simply put, Kaishi takes a directory of files, manipulates them in some prescriptive way, and then saves a new directory of these modified files. The dataset object as-loaded in memory can also be used for downstream tasks, but that’s entirely up to the data scientist.

We plan to release functionality for more data types in the future, however, the current dataset types are File (generic operations), Tabular, and Image. Image is the most mature of the three, but there’s a lot of core functionality for pipelines that enforce reusability by design. Each pipeline component that’s custom ends up being a single, portable class.

Use cases

There are a variety of use cases you could use this library for, especially when augmented with your own added functionality. Here are some examples:

Standardize sizes and convert a set of images to grayscale

Deduplicate rows in a directory of .csv files

Detect photos in a directory of document images

Auto-rectify an image dataset

Concatenate a directory of .csv files with the same schema

And there are also a bunch of more specific use cases that you might run into that don’t necessarily warrant putting in core functionality:

Quantize images based on a regular expression in their filename This is done in the docs in 13 lines of code

Use your own algorithm to make predictions on data points and filter by class label

Create PSD plots from time-series data and save as images

Of course, if you create a pipeline component that you think could be used by the community, feel free to contribute to the project! We’d love to hear from you.

Want to know more?