How I built a spreadsheet app with Python to make data science easier

67,559 reads

@ ricklamers Rick Lamers Co-Founder of Orchest. Improving data science tools one commit at a time.

Today I'm open sourcing "Grid studio", a web-based spreadsheet application with full integration of the Python programming language.

reactions

About a year ago I started tinkering with the idea of building the data science IDE that I had always wanted. Having worked extensively with Microsoft Excel, R (Studio) and Python, I envisioned how some integrated version of those would make my life easier.

reactions

Why?

The main issue that I set out to solve with Grid studio is the scattered workflow that I was experiencing while going back and forth between multiple tools such as R studio and Excel while working on data science projects.

reactions

While exporting that CSV file for the gazillionth-time, running into freezing up of application windows when my row count was too high or trying to do something straightforward such as reading in a JSON file I had had enough. The existing tools did not provide me with the environment and associated workflow that enabled me to be productive.

reactions

That's why I decided to build something that would integrate my workflow into a single, modern and easy to use application fit for my data science needs.

reactions

How does it work?

Grid studio is a web-based application that looks remarkably similar to an ordinary spreadsheet program such as Google Sheets or Microsoft Excel. However, its killer feature is the deep integration of the Python language.

reactions

Viewing data in a tabular structure and manipulating it directly feels naturally to almost everybody who has used a computer.

reactions

Combining this simple UI with the power of a full fledged programming language such as Python really makes it stand out.

reactions

Scripting with Python is as straightforward as possible: just write a few lines and execute it directly.

reactions

Core integration: Reading and Writing to sheets

At the core of the Python integration is the read and write interface to your spreadsheet. A high performance connection between your sheets' data and data in your Python process.

reactions

Simply write to the sheet like so:

reactions

sheet( "A1:A3" , [ 1 , 2 , 3 ])

And read from the sheet like this:

reactions

my_matrix = sheet( "A1:A3" )

With this simple yet powerful function you can read and write directly from and to sheets in order to automate data entry, extraction, visualization and more.

reactions

Writing custom spreadsheet functions

While reading and writing gives you a lot of flexibility through a simple interface, sometimes it makes a lot of sense to write custom functions that can be called directly in your spreadsheet.

reactions

Common spreadsheet functions such as AVERAGE, SUM, IF, etc. are already available by default. But what if you need more?

reactions

Simply write the function you need!

reactions

def UPPERCASE (a) : return str(a).uppercase()

Now call this function in your spreadsheet, just like you would a regular function.

reactions

Leveraging Python ecosystem

By leveraging the power of the Python ecosystem you get instant access to state of the art data science tools:

reactions

This enables simple access to powerful models, such as a linear regression and SVMs for modelling your data.

reactions

Docker runtime

The application runs in a Docker container which gives you easy access to a fully packed and isolated UNIX environment (even on Windows!) with everything ready to go: Python, scikit-learn, numpa, pandas, terminal, wget, zip, and much more.

reactions

This makes installing Grid studio as simple as downloading the prebuilt docker image and running a single command.

reactions

Data visualization

A common task in data science is visualizing your data. Given its importance Grid studio has built in support for advanced plotting by integrating interactive plotting library Plotly.js and Python's standard Matplotlib. This provides you with advanced plotting capabilities in vector sharp format.

reactions

To give you some ideas about how to use Grid studio's features we show how they can combined with some concrete examples.

reactions

Example: Scraping the web

This example shows you the power of having Python at your fingertips. Something that typically requires some back- and forth between tools and files can now be integrated into a single script.

reactions

Above, you see how a short script easily loads news articles from Hacker News directly into the sheet.

reactions

Source: scrape.py

reactions

Example: Estimating a normal distribution

This example shows a somewhat silly use case of estimating the normal distribution with ever higher fidelity visualized with Plotly.js. Here you can see how interactive plotting can give you a sense of what is going on.

reactions

Source: estimate_normal.py

reactions

How can I use it?

Installing Grid studio locally is very simple:

reactions

(Make sure you have Docker installed) 1. Clone the repository with this command: git clone https://github.com/ricklamers/gridstudio 2. Run the bash script (on Windows use e.g. Git Bash) with this command: cd gridstudio && ./run.sh 3. Go to http://127.0.0.1:8080 in your browser

Git Bash for Windows - Docker install

reactions

Note: if you run into issues feel free to open a issue on GitHub, I'll try to assist/fix as quickly as possible.

reactions

Note: on Linux you might need to run (for step 2):

reactions

cd gridstudio && sudo ./run.sh

as Docker requires sudo access to run.

reactions

Release + Future development

As mentioned in the introduction, today Grid studio will be available for free and open source through the GitHub repository.

reactions

If you made it here you are very welcome to try it out yourself and submit any feedback and/or contributions to the project on GitHub.

reactions

I have some ideas about which functionality or features could be added in the future to improve Grid studio. However, since the project is now open source I think it would be wise to track these on GitHub and see which have the highest priority according to all involved.

reactions

- Expanding the number of 'native' functions available in the spreadsheet (like AVERAGE, SUM, IF, etc), maybe even reaching parity (and consistency) with some existing packages like Libre Office's Calc or Excel

reactions

- Syntax highlight/function tooltips when typing formulas in the spreadsheet

reactions

- Advanced sorting and filtering in the spreadsheet

reactions

- Extended controls for interactive plotting with Plotly.js

reactions

- Sharing of workspaces/code more easily (i.e. export workspace)

reactions

- Forms of real-time collaboration (this might be too difficult)

reactions

- Some sort of API/interface for add-ons/extensions

reactions

- Upgrade formula parser to a real grammar based parser

reactions

- Python autocompletion

reactions

- Performance optimization

reactions

- Core Python/sheet integration robustness (no character/sequence breakage)

reactions

Open source motivation

While this project was originally intented for commercial release, I've decided it might be better off as an open source project for everyone to experiment with and potentially be developed by a small community of interested data enthousiasts.

reactions

The reason for this is that during the initial development of the project I've discovered a number of projects that offer similar functionality to Grid studio.

reactions

First, there is an open source plug-in that integrates Python directly into Microsoft Excel called xlwings. Although it does not really integrate spreadsheets and Python into a single coherent product, it does offer the advantage of giving users access to the 'real' fully loaded Excel environment they are already familiar with.

reactions

Second, Python has evolved from IPython to Jupyter Notebooks to JupyterLab. It enjoys a lot of popularity and rightly so, it offers a very nice work environment for data scientists with a strong emphasis on explainable code through long form Notebooks. Although, it lacks any kind of spreadsheet functionality that, in my opinion, is so appealing to novice data scientists because of its intuitive behaviour.

reactions

Overall, projects like these meant that commercializing Grid studio would mean competing with these product substitutes that are frankly available for the incredibly low price of free.

reactions

Regardless, I sincerely believe that Grid studio does have something unique to offer over existing alternatives and could be the tool of choice for quite a few use cases.

reactions

Grid studio is available for free and open source through the GitHub repository.

reactions

reactions

reactions

reactions

Share this story @ ricklamers Rick Lamers Read my stories Co-Founder of Orchest. Improving data science tools one commit at a time.

Tags